When is a unique id not unique? How can a file exist and not exist? All this caused a mysterious bug in JFFNMS with the reachability interface which goes to show the limitations of the PHP function uniqid().
For quite some time modern versions of JFFNMS have had a problem. In large installations hosts would randomly appear as down with the reachability interface going red. All other interface types worked, just this one.
Reachability interfaces are odd, because they call fping or fping6 do to the work. The reason is because to run a ping program you need to have root access to a socket and to do that is far too difficult and scary in PHP which is what JFFNMS is written in.
To capture the output of fping, the program is executed and the output captured to a temporary file. For my tiny setup this worked fine, for a lot of small setups this was also fine. For larger setups, it was not fine at all. Random failed interfaces and, most bizzarely of all, even though a file disappearing. The program checked for a file to exist and then ran stat in a loop to see if data was there. The file exist check worked but the stat said file not found.
At first I thought it was some odd load related problem, perhaps the filesystem not being happy and having a file there but not really there. That was, until someone said “Are these numbers supposed to be the same?”
The numbers he was referring to was the filename id of the temporary file. They were most DEFINITELY not supposed to be the same. They were supposed to be unique. Why were they always unique for me and not for large setups?
The problem is with the uniqid() function. It is basically a hex representation of the time. Large setups often have large numbers of child processes for polling devices. As the number of poller children increases, the chance that two child processes start the reachability poll at the same time and have the same uniqid increases. It’s why the problem happened, but not all the time.
The stat error was another symptom of this bug, what would happen was:
Child 1 starts the poll, temp filename abc123
Child 2 starts the poll in the same microsecond, temp filename is also abc123
Child 1 and 2 wait poller starts, sees that the temp file exists and goes into a loop of stat and wait until there is a result
Child 1 finishes, grabs the details, deletes the temporary file
Child 2 loops, tries to run stat but finds no file
Who finishes first is entirely dependent on how quickly the fping returns and that is dependent on how quicky the remote host responds to pings, so its kind of random.
A minor patch to use tempnam() instead of uniqid() and adding the interface ID in the mix for good measure (no two children will poll the same interface, the parent’s scheduler makes sure of that.) The initial responses is that it is looking good.
JFFNMS version 0.9.4 was released today, this version fixes some bugs that have recently appeared in previous versions.
The triggers rules editor had a problem where some of the rules clicked off the triggers would not appear or could not be edited correctly.
Most of the Admin screens have the ability to sort the rows. This, unfortunately, didn’t sort but the functionality has been restored.
Most users are probably unaware of this, but the database schema is first created for MySQL and is then converted for PostgreSQL. The conversi0n process is far from ideal and hasn’t worked until this release. More testing is required for PostgreSQL support but it should be a lot better.
You might (or not if you don’t visit) notice all my websites were down. A rushed apt-get dist-upgrade and I found two problems:
PHP5 got removed, which is bad if you run a wordpress site that uses PHP to run
The apache configuration has changed.
Yes, the NEWS entries did warn me, if I read them fully. Yes, I didn’t read them enough.
Apache now ignores configuration files that don’t end in .conf To give a completely non-theoretical example, if you have your virtual hosts in files such as /etc/apache2/sites-enabled/enc.com.au then this will not be recognised and your sites will show the default “It works” page.
Stuff that doesn’t fall in the usual places where website stuff should go, which for my setup is a lot of things, will also be denied as the developers have tightened up the rules around what is permitted. Pretty simple to fix with a few <Directory blah> clauses.
This isn’t a criticism of the Debian apache developers. They do an awesome job of keeping the package workable, flexible but secure which isn’t easy. Now it’s all back working, I actually agree with the changes they have made. It is just that the latest changes are, well, tricky so be forewarned.
JFFNMS version 0.9.3 has been released today. This is a vast improvement over the 0.9.x releases and anyone using that train is strongly recommended to upgrade.So what changed? What didn’t change! A nice summary would be fixing a lot of things that were broken or needed some tweaking. A really, really big thanks to Marek for all the testing and bug reports and also patient “just run this and tell me what it says” tests he did too. If something wasn’t right before and works now, it is quite likely it is working because Marek told me how it broke.
I recently had a bug report in JFFNMS that the SLA checks were failing with bizarre calculations. Things like 300% disk drive utilization and the like. Briefly, JFFNMS is written in PHP and checks values that come out of rrdtool and makes various comparisons like have you used more than 80% of your disk or have there been too many errors.
The logs showed strange input variables coming in, all were integers below 10. I don’t know of many 1 or 3 kB sized disk drives. What was going on? I ran a rrdtool fetch command on the relevant file and got output of something like 1,780000e+07 which for an 18GB drive seemed ok. Notice the comma, in this locale that’s a decimal point… hmm.
In lib/api.rrdtool.inc.php there is this line around the rrdtool_fetch area:
A quick check and I was finding that my 1,7…e+07 was coming back as 1. We had a float conversion problem. Or more specifically, php has a float conversion problem. I built a small check script like the following:
$linfo = localeconv();
print "Decimal is "$linfo[decimal_point]". Pi is $pi and ".(float)($pi)."n";
print "Half is ".(1/2)."n";
Which gave the output of:
Decimal is “,”. Pi is 3,14 and 3
Half is 0,5
So… PHP is saying that decimal point is a comma and it uses it BUT if a string comes in with a comma, its not a decimal point. Really?? Are they serious here? I tried various combinations and could not make it parse correctly.
The fix was made easier for me because I know rrdtool fetch only outputs values in scientific notation. That means if there is a string with a comma, then it must be a decimal point as it could never be used for a thousands mark. By using str_replace to replace any comma with a period the code worked again and didn’t even need the locale to be set correctly, or that the locale for the shell where rrdtool is run is the same as the locale in php.