Technical News


The news items below address various issues requiring more technical detail than would fit in the regular news section on our front page. These news items are all posted first in the Technical News discussion forum, with additional comments/questions from our participants.

(available as an RSS feed.)

26 Aug 2008 22:53:45 UTC
Ah, yes - here we go again - the regular Tuesday outage for mysql database backup/compression and other tasks better suited to happen during "quiescent" time.

For example, this week we replaced the failed drive in the workunit storage server with a new drive. That was painless. We also spent a bunch of time experimenting with the new-ish RAID server. I say "new-ish" as it's new to us, but it is an old system. For example, it can't handle logical volumes greater than 2TB. We however today confirmed (a) it can handle physical single drives at least 750GB in size, and (b) physical volumes greater than 2TB (i.e. put three 750GB drive together to make a 1.5TB RAID5).

We also tested that this system is keeping up pretty well doing a continual backup of our upload directory. That is, we're doing a constant rsync with the upload directory to keep a "hot backup" around on a separate system. We didn't have the bandwidth/storage capacity to do this ourselves before (and daily backups to tape were too expensive).

Anyway.. the extended length of the outage today was mostly due to revamping the way we're doing the backups. We're working to include better query blocking (to ensure the database is totally update-free) and figure out the best way to maximize our time, thus ultimately shortening these outages.

- Matt


25 Aug 2008 22:56:00 UTC
I've been out for a couple weeks. I really need to get the others around here to chime in while I'm away, but it's hard to convince people who aren't as hypergraphic as I. Anyway, it seems like whatever happened most everybody survived. Another problem: what I end up blathering on about in these posts is hardly comprehensive, and given arbitrary priority based on whatever is on my mind at the given time. This can be confusing, I imagine.

I might also just go ahead and start only posting here when I really need to (during *real* server issues) and post less important day-to-day type things in the blog. We'll see how that goes. It might help keeping specific issues contained to one meaningful thread.

In any case, a brief rundown of the past two weeks: A drive failed on the workunit storage server. Usual drill there except it hung after the failure, however once rebooted it recovered just fine using a spare drive. Outside of that were more minor issues (another server hung requiring reboot, the mysql replica stopped for no apparent reason and took a few days to catch up, etc...) causing various queues to drain or fill too fast, bottlenecks were exercised, and we had a couple temporary complete/partial public server outages... all told nothing out of the ordinary. We are still running a bit "hot" due to the Astropulse release - by "hot" I mean we're using far more storage/network resources than we'd like, but we're otherwise okay.

Going back to catching up from the absence...

- Matt


7 Aug 2008 22:11:38 UTC
Towards the end of the afternoon yesterday we put in a new scheduler to fix a bug with "anonymous platforms" and the way they handle Astropulse workunits. This is working fine as far as I know, but at first there were some brief issues with uploads in general (human error when installing new scheduler).

Today got our new NAS machine into the closet. We're close to removing the old NetApp filer, which still works great after so many years, but the drives are too small and we can't afford support on this system, and buying new replacement drives is prohibilitively expensive. Plus the thing is just physically huge - a whole rack taking up a third of our closet for only 3 TB raw space. We're replacing it with a 3U system that will ultimately have about 7 TB raw space. Getting that into the closet meant I was able to fire up another server-to-be today in our prep lab and get that configured.

Traffic-wise we're still trying to get a feel for our demand and our bottlenecks. Eric wrote a script that is busy deleting antique workunits/results that exist on disk but not in the database (not sure why the antique deleter built into BOINC isn't working...). This will clear up additional much needed room but this is pretty much all we can do short of getting a whole new workunit storage server.

Looks like web code was updated just now, breaking a thing or two. I think Dave's addressing that stuff. I've been mostly catching up on several behind-the-scenes programming projects today.

- Matt


6 Aug 2008 21:11:48 UTC
Generally speaking, the wealth of issues we've been experiencing were simply due to Astropulse adding about 10-20 more Mbits/sec to our general average. This was a little higher than we expected, hence the initial air of mystery, but still quite within our abilities given current infrastructure. This traffic might go down a bit once everybody requesting their first Astropulse workunit gets their single copy of the Astropulse client.

So this explains the big rush once we released the first workunits and the longer "catching up" period, especially given the fact we were constrained all weekend due to lack of workunit storage space.

Today I've been mostly working on build scripts and testing recent database code fixes. Getting back on the "development" train for a bit... We are also close to getting that new home-grown NAS into production.

- Matt


5 Aug 2008 23:15:08 UTC
Today was another one of them "outage days" where we shut everything down to do basic weekly maintenance (database backup and whatnot). We had a particularly large task list this time around. A lot of it was fairly mundane - like moving/compressing files to make more room on various storage systems.

The sidious crash the other day did in fact break the mysql replica again. No big deal, but that meant recreating the database from the master - a seemingly weekly occurrence. It's easy to do, just adds extra time to the whole operation.

Also, we tried to fix that broken index on the science database. We found the corruption was actually not on the RAID system we thought (the one that required a drive replacement). Huh. Anyway.. the index repair on the whole table was taking too long. We might just go ahead and drop/rebuild the specific index later now that we are more sure what's what.

We brought all our backend services (feeder, transitioner, validator, etc.) up to spec on current BOINC code for the first time in a long time, so we carefully turned these on one at a time to observe the logs/results and make sure nothing got all screwy with the updated code.

So we're back up, more or less. The current mystery is why we are using so much bandwidth. Too many factors at play to make a clear determination - lots of known network bottlenecks, lots of database bottlenecks, unknown Astropulse behavior, etc. We'll give this a closer look tomorrow after (hopefully) some of the traffic jams disappear.

- Matt


4 Aug 2008 21:37:18 UTC
Another wacky weekend for us. Astropulse is still ramping up - we're creating work, sending it out, receiving results back and assimilating them. However the validator stopped granting credit for these workunits - something we'll fix and we can also retroactively give people their credit. The workunit storage server ran low on room again, the bottleneck that's been giving everybody headaches over the weekend as the splitters could only create work as fast as workunits got deleted off disk. Right now things are generally running slow as I'm moving stuff off the workunit server to make room causing lots of excess internal i/o. As an added bonus the mysql database replica server crashed this morning - it ran out of memory. No harm done, but it looks like it'll take a while to catch up again (it's been lagging behind all weekend). I would like to try to split the numbers on the status page between the two different applications (SETI@home/Astropulse) but those extra "where" clauses make the queries run forever.

In better news, looks like we got our new home-grown NAS/RAID box working as we'd like it, so we may start employing that sooner than later (thus freeing up lots of room/power in our server closet). Also all drive issues on our science database server over the past couple of weeks have been completely dealt with at this point. Well.. there's one lingering corrupted index which we'll try to rebuild tomorrow during the outage.

I was actually out of the loop since Thursday as I went up to Seattle to play a gig on the main stage at the Microsoft Techready conference at Bell Harbor. Anybody around here attend that thing? Fun show/event, but the stage tent was completely inadequate and the entire band got soaked by rain and sea mist. I'm amazed none of us were electrocuted.

- Matt


30 Jul 2008 20:10:28 UTC
Looks like we're pretty much out of the woods regarding recent issues. Plus the stats dumps are working again (for the first time in days) so there was an artificially inflated bump in BOINC world-wide productivity for a moment there.

Following on with the science database server stuff. I continue to play the RAID "shell game" to get the root filesystems back on the actual root drives (just for our own sanity, mostly). I also still have to drop/rebuild that one index which gave us trouble a couple weeks ago (apparently "checking" the index didn't fix it) - all very minor issues.

Regarding our experience with drive failures... We see the obvious stuff - drives fail either (a) immediately, (b) after 2-4 years, or (c) never ever. I remind people that our original SETI@home data recorder contained drives that were already heavily used for about 5-6 years when we installed them down at Arecibo in 1998, and then they were reading/writing successfully until a couple years ago. They would still probably be working but we have since switched to the newer multibeam data recorder system. Anyway, we don't have enough data to prove that high temps or heavy loads kill drives faster. My gut feeling is they don't as much as you think. My gut feeling is also that more than half our "failures" are bogus - for example, we had a lot of fibre channel errors, or RAID card bugs, or smartd being oversensitive making it seem like perfectly good drives were unhappy. Many times we just remove and re-add the "broken" drive and it works just fine. In the current case we believe the drive replacement was necessary.

Regarding linux OS re-installs... We've been using Fedora for a while now. Each OS rev has about 18 months of support, and we like to keep up to date for various compatibility/security/bug-fix reasons. It's easy to "yum upgrade" to the next OS rev, but after doing this a couple times you find configuration files get out of whack, and your system is littered with "rpmnew" files. Package conflicts arise. Plus every few years you learn enough that you might want to rethink your file systems/adjust partition sizes, etc. So a fresh install is more just "spring cleaning" than anything else.

- Matt


29 Jul 2008 23:13:57 UTC
Today we had our usual Tuesday outage which was a bit longer than usual as we had extra things to take care of (outside of the usual BOINC database table compression and backup to disk).

I failed to mention yesterday (though many have noticed) that db_dump hasn't been working for days, which means our stats have flatlined all weekend. This was because our mysql replica failed (we run these expensive stats lookups on the replica so they don't affect the more important updates running on the master). So part of the outage today was to rebuild this replica from scratch via the dump from the master. It was easy - we do this regularly anyway - just takes a long time.

Also, Jeff and I replaced a failed drive on thumper (the science database server). There are 48 drives on the thing so disk failures are common, and we get Sun support on this important system. We ask for a drive, they send one, we put it in and ship the old one back. Easy as pie. Unfortunately the software RAID on this system made some bogus complaints upon restart (unrelated to the device that required the new drive). I'm not sure why mdadm gets confused - for example I converted a couple spare drives to a new RAID device, which works fine, but upon reboot (many months later) mdadm freaks out that those spares are missing. Anyway, this was mostly harmless, and another warning we really need a fresh OS install on this system sooner than later (that'll be scary).

We're running full bore now. It'll take a while to catch up, and we may temporarily run out of work again (still not a comfortable amount of free disk space on the workunit storage). But it'll all clear up eventually.

- Matt


28 Jul 2008 21:27:00 UTC
Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail.

Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project.

We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day.

On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well.

Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work.

And oh yeah.. we were slashdotted again on Sunday.

That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out!

- Matt


24 Jul 2008 21:35:24 UTC
Astropulse release progress has been slowed by various things. Some necessary updates were made to the generic BOINC scheduler which we then employed on Monday. After that we found several weird problems including computers being refused work because their hardware was wrongly deemed inappropriate. At first this seemed like a "Mac only" problem but as far as I could tell some Macs were still able to get work. In any case, we ultimately fell back to the "old" scheduler this morning. This improved things according to some rough, immediate analysis. It is still unclear the complete set of scheduler problems, their causes, and their solutions. We'll chip away at that as Dave works his way through a large e-mail backlog.

Yesterday Dave, Jeff, and I had a "work stoppage" and went for a hardcore hike in the Desolation Wilderness (near Lake Tahoe) - something we've been talking about doing for way too long, as we are all avid hikers. We were joined by my wife and Daniel, a visiting BOINC developer from Spain. Since this is technical news, the technical details are thus: We took the Twin Bridges trailhead (at 6200') up to and beyond Horsetail Falls. This included some surprisingly dangerous boulder scrambling which sapped more energy than originally expected. Our plan to bag Ralston Peak (9200') was reduced to basic exploration up to (and ultimately into) Lake of the Woods (over 8000'). The boulder scrambling downward was even worse, but all knees/ankles survived intact. All told, about 7-8 miles of hiking/scrambling, almost 2000 vertical feet gained and lost, taking about 8 hours including lengthy breaks. I felt poorly acclimated, even though I easily conquered a similar hike in Yosemite (up to the top of Nevada Falls and back) six days earlier. Dave was acclimated but started the hike a bit exhausted as he did about 800 feet of rock climbing in upper Yosemite the previous day.

- Matt


22 Jul 2008 21:16:02 UTC
Yesterday afternoon we installed in a new scheduler which included some updates necessary for the upcoming Astropulse rollout. However, our network performance took an immediate hit. After about 10 minutes trying to figure out what was causing this Jeff and I realized our scheduler switch perfectly coincided with several expensive credit-analysis queries Eric was running, also in regards to the Astropulse rollout. So it wasn't the scheduler - just the database getting overloaded. That got cleared up quickly.

Last night I noticed people complaining about Mac computers being denied work. This is still an issue, probably with the new scheduler implementation, and we'll address it shortly.

We had the regular weekly outage today during which I tackled some extra things. First off, due to continuing mysql database performance issues we completely dropped the credited_job table (before we just dropped the indexes). Reminder: this is the table that connects user ids in the mysql database to result ids in the science database, so we know who did what. This is also the only table in the mysql database that grows without bounds, and therefore has been the cause of much headaches as of late. Don't worry - we have all this data archives in three formats in three different locations, and will continue to collect this data in flat file format. I also checked the integrity of the database filesystem now that it was cleaner. No problems there. I started up the projects and mysql is currently handling well over 2000 queries/sec without breaking a sweat.

- Matt


21 Jul 2008 18:49:42 UTC
I was out of the lab since last Wednesday hence the dearth of tech news reports. Though not all that much to report. We had a couple of the usual/typical blips that required minor maintenance, most notably the db_purge process (the thing that keeps the result/workunit tables trim by actually deleting database rows from the BOINC database once the scientific data has been inserted into the science database) - this process hung for some unknown reason and the BOINC db grew great in size. A simple restart fixed that.

As for that index corruption in the science database I mentioned last week, that index was rebuilt just fine, but only after we took one drive in the particular RAID holding these indexes off line - smartd was reporting a lot of errors so we think that drive was the culprit of the corruption. We'll try to replace it sooner or later (the system is now down to only 47 out of 48 500GB drives).

I haven't fully caught up yet from being gone but I imagine there will be some AstroPulse ramping up to report sooner or later. I see scheduler updates have been made (and I think put into beta). I'll meet with Jeff/Eric later and discuss.

Looks like there will be a campus network outage that affects us this upcoming Wednesday morning - it will last about a half hour, starting at 6:30am (Pacific Time). A couple router upgrades from what I can tell.

- Matt


15 Jul 2008 22:42:09 UTC
Had the typical weekly outage today - the results of which were much happier than last week. We were also hoping to fsck the mysql data drive that gave us grief last week to make sure it's okay, but the outage was taking too long so we'll do that later. We did fire off our weekly science database backup which quickly failed due to finding a corrupt page or two. This happens from time to time - and turns out this particular corruption is within a index that we can easily drop and recreate if the usual data-cleanup utility doesn't work. Also science database replication broke at some recent point, probably due to the primary database catching up on backlogged inserts caused some kind of handshake timeout. No big deal - replication is catching up now.

The campus network graphs are all out, which is how we confirm what our current bandwidth usage is. I hope this will get fixed soon. I feel like a doctor without a stethoscope.

- Matt


14 Jul 2008 23:07:47 UTC
So the second half of last week was spent trying to figure out why our database server was so painfully slow. Bob, Jeff, Eric, and I were scratching our heads, trying this and that to diagnose and fix this mysterious problem. Everything was fine before the Tuesday outage, nothing changed during the outage, but upon restarting the project we couldn't handle very much load.

We were quick to blame mysql, as it has had random episodes in the past of secretive bookkeeping causing us grief. We ruled this out. We started blaming the "credited job" table which is growing infinitely. This is the table keeping track of which user did which workunit. We do nothing but insert into this table (no random access selects), so why would that be a problem? Nevertheless we turned off inserts (back to writing similar info to flat files for later parsing) to no avail.

Maybe it was hardware? Did a disk fail? Is a disk about to fail? We ruled all that out as well, which brought the focus back on mysql with dozens of server tuneables that we tweaked for various reasons over the years. Did we go too far with some of those variables? We convinced ourselves that wasn't it.

Of course on hindsight the ultimate solution seems obvious: the filesystem where all the data is kept. Just because the hardware seems okay, and I/O rates are normal, doesn't mean the filesystem is happy. And the focus was back on "credited job" as this table is constantly growing and therefore a big ol' file - much bigger than anything else. A file that is constantly growing during all other inserts and updates that happen as the project is running will likely become interleaved and fragmented to the nth degree. Without fearing data loss we dropped the credited job indexes and that alone broke the dam. Well, jeez.

We're still catching up from the backlog, but mysql is performing incredibly well at this point. This is good, as we're hoping to release Astropulse before the end of the week. More on that later.

Happy Bastille Day, by the way.

- Matt


8 Jul 2008 23:19:59 UTC
Weekly outage day (to compress/backup BOINC database). It lasted a little longer than usual due to some confusion - unbeknownst to me a recent web code update was made that broke the "stop_web" mechanism which keeps the database quiescent during the outage. It's also taking a long time to recover. Not sure why but we'll see if the clog pushes through. I took advantage of the outage to move server anakin into the closet. We also upgraded the RAID card BIOS to see if that fixes our minor issues with ptolemy's current hardware RAID setup. Well, it's logical volume initialization is still way too slow, but maybe we'll live with that if all future resync's are fast.

Just wrapped up the scoring meeting I mentioned yesterday. The bottom line being our current scoring algorithms for individual signals (spike, guassians, pulses, triplets) are sound, the multiplet scores (interesting groups of signals of a single type) are 99.9% sound, and metacandidate scores (of single sky pixels containing "candidates" like indiviual signals, multiplets, or stuff observed from previous SETI project, as well as interesting celestial objects) are still way up for debate as this is where individual philosophies differ, but we'll probably just go with the easiest solution (multiply all the candidate probabilities together) and see what that list looks like. Jeff will write all this up. Maybe we'll even have a science newsletter.

Jeez... still having a hard time recovering...

- Matt


7 Jul 2008 22:23:11 UTC
Rather dull holiday weekend except for the fact I was up in Oregon and remotely dealing with several server issues hidden from the public - nothing really newsworthy. Various previously mentioned projects are continuing along: I'm installing an OS on ptolemy in the hopes we can flash upgrade the current RAID cards' software and see if that helps, otherwise we're buying new cards that we *know* work. I might do a bit of physical server shuffling during the weekly outage tomorrow - get some of the newer stuff into the closet - maybe.

Looks like the big "scoring meeting" is also tomorrow where we will try to settle on our candidate scoring algorithms. Basically we need to pool together our scoring techniques from previous reobservation runs and apply it to the nitpicker which, unlike all prior data analysis, runs and updates in real time as signals flow in. It was easier before, at least in the candidate analysis I've done. You'd turn the crank, look at the results, adjust some variables and turn the crank again. Not so easy to be as casual and change algorithms when the crank is turning 24/7 and a million signals are added every day.

Oh yeah - back to the "ALFA running" problem on the science status page. Turns out we need to recompile our program that peeks at the observatory status broadcasts for our own status pages. This hasn't been recompiled in ages, and much has changed in the meantime. An added compilation is that this running on a Solaris machine down in Puerto Rico making recompiling old, stale code a challenge. Jeff is tackling that.

- Matt


3 Jul 2008 21:11:53 UTC
Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn't it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can't get this to ultimately work we can get some 3ware cards (or some such) instead.

Meanwhile, with ptolemy pretty much gone we've been having mounting problems with servers still requesting its disks. No matter how hard you try there's always some dependencies that hide until too late. So it's been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine.

Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff's desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it's doing all the scoring algorithms, which we'll discuss next week.

- Matt


2 Jul 2008 22:29:10 UTC
Working on ptolemy's conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We're finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can't create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal.

Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the "special user" tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah).

Regarding the "ALFA running" bit on the science status page - I think I fixed this, but we haven't collected ALFA data since, and won't for a while, so I don't have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon.

- Matt


1 Jul 2008 22:09:19 UTC
Today's Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn't have this problem, but maybe that's because it's attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed.

While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see.

Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as "not running" for a while now. This was lost in the noise as Alfa actually hasn't been running much recently, but is still should have been shown as "running" every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I'm finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We'll see if that helps.

- Matt


30 Jun 2008 21:58:57 UTC
A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a "detour" so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well.

Eric's back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much.

- Matt


26 Jun 2008 21:07:44 UTC
The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I'm not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see.

Funny aside: while getting new-ish donated server "clarke" up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don't need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I'd comment out the old line with a "#" and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores "#" comments in inittab, and just looks for lines containing the string "initdefault" and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change).

Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff's code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of "fuzzy compares" in the nitpicker (due to floating point computations on different chips you can't expect decimal values to be "exactly exact"). When two items are close enough to be called "duplicates" you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now.

Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I'm smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk.

- Matt


25 Jun 2008 22:23:54 UTC
This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name (boinc2.ssl.berkeley.edu) set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don't fully understand (why don't they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this.

However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a "permanently moved" warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling.

So I checked the two redundant download servers (bane and vader). Turns out bane wasn't serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be "better" but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What's the deal, then? Perhaps this has to do with our "pound" load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under "unrelated and currently harmless problem."

Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant "Conent-Length." Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, "I don't know what you want - try again in 10 seconds." This got worse and worse as more clients wrapped up their currently workunits and tried to connect.

The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol' pound to simple shovel ptolemy's packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won't help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week.

Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff's been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn't happened in, well, months. At least glacial speeds are non-zero speeds.

- Matt


24 Jun 2008 21:50:01 UTC
Had the usual outage today. No news there, and we're recovering normally at the moment.

Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we're going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources.

Tomorrow we're going to attempt converting our scheduler to the new-used system "anakin" so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues.

- Matt


23 Jun 2008 22:22:22 UTC
Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What's causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it's not even close to a tragedy, so we're just keeping our eye on it.

I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn't cause any downtime or data loss, but it's getting us to reconsider our current stance on software vs. hardware RAID. We've been sticking with software RAID due to ease of use and quickness of warning, but we're finding it sometimes doesn't behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&D on that front

I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out.

- Matt


19 Jun 2008 19:41:22 UTC
We're still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we're also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that's where all the data are). So there's basic I/O contention at the moment.

Other than that I have nothing to report - I've been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks.

There's a halo around the sun at the moment. Cool.

- Matt


18 Jun 2008 23:16:03 UTC
The assimilator queue grew again. The main culprit this time was the NTPCkr - from here on out I'll simply refer to it as the nitpicker - as a reminder this is the program that is pretty much the culmination of all our SETI@home data collection and analysis, i.e. it's the thing that'll find the aliens if they exist. All other analyses so far using SETI@home data were cursory by comparison.

Anyway.. we're finding every so often that we have "deep" pixels containing tens of thousands of multiplets, each containing thousands of signals. When my "science status page updater" hits one of these it hangs on for quite a long time, causing a heavy CPU load on the database server as it tries to wade through this flood of signals gathering statistics. My optimizations (mentioned earlier in the week) helped, but not enough. We may devise/implement more. In any case, the heavy nitpicker load made the assimilators slow down. We killed those particular processes and I think we're catching up again. Slowly.

So the donation processing suite had been choked for a couple weeks and nobody noticed. This was caused by a suddenly (and silently) more stringent firewall, and masked by several things. We've been getting the donations, just no confirmations. So there's quite a few missing green stars I imagine. Not exactly sure what to do about that just yet.

- Matt


17 Jun 2008 20:44:23 UTC
Ho hum weekend, which is good. The air conditioning people came up yesterday (Monday) and today to do follow-up inspection of our server closet system (which failed last week) and found a couple more leaks which have been repaired. We seem to really be pushing it beyond its limits. Had the usual database outage today. No big whoop there.

Somebody noted earlier that their results were getting validated surprisingly quickly. We didn't change anything. This may have been due to a longer-than-usual period this past weekend of fast workunits - the average turnaround time was roughly 10 hours (about 20%) shorter than normal, meaning pairs were getting matched up that much faster.

A lot of what's been going on the past couple of days has been post-vacation catchup (half the staff was out of town). While I have a zillion other things to do I discovered a couple ways to optimize the NTPCkr so I coded that up and I'm testing it now. Every little speedup on this front helps. Jeff's still working on the scoring part. We're getting there...

- Matt


11 Jun 2008 21:25:25 UTC
Some general BOINC code got updated on our servers this morning, which broke a couple things (some pages went blank, and the php "magic quotes" got messed up causing all kinds of backslashes to appear everywhere). I whined to Dave and he fixed it, which is usually how these particular problems sort themselves out. The problem with the web code is that it is being completely or partially used by all kinds of BOINC projects, so a "fix" for one project may end up unexpectedly being a "bug" for another, which is why this kind of thing happens from time to time. We try to keep SETI@home as up to date with the BOINC source tree as possible, even if that means we're on the "bleeding edge." Of course this is all web code, so problems like these are cosmetic and relatively minor in the grand scheme of things. We do more thorough alpha/beta testing of the important back-end functions - you know, the ones that update millions of database records every day.

Other than that today has seen more OS installs/RAID manipulations on various donated servers that have been anxiously waiting their call to duty (I got beyond the issues I was having yesterday). Slowly but surely we'll get these up and running. I also got a bunch of data drives from Arecibo - it's been a while we got a batch of fresh data up here, so I'm now lost in data pipeline management mode.

- Matt


10 Jun 2008 22:20:19 UTC
Normal Tuesday outage. Didn't really do anything special this time around. I did mess around with server "anakin" a bit (the presumptive replacement scheduling server) - for starters it keeps booting up in X (though the inittab says not to) and one of its drives got marked as "defunct" (the hardware RAID is rather confusing - I can't figure out how to "unfail" the drive). Both really minor issues. At least there was zero fallout from the air conditioner failure yesterday. Other than that I'm mostly working on mundane sys admin chores and catching up on some back-end diagnostic/analysis stuff.

- Matt


9 Jun 2008 20:52:35 UTC
Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We've been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days.

Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down.

- Matt


5 Jun 2008 21:24:59 UTC
Another mild day in server land. Lots of minor apache issues. There was an annoying web scrape yesterday afternoon that gummed up the works for a moment. This morning I found a bug in the web log rotation script that prevented our public web server from restarting - so it's been running for weeks non-stop during which the httpd processes bloated in size (apparently there are small/tolerable memory leaks in php/apache/boinc code somewhere). Then later our scheduling server was suddenly unable to run the scheduler cgi. We were dropping connections so I got alerts right away about this. I had to stop/restart apache twice, though, to get it working again. Not sure why the first restart didn't take.

Jeff's adding more star catalog data to our database. Bob worked on another alert script to better check our current database storage allocations (and prevent another minor mishap like earlier this week). Eric and I swapped drives between his hydrogen server "ewen" and ptolemy (for when the latter becomes a storage server) - ewen freaked out a little bit unexpectedly - we umounted the filesystems before pulling the drives, but an xfs daemon woke up and thought that particular partition should still be around, etc. No big deal - just a lot of alert e-mails that were scary at first.

- Matt


4 Jun 2008 20:06:25 UTC
Things are continuing to clear up nicely since the science database kerfuffle earlier this week. The assimilator queue is still large, but now that everything is more or less "caught up" it's draining at a pretty good clip.

Nobody probably noticed but for a while there this morning (actually still as I type this sentence) we had two scheduling servers - ptolemy and anakin. I finally got anakin up and configured and made it a secondary scheduler to test it out. Once we're ready to convert ptolemy into something else, we now have another scheduling server in our back pocket.

- Matt


3 Jun 2008 21:46:01 UTC
Good news. The science database problems were far less severe than we thought. Short story: we ran out of space. Long story: due to a slightly confusing configuration we thought we ran out of extents for reasons unclear. Informix categorizes all usable storage space into dbspaces, fragments, chunks, extents... maybe more things I'm not sure. We've had problems in the past where we ran out of extents long before running out of actual disk space and we thought this is what happened again. The solution for such is painful - basically like rebuilding a RAID system (unload everything, recreate, and reload). Luckily we discovered we had some fragments/chunks misaligned (some fragments had more chunks than others) so all we had to do was add more chunks, and we had plenty of disk space for that. We added enough to get by for now, and will do more when we catch up from the queue draining/filling.

We had our usual outage today (for BOINC database backup/compression, etc.). Between the usual recovery for that and the recovery for all the above it may be a bumpy ride for the next 24 hours or so.

Yesterday afternoon server "bane" (one of the two download servers) was having mounting issues which required a reboot to clean up. I was home at the time and rebooted it remotely. Of course, like my desktop last week, a new kernel was yum'ed in during the recent past and messed up grub for some reason, so it wouldn't load the OS. I had to get Jeff, who was still at the lab, to deal with booting from the emergency DVD and boot from an older kernel. While bane was down half the downloads connections were failing, but usually retries were successful as we have the two redundant servers.

Today I got server anakin more officially racked up (actually just sitting in a rack directly on top of a UPS) to ultimately become the new scheduler. It's a recently donated Dual Xeon (used) that is actually less powerful than our current scheduler, ptolemy, but should be able to handle the job just fine. We plan on making ptolemy, with its 16 mostly unused drive bays, a network storage server to replace our ageing Network Appliance server, which fell out of service long ago and its many drives are dying with regularity - infrequent but still worrisome.

- Matt


2 Jun 2008 18:58:32 UTC
Early Sunday morning I discovered the assimilators were all failing. Immediate analysis uncovered zero smoking guns. All the assimilators were choking on the same subset of results, and all while inserting pulses. Plus the actual processes were seg-faulting before they could produce any useful error codes. Checking the failing result files and database entries showed nothing obvious (all different sizes, submitted at different times, created by different clients, etc.). I did all I could do. I told the other guys (Bob, Jeff, Eric) - Bob's checking the database now for any subtle weird behaviour (once again I found no obvious problems yesterday) and Jeff's recompiling the assimilator code (perhaps a version that outputs useful error information). In the meantime, the assimilation cue grows, and our disk usage grows with it (as we haven't deleted anything in over a day) - sooner than later I'll have to stop the splitters to prevent storage disasters. I'll update this thread if we figure out what's up on that front.

The only other real gripe right now is that our data recorder system at Arecibo is only seeing one of two data drives. Not a tragedy - we can still record data but this will put additional strain on the operators down there until we figure out why.

- Matt


29 May 2008 22:40:14 UTC
I spent the entire day so far (and will certainly continue after writing this missive) doing nothing anybody will ever care about - mostly revolving around php programming for upcoming letter drive (more on that later). My desktop was getting funky X errors so I decided it was due for a reboot, and then it wouldn't come up again. This new Fedora Core 9 distro apparently yum'ed in something which broke the boot loader. An hour or two spent trying to suss that out and ultimately reinstalling the OS and I'm back in business

We did have a software meeting earlier - we're getting back on track with various stagnant analysis/database projects. Also discussed the Google Sky map stuff - they get their images from many different sources, so it's still unclear what epoch the coordinates are in. No simple official statements like, "Google Sky coordinates are entirely in J2000." So we're going to have this cosmetic issue where the image data on the science status page may not exactly line up with our reality (which is J2000). In any case, this is hardly a scientific issue as in doesn't affect our analysis - just what's in that neat little Google window.

- Matt


28 May 2008 20:04:41 UTC
People noticed there were short network "hiccups" during the course of the evening, ending this morning. All of it was quite mysterious - no database problems, no workunit storage server problems, and at first no obvious download server problems. Upon further examination I found the DNS configuration was "lopsided" towards one of the two download servers. We have load balancing software on both machines so they were sending equal numbers of workunits, but all initial requests hit only one of the two. This hasn't been a problem before, but apparently this week's outage caused enough strain on apache such that every few hours the load got fairly high and log rotation would take abnormally long (several minutes) and nothing could get through during that time. We are also at our highest active user level in over a year (about 10% higher than a couple months ago), so maybe that added to the apache/server stress level, and what we were seeing were outage "aftershocks." In any case, I fixed the DNS so perhaps this won't be so drastic next week (and hopefully for many weeks to come).

Work on the NTPCkr continues - Jeff uploaded the Hipparcos Catalog to the database, so I added a star count on the science status page for the pixel we are currently observing. Of course, the more stars in a pixel the higher the score. However, there are only about 100,000 catalogued stars and 15,000,000 pixels. So odds are pretty high we are observing zero (known) stars at any given moment.

Oh yeah the idle splitter processes - a couple were shirking their duties. I told them to stop slacking off and get back to work. Not that we needed them but it looks bad to have 'em sitting around doing nothing (in reality they were stuck on some stale trigger files).

- Matt


27 May 2008 21:23:45 UTC
Long holiday weekend (Memorial Day). On the actual day off (yesterday) the BOINC web/download server was misbehaving. In theory I should have been able to connect to the KVM from home but that wasn't working properly (couldn't access via the web due to incompatibilities with newer JRE versions, couldn't access via the standalone client since I ain't got no Windows machines and the client only works on Windows, etc.) so I had to drive up to the lab to kick it in person. No big deal - just a runaway job that clobbered the process queue. Had the usual database backup outage today. Not much news to report.

To answer RHWhelan from my last thread: > ...it seems that most of the data we analyze gets dumped soon after we report. Not sure what you mean by dumped but nothing important is getting thrown out. Your SETI@home client reduces about 350K of raw data into a few signals which get plopped into a result file and uploaded to our server. Once these signals are verified and put into our master database the result file (and its sister row in the database) are deleted to make way for more. The signals themselves never get deleted.

> It also appears that the real staff spends more time transferring, storing and manipulating data and hardware than actually analyzing the results. I don`t mean to be critical, I am actually very devoted to the philosophy of SETI but I must admit it seems a bit futile.It appears that way because it's completely true. And there's nothing wrong with that. To be clear, the "real staff" running the entire show is me, Jeff, Eric, and Bob - all working part time (combined we're about 3 full time employees). Anyway... I understand the feelings of frustration due to perceived futility - science takes time, underfunded/understaffed science takes even more. We're only just now turning the corner on the analysis. Unless final results start appearing, we're still productively collecting/reducing data - not as interesting, but still quite useful. I don't expect everybody to maintain interest until we have some real data products, and then I expect interest to jump.

> Are there ever any "HITS" or even slightly suspicious data streams?There are hits and then there are HITS. We haven't really looked for the HITS yet as we've been unable to until very recently (that part is working now in beta). There are no data "streams" as data don't come to us in streams - the earth rotates so signals that persist over time that are actually originating from outer space will only last a few seconds as our beam passes over it.

When I first started working on SETI in 1997 the group here (just Dan and Jeff at the time) we were wrapping up final analysis on SERENDIP III. Didn't find anything really interesting. Then we started collecting data for SERENDIP IV. We were starting to dig into the final analysis of that data set (about 60GB) when SETI@home came into being and derailed that, though Jeff and I have been plotting to wrap that up sometime soon (once we get the SETI@home final analysis rolling). SERENDIP IV is actually interesting, even with 11 year old data - the analysis is hardly as deep as SETI@home, but much wider: the frequency range is about 35 times bigger than SETI@home. We are also doing Optical SETI, and pulsar searching... The point being is SETI@home isn't all we do, nor is our lab here at Berkeley the only SETI lab on the planet. Nevertheless we do have the biggest, bestest search going by far.

- Matt


22 May 2008 22:35:37 UTC
More database poking/prodding today. Tweaking different mysql variables (and even adding "noatime" and "nodiratime" to the mount options of the data partitions) didn't really help all that much in regards to the transaction committing stuff I was whining about yesterday. So be it. Bob and I also found this morning that our science database indexes were in need of rebuilding as well. Every few weeks we need to run an "update statistics" query to keep those indexes in line.

Slowly working my work through the OS upgrade queue. We're getting FC9 installed on one of three recently donated servers (dual 2.80GHz Xeon / 4 GB RAM) so we can finally start getting these (and another equally powerful P4 server with more RAM, also recently donated) thrown into the fold. The use of these is still up for debate, though they all will be perfectly good general backup/redundant/compute servers. We are definitely missing some redundancy on the backend. I mean, we do have server "maul" sitting around which is quite powerful but being a test model donated by Intel it has an engineering motherboard with keyboard/mouse issues, so we don't want to trust it with anything that needs to have 24/7 uptime - instead it's up and running as a test/compute server, i.e. if it goes off line for any period of time we won't be sad.

Anything else? Just some work on more internal data plots for data integrity checking, and the final bits and pieces of that proposal which is due tomorrow.

- Matt


21 May 2008 22:16:59 UTC
The BOINC mysql replica wrapped up its resync. This morning Bob did some testing to see if we can improve our failure/recovery situation. MySQL allows different levels of log commitments to disk: commit only when the buffer is full, commit at least once a second, or commit on every transaction. We've been sticking with the middle option, as that affords us the most protection without heavy disk I/O - the worst case is that we lose one seconds' worth of data. However, we've proven a couple times now that we do many updates per second (i.e. hundreds) and that's enough to bring the master/replica majorly out of sync if one crashes before being able to commit. So today we tried the last option and expected an increase of disk I/O and sure enough this commit level brought the database to its knees almost instantaneously. We tried this first on the replica and thought it was its software RAID or low number of spindles causing the headache, but applying this to the heftier master had the same effect. So it's back to the drawing board on that front: we don't have the server capacity to commit on every transaction. Maybe there's other screws we can tighten to make this possible. Bob's looking into that. More tests to come, or we'll just put this on the back burner.

Other than that... Got FC9 running on my desktop. So two computers are upgraded now, and I'm getting to understand all the gotchas. Also Jeff and I actually are discussing SERENDIP again. You ever hear of that? That's the project we were working on before SETI@home happened, and it's been in limbo for about 10 years. But as Dan continues to build SERENDIP-like spectrometer boards to help other SETI scientists around the world, these other projects may want to incorporate our data collection/analysis software, so we better dust that off sooner than later. In the process we can maybe throw the old SERENDIP IV data into the same database as SETI@home to buff up our sensitivity even more. That's the hope, anyway.

- Matt


20 May 2008 20:44:57 UTC
Today's weekly backup/compression outage was more or less normal, running the "recover replica from backup" drill without ado or incident. That's all continuing now behind the scenes as we already have the main project up and going through its usual quick recovery.

In the previous thread Joker mentions some (broken) changes on the account page, etc. I see that a lot of php files were updated on our web site. We sync our web site from time to time with the most current versions in the BOINC html repository, and of course this may alter behavior of certain pages or break them altogether. The appropriate parties have been notified.

- Matt


19 May 2008 23:11:32 UTC
Fairly straightforward weekend, server-wise. We're still without our BOINC mysql replica database (see previous note) but we'll clean all that up tomorrow during the usual Tuesday outage. We'll also test some mysql configuration options which may protect us from such failures but at the expense of increasing disk I/O. Basically mysql could write every transaction immediately to disk as opposed to writing all queued transactions in a batch once per second - which doesn't sound like much but we can do hundreds of updates per second at times.

Still fighting with Fedora Core 9 on the test system. Ultimately trying to yum up from FC6 failed, and trying an upgrade from DVD failed - I just couldn't get X to work. So I did a clean install and that fixed the X problem, but there are some surprising but minor issues I'm working around. For example, a bug (or feature) prevented the ifcfg-eth0 script from having a "GATEWAY=" line, so I had to add that by hand to get network connectivity. And autofs wasn't installed by default. I yum'ed it in and it isn't working. I'm debugging that now. Oh I see - "grpid" isn't a valid mount option anymore (?!).

I did add yet more info of nonzero interest to the science status page - namely a link to a chart noting our entire SETI@home data distribution history. I made this chart for internal use originally, but decided it may be fun for the public to see when exactly we observed and roughly how much we analyzed per day. I know I added a couple of web features under the radar lately - I figure we'll publicize all the fun new tidbits in bulk at some point.

- Matt


15 May 2008 23:35:49 UTC
Okay today wasn't so great, but it could have been worse. Eric had continuing problems with ewen so he tackled that for a couple hours this morning, finally getting the thing to recognize its new SCSI drives upon reboot. The general network malaise that happens when ewen is offline masked the fact that, like before, BOINC mysql database server jocelyn suddenly rebooted itself for no apparent reason, causing the mysql engine to shut down ungracefully and requiring a lengthy cleanup.

So that's why we were offline most of the day. Upon recovering the replica server (sidious) was out of sync - no big surprise there but that means we'll have to rebuild the replica database yet again. What a pain! In theory we should be able to swap relation between these two servers easily during such crises, but we haven't gotten a well oiled procedure in place yet for that. Maybe we'll start running drills on this soon. Thing is we didn't want to get fancy as we're near the end of the week, people are bogged down with the proposal, and I'm actually going out of town tomorrow for a quick private corporate gig in LA so I'm going to be completely out of touch for the next 40 hours starting.... now!

- Matt


14 May 2008 23:48:03 UTC
More of the same today. General progress slowed by grant proposal effort and continuing ewen debugging - as mentioned in yesterday's note, when ewen is down everything still works, more or less, just veeeeery sloooowly. I'm also experiencing some growing pains trying to install Fedora Core 9 on one of our test servers (which also, as it happens, sends out the "reminder" e-mails). Run into problems with a standard "yum" live upgrade. Fair enough - I went to upgrade it from DVD but only then realized the system has only a CD drive. Sigh. So I had to pluck a DVD drive out of a defunct system. Then finally after the install X isn't working. I'm hoping a yum update at this point will fix that. On the bright side I continued Jeff's effort on Google Sky and converted our science status page to use it. Fun! I'll make a formal announcement of server status updates when I add one or two more things...

- Matt


13 May 2008 22:11:58 UTC
The standard weekly outage chores (database compression/backup, log rotation, general housecleaning) went by without much incident. It's the extra stuff we try to do at the same time that may or may not be as easy. Today Eric wanted to add a donated (and upgraded) 12TB disk array to his Hydrogen database server, ewen. We also took the opportunity to move a few things around in the closet now that there was rack space (and rack rails that fit!). The moving was fine - however ewen is having problems booting now. Eric added a couple SCSI cards, so maybe there's confusion about where the boot disk is, etc.

In any case, ewen isn't really a SETI@home/BOINC server, but contains enough shared stuff that when it disappears, there's a general malaise in the BOINC backend. Uploads and downloads are fine - it's the splitter, validating, assimilating, etc. that's not going so well (if at all). Eric's beating his head on that. Meanwhile, random unix commands sometimes work immediately, sometimes take 30 seconds to respond. Not so fun. We hope to beyond this before day's end.

I did fight the crowds and downloaded Fedora Core 9 for soon-to-be server upgrades. I'm upgrading one test case now - so far so good.

Jeff has been figuring out the Google Sky API. We'll probably replace the Sloan Survey pix on the science status page with this, as well as use Google Sky to show our current top candidates as they start rolling in via the NTPCkr.

- Matt


12 May 2008 23:26:00 UTC
Not really much of an exciting weekend server-wise, which is typically a good thing. Lots of little bits and pieces being put together to get the new project and scientific analysis software rolling, but nothing really to report outside of mundane details. Progress in general is temporarily slowed this week - we're a man down as Eric is lost in grant proposal land.

Fedora Core 9 is coming out tomorrow. If the mirrors aren't swamped I may upgrade a test machine or two during the usual Tuesday outage. I'll also start bringing some recently donated servers on line which have been waiting on this release (I didn't want to install 8 just to have it become obsolete that much faster). We may also do some server closet shuffling during the downtime.

Happy belated Mother's Day!

- Matt


8 May 2008 21:17:25 UTC
I'll start with hardware - just some minor things. First: the boinc.berkeley.edu website (and alpha projects) were down for a while this morning because the BOINC server froze. Still not sure why, but a power cycle cleared that up. Second: currently AstroPulse scientific data only exists in the "beta" realm - Bob and company are now creating the db spaces on the master science database server along with SETI@home. This may slow things down temporarily due to heavy disk I/O. Third: we got our second new enclosure (the previous one was broken) so we're starting to archive data off site again via our ISP, hence the slightly noticeable bump on our traffic graphs. I guess from this point on you shouldn't assume all transferred bits depicted on said graphs are due to workunit/result exchange.

Software wise, we're chugging along on the various projects mentioned in previous threads. When we all get into programming mode this generally tends to uncover bugs/issues that went unnoticed during network manager mode (or scientist mode, or administrator mode, or ...). Things like being able to insert workunit_groups of any size, but only able to read ones under 8K. Not a problem when all we're doing is inserting, but now that we have to read them back in to do some precess adjustments, this constraint uncovered a few such groups that were extra-large in size. Why? Well, that's what I mean - one little headscratcher leads to another. I've been on this all day, and Jeff's been beating his head on this "ragged file" problem causing some splitters to error out - but when we restart them on the same files they work. Why? Why?! Actually, these problems are kinda fun as when we do discover the root cause there's a happy "a-HA!" moment.

- Matt


5 May 2008 22:44:09 UTC
Typical weekend - a couple weird things but nothing tragic. For example the assimilator queue ballooned for a while, but then worked its way back down to zero on its own. There might have been mysql database load causing some general malaise like the above - no smoking guns have been found yet.

Otherwise general progress. With the servers doing well I continue to send out reminder e-mails to users who haven't returned results in a while. We consistently fight a general downward trend as people buy new computers and forget to reinstall BOINC. Looking at the recent active user graphs out there I'd say about 10% of the reminder e-mails result in a returning user. Most of them bounce (or get spam filtered). Also a large fraction of these e-mails are currently going to users who haven't sent results back in years. So I imagine the success rate will increase over time, but on the other hand I imagine we won't be sending out such mails as often in the future (the number of people who could be deemed "ready to remind" is finite).

Meanwhile I'm working on finally running the precess fixer (run into some embedded sql issues this afternoon), while Jeff is almost ready to throw the NTPCkr into beta. We actually discussed public data visualization of candidates at our general meeting this afternoon. And it sound like AstroPulse is pretty much ready for prime time as well. Woo-hoo!

Happy Cinco de Mayo!

- Matt


1 May 2008 21:03:51 UTC
Happy May Day!

Not much to report these past couple of days. We've mostly been bogged down doing actual software development, which for me has meant trying to wrap my brain around how to pull useful information out of the science database in an efficient manner. The "efficient" part is the crux given the size of the database. Nevertheless, I will be restarting the skymap processing again - watch for new maps soon, albeit of coarser resolution, but perhaps animated over time. We shall see. Jeff's been in NTPCkr land, mostly, though we've been working through continuing data flow issues together as well. Note how I added a third color (gray) to the splitter status section of the server status page. This denotes files that didn't complete due to error which, at this point, is always due to "ragged" files (i.e. missing blocks at the head/tail containing the radar blanking signal).

We had lingering problems rebuilding the BOINC db replica. Despite getting a clean dump from the master, upon reload the replica complained of broken tables that needed repair. These tables did break in the recent past but have since been fixed, but maybe there were lingering error flags hanging around. Anyway Bob cleaned all that up and it's catching up now (again).

EDIT: in case you're watching the network graphs, we just figured out how to send more data to our archives over the ISP - so the spike is raw data archival traffic, not some kind of sudden workunit download frenzy.


- Matt


29 Apr 2008 22:08:03 UTC
During today's outage, Jeff and I did yet more reorganization of room 329, culminating in finally, for the first time ever, putting sidious in a rack. This was a major step in filling this particular rack, which will hopefully replace one of the three racks in the closet sooner than later. We also did the steps to rebuild the replica database, which is happening in the background now. May complete tonight or tomorrow, and then it shall "catch up" quickly after that and we'll be back in business on that front.

Clarifying the bottleneck I mentioned yesterday - this is strictly due to our current data processing rate. Drives with raw data come in, which we always archive to off site storage as well as copy into our processing directory (where the splitters read them to make workunits). In a perfect world, we'd be processing data as fast as we archive them, but to do so would require a lot more active users. So frequently our 8 terabyte processing directory fills up with unsplit data, and everything logjams. So this isn't a database bottleneck - it's a data bottleneck. More people/computers is the solution.

Still, people asked for more info about the quality/quantity of database throughput. Here's a short essay about that. This is by no means complete it's but a good start.

We have two databases, the mysql database which is BOINC specific (running on jocelyn, replicated on sidious - we call it the "BOINC" database), and the informix database which is SETI specific (running on thumper, replicated on bambi - we call it the "science" database).

The science database, while very very large (billions of rows) is not a problem under normal conditions, even as we insert over million new rows every day. This is because inserts are generally at the ends of tables, so it's all pretty much sequential writes and that's it. With the introduction of actual scientific data analysis comes large numbers of random access reads. Earlier this years tests using the NTPCkr (our software to do such analysis) showed this will be a problem so we spent a couple months reconfiguring the science database server/RAID systems to optimize random access performance. We seem to be in the clear for now as we continue NTPCkr testing.

The BOINC database is largely where problems arise, partially because this is our public facing database, i.e. users notice quickly when it isn't working. This contains all data pertaining to user stats, the web site, result/workunit flow, and the whole BOINC backend state machine. On average it gets about 600 queries per second, peaking at well over 2000 per second (like now, as we recover from today's outage). Thanks to many years of gaining expertise forming proper queries and creating proper indexes, 99% of these queries are super duper fast. But there are still unavoidable issues.

The lifetime of a particular workunit and its constituent results is long, as they are created, sit on disk waiting to be sent, hang out in the database as users process them after which they succomb to the whole validation/assimilation/deletion cycle, and finally get purged after a 24 grace period (so users can still see finished results up on the web for some time after completion).

Due to this lifetime at any given point we have roughly 3 million workunits and 6 million results in the BOINC database. This is all important data, but it's mostly metadata - the scientific stuff is contained on larger files on disk. So even with these large tables, and the user/host tables, and forum/post/thread tables, all the commonly accessed parts of the database fit into memory cache when it's all "tightly packed."

We create upwards to a million workunits/results a day in this database, which means the tables would immediately grow too large to be useful, which is why we purge (i.e. delete) them when they are finished - the useful data has been assimilated into the science database at this point anyhow. But deleting isn't in sequence - it's random as results don't return in sequential order. When rows are deleted from a mysql table, it doesn't free up space until ALL rows from the entire database page are deleted - something that isn't likely when done in random order. So even though row counts remain stagnant on these two tables, the tables bloat to roughly twice the size on disk by weeks' end, and mysql memory cache takes a major hit. This is why we have a weekly outage to, among other things, compress the tables (or "repack" them).

Meanwhile, there are daily unavoidable long queries, for example to do user/host/team stats dumps. To dump all this data means reading in whole tables into memory (not just pertinent rows/fields) - queries like this temporarily choke memory cache. Indexes won't help - we're reading in everything no matter what.

Also meanwhile, I haven't mentioned the "credited_job" table which is actually the largest table in the BOINC database. We're still just inserting into it (harmless sequential writes) but I'm afraid this is a disaster waiting to happen once we start actually reading from it.

Bottom line, the BOINC/mysql database is usually fine as of now. It beautifully handles a stunning variety of queries from several public servers and a rather busy backend. A perfect open source solution that folds nicely into the general BOINC philosophy (keep it standard and free). SETI@home is rather large compared to other BOINC projects, so we had to put a lot more TLC into maintaining our mysql servers, and we pass our improvements on to the general BOINC community.

- Matt


28 Apr 2008 22:59:14 UTC
Back from a relatively painless weekend. Except the replica mysql database is screwed up again - it got stuck on a duplicate ID (not sure why) which is relatively harmless but this caused its logs to grow at an inordinate rate, filling up the data drives and bringing the whole thing out of sync. Fine. We'll recreate the replica again during the outage tomorrow (much like we did a couple weeks ago).

Since we've been fairly stable the past couple of weeks I continued to send out the "reminder" e-mails today which has already rocketed our active user base back over 200,000. This is good, as our current data flow bottleneck is the amount of data we are able process, so the more computers the better. Tell your friends!

- Matt


24 Apr 2008 20:33:28 UTC
Work week wrapup. No major news outside of things I already posted here and elsewhere. People are out sick. Man there's been a lot of nasty bugs going around this year. I've been catching up on minor nagging items. Mostly cleaning up the lab - some recently donated servers are stuck waiting on fedora core 9 to be released as well as having no place to physically put the things to set them up. We have a lunch table in the center of the lab piled with random stuff so we're all eating lunch at our desks. Also worked on donation system upgrades. The IT people on campus are now allowing us to pass hidden user ids which will vastly increase my ability to match green stars to specific donators (we've been relying on people entering the right e-mail address on the donation form). Some updates to the boinc web interface broke a few pages - I fixed all that. Yeah.. lots of the usual day-to-day tasks.

- Matt


22 Apr 2008 22:27:41 UTC
Back from a long weekend out of town. Didn't seem to miss very much. I checked the network graphs while I was away and saw no dips, so that's a pretty good sign things were generally healthy in my absence. There was another seemingly bogus disk failure on thumper. Is smartd being too sensitive? The drive tagged as potentially faulty was failed/re-added without much ado. Today had the usual outage. Nothing out of the ordinary there.

One funny thing - for an unspecified amount of time nobody on the Berkeley campus (outside of the space lab) was able to connect to our servers to receive/send SETI@home data. This was due to asymmetrical routing - a problem on our public facing servers that send data over our ISP (as opposed to via the campus LAN). Jeff found and fixed the problem and I updated the network scripts to make sure a reboot doesn't break it again.

Jeff just spent an hour or so walking me through the current nitpicker (i.e. the candidate-finder) code. This really is one of those simple concepts that requires a complex solution. I find it frustrating to describe why, as the reasons are hardly obvious, and the problems are nested. We used to do this stuff with our own human brains which can find patterns and detect duplicates and RFI quickly as long as the data fits on a couple pages. This isn't so much the case anymore, and getting the computers to smartly (and efficiently) do the same grouping, comparing, and discarding is difficult. Think of it this way: you have a bunch of friends and you realize two of them are single and, based on many different variables, perhaps quite compatible - so you set them up on a date. Easy, no? Now try to run a completely automated dating service trying to accurately pair up every single person on the planet with the best possible mate. Not as easy. In any case, I might start throwing random output from it on the science status page which is of anecdotal interest. Like extra info about where we're currently pointing and what we've seen there before. Check for that in the next day or so.

- Matt


16 Apr 2008 21:34:36 UTC
So far so good with the new workunit server. We recovered from the recent spate of outages fairly quickly. The assimilator queue is starting to drain at a good clip, too. If anybody's looking at the traffic graphs and noticing a "bump" over the last hour or so - that's us sending our raw data to HPSS over the Hurricane pipe (in additional to sending it over the standard campus pipe). With the recently purchased (and employed) disk enclosure this extra bandwidth is now possible, and every little bit helps (pun intended).

Mostly working on programming today. Wrapping up work on the precess recalculator - will probably deploy next week. Astropulse and the ntpckr are both just around the corner as well. I know we've been saying that a while, but it's getting truer ever day. Lots of big things coming down the pike.

- Matt


15 Apr 2008 22:24:02 UTC
As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We'll be catching up for a while, I imagine.

The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now.

Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that's why.

Happy Tax Day, my U.S. compatriots.

- Matt


14 Apr 2008 19:03:42 UTC
Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday).

So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc.

- Matt


10 Apr 2008 17:53:43 UTC
We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there's a chance it'll be operational, meaning short glitches instead of multi-hour outages. That's the hope anyway. We might actually test that later today (if it doesn't reset itself on its own). There was discussion about how to implement a second workunit storage server so we don't have this single point of failure anymore. Not as easy as it sounds.

- Matt


9 Apr 2008 21:24:22 UTC
Continuing on from yesterday's tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We'll start the replica once it's ready and it should catch up as well.

Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there's no space available on the splitter system? Things like that. So I'll be coding up more robust scripts in the near term.

- Matt


8 Apr 2008 23:43:16 UTC
Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage.

Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow.

We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file.

- Matt


3 Apr 2008 21:31:19 UTC
Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week.

Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers...

In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing.

- Matt


2 Apr 2008 22:54:30 UTC
So far so good, running with the faceplate off the workunit download server. If this remains the case we'll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn't a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven't weathered those before).

Remember radar blanking? Here's a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar's time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we've been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future.

With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we're collecting a lot more data than we originally intended, which means we can't seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We're going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process.

Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren't in J2000 as much as J1993 (the observatory's multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database.

Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I'll send more each day.

- Matt


1 Apr 2008 22:15:39 UTC
Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it's gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we'll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate.

No other real big shakes about today's outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn't broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I'd still like an explanation, however).

Happy April Fools, by the way!

- Matt


31 Mar 2008 21:46:51 UTC
The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there's some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn't too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server's faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren't properly racking as much as stacking.

One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking).

I'm leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement.

- Matt


29 Mar 2008 5:16:39 UTC
I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing.

In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there's no option except drive up to the lab and kick the thing in person.

Except it's 10pm on a Friday night, and it's raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there's no guarantee any fix would work. And even if I did get it running, given current history there's no guarantee it would stay running through the night or the weekend, so I'm staying home.

Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don't count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied.

- Matt


27 Mar 2008 22:40:40 UTC
There's not much news to report on the technical front - but that doesn't mean I haven't been busy. I've mostly been engrossed in tasks that have little effect on the public servers, so anything I've been working on is either (a) too complicated to describe to everybody's satisfaction (including my own), or (b) relatively uninteresting.

I've been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it's not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week.

Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight.

- Matt


24 Mar 2008 22:28:55 UTC
Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection).

There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery.

Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better.

- Matt


18 Mar 2008 21:15:54 UTC
Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We're waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn't talk gigabit anyway - UPS's, service processors, older servers...

Bill, who donated our previous and current routers, came by to pick up the 2811 we're no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading.

Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery).

- Matt


14 Mar 2008 17:52:11 UTC
We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function.

We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period.

If the IO load is manageable the feature will remain enabled.



13 Mar 2008 21:25:40 UTC
A few small items today.

Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we'll see. Having the indexes on a different volume can only help.

We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you're readind this thank you!). Eric got this hooked up to a test server pretty quickly - it's pretty sweet. We'll get this in the closet sometime next week, and then we'll have the ability to reboot systems from home, which should minimize down time over the long haul.

With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it.

I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting.

- Matt


12 Mar 2008 22:32:31 UTC
As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn't allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we'll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We'll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks.

And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it's fixed, and a couple "clogged" donations pushed through just now.

- Matt


11 Mar 2008 22:09:13 UTC
Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice.

Despite the happy current performance of our servers, we're still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I'm building the first RAID1 pair - it's syncing up now - where we'll start recreating indexes as soon as tomorrow.

- Matt


10 Mar 2008 18:58:22 UTC
Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it's a bummer when I do. Anyway, I'm back, though still only about 80-90%.

In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we're still in dire need of database server improvements, mostly in the RAID re-configuration realm. We're also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it?

- Matt (sniff cough)


4 Mar 2008 23:27:02 UTC
Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies.

Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry!

However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later).

I'm reverting the four assimilators. Doesn't seem like 12 helps and only caused memory problems on bruno. We're really going to have to do some major reconfiguration on thumper before we can catch up again.

- Matt


3 Mar 2008 23:13:14 UTC
So it was a rough weekend, mostly due to the excess assimilators being employed to knock down the ridiculously large back of results waiting to be entered into the science database. Long, long ago we had chronic problems with a memory leak in the assimilators, but that hasn't been a problem so much lately as things have moved it to a more powerful server and got BOINC going. Now they all get restarted every week due to the database backup outage. Anyway... having 12 running at once seemed to exercise the memory problem enough to cause the upload server to lock up a couple times. This created a general malaise on the backend, aggravated by a current period of fast workunits creating a heavy load on everything.

This morning bruno was rebooted and log jams were cleared. Servers are trying to get on top of their queues. But in the positive progress department, check out the most recent traffic graph (green = outbound, blue = inbound). Can you guess when we switched over to the new router?



Yay! We now increased our bandwidth capacity by about 50%. The roving bottlenecks are surfacing elsewhere, though until we get beyond the current period of catchup we don't have a good sense of what's normal or what to expect. We still have a ways to go to fully capitalize on the full gigabit of bandwidth Hurricane Electric is offering us, but this is still a vast improvement for now.

In regards to one comment in the previous thread: despite our small staff and minuscule pay scale we're generally close to 24/7 system monitoring, what with all of us on different schedules checking in regularly at random. And nope - I still don't have a cell phone. Never had one and, if possible, never will.

- Matt


28 Feb 2008 21:25:13 UTC
Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. W