Technical News - 2006 |
![]() |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
December 8, 2006 - 00:15 UTC End of the week update. Like always we've had our hands full with behind-the-scenes stuff that usually goes unreported. The new multibeam splitter is close to completed, and will be beta-tested shortly. This means new analysis on hot, fresh data is coming down the pike. The Astropulse project is also making leaps and bounds forward. The data collection down at Arecibo is fully automated at this point. After drives fill up with data the operators get warnings and swap them out with empty ones. When they have enough full drives they ship them up here where they are copied over the internet to HPSS until we are able to turn them into workunits. Meanwhile, a lot of attention was paid towards the user experience - a department which we've had (and continue to have) zero resources for development. Among other things, users can get reacquainted with their abandoned classic credits with a new form, and we now have a growing bank of volunteers who are giving tech support via internet phone (i.e. Skype). And finally an addendum to the addendum in the previous tech news item: The Intel donation is in fact a quad dual-core Xeon processor system. Turns out only half the processors were coming on line, but were hyperthreading, making it look like 8 total processors. After some tweaking in the BIOS the system finally recognized the other two processors, so now, with hyperthreading, there are 16 3GHz processors according to the OS. December 5, 2006 - 23:15 UTC Today we had the usual database outage a little bit later than normal (the person who usually does this was out sick today). Nevertheless, it was short enough that it still remained within our claimed 4 hour window. An addendum to the previous tech news item: The recent Intel donation is not a quad dual-core Xeon processor system - it is a quad single-core Xeon processor system using hyper-threading (which shows up as 8 processors according to the OS). November 29, 2006 - 19:00 UTC Last night we had a blip of an outage due to our data download server losing its mount of the file server holding the workunits. This was strictly a random failure, and it worked itself out on its own. We saw similar behavior back when we had a heavier load on the entire system. We released the enhanced client which greatly reduced the rate of workunit/result exchange and therefore reduced the occurrence of these load-related problems. Thanks to Moore's Law and an ever-increasing user base, we'll need to address this issue sooner than later. The other lingering, randomly-occurring problem has to do with "rough periods" accessing the database (see eariler tech news items for details). Basically, what's going on is this: every 24 hours or so a process dumps all useful user/host/team stats to XML files which other sites can upload and generate leader boards, graphs, etc. These tables have continually grown in size, and apparently when this process runs they can knock the result table out of memory. The feeder process, which keeps a healthy queue of available work to send out to users, needs the result table in memory or else a sub-second query to select more work becomes a multi-minute query to read the whole result table back into memory from disk. We're looking into making these queries more efficient. We're also looking at setting up a new BOINC database server (remember that the BOINC database is separate for the SETI@home-specific science database which already is on a new server and working well). Recently Intel donated several pieces of hardware to us, including a quad dual-core Xeon processor system (i.e. 8 3GHz processors total). We're currently working out some system quirks, but when we begin trusting it we'll make this our master BOINC database server, and the current one will be a replica. This will provide an immediate backup if needed, and remove the necessity for the weekly outages. More to come on that. Another recently Intel system has already been set up and is being used as a backend science CPU server (and to read new data from hard drives sent up from Arecibo). The last of the known never-touched classic data tapes has been read last week and is in the splitter queue. Next we will start reading tapes that have gone through the pipeline in some form or another, but for some reason never made it into our master database. Possible reasons include: bad data (but hopefully not), a tape drive failure that caused the tapes to remain unread (surprisingly more common than you'd think), poor initial analysis or database corruption leading to failure during redundancy checking. So don't be upset when tapes from the late 90's appear on the queue. Data from 1998 is worth the same as data taken in 2006. The ETs we are looking for come from light years away. A few years won't make any difference when looking for signals consistently repeating over time. November 14, 2006 - 23:45 UTC So the download server stopped sending out work yesterday afternoon, and it didn't get caught and fixed until this morning. We are in the process of major network upgrades (stay tuned for news on that). A configuration change made for testing the new plumbing tickled a creeping problem where long, cryptic symbolic link chains break unexpectedly. In other words: oops. November 8, 2006 - 00:30 UTC Earlier today we had the usual BOINC database backup outage. No big shakes there, but we took the time to clean up the server closet a bit. There were lots of tangled ethernet cables that needed to be rerouted. Further cleaning will require turning servers off, so we'll need a coordinated outage for that. The new data recorder is close to fully functional. It has been operating for weeks now, but as it becomes more automated we'll be collecting data at a faster rate (instead of continually waiting for human intervention as various problems arise). Most of our effort is focused on this, as well as making a splitter to create workunits containing the new data and a client which can process them. October 31, 2006 - 20:30 UTC Today during the regular outage we completed the final steps to shut down the server "castelli" which had up until recently been the science database server. This database is now on "thumper." At least this is how it looks on the surface. In reality, the machine that was powered off today was "galileo," the scheduling server. Since castelli had a couple more CPUs, another GB of RAM, and a more recently updated operating system, we simply turned off the old galileo and changed castelli's name to galileo. Follow? Anyway, a new scheduler has been compiled and will be implemented sometime soon. While the project is back up, expect some rough seas ahead as we may be shutting off the systems briefly to exchange scheduler software and test it. As well, we're having wild post-backup BOINC database performance like last week, which took a few hours to clear up. October 25, 2006 - 19:00 UTC We're currently starting some minor database testing to diagnose recent issues (see previous note below). Namely, we're purging all the finished results from the database as soon as they are complete, instead of letting them linger around for at least a day. This means you won't be able to see completed results on your account page as they will disappear from our user database as soon as the data is assimilated into our science database. This condition is only aesthetic and will be temporary. The hope is that we don't have any "orphaned" results in our database and cleaning out all the rows that are spoken for will help determine this. October 18, 2006 - 18:30 UTC Completely unrelated to server issues over the past week we are still having these "rough spots" as noted two entries below. We had another such event this morning which lasted until we finally stopped/restarted the whole backend. What would cause a query that normally takes one second to suddenly take over 1000 seconds? What's odd is that (a) no other queries are similarly affected, and (b) running the very same query on the command line on the same machine as the same user, even during these rough periods, returns in less than a second. Nothing is amiss according to the server logs. It's quite a poser. Perhaps a mysql upgrade is in order. October 16, 2006 - 19:30 UTC Busy weekend. On Thursday afternoon our mail server up and died. We were planning on replacing it anyway but this more or less forced the issue. We are still in the process of recovery and dealing with configuration complexities. Mail behavior is mostly normal at this point, with occasional unexpected quirks that need to be ironed out. On Sunday, our upload server (kryten) lost the ability to NFS mount another machine and this had a domino effect on the whole server back end. This has been a problem before (the sudden loss of NFS mounts) with far less tragic results, but the solution for this was and still is rather simple: reboot the machine. However, upon reboot the RAID device storing the uploaded results forced a resync, so we stopped the project for a few hours to let this happen in peace. Once back on line, the backlogged traffic caught up relatively quickly. In general, server uptime has been excellent over the past 6-12 months when compared to days of yore. So naturally as soon as we started sending out a mass mail to inactive users asking them to rejoin the project, several servers decide to flake out on us. Talk about bad timing. Oh well. October 12, 2006 - 19:00 UTC A new scheduler was installed the other day and there was a small bug regarding user preferences which has since been found and fixed. The fixed version will be employed shortly. And now a separate scheduler problem. There were a couple of rough spots during the past day or so when work wasn't getting sent out very fast. The feeder process keeps a queue of 100 results to send out via the scheduler, and is continually querying the database for the next batch of results to keep its queue full. For some reason this query, which normally takes less than a second, started taking about 1000 seconds, during which the tiny queue would empty and there would be no work to send. There were periods of about 2-3 hours where this was happening - the feeder query would mostly return rapidly, but occasionally take 1000 seconds. Eventually it was go back to normal and stay that way. We're looking into that. Meanwhile the new science database is doing well, though we're keeping the old science database server on line as a back up just in case things go unexpectedly awry. In a few weeks once we're satisfied we'll decommission the old server. Data-wise, we're still collecting new multi-beam data (over 1 TB so far!), but still have to create a new splitter, make minor cosmetic changes to the client, and test this in beta before this data goes public. Meanwhile, we're working on old classic data that has never been touched before. We passed on these tapes back in the day mostly due to potential RFI corruption, and we had cleaner tapes which got preference at the time. October 9, 2006 - 20:15 UTC Turns out a bunch of the tapes we had slated for workunit creation over the weekend were corrupt and unreadable. So this morning we are reading good tapes as fast as we can, and only now just ran out of work. A new image is just about to come online as this is being typed, and we'll catch up soon after that. October 5, 2006 - 21:30 UTC Last night we successfully finished migrating all the science data off the old server (castelli) onto the new one (thumper). So far, so good. However, we still have to move the beta data over to thumper and until then the beta project is offline. The new database still needs some minor tuning, but this will happen during regular (or very quick) outages, none of which would effect the usual result/workunit flow. By the way, the name "thumper" is just the pet name that Sun gave the new line of Sun Fire X4500 systems. If we get another, it'll be called "bambi." October 4, 2006 - 23:30 UTC The new science database is up and running, but unfortunately needs some unexpected performance tweaks before it can be used in production. Right now it's in the middle of an index build that's taking way too long. We need to figure our exactly why and tune it, but also don't want to break the current index build mid-stream. So we just have to grin and bear it. Hopefully we'll be back up tonight. Probably tomorrow morning. That's life in the big city. Addendum: The index build finished just after writing this note. We are sending out new work now. October 3, 2006 - 21:15 UTC As this item is being written, we are busy finishing the final steps of the science database migration. This is why the splitters and assimilators are currently turned off - the science database cannot handle inserts or updates at this time. So no new work is being generated right now. We plan to be back up tomorrow and should have enough work to last until then. Addendum: Since we also have to migrate the science data for beta project, we are leaving that project entirely off for now. September 28, 2006 - 21:30 UTC The new drive enclosure (to read drives full of data from Arecibo) didn't work. Long story, but the bottom line is we aren't exactly sure why and we're tired of screwing around with what will be the new master science database server. So we're changing plans a bit. Basically we'll put the enclosure on some other server - we just aren't exactly sure which one yet (need to first determine bandwidth requirements, PCI card support, etc.). We'll get the data on-line soon somehow, and then splitting it into workunits will follow. What this all means is that we are no longer waiting on anything to do the final step of the science database migration. And what that means is a semi-outage (during which splitters and assimilators will be turned off) of a day or so as we move all the remaining data to the new server. Then a complete (but shorter) outage to point everything over to the new database and make sure it is behaving properly. This may happen during our usual Tuesday outage next week. We'll post something on the front page as the big event approaches. It should be stated that about a month ago we finally got the SETHI project up and running again. At least some science is getting done with the data we are collecting at Arecibo - one of the most detailed maps of hydrogen in the local interstellar medium. But what about interesting SETI candidates for reobservation? What happened to those? Well, work on the next phase of candidate generation has been blocked (for a very long time) behind other big tasks (moving the whole project over to BOINC, getting a new data recorder developed, installed, tested, and put into production, migrating data and bringing the new science database server online, putting out countless figurative fires...). Once the new database is in play we still have to do significant analysis and data correction before we can to work at finding E.T. hiding within a terabyte of spikes, gaussians, pulses, and triplets. September 22, 2006 - 17:00 UTC Some news: The new data recorder is fully functioning down at Arecibo now. We already have a couple 500GB drives full of data here at the lab. We're waiting on drive enclosure parts before reading these drives. Some tweaks to the splitter (which converts data to workunits) are being made so we can start sending out this new data. Meanwhile, we're still sending out regular data in its current form. Some have noticed that older tapes are being converted into workunits. We ran out of "new" data to analyze so the current data tapes being used have one of the following qualities:
September 11, 2006 - 22:00 UTC We're back on line, more or less, after the science database crash last Friday. Work is being generated, and the assimilators are catching up on the backlogged results. But we're far from out of the woods. We aren't quite finished getting the data unloaded from the old (broken) database. There is no data corruption, but we are currently operating on only one half of a RAID 10 mirror. In other words, one more drive failure and the whole database is toast. What happened? The two mirrors are on two separate drive arrays, and one of the two enclosures went belly-up, causing all our headaches this past Friday, and our cautionary measures over the weekend. The rest of the data should be unloaded within the next 24 hours. Then we have to check the data, and load it onto the new server. Then we check to make sure the data on both databases match. And then we shut down the whole project for a day and migrate all the workunits created and results uploaded since we started this whole project a week ago. Finally, we turn things over to the new database and start the project back up. If all is well, we'll be completely on the new server and can fully retire the old one. Meanwhile there is positive news about the data recorder. We haven't been able to take much data with the new multi-beam recorder because of DLT drive problems galore. In an effort to move away from that technology, we successfully implemented and tested using swappable SATA drives to store the data. As of this morning we have all the parts working at Arecibo. Soon we will get the parts ordered/installed/tested up here at the lab. Eventually, instead of shipping data back on forth on tapes, we'll be shipping whole 500GB drives. September 8, 2006 - 22:00 UTC We are out of work to send out, and probably will be for the whole weekend. Here's why: Recently Sun donated a new server (see below for details) which we decided to make our new science database. Our current science database works just fine, but is on an older (slower) system with a used set of fibre channel disks that frequently fail for one reason or another. We finally got the new server set up to our liking last week and started unloading all the database tables this week from the old system. The increased disk activity caused the aforementioned disks to completely freak out last night. We had to shut down the science database this morning and we are still in the process of recovering the system. While most of the BOINC backend functions without any dependency on the science database, the splitters and assimilators do not. The assimilators being off are no big deal - this just means a delay in moving results on disk into the database. But when the splitters are off no new work can be created, and our queue of work to send already ran dry this afternoon. It is highly unlikely we will get the database back up before the end of the day, or anytime this weekend. Even if we do, our highest priority will be to unload the remaining few database tables before the disks crash again. If you want to keep your computers busy, you can always work on multiple projects. August 22, 2006 - 22:30 UTC We had our usual database backup outage today, during which we also replaced a set of UPS batteries, upgraded the OS on a fileserver, and updated software versions of various server processes. All went well, except we ran out of work. This is likely due to high noise levels in our current data, meaning workunits are taking a lot less time to process than normal. While our result to send queue is currently at 0, this only means we have 0 extra workunit to send out at any given time. We're only slightly failing to keep up with current demand. As we get more tape images on line this trend should reverse. August 10, 2006 - 14:00 UTC How about an update? During the regular weekly outage the web servers were also taken down to replace batteries in a failing UPS. Something fishy happened during this procedure and a wire shorted out, causing a massive spark and one of the new battery connectors to vaporize to nothing. It was exciting and luckily nobody got hurt. The servers in question are on a spare UPS until the replacement batteries get replaced. The new data recorder is still currently unable to collect data because of inexplicable tape drive errors. While we are researching that we are attempting to use disk drives to collect data in the meantime. These are hot-swappable SATA II drives in trays that will be shipped back and forth between Berkeley and Arecibo. Speaking of tape drive errors, the last of our original set of DLT IV tape drives finally bit the dust. Sure, these drives were old, but we still have a lot of SETI@home data on these tapes to read! We've been using a Super DLT drive to read tapes in the meantime, and a replacement drive is on order. The new Sun server (see July 26 news item for more information) is almost done being configured. Why is it taking so long? There was some confusion about how linux recognized the boot drives. Long story short, in a 24 drive configuration, the boot drive is /dev/sdm, and the secondary boot drive is /dev/sdo. Of course, the Fedora Core 5 installer does everything in its power to install on /dev/sda. Once the OS was installed, the creation of several large (>2TB) linux volume managed filesystems sitting on top of software RAID simply took a lot of time - the initial RAID sync's tooks hours, as did putting a filesystems on the new RAID devices. The plan is for this server to become the sole science database server, replacing an E3500 system with flaky fibre channels and failing disk drives. We will have to unload all the data to files and reload it into the new database, meaning a massive outage, but most of this will happen "behind the scenes" without much disruption to normal user activity - only the splitters and assimilators will be offline during this procedure. August 2, 2006 - 17:30 UTC We had to bring down the project briefly to reboot the upload server. Once in a while it loses random NFS mounts, this time resulting in the inability for the assimilator to insert into the science database, and for results to be uploaded in general. July 26, 2006 - 23:00 UTC Late last week a power fault crashed our science database. One of the RAID mirrors failed, but ultimately we were able to got all the failed drives re-synced and back online without too much trouble. Nevertheless, we are anxious to get a new science database server, both for speed and reliability. Sun recently donated a Thumper (X4500) system to us to beta test and potentially use as a database server. We are in the process of configuring this system, which has two dual-core opterons, 8 GB of RAM, and is half full with 24 500GB SATA drives (12 TB total). The science database itself only needs about 1 TB - the remaining space may be used for temporary tape image storage, as the new data recorder (science newsletter coming soon) records up to 300GB a day. July 7, 2006 - 17:30 UTC This morning starting at around 5:00am (12:00 UTC) there was a lab-wide power outage to diagnose electrical problems in the nearby Lawrence Hall of Science building. We're not sure exactly why our lab also had to be shut down, but there you have it. All they did today was hunt for the problem. From what we've been told, a longer outage will be necessary in the future to actually fix it. Since nobody wanted to get up at 3:00am to shut everything down before the outage, we shut off everything before the end of the work day yesterday. Outside of some minor dependency issues (i.e. server x hung until server y came on line), all systems/services cleanly came back up this morning. April 11, 2006 - 23:00 UTC Today we had an outage for hardware reconfiguration. Due to cabling issues we only ended up being able to do half of what we intended. This is why the outage only lasted a couple hours. What we did accomplish was doubling the memory (from 8GB to 16GB) in our BOINC database server. This rather simple, mundane procedure was slowed by confusing messages from the 40z service processor. After adding the DIMMs, they weren't recognized until we cleared out old "criticial" errors from the event log. In order to gather some data about improved throughput we actually didn't restart using all the new memory - only about half of it. We're spending far less time in i/o wait, which is exactly what we hoped. After tomorrow's weekly database backup/compression we plan to use all the available memory for the database. We'll also set db_purge so that it keeps results around longer. April 6, 2006 - 18:00 UTC The UPS protecting the BOINC database server and the boinc.berkeley.edu web server started flaking out last night. This didn't affect the former, as it had redundant power elsewhere, but the BOINC web site disappeared a couple times during the evening. A half hour ago we brought the project down briefly to move these servers onto more reliable power in the meantime. March 30, 2006 - 22:30 UTC Okay. We seem to be out of the woods for now. We vastly reduced the size of the result table in the BOINC database. How? By clearing out the results-to-delete queue, and by cranking up db_purge so that it removes all results from the database as soon as they are deleted from disk. Under normal circumstances we keep at least a days' worth of old result rows around so participants can see those recently finished in their personal lists. Over the weekend we'll keep this clamped down for observational and catch-up purposes. Next week we might relax the db_purge parameters to allow completed result rows to linger like before. Anyway.. having a shrunken database has enabled most of it to fit in RAM, which has vastly improved performance for now. This will change quickly, of course, as we continue to grow (and tables get fragmented on disk), but we are hoping to obtain more memory (4x2GB Dimms for a Sun v40z) very soon. As a side note, it should be mentioned that the old SETI@home classic data server (a big ol' Sun E450) has proven itself still quite useful during yesterday's server closet reconfiguration outage. Since it is fairly large and on sturdy wheels, it made a perfectly good cart for transporting heavy rack hardware from the lab into the closet. March 30, 2006 - 00:30 UTC Today we had an outage to finally move some of the newer (and very noisy) hardware into the server closet. Namely, the Sun v40z which has been the main BOINC database for the past year or so and the Dell linux server which is the boinc.berkeley.edu webserver and alpha project, among other things. Moving out of the closet was cyclops, a Sun 450 which held the Classic "non-master" science database. Soon we will backing up that old data and dropping it. March 27, 2006 - 23:30 UTC We're in the middle of another pathological situation where the result/workunit queues have gotten too big over time and we need to quiet our systems to let them drain a bit. We've run out of disk space for the workunit storage while the database is unable to handle the extra load caused by keeping so many result records around. So we won't be sending out any new work until we have the resources to do so. March 21, 2006 - 20:00 UTC Quick update: As you may have noticed connections to our data servers are spotty at this moment in time. There are two reasons for this. First, we are one day before our weekly database compression/backup and this is when the database is at its worst (it gets fragmented and bloated over the course of the week, so it requires more disk I/O to find data than when the data is nicely compressed in a smaller physical space). Second, we just quickly rebooted the scheduler to attach it to a new console server. Outside of that, it's been business as usual the past few weeks. The new data recorder is being wrapped up and tested, and the SETI@home enhanced client is almost out the door. March 2, 2006 - 21:15 UTC So it turns out the master database storage arrays had three drive failures during the long and thorough RAID resync process. We had two hot spares and a spare drive on the shelf. This, along with the fact that the array was RAID 10, means that we shouldn't have lost any data, but the resync process took extra time to do deal with these lost drives. Why did we lose so many drives? These are old storage arrays donated to us a while ago, and the disks came with heavy wear and tear. We already had several other disks fail in this system so this is no big surprise. Once everything is resync'ed (in about 20 minutes from the time of writing) we'll start up the master database, check its tables (which may take as long as 24 hours), do some other hardware testing, and if all is well start up the assimilators/splitters again. If not, we might be out for an extra day as we continue to clean up. February 28, 2006 - 21:15 UTC We had a planned outage today to remove a couple more items from the server closet (the Classic SETI@home data server and several large, heavy disk arrays which contained the old science database). In order to safely do so, we wanted to power down several important machines so they wouldn't accidentally get bumped and go down ungracefully. The Bay Area is having a rough winter, and a storm today brought lightning which knocked out power to the entire campus, including our lab, around 8am. Most of the servers went down without a hitch. And with the power off anyway we went ahead and cleaned up the closet as planned. We can now get behind the racks again without painful contortion. Powering up the entire network is painful, as servers need to revive in a set order, and many hidden mounting issues come to light (that only get tickled by a reboot). Plus some drives needed some fsck'ing. Everything eventually booted up just fine, except for the master science database. One of the fibre channel loops disappeared on this particular server. Bad cable? Bad GBIC? Not sure just yet, as the terminal wasn't working well enough to give us all the boot diagnostics. We hooked up a laptop and fought with hyperterm to see these messages, but by the time we got that working the machine booted just fine for no explicable reason... but all the metadevices needed to be resynced. This resync could take up to 24 hours, during which the master science database will be down. That means no splitting and no assimilating, and we'll probably run out of work to send before too long. Oh well. February 28, 2006 - 00:30 UTC Just a quick update so you know we haven't disappeared. We've entered a phase of massive cleanup - moving machines around in preparation to put newer ones in the server closet. Since we were cracking the whole system open we figured we might as well bite the bullet and clean all our /usr/local's, update old versions of software, etc. So naturally, everything broke. The last couple of weeks have been spent playing a non-stop game of Whac-a-Mole, trying to fix one minor broken thing after another. You may have noticed some of these failures. For example, the user-of-the-day selection was stuck for a week due to a broken path. There were some other minor issues. One of the assimilators kept crashing with no error messages - after some painful debugging we found it was freaked out by a single corrupt record in the database. But other than that there has been slow, steady progress. The new data recorder is nearing completion (being stress tested at this point), and we're planning to move more old servers out of the closet tomorrow. February 16, 2006 - 23:00 UTC Today we had another quick database backup/compression, and then upgraded the MySQL version again (to the latest 4.1.x). It was a painless upgrade, and a couple problems seemed to have cleared up. Most notably, users are now able to "merge computers" again via our web site. This query had been locking up the system. February 14, 2006 - 23:00 UTC We had a couple of outages over the past few days. One was unintentional - we are still having database lock issues involving the "merge computer" function on our web site, and this was turned back on accidentally. Yesterday we had a standard database backup/compression. We're going to begin doing these twice a week as we continue to figure out why we are having throughput issues. Today we replaced a disk enclosure that was part of our workunit storage array. It was a relatively painless procedure except that the system wasn't recognizing the old disks upon restart. Eventually this was diagnosed: The system's qlogic fibre channel card needed a configuration "refresh." This required hooking up a console and simply entering/exiting the BIOS without editing anything. Anyway, the whole system is back up and running now. February 7, 2006 - 23:00 UTC When SETI@home enhanced is available, it will vastly reduce the load on our servers since enhanced workunits take longer to process (increased credit per workunit will reflect this). Until then, the load on our database server is getting worse, and we had to move the Wednesday table compression/backup outage a day earlier. We might move to a two-per-week outage schedule until we can handle the traffic. We've had other server issues over the past week, including some machines that needed to be rebooted to clean up stale mounts and whatnot. The Classic backend server shutdown is aggressively moving forward since we need to squeeze newer production machines into the air-conditioned closet. We are getting too many warnings because of overheating to wait much longer. Right now half the BOINC backend is still sitting in a small lab meant for humans - not loud, multiple-CPU systems. The outage today took a little longer than usual as we were dropping the old master science database. This will free up four huge A5000 disk enclosures which we'll probably remove from the closet tomorrow. We also had to add some fields to the signal tables on the current master science database. This is why the assimilators are still off during the time of writing - they are being recompiled and tested against the new tables. February 1, 2006 - 22:30 UTC We took the database crash on Monday as a reminder that we really should upgrade our MySQL engine. We were running 4.0. Today during the usual outage we upgraded to 4.1. After the usual data backup, the procedure couldn't have been simpler. Basically we untar'red the new version, moved the data symlink in place, and started it up. We individually tested each back-end process and it all worked well, so we started everything up. Immediately the database server ground to a halt. We figured this had something to do with new innodb table locking issues and twiddled those parameters. We also rebuilt several indexes, all to no avail. Eventually we discovered the one query that would appear behind which everything was piling up: the "merge hosts" update. This is a function on our web site to allow users to combine hosts that for one reason or another show up as multiple hosts in our database. Who could have predicted this would be the culprit? So we shut it off until we figure out exactly why. The reigning theory: this is the only update to an innodb table via php, and we may have to recompile php to pick up newer MySQL libs. Anyway, we're back up now, and will hopefully remain so for the night. Meanwhile, since it wasn't really doing anything else and the server closet has been running hot, we actually shut down sagan today for good. This was the SETI@home Classic data server. So every Classic workunit ever sent and every result received went through its ethernet port. We're not sure what plans we have for it. Auction, perhaps? January 30, 2006 - 23:15 UTC The BOINC database went down around noon. It freaked out because it was in an inconsistent state. When this happens it automatically shuts down and goes through a lengthy recovery process. After that we slowly brought the project back on line. This has been happening more frequently due to increasing load on the project. We are planning to upgrade the MySQL version perhaps later this week. January 24, 2006 - 22:00 UTC Well, we had an outage today to move some disks around. A replacement workunit storage server enclosure arrived and it should have been a simple disk exchange. But things kind of exploded and we didn't even get to that. To shut down the workunit storage server we had to first unmount it on all the splitter/download machines. But koloth and kryten were having all kinds of mounting troubles. Eventually we had to reboot kryten to clear the pipes. But then it took 45 minutes to shut down for reasons we are still unclear about. We didn't want to power cycle the thing as the result storage disks attached to it were quite busy doing something. Eventually the disks fell quiet, but then nothing happened for a good 15 minutes. We gave in and powered it down. We flipped the switch, but it didn't power back up. We tried again, and then smoke and sparks came out from around the power cord. Uh oh. The cord was slightly melted. We threw it out, got a new cord and a better surge protector in place, and kryten powered up (phew) but died within 15 seconds. Apparently we had a bad power supply. By some divine luck we happened to have one spare E3500 power supply kicking around in the basement. It was an easy replacement and then kryten powered up just fine. That was a major relief, as we don't really have a good backup server for kryten at this point in time. After careful inspection and remounting everything we eventually came back on line. January 20, 2006 - 23:30 UTC The master science database merge is complete. All validated signals ever returned by Classic and BOINC SETI@home clients are currently on one single server. Now we may begin our next phase of scientific data analysis programming. The final step of the merge started on Monday and took longer than expected, resulting in us running out of work to send out. This no-work state lasted for over a day, but we are now back online back up and serving workunits. The merge actually finished yesterday, but due to compilation issues with new splitters and assimilators we couldn't get generate any new work until just now. Of course, like all extended outages, it may be a while before everything "catches up". January 17, 2006 - 23:30 UTC The final big stage of the database merge started this morning, which is why the assimilators and splitters are off. Basically, we are now merging the final bits of data that were collected in the last month (since we started the merge). This should take a couple of days - hopefully we have enough work to send to last that long. Over the long weekend we had a couple of connectivity issues. Both the upload and download servers have some kind of automounter problem where, under heavy load, they suddenly lose a random subset of their mounts. When this happens they still function but in a somewhat degraded mode, and eventually need to be rebooted. We also still have that feeder bug which clogs its shared memory segment so it is unable to cache enough work to send out. Both of these problems are being diagnosed. There are also continuing issues with the lab firewall/routers which is being handled by the lab administrators. Things are much better now in general, but there may be a day or two of the usual "catch up". On the plus side, since we turned off the classic data server, we freed up one of our precious few Cogent IP addresses. So we are working toward moving the scheduler onto our private Cogent link, which will keep the bandwidth off the lab routers and therefore protected from those problems. January 12, 2006 - 18:30 UTC We quickly wrapped up another database compression/backup this morning. It only took two hours as the database is now at its minimum size given our current operational status. Some extra time was spent rebooting the database server to change a BIOS setting (that was causing harmless but annoying messages to clog the server logs). We started early as we were hoping there were batteries arriving to replace dead ones in one of our UPS'es, but they didn't show up this morning. Had they been here we would have shuffled some UPS'es around during this outage. Maybe next time. At any rate, since we are nice and trim and all caught up we shall go back to weekly Wednesday outages for database compression/backup. It should be noted we are currently processing 1.2 million results a day, which is far above our original hopes of being able to process 1 million results a day. But apparently our servers are pushing their limits, and certain events can trigger a month-long back-end malaise. Future SETI@home applications will have workunits that take longer to process, which will help. We also hope to acquire newer replacement hardware as well at some point. Like a database replica server so we wouldn't need these weekly outages. The main part of the master database merge is done. We are now planning the shape and scope of the outage for the final part of the merge. Unlike what was stated in the previous tech news item, it may take longer than a day to do this - we want to make this a "partial" outage (during which users can still upload/download work) so it will take some careful planning to minimize any "full" outage time. In other fun news: We finally got around to adjusting Classic credit for users that showed obvious signs of cheating. In Classic, it was very easy to cheat the system to get credit without doing any actual work. We ended up partially or entirely removing credit for about 900 of the top 10000 users (all of which had about 20000 or more credits). Below that there wasn't enough data to show obvious signs of rampant cheating (not to mention enough time and disk space to run the checks on the remaining several million users). These adjusted credits should be sync'ed up with the BOINC databases soon if not already. Case closed. January 9, 2006 - 21:30 UTC There were problems over the weekend that caused internet connections all over campus to disappear at irregular intervals for minutes at a time. We believe this has been resolved. We had another short outage today to further compress the BOINC database and back it up. Each successive compression has vastly helped the database server throughput. For reasons stated in previous tech news items, the database swelled up to 31GB, all because of results that were clogged in the last few queues of the BOINC backend system. It shrunk to 25GB by January 2, and 19GB by January 5. After today's compression is was only 13GB. So the backend servers are much happier. We're pushing out data at 80Mbits/sec without dropping any connections. We might even be able to turn out server state counts back on once the initial rush clears. Meanwhile, the master database merge is wrapping up. The spike and gaussian tables are already merged - all that is left are the pulse and triplet tables. We'll still need to do the final cleanup of results/singals sent since mid-December. The size and scope of that outage (or whether or not we'll need an outage) has yet to be determined, but it won't be anything drastic (i.e. probably less than a day, maybe just a few hours). January 5, 2006 - 23:30 UTC We're back on line after a day-and-a-half long outage during which we were able to clear out enough of our BOINC database to get us back to the same levels as in early December. Of course, there's going to be a painful period as all the BOINC clients clog our servers with demands for work. This will push through eventually as it always has. Meanwhile the master database merge is going along swimmingly - the result and signals tables are copying over much faster than expected. January 4, 2006 - 21:30 UTC Well right now we're in the middle of another self-imposed outage to clear the pipes. The BOINC database compression on Monday helped a bunch - enough that the final troublesome queues were draining, but not nearly as fast as we would like. Here are some numbers: Normally the entire BOINC database, when unloaded to an uncompressed ASCII file, is about 17GB. This is largely due to the huge workunit/result tables. Because of assimilator/file deleter issues, by December 21 the database swelled up to 26GB. On December 28 it was 31GB. Unable to purge old workunits and results, our database ended up nearly double its normal size! No wonder we were in a world of hurt. After the new year we recovered a bit, and the compression on Monday brought it back down to 26GB. But we still had this annoying backlog of a million workunits and four million results, and until we got those purged, we weren't going to get any smaller. This morning we turned off the scheduler to allow these queues to finally drain. We estimate it will take until tomorrow morning to clear everything out once and for all. Better to just bite the bullet and fix it rather than watch and hope for improvement during the coming weeks. When the queues all hit zero, we'll do one more quick database compression/backup and get back to work. January 2, 2006 - 19:15 UTC Happy New Year! The holiday season has been a bit of a headache, as several nagging problems kept the BOINC backend from running optimally. Luckily, most of us were around town and able to stop/start/reboot/kick things as needed to keep the project rolling as much as possible. Most of the issues stem from an excessive load on the BOINC database. Remember that the BOINC database is the one that contains all the information pertaining to the distributed computing side of things: like users and teams, but also cursory workunit and result data for scheduling and sending/receiving purposes. We are still in the middle of the master database merge (see below for more information about that). This is an entirely different database that contains all the scientific products of SETI@home (both Classic and BOINC). So while we are busy merging the old and new scientific databases into one, this has no bearing on the problems people are having connecting to our servers, posting messages on the forums, etc. The merge process will be continuing for many weeks, in fact. In a nutshell, the BOINC database issues started when we built up a large "waiting to assimilate" queue in mid-December. Then we got hit with, among other things, an influx of new users, a network outage beyond our control, a failed disk, a full disk, a spat of noisy workunits, and a database crash. All events were handled effectively, but the queues weren't draining as fast as we wished. The sum of all this ended up being large, unweildy tables in the BOINC database, as old workunits and results weren't being purged and more entries were being inserted. All the backend processes that enumerate on these tables (the validator, the assimilator, the file_deleter, etc.) all slowed down. It got to the point that just doing a "select count(*)" on these tables would take 30 minutes, which is why we shut off the counts on the status page. To help all this, we rebuilt some of the backend processes. Those who pay close attention may have noticed that workunit names have changed over the past week or so. It used to be a tape name followed by four dot-delimited numbers. Now there are five numbers. The new number (which is currently "1" for all workunits) is a scientific configuration setting. Having this number in the name saves us two expensive database reads to look up these configuration settings. This change vastly improved the assimilator throughput, but we were already mired in the problems listed above. Without this change, though, we would have been dead in the water, as the deleters would back up behind the assimilators. We would have then run out of workunit space, the splitters would halt, and no new work would be created. Adding insult to injury, we found the feeder has some kind of bug in it. The feeder is the process that keeps a stash of results in shared memory that the scheduler reads to find out what to send to clients requesting work. Over time the feeder gets less and less able to keep a full stash. Eventually the feeder can't keep up with the scheduler's demand for results, and then clients get "no work available" messages. These clients retry quickly, and these extra connections cause stress on the server, which then starts dropping these connections. So every day or so we've been restarting the feeder to clear out its clogged shared memory segments and that temporarily improves connectivity. We're looking into it. Since the last database compression/backup on Wednesday, we purged about 8 million results from the result table. So we decided to have another compression/backup outage today (Monday) to reap the benefit of a much smaller result table sooner than later. |
| Copyright © 2009 University of California |