Technical News


The news items below address various issues requiring more technical detail than would fit in the regular news section on our front page. These news items are all posted first in the Technical News discussion forum, with additional comments/questions from our participants.

(available as an RSS feed.)


23 Nov 2009 22:46:08 UTC
How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.

However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.

Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.

Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.

It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.

Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.

- Matt


17 Nov 2009 22:48:26 UTC
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt


13 Nov 2009 0:01:13 UTC
Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.

- Matt


10 Nov 2009 22:58:34 UTC
Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:

1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).

2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.

3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.

That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).

Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual "check in from home every few hours and tweak this and that").

- Matt


10 Nov 2009 0:24:48 UTC
Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt


5 Nov 2009 22:53:58 UTC
Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.

This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.

Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.

- Matt


4 Nov 2009 23:28:41 UTC
Our internal file server ptolemy crashed again early this morning and Eric had it rebooted by the time I got in. This is getting to be more than a minor concern. We're going to start collecting kernel crash dumps so we can at least get a clue what's wrong if this happens again.

Informix tweaking continues. Some page corruption did get uncovered during the last science database backup, probably due to the RAID hiccup last week. Not a big deal, but that's just another thing on the list of "maybe that's the problem" when trying to get the database to do anything outside of the usual splitting/assimilating.

Meanwhile, version 2 of the raw data pipeline is getting more and more automated - you'll should see a few more files appear on the to-split queue throughout the evening without any intervention from me.

- Matt


3 Nov 2009 22:13:35 UTC
Tuesday is our outage/maintenance day. This was the first database compression/backup using the solid state drives on mork for the innodb logs - there are a lot of variables at play (like the result table only being 80% the size it was last week), but at first glance it seems like that alone shaved quite a bit off the compression time. Cool. Bob also tweaked another informix parameter, bounced the science database, did some table checks, etc. - maybe this will improve our science database performance (which has been strangely prone to "locking up" as of late). Or maybe not (after restarting the project we still had some queries lock everything up - some work still to be done, I guess).

I also got a couple scripts in order such that I'm getting on top of the data pipeline again. Hopefully we won't run out of workunits again as badly as this past weekend.

Just got back from a meeting discussing the university's current furlough plan - yeah, due to state budget cuts we are being forced to take days off - a kind, gentle way of enacting pay cuts, but not pay cuts really in our case - since we aren't paid by state funds (it's all donations) we are only being forced to take days off for "parity" but SETI still gets to keep its funds. Fair enough, as I understand we're all swimming in the same bowl of soup and belts are being tightened all around. And I already take several days off a month without pay, so in my particular case it's a complete wash.

- Matt


2 Nov 2009 22:57:50 UTC
In case you haven't noticed, we've been low on workunits. As warned in several previous tech news items (and now on the front page) we're still in the process of converting our data pipeline to use the new radar blanking suite (to vastly reduce noise/interference). This conversion process has been slowed by several factors, including these two: it takes a long time to bring up old data from our archives (approximately 4 hours per 50 GB file), and it turns out a lot of these files contain garbage that make it impossible to process (which we can only discover after spending the time to bring the files up here). We are also low of current data because ALFA has been offline for a month due to maintenance.

In better news, ALFA is back up and we're collecting new data again. As well I moved the "testing phase" version of the data pipeline onto the main production data file server, which should generally help as we'll at least speed up disk i/o. Also our assimilator queue finally drained to zero again. I see that people are complaining about lack of work on various threads. We don't guarantee a steady stream of work, but do understand that such a steady stream is important for maintaining public interest. We're doing what we can. I'm getting another file on line as I type this - should be splittable (I hope) sometime this evening.

Our science database server (thumper) lost another disk over the weekend. No big deal, and the RAID recovered with a spare just fine - but nevertheless this is just another reminder that we really need to reconfigure the disk arrays on that system - they are unwieldy and inefficient.

- Matt


29 Oct 2009 21:35:05 UTC
As predicted the data well temporarily ran dry overnight, but I'm trying my best to keep up with demand today (and set it up for over the weekend).

Weird thing today - I've been noticing intermittent problems connecting to the science database to make the most trivial queries. We thought this, and the assimilator queue backing up, were probably due to Bob's recent configuration changes to the informix database engine perhaps not helping so much. But then I noticed one of assimilators was inserting thousands and thousands of signals as fast as it possibly could from a single result file... since 7:40am yesterday morning!

This is not normal. Result files usually contain a handful of signals, maybe a few dozen tops. If they reach 30K in size they are automatically "cut off" and sent back to us. I tracked down the result file with all the signals - it was 1.6 gigabytes in size! Not sure how this happened, nor how it passed validation (though I have my theories), but it sure contained a lot of signals repeated over and over and over again. I moved that out of the way and hopefully that'll improve performance in general around here.

- Matt


28 Oct 2009 22:43:56 UTC
Jeff is back in town and back in action here at the lab. He's now working on the NTPCkr/RFI stuff (which has been languishing due to lack of effort and the science database throughput woes which I've been alluding to lately).

As predicted, I did finally get the astropulse version of the splitter to compile (just some library/linking bugs that had to be hunted down and exterminated). So astropulse workunits using the software radar blanking system are going out! Meanwhile, I hit some more management snags with the multibeam stuff - I'm trying to blank/split really old files which we recorded before we had all the kinks worked out. Long story short, some files I spent a lot of time (days) pulling up from our archives and doing the first stages of radar analysis are unsplittable. Darn. I was hoping to just get beyond the dearth of data in the nick of time, but it looks like I got to pull more files up from the archives, and we'll run a bit dry before they are splittable.

Today is a particularly windy day, which means it's fairly clear. Here's a picture taken from my iPhone looking out from the lab patio onto the Bay. That the Lawrence Hall of science directly below me, then downtown Berkeley, then the Bay itself, then San Francisco, the Golden Gate Bridge, and the Marin Headlands in the distance. The detail isn't so great, so you can't see that the Bay Bridge is completely devoid of cars right now (it's shut down due to technically difficulties), which is quite rare and quite odd.



- Matt


27 Oct 2009 21:59:49 UTC
As many of you already know, Tuesday is the regular outage day where we dry clean the mysql database and pack it down tight. We're recovering from it now. Today I also did some testing of the newly employed solid state RAID 1 on mork (the master mysql server). It seemed fine, so this device now holds the mysql/innodb logical logs, thus resulting in far less competing writes with the data RAID 10 (where the logs used to be kept). Will this help much? I dunno. A non-zero amount at least.

I'm still assembling the new data pipeline. Got a few files in the queue now for multibeam analysis, but I can't seem to get a new astropulse splitter to compile. I need to recompile so that it reads the software radar blanking bit instead of the hardware one, but I'm hitting some library/include issues. Sigh. One of those problem you know you'll get working eventually but right now the path isn't exactly clear, and everything will be annoying until you finally get a successful "make."

- Matt


26 Oct 2009 21:51:34 UTC
Okay, so where are we... Over the weekend the raw data queue shrunk down pretty far, but don't fear. Astropulse ran out of work to do, and multibeam has maybe another day or two, tops. Meanwhile I'm working behind the scenes actually splitting a bunch of software-radar-blanked data from 2006. This is actually going out now to people, but just doesn't show up on the server status pages. I'd have to do some minor hacking to get these files to show up on that page, but that'll be moot fairly soon as all data will be software-radar-blanked and I'll just point the script to look in the new data directory (as opposed to having it look through two directories and figure out the combined status of everything).

Anyway, there's that. We might run a little dry over the next few days as I'm still scraping together disk/memory resources to get these old files pulled up from the archives, analysed, and embedded with the new blanking signal. Only then can these files be split into workunits. I'm working on it.

Meanwhile, we're still having sporadic problems with informix locking up on us. It's getting to be really frustrating, as you don't really notice anything is wrong until the workunit queue runs dry or something like that. The idea of migrating to another database engine is on the table again. Also, bruno was having some nagging mount issues so I just now rebooted it. You may have noticed the whole project disappearing for a half hour there. That was me.

Rumor has it Jeff is back in town. He was away for several weeks hiking in the Himalayas. I imagine he has jet lag and other kinds of recovery to deal with, and he'll appear maybe later this week.

- Matt


22 Oct 2009 20:52:51 UTC
Eeewww. Last night ptolemy (an internal-use file server) crashed. Eric rebooted it this morning, and I still had a bunch of cleanup to do after that which took me until just about now. Other systems had to be rebooted, nfs/autofs daemons kicked, stale trigger files removed, etc. I also bounced informix as it seems like the science database was locked, but this happened to be two different coincident problems, one affecting the splitters, one affected the assimilators, and making it seem like both were hanging on the science database.

The latter problem was a real nuisance. I had to reboot vader, mess around with iptables/network configs, /etc/exports, etc. all of which seemed to do nothing. The problem was that vader couldn't mount the result storage device (which is exported from bruno) while all other systems had no trouble mounting it. I never figured out the exact problem, but yum'ing in the latest nfs-utils package seemed to massage the right muscle and suddenly it was visible on vader. Fine. Everything is sort of catching up now. Bob also got the mysql replica in working order again, so that's good.

Hopefully this isn't a sign that ptolemy is on its way out... Ugh.

- Matt


21 Oct 2009 22:50:09 UTC
The mysql replica was turned back on today, then turned back off - Bob noticed it was still misconfigured, so he's re-recreating it and will probably turn it back on soon (within the next 24 hours).

Meanwhile, the software blanking pipeline is still warming up - actually some workunits from 2006 will go out (secretly) very soon. It's hard to tell how fast I can get this data up from HPSS and blanked. It all may be too slow to keep up with workunit demand, but we'll do what we can. It's hard to automate this stuff - I find the more I automate things the more time I spend cleaning up large, unexpected disasters.

- Matt


20 Oct 2009 22:19:49 UTC
Recovering from the weekly maintenance outage right now, during which we took care of a couple extra things (above and beyond the usual mysql database compression/dumping). Eric replaced a failed root drive on his hydrogen database server. While he was at it, he upgraded the system's OS (it was way out of date). Meanwhile, I took the opportunity to finally bite the bullet and remove the SETI network's reliance on this server, as it hosted (for only historic reasons) the 32-bit libraries for informix - so when this server went down pretty much everything hung waiting for it to return. So this pointless dependency is no more, which is a bit of a relief.

I also added a couple recently donated solid state drives to mysql master database server mork, if only to create a tiny RAID1 on which to put mysql logs, and thus hopefully reduce disk contention on the data drives (which currently also hold the logs). We'll implement that new mini RAID over the course of the coming week.

Also, it turns out mysql replication was broken for the beta project this whole past week. Oops. So tomorrow we'll start the recovery of that (using today's mysql dump). I also turned of "show tasks/results" as the project recovers. Maybe I'll turn that on tomorrow after the smoke clears.

I'm still pulling up files for future software radar blanking analysis/processing. It's really slow given our various network bottlenecks (real or imposed).

Oh yeah.. I guess this is also technical news: Most days I take the train to downtown Berkeley, walk across campus, then ride the hill shuttle up to the lab (which is 1.5 miles up a very tall/steep hill). The shuttle's brakes failed on the way home - or at least showed enough signs of pre-failure such that the driver refused to go any further. He called dispatch to get another bus, but nobody was responding to his pleas. Given my tight schedule (and lack of cell phone service on the hill) I had no choice but to walk all the way down the hill myself, which wasn't the first time, and was no big deal - just terribly annoying.

- Matt


19 Oct 2009 21:03:04 UTC
Happy Monday. We had some "brown-outs" during the weekend brought on by our science database getting clobbered. We're still not exactly sure why it locks up the way it does, but we'll improve the underlying disk i/o subsystem someday, and that could only help (usually when it has fits I find the respective disk arrays are almost to completely 100% utilized).

It's another rainy day around here, which means the air conditioner isn't as efficient, and the server temps are on the rise again. Scary, but there's not much we can do about it right away.

I'm actually pulling up old (2006) data from our off-site archives to be the first "production" data processed using the software radar blanking. We shall see how well it works (in both multibeam and astropulse) later these week, I imagine, if all goes well.

- Matt


15 Oct 2009 19:08:57 UTC
FYI, the replica database server caught up and I turned the result views back on, etc. There are complaints that some results may have disappeared from the beta database. Bob and Eric are looking into it.

In the last thread, this old post of mine (from 2005) was quoted to point out the comedic irony that little has improved despite my claims:

> The SETI@home Classic backend is a tangled mess. There have been many problems over the years, most of which were invisible to the participants. None of these problems were fatal to the project or its science, but have resulted in an obnoxious web of ridiculous dependencies, confusing configurations, and unweildy databases. I am practically drooling dreaming of day when we get to turn all that stuff off and be done with it already. The BOINC backend is sooooo much easier to deal with.

I can see why this is funny (and I agree that it is), but allow me to point out in case people want to use this as some sort of sign of failure:

1. With the old server backend there was 0% chance that science would ever get done. Things like the NTPCkr were impossible in the old days.

2. We had a larger staff at the time of that post. Since we are currently working with less labor resources comparisons are unfair.

3. Our uptime has been much better since moving to BOINC, and downtime has been far more productive (users can work on other projects, etc.).

4. Yeah I admit there are still ridiculous dependencies, confusing configurations, and unweildy databases, but it's a completely different set than what I was referring to back then, and generally things are better across the board.

- Matt


14 Oct 2009 17:51:33 UTC
Got finished kinda late yesterday hence the lack of tech news report. So last week we had lots of database cleanup to deal with due to server crashes over the preceding weekend. The mysql replica database suffered quite a bit, so we planned to recover it using a standard mysql dump file, except we discovered that the latest version of mysql is buggy and the dump files sometimes contain syntax errors. Great.

So this week we recovered the replica by copying all the myisam and innodb data files (and logs) from mork (the master database server). We actually did the first rsync on Monday to help speed things along on Tuesday, but there are so many large files it took forever to even to just do the "delta" rsync. That's why the outage was so long (this final rsync could only happen when the master database was quiescent).

This morning Bob and I made sure all the right config tweaks were made in /etc/my.cnf and started up the replica server. Only one minor snag at first which we fixed, now it's running again and catching up! We still have to figure out how to get mysqldump to work 100% of the time syntax-error-free. That's actually kind of scary.

Meanwhile, the Bay Area was hit with a record breaking storm yesterday. Yeah, I grew up in New York so I can say it was only really a "storm" by Bay Area standards but still we had low temperatures and high humidity. This wreaks havoc on our air conditioner, and the server closet has been hovering a few degrees away from disaster for a couple days now. In fact, ewen (Eric's hydrogen survey server) just lost a drive in the root RAID. It shouldn't be a big deal to replace, except that when ewen goes down for maintenance everything hangs (as there are lots of informix libs living on that system - we really need to move them off but have been loathe to do so for fear of breaking something else).

- Matt


12 Oct 2009 23:02:21 UTC
The latest software blanking tests were also a success, so we'll start putting older pre-hardware-blanked data into production, now that we can remove the radar. Yay! May take a few days to rev up this engine. Meanwhile Eric has been making progress on the "zone RFI" rejection software/algorithms, so we can start getting rid of the garbage that makes up our current "top candidates."

The mysql replica was pretty much rendered useless by all our poking and prodding last week. We'll recreate it from scratch tomorrow (we hope). We are still concerned that we suddenly don't have a reliable backup mechanism, if mysqldump occasionally gives us dumps containing hidden syntax errors!

- Matt


7 Oct 2009 20:35:19 UTC
The replica recovery is on hold for a while. We've experiencing random, intermittent issues when trying to recover one database with a mysqldump from another. This used to always work perfectly, but then something in mysql 5.1 screwed up the quoting and backslashing. I was able to get around this before by writing a script that parsed the large dump files one line at a time, but even that isn't working now. Bob has found other complaints on the web about this, so maybe there's a bug fix somewhere (we're certainly not going to pore through 20GB of ascii looking for missing backslashes and whatnot). We might have to do another dump from scratch, which won't happen until next week, which means the replica may be offline until then. Still, when we recover from the outage (and the weekend backlog) I might still turn on the "show results" flag so users can see recent result history, etc.

I'm working on a second test file to help solidify our warm feelings towards the software radar blanking suite. This will get split/sent out tomorrow (unless I suddenly disappear on an impromptu vacation, which is known to happen from time to time).

Due to popular demand I put a little effort in this morning to cleaning up the technical news main page - it's been a while since I have done so, so the page has gotten rather large and painful to load for people on slow connections.

- Matt


6 Oct 2009 22:43:01 UTC
Quick post-weekly-outage wrapup: everything went fine, albeit a little slow given recent events. The replica recovery is going on now. Hopefully it'll continue along safely overnight and we can turn the replica back on sometime tomorrow.

One hilarious note. All our server reboots over the weekend dislodged several instances of sendmail, which then went on to send forth unexpectedly large queues of cronjob/server related e-mails to me, Jeff, and Eric. We're talking about 35,000 e-mails, all of which went through the lab spam firewall first, thus clobbering everybody's e-mail in the entire Space Lab for about 24 hours. Fun.

- Matt


5 Oct 2009 20:43:16 UTC
Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt


1 Oct 2009 19:41:59 UTC
Some random news items as the work week winds down. First, we did finally get some data drives from Arecibo - the last of them until we start observing again in early November (at the earliest). So that'll tide us over for a short while. Second, it seems like the third time's the charm: preliminary results from the third software radar blanked data test are looking good! We might roll this into production as early as next week. This means we can start analyzing a wealth of pre-2008 multibeam data that was otherwise useless.

We're still having some science database throughput issues that's keeping us from running the NTPCkr as much as we'd like. More and more this is becoming my number one priority.

- Matt


29 Sep 2009 21:36:20 UTC
Hello all - usual outage day again today. It's an interesting battle between our two mysql database servers. Okay maybe not that interesting. But mork has far more RAM, and jocelyn has a much faster disk array. And we see what we expect - mork is a much better master server as it can hold the database in memory and do all kinds of random access, but during the outages jocelyn does its database table compression much faster, as it involves a lot of sequential writes to disk. Anyway, we're back up - not much shakin' on that front.

We did have an outage last night for an hour. This was a known event involving some network infrastructure maneuvering down on campus. It was unclear how long this would take, so we didn't bother with any kind of panicky warning on the home page that we were going to be down for an unspecified amount of time. I think you're all used to that by now anyway. Plus, the good news is that this was one more task out of the way such that campus can get back to determining our bandwidth upgrade needs.

I found yet another radar blanking bug. I least I'm *finding* the bugs I guess, and it's much easier to fix them once they are spotted. Anyway, iteration 3 will commence sometime in the next day or so.

And thank you to Tiaan Geldenhuys, who donated a bit of javascript to our NTPCkr page such that if you zoom in on the Google skymap you'll see the border of the "pixel" which makes up this candidate.

- Matt


24 Sep 2009 19:29:14 UTC
Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.

- Matt


23 Sep 2009 20:46:09 UTC
Had more science database woes at the end of the day yesterday - processes (including splitters) getting logjammed. I'm hoping a couple "update stats" commands will fix all that.

Speaking of splitters, I'm actually running (drumroll please) the first software radar blanked data through a splitter right now, and workunits will be distributed to the public fairly soon. This is still in test phase - we shall see if the software blanking performs better than (or worse, or the same as) the hardware blanking. I'm guess with a couple tweaks here and there my code will be far better.

- Matt


22 Sep 2009 20:43:14 UTC
Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?

So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.

However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.

I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.

Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.

- Matt


17 Sep 2009 19:55:10 UTC
As Josef pointed out in yesterday's thread we are indeed unable to get any new data from the telescope until early November. This is a problem because we have only a few drives full of data on our shelf, and maybe a few drives down at Arecibo (which we asked to have shipped up to Berkeley).

The silver lining is that Jeff has been putting effort into getting the data recorder crashing issues fixed - now that project can be back-burnered and he can focus on RFI issues. Meanwhile I'm cracking on the software radar blanking stuff. I actually made a significant advance this morning, discovering that at any given time the radar patterns we are locking onto can drift as little as 0.1 samples, with drastic results in our ability to find the radar. I've solved that little bit, and it's all pretty much plumbing/testing/deploying at this point. Hopefully I can get this rolling before we completely run out of data. Of course, I always feel that running out of data shouldn't be that big a deal.

By the way, one of the reasons I've been lax with these threads lately is that I'm getting tired of being the sole focus for tech support/donation queries/etc. Please don't be insulted if I address roughly 0% of your requests that are personally addressed to me. I simply don't have the time. I keep asking for additional web presence and user interaction from the others or perhaps the hiring of actual web support staff, to no avail.

- Matt


16 Sep 2009 20:53:41 UTC
Hello again. Sorry about the lack of information lately. I was out sick a large chunk of last week.

Anyway... it's been business as usual more or less. The raw data pipeline really shrunk down but fresh data finally arrived from Arecibo, so we were able to flood the queues again. But I see that we're in a period of about two weeks of zero observations, so we might tighten the belt again before too long. The new mysql setup (mork as master, jocelyn as replica) has been working quite well the past couple of weeks. We have another mork-like server (tentatively called mindy) but, like most of our equipment around here, was a donated system of unknown quality. Several hours of fighting with it yesterday makes me believe mindy may be a dud (processor errors during boot, etc.).

There have been complaints lately about uploads. I don't see any immediate problems on my end. I see files appeared on the server at the normal rate. The traffic graphs don't show anything vastly awry. Eric's been messing with the apache/balance settings on that system so I defer all questions to him.

Eric and Jeff are working on the first gross-level RFI removal infrastructure. Once that's in place the NTPCkr data will start making slightly more sense (the top candidates are all pretty much junk right now). Until then, I will only upload the top ten list by hand every so often.

- Matt


3 Sep 2009 17:32:56 UTC
Sorry about the delay in posting. I've been around, just busy. Those interested in more info should note that we are posting general weekly meeting updates at seti.berkeley.edu.

Outside of lots of little network/system hiccups which have been addressed in our usual whac-a-mole manner, there has been continuing data pipeline issues. The data recorder at Arecibo has been crashing, seemingly randomly. This wouldn't be a big deal but it requires human intervention to reboot, so when it locks up at night, we can miss hours of data. Meanwhile, our reserves are pretty much running dry. We do expect a shipment of at least 4 full data drives by early next week. We may run out of data of the weekend, but that's okay. And yes we are aware of splitters stuck on certain files.

On a more positive note, server mork (a new 24 processor/64 GB RAM intel system) is working beautifully as our master mysql database server (handling a sustained 2500 queries/second without breaking a sweat). Meanwhile we reconfigured jocelyn to be the replica server now. There are some gotchas we've been working around so not all pieces have fallen into place on that front, but we're close. The former replica server, sidious, has been retired (it's actually powered off and sitting on a lab bench).

I haven't updated the NTPCkr candidate list in a while as the candidate scorer program seems to lock up the primary science database. I'll mess around with that today (mainly trying to force it to connect to the secondary science database server).

Little progress on the radar blanking front, though still non-zero progress. Finding the time is difficult.

- Matt


19 Aug 2009 22:07:01 UTC
Okay. Spent a large chunk of the day hacking the last final bits of the NTPCkr web page together and made it available for public viewing. Yippee! There's a link on the front page in the news section if you're looking for it.

There's still a ton more work to be done on this page, as well as the NTPCkr itself, and this is still just the first step in many as far as final data analysis is concerned. We haven't even touched radio frequency interference removal yet (outside of the tools we already have from other SETI projects that we could retrofit for SETI@home). Still, it's a (seemingly rare) major step in the right direction around here.

I also had a code walkthrough with Jeff/Eric about my radar blanking difficulties. Eric had several good things to try, which I'll get started on once I post this message. Actually I might look into the stuck science status page first...

- Matt


18 Aug 2009 22:41:06 UTC
Outage day, usual drill: shut everything down, back up the mysql databases, fire off a science database backup as well while we're at it, compress the mysql tables (which get fragmented over the course of a week), and start everything back up. As far that was concerned, everything went smoothly.

However, we were hoping to hook up a couple extra solid state drives to the new replica server mork. The plan was to put some mysql logs on these drives to help unload extra i/o from the rest of the database drives. We got all the hardware in place and hooked it up today, only to find the server BIOS wasn't seeing these drives. In the time allotted for this task I determined this was either due to (1) bad cables, or (2) motherboard weirdness. Since this is an Intel donated server with an "experimental" motherboard, all best are off. I did prove we could see the SSDs when I swapped cables around, but given the current setup we couldn't run normally like that (long story). In any case, I think we're fine without these drives for now, and may still go along with the plan to make mork the master next week.

Other than that, radar blanking woes continue. I'm going to have Eric and Jeff look at my code tomorrow and point out what I'm doing wrong, if anything. I also hope to get some version of the NTPCkr page online tomorrow (he says with little fanfare).

- Matt


17 Aug 2009 21:22:19 UTC
Okay things haven't been running so well the past couple of days. First, there were some mount problems in the middle of last week which caused our assimilator queue to clog up. This inflates our result table causing all kinds of table fragmentation which never helps the general pipeline. Later in the week I noticed the spike table in the science table was running out of space, so Bob added a few more database chunks. That process eats up a bunch of disk i/o, causing splitters/assimilators to slow down temporarily. But then we hit some major chokepoint causing work production to grind to a halt.

Actually it was worse than that - things were working normally, but only really slowly. This makes it hard to find an obvious smoking gun. Usually this is a symptom of heavy disk/database i/o on thumper. We were testing all that this morning by turning processes off but to no avail.

So.. remember how I mentioned in my last note how we just got new raw data from Arecibo? Well, the script copying it over to the raw data storage server failed to register the file system was full, and packed it up tight. Turns out this caused the storage server some distress, and when I finally checked into it this morning the load was high and all the nfsd's were in disk wait. I deleted one excess file, the nfsd's sprung to life and the whole dam broke, the splitters charged full steam ahead, and the network bandwidth is now tapped out trying to catch up on demand. Fair enough.

- Matt


13 Aug 2009 20:22:28 UTC
I was actually out the past couple of days. Family stuff, including an adventure where we had to tow our Prius almost 100 miles back to Oakland (it freaked out and lost power on I-5). It's in the shop now - luckily these newfangled cars store debugging information so they were able to locate the problem (flakey potentiometer causing erratic accelerator information, and as a failsafe the Prius cut its own power).

Anyway.. during the past couple of days Jeff and Bob handled the Tuesday outage, and Eric tackled a couple general network issues as well (the upload server got misconfigured somehow and was dropping excess connections, and then the assimilators were dead in the water for a while there, causing the queue to back up, the workunit disks to fill up, and finally the splitters to shut down - which is why we ran out of work to send out last night). All seems much better now, albeit jammed with traffic.

In better news we did finally get the first two data drives from Arecibo as recorded by the upgraded data recorder and new external drive docks under normal operations. So we're not going to run out of raw data after all, or at least not just yet. I'm copying those raw data files onto our local drives as I type.

- Matt


10 Aug 2009 21:30:52 UTC
Happy new work week (for those with "standard" work week formation)! The weekend was rather quiet - no major outages or glitches. We burned through all the data we have on line for Astropulse, but still have plenty to process for multibeam. We do have at least a couple drives containing hot, fresh data coming up from Arecibo any day now. We're also hoping the amount of time ALFA gets to observe actually increases, or else we'll always continually be dangerously close to being, if not completely, out of data to process. As far as problems/concerns go, this is a good one.

I got the first rev of the daily cronjob running right now which creates an updated "top ten" candidate list (via the NTPCkr) to be parsed by some PHP for public consumption. It's running now, and taking a long time. I'll see how long it takes before making anything live, being as how we'd like to run this every day, but may be forced to pace it slower than that.

As for radar blanking, I'm finding the correlations still aren't clearly defining which is radar and which isn't. I'm going to talk to Dan about that shortly.

- Matt


5 Aug 2009 21:57:21 UTC
Raw data pipeline: Jeff and I are mining old files that were only partially done for one reason or another. Hopefully these can keep us crunching until we get more data from Arecibo. To add insult to injury, it seems the observatory has been suffering from several power outages the past few days, probably due to thunderstorms.

MySQL databases: So far so good with mork as the replica. We recovered pretty quickly from the outage yesterday. I'm hoping the freeze yesterday was a fluke, or caused by some temporary variable which has since changed, or at the very least next time it happens we'll have some kind of smoking gun somewhere on the system. We're looking into getting its twin "mindy" on line sooner than later.

NTPCkr: Jeff and I met this morning to discuss the current status of what we need to do to get this thing on line. To be clear, Jeff has been doing pretty much all the work on the NTPCkr engine, and I've been helping with the cosmetic/web stuff. Anyway, Jeff has a couple bugs to clear up. Nothing major - things like the reporting mechanism sometimes spits out the same candidate twice. I've been working on web site stuff, like putting in all the hooks to allow people to discuss candidates amongst themselves on a separate forum. Once we clear up our current set of bugs/updates I'll fire up a daily cronjob which will (a) generate the current "top ten" list, (b) pull all the data from the science database from these candidates (if not already on disk) for plotting purposes, and (c) create discussion threads for each candidate (if they don't already exist). Then we're live, but we'll have many "version 2.0" tasks to address right away.

- Matt


4 Aug 2009 22:52:27 UTC
Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.

And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.

Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.

People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.

- Matt


3 Aug 2009 23:19:10 UTC
A relatively spotless weekend (though I did arrive this morning to find 1000 e-mails in my inbox - all warnings about mount issues from a behind-the-scenes compute server). The new replica server "mork" caught up pretty much instantly last week once the whole database was read into memory (about 32GB) and is now actually serving as the main replica for now, if only to stress test it. We still may crack it open and reconfigure it if we find the drive configuration is a bottleneck. In any case, if you're looking at results on this web site, you're pulling them off mork.

We are also getting close to running out of data. Just as we got the data recorder working again they had two weeks without any Alfa observations. We're currently trying to split raw data files that were only partially split for one reason or another, but after that... looks like my software radar blanker project has been bumped up in priority. No need to panic, at any rate - we probably have a couple weeks, I think, and we might get a burst of new data from Arecibo during that time.

- Matt


30 Jul 2009 19:46:03 UTC
So we seem to have gotten over the hump with this new replica server. I should point out working on this server has had zero effect on the rest of the normal project operations, except for perhaps eating up all my time. Anyway, my script got around the dump/restore bug, and after some configuration headaches this morning we are successfully replicating on mork! Of course, sidious continues to be the replica we are using for production, while mork is considered "beta test."

It is catching up on the backlog far slowly than we hoped, especially given the power of the machine. Of course, power is measured in network, disk, memory and cpu. This system certainly has cpu (24 processors!) but word on the street is that mysql actually *drops* in performance after n processors. What "n" is, and what the penalty is remains unclear. Also, this system has fewer disk spindles than sidious (8 compared to 10), and they are slower disks, I think. So we may be seeing a disk i/o hit, but iostat doesn't really show anything amiss. The system is also in our lab and not in the closet, so there may be an extra network hop or two slowing things down. Anyway, as it progresses we'll gauge its performance and act accordingly.

As for changing linux flavors, the current issue here is mysql versions, and not so much linux distributions. As mentioned elsewhere we're trying to adhere to a homogenous setup, and we have less than zero time to mess around with anything experimental like trying new OSes on for size. In any case, Fedora works well enough, and while I generally swear by open source software for both philosophical and practical reasons, I do understand that you get what you pay for.

- Matt


29 Jul 2009 22:32:28 UTC
So.. getting mork on line as a test replica server still continues to be one headache after another. We finally got the hardware working, finally got the drive configuration set up, finally got the OS installed, finally got MySQL fired up, and we were populating the databases using Tuesday's dump files.

Then we hit a completely mysterious error and consistently at the same point in the dump file. Long story short, I spent pretty much all day today trying to find the cause of this error. At this point we're about 90% convinced it's an actual bug in the MySQL version that comes standard with Fedora Core 11 (version 5.1.35) where it fails reading mysqldumps containing large text fields. This seems like a major problem, no? Anyway, the same mysqldump worked on a test 5.0.x database engine. So I'm looking to upgrade this version beyond what's in the current Fedora repositories. What a pain!!!

I just turned on the "show results" flag, even though our current replica is still far behind reality.

- Matt


28 Jul 2009 22:32:02 UTC
Had the usual Tuesday outage today for database maintenance. Nothing too exciting to report about that except we continue to have progress getting new server mork on line as secondary replica (and hopefully someday primary master). MySQL is running on it, and all the tables are being populated as I type this.

A note about the "old junk" I mentioned yesterday. I was talking about real junk (gutting parts servers, shipping boxes, etc.). We still have the E450s that were our various servers during the "classic" phase of SETI@home. We keep talking about auctioning those off but I doubt any of us will ever have the time to coordinate that. Maybe we'll donate them to the Smithsonian.

- Matt


27 Jul 2009 21:54:56 UTC
Not much time to report very much, but the good news is that we finally got one of those new Intel machines working. Eric was in over the weekend installing a new disk controller card, and Jeff and I wrapped up the OS install/configuration today. We now have a new system with 24 processors (4x6 2.13 GHz) and 64 GB ram. We'll try to make this a replica mysql server (in addition to sidious) and see how it does, maybe tomorrow...?

Data-wise, we're finding the Alfa receiver isn't on as much as we thought, and we're running low of data from our archives, as well as data currently on-line. Actually, that's not true at all - we have plenty of data taken between January and April 2008 which has the hardware radar blanking signal (so we can reject RFI), but was accidentally pre-precessed (so we have to unprecess after the fact). Not that big a deal.

About to disappear into the basement and throw out a bunch of old computer junk we haven't used in many years (various people are complaining about how much space it's taking up, which is fair).

- Matt


23 Jul 2009 20:21:14 UTC
Oh, hello. I was out of town most of last week on vacation, then Jeff and I were at OSCON 2009 down in San Jose until today. Despite being billed as an open source developer conference we got all kinds of linux sysadmin and mysql tips and tricks from various experts that we may apply towards better diagnosing of system/network/database issues in the future.

That all said, I haven't had the time to catch up on the lengthy discussions here in this forum during my absence. I imagine it has been mostly about our continuing network struggles. This may all become quite moot quite fast as Eric started rolling out the updated scientific analysis configuration, which is an easy knob to turn as we can increase sensitivity, thus improving our science, with the additional happy side benefit of reducing demand on our servers. I think, though, that we have now just reached the limits of that particular knob before getting diminishing returns.

Apparently there were a few servers that needed to be kicked while I was away. Jeff and Eric took care of all that. Mount issues and the like. We also seem to have our new disk arrays set up both at Arecibo and here, so the raw data pipeline should be kicking into full swing again soon. This is good as we're down to our last 10 files that we've been bringing up from the archives (there are a lot more files, but they require the radar blanking software to work in order to be processed, and I haven't gotten around to that yet).

- Matt


13 Jul 2009 22:11:48 UTC
The data pipeline over the weekend seemed to be more or less okay, thanks to running out of Astropulse workunits and not having any raw data to split to create new ones. Of course, I shovelled some more raw data to the pile this morning, and our bandwidth shot right back up again. This pretty much proves that our recent headaches have been largely due to the disparity of workunit sizes/compute times between multibeam/Astropulse, but that's all academic at this point as Eric is close to implementing a configuration change which will increase the resolution of chirp rates (thus increasing analysis/sensitivity) and also slowing clients down so they don't contact our servers as often. We should be back to lower levels of traffic soon enough.

We are running fairly low on data from our archives, which is a bit scary. We're burning through it rather quickly. Luckily, Andrew is down at Arecibo now, with one of our new drive bays - he'll plug it in perhaps today and we'll hopefully be collecting data later tonight...?

To be clear, we actually have hundreds of raw data files in our archives, but most of them suffer from (a) lack of embedded hardware radar signals (therefore making it currently impossible to analyse without being blitzed by RFI), or (b) accidental extra coordinate precession, or (c) both of the above. Software is in the works (mostly waiting on me) to solve all the above.

- Matt


9 Jul 2009 22:09:13 UTC
Not much news. Eric, Jeff, and I are still poking and prodding the servers trying to figure out ways to improve the current bandwidth situation. It's all really confusing, to tell you the truth. The process is something like: scratch head, try tuning the obvious parameter, observe the completely opposite effect, scratch head again, try tuning it the other direction just for kicks, it works so we celebrate and get back to work, we check back five minutes later and realize it wasn't actually working after all, scratch head, etc.

Thanks for all the suggestions the past couple of days (actually the past ten years). Bear in mind I'm actually more of a software guy, so I'm firmly aware that there's far more expertise out there regarding the nitty gritty network stuff. That said, like all large ventures of this sort the set of resources and demands are quite random, complicated, and unique - so solutions that seems easy/obvious solution may be impossible to implement for unexpected reasons - or there's some key details that are misunderstood. This doesn't make your suggestions any less helpful/brilliant.

Okay.. back to multitasking..

- Matt


8 Jul 2009 19:03:43 UTC
Once again it took the replica all night to recover. I started it up this morning, and it's catching up now. Well, almost. I'll turn the "show tasks/results" feature back on once it really starts catching up.

There's been a lot of discussion lately about our bandwidth woes. I actually talked to Blurf this morning on the phone regarding the (rather generous) push to donate money/hardware towards solving this problem. Let me try to paint a big picture here.

We pay for a gigabit of bandwidth from our private ISP (Hurricane Electric), but can only use 100 Mbits given current campus infrastructure. Most of campus is on gigabit already, but our lab is all the way up the hill - so it's much harder and more expensive to improve the old wiring/routing. The entire rest of the Space Lab uses about 10 Mbits/sec, so there is absolutely zero push by anybody else to spend money/effort on this project. Luckily, there was a spare 100 Mbit cable which is what we are using for the Hurricane Electric link.

While we pay for our bits, they still have to route through campus in order to ultimately hook up with the right backbones. That means we have to adhere to campus's network specs, which in turn means we can only use very specific brands/models of hardware, and can only act once they've fully researched our needs. We opened up a ticket months ago asking to start this research. We got word a couple days ago this research has more or less finally begun. Not much progress, but still non-zero. This may seem impossibly slow, but campus really pretty much always has much bigger fish to fry. Plus our requests usually present them with something new they haven't dealt with before, and therefore they are far more careful.

Ultimately, we should be presented with a couple options from campus which include exact pieces of hardware to be obtained. It's still not clear how much cable has to be upgraded and where, but we know we'll need two new routers, if not also other hardware. When campus gives us this final report, only then can we start figuring out how to obtain the necessary hardware.

As for other options, like going wireless... There actually used to be a building down in the flats that got wireless bandwidth from us. The experience was that it was quite slow and prone to suffering during bad weather, etc. This was a while ago, but still there is enough concern about reliability that nobody seems to want to go down this path.

Of course, another option is relocating our whole project down the hill (where gigabit links are readily available), or at least the server closet. Since the backend is quite complicated with many essential and nested dependencies it's all or nothing - we can't just move one server or functionality elsewhere - we'd have to move everything (this has been explained by me and others in countless other threads over the years). If we do end up moving (always a possibility) then all the above issues are moot.

Another important thing to consider is that we can always reduce are bandwidth demands via other means, which I also explained in another recent threads. Things like removing redundancy (and putting a cap on workunit downloads per day per host), or adding scientific analysis. Or, to be a little extreme, calling SETI@home done, turning off the downloads for good, and moving on to the next thing (something I am actually in favor of doing sooner than later, but the others around here seem to disagree).

I definitely appreciate past and current efforts to help us get beyond the current bandwidth crisis. However, as noted above, there are enough variables involved that I'd hate for you all to start collecting money directed towards a solution to a problem which might just go away. In the meantime, thanks as always for your patience (and crunching time when you actually do get workunits) - we'll keep working with what we got and see if we can't get beyond the storm sooner.

- Matt


7 Jul 2009 22:35:01 UTC
Had the usual weekly database maintenance outage again today. It looks like our mysql database has shrunk for two weeks in a row now (due to less results out in the field). This is a good thing as it means more internal I/O resources. We're recovering from the outage now as I type this. I still expect it to take a while (maybe a full 24 hours?) before we stop dropping connections left and right.

As for that raw data storage server issue mentioned yesterday... turns out it was, uh, user error. A partition filled up. Oops. Still, not sure why the data trasnfer tools (to pull data up from our off-site archive) wasn't noticing that the disk was full and kept trying to write to it over and over and over again.

Question: does anybody out there actually *use* NetworkManager? Or does it exist simply to confound and annoy? I'm willing to believe it's a useful tool, but unfortunately my experience pretty much shows the latter - it randomly and unexpectedly breaks network connections without remorse. I have made it a habit to remove that package and all my machines whenever I find it. Of course, I just installed a clean OS on my desktop. Suddenly firefox is starting up in "work offline" mode, even though I uncheck the box every time. I did some research and found, ha ha, it was my old nemesis NetworkManager getting in the way - it got reinstated with the new OS install. One quick "yum erase" and firefox was once again starting up actually connected to the internet, which I think is preferable, no?

- Matt


6 Jul 2009 23:00:18 UTC
It's still pretty ugly out there - we're maxed out our bandwidth and mysql resources. We were able to squeeze out a few more cycles from the upload/scheduling servers this morning, but generally it's been quite impossible the past week or so. Clearly this is a result of increasing our user base, and the growing percentage of results being processed by cuda clients.

To solve this problem we have several options. There is non-zero but nevertheless slow progress in both the bandwidth and mysql fronts, so we're effectively stuck with what we got for now. We could go to single redundancy and keep the split rate the same. This will immediately divide out outgoing bandwidth in half, but people will, on average, get less work to chew on. We could also increase the resolution of chirp rates that we process, thus lengthening the time it takes to process a workunit. We may do both. From what Eric tells me compressing workunits only helps multibeam, and only by about 20%. Almost not worth considering, since that will get us 5-10 Mbits back, and we need something like 50.

The other annoying thing is that on Friday/Saturday our raw data storage server got hung up while we were copying a file up from our archives. This caused splitting to slow down until we ran out of work to send. Not sure why this was the case, as I killed that transfer and everything worked fine after that. Even more mysterious is that, while bringing the same file up again this morning it choked our server once more. Why this one particular file is having such a random and extreme negative effect is beyond me at this point, but we're doing other tests, etc.

You know, I should point out that while I write these daily missives I tend to disagree with a lot of policies that end up getting enacted around here, which it makes it difficult for me to defend one practice or another that might be discussed on these threads. Anyway, don't blame the messenger.

- Matt


2 Jul 2009 18:24:11 UTC
Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.

Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?

We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.

- Matt


1 Jul 2009 19:38:40 UTC
Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience.

To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs.

Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand.

Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05.

- Matt


29 Jun 2009 22:16:48 UTC
Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being.

This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text
/plain."

The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that.

We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on.

- Matt


25 Jun 2009 20:59:16 UTC
Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything.

We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using "pound" - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph.

Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be "professional" since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds.

- Matt


24 Jun 2009 19:56:44 UTC
Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries.

All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed.

First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything.

Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple "repair table" commands found the problems and claims to have fixed them.

Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh.

In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years).

- Matt


23 Jun 2009 23:09:29 UTC
Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. "alter table user type = innodb") to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago.

A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone.

I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see.

- Matt


22 Jun 2009 20:53:56 UTC
It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before.

Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared.

I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers.

- Matt


18 Jun 2009 22:36:40 UTC
Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with "200" status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself.

The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days...

- Matt


17 Jun 2009 20:16:11 UTC
I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.

Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.

This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.

We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.

Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late.

As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.

Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.

So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.

- Matt


11 Jun 2009 21:52:12 UTC
Spent the morning clearing out my mail spool - something that could easily eat up a full day if I let it. It's amazing how these "this will only take 5 minutes, tops" tasks add up, especially when there are about 100 of them.

Bob found the mysql replica has been falling behind a bit more than he though it should, and after some poking around I found iptables getting in the way. So I did some reconfiguration on that system, rebooted it, and now let's see if it is operating any faster... This wasn't the crux of our mysql woes, but it may help a little bit (less chance the stats queries will rely on the master if the replica is always caught up). Actually as I write this I see we're in another difficult period. Eric was actually just up here and suggested a workaround for one of the queries that has been given us the most headaches lately. We might implement that in the near future. We also should try throwing some of this new hardware at the problem (if we could ever get it working).

The dust is settling after the anniversary a bit - still haven't gotten any video from the students putting it all together. Dan, having spent some time in Arecibo recently has new insight about the radar problems we've been having - so I may get yet another code rewrite on my plate in the near future. Hopefully this will be the final revision that will actually get completely and be used to clean up a huge backlog of dirty data (waiting to be processed). Jeff and I hope to also get some NTPCkr far enough along to present something to the public. I know I've been saying that a while.

- Matt


10 Jun 2009 22:12:33 UTC
Playing around installing the new Fedora Core on my desktop today. So far so good. It seems any time anybody in any context mentions a specific flavor of linux this inspires discussion, usually in an incredulous tone, about why in god's name would you even consider using version x instead of version y, etc. I understand the pros and cons, and we're not going to change anytime soon, if ever. Personally I'm waiting for the day when operating systems disappear and we can all get back to work.

Still haven't gotten any of the Intel systems up and running for various reasons. I'm abandoning all of them for now. Very frustrating - every time I solve one problem another takes its place.

And the inability to collect data at Arecibo continues - the problem has been narrowed down to the (very old) EDT card working on a newer OS. The good folks atEDT are working on it (even though they don't even sell this card anymore, I don't think...).

- Matt


9 Jun 2009 22:27:28 UTC
Well I employed my database code adjustments yesterday afternoon... and they seemed to have had a decidedly opposite effect than what we expected. So I reverted them back last night. Back to the drawing board on that front - I'll think we'll basically move from understanding the problem to eventually adding more hardware so it isn't a problem.

The key to that is getting hardware to work. Eric figured out the issues we were having on one of the newer Intel servers (the RAID controller card had to have a jumper moved around, even though I checked the jumpers already and they matched a similar card in a similar system that is working just fine). Of course, it's a hardware RAID controller, and it won't let me do JBOD, so I was forced to make 8 individual RAID groups, each containing one drive. This is annoying enough, but they RAID bios contains primitive enough mouse drivers that each step of pointing the mouse and clicking on the appropriate button took anywhere from 5 to 60 seconds. So it took me about 90 minutes to configure the RAID. Of course, we could have just stuck with using hardware RAID but for benchmark purposes we're comparing this system to one with similar software RAID. So there ya go.

Our BOINC server - one system that handles the boinc.berkeley.edu website, all the alpha project stuff, etc. - has been having more and more problems as of late, all resulting in the CPU load spiralling out of control. We're in the process of getting another one of these new Intel servers up and running to replace this older server. Of course, we're hitting all kinds of other problems trying to boot the thing. More on that tomorrow if it's still offline.

Downloading FC11 today. All the mirrors are jammed.

And of course we had our weekly outage. No big developments there - Bob took care of all that. He did notice during our weekly science database backup that we had some corrupt database pages. This may be because of something else he discovered - the disk space made available for Astropulse had filled up sooner than expected. So he added more disk space to those tables.

- Matt


8 Jun 2009 21:31:03 UTC
Dan and company are wrapping up their work at Arecibo and heading home today (I think). It was a painful weekend trying to get our data recorder working again (and installing the new SERENDIP V data recorder) but all is well, more or less. We even did some observations of the crab nebula (and its known pulse) which Josh then found in the data using Astropulse, providing a good end-to-end test. We'll send workunits using that data once we get that raw data up here. We ultimately found our SATA drive enclosures were a major part of the headache, and we're planning to replace those with USB enclosures... probably.

It was a painful weekend network-wise - the increased active user load (mixed with the lack of long Astropulse workunits to send out) means a lot more activity on the result table in the database, which means periods of mysql choking. We're adjusting some code to do "dirty reads" which may help conserve resources. For example, the count of the result table to determine the current size of the ready-to-send queue doesn't have to be 100% accurate, so locking the table to do such a query is overkill. We'll see if that works, or helps.

We hope to replace these database servers, or at least the mysql replica, with one of these new Intel servers. They have tons of CPUs and gobs of memory, but the disk controller doesn't work. Actually, that's unclear - we replace the card with one we know works, and that wasn't behaving either. Until we can figure that out we're stuck with what we got.

- Matt


4 Jun 2009 22:27:11 UTC
A day full of troubleshooting. Still trying to get one of these Intel servers up - everything in the system works except the hardware RAID. We got the new drives in the mail today, but still can't get into the RAID bios. We do have a card we know works in another Intel server which we'll swap in sometime but we're tabling this project in general for now...

That's because Dan and a bunch of the CASPAR students are down at Arecibo to install their new SERENDIP V data recorder. They'd like to test it while they are there, of course, which means comparing its functionality with our recorder, as well as do some observations of the Crab Nebula to run through Astropulse, etc. What does that mean for us? That means we really need to get our SETI@home data recorder SATA drives/enclosures working. They have been off line for well over a month now, but now that we have our own people with immediate access to the machine it's speeding up the debugging process. Still, there are plenty of mysteries that seem impossible to figure out. Jeff's frustration with SATA/USB/drivers/linux is palpable coming all the way from the other side of the room. In fact I just heard him tell the gang down there to install a new OS on the system (the current OS is ancient, and quite possibly the source of our woes).

Meanwhile Jeff and I are continue to tinker with NTPCkr stuff. I've been trying to optimize the NTPCkr page, finding that it spends most of its time parsing the XML of the zillion multiplets (groups of similar signals) within each candidate. So at this late hour we may change how we divide the multiplets up into "barycentric" (tight in frequency space) and "non-barycentric" and just score them according to frequency tightness. This may not only yield far less multiplets, but they may be ranked better as far as how interesting they are. There's gonna be more tweaking/testing on that front.

- Matt


3 Jun 2009 22:00:49 UTC
Today started messing with one of the new Intel servers. We're still waiting on drives to ship before doing much with it, but at least it boots off of DVD. There are some other kinks to work out as well. I think we're going to call it "mork." We hope to at least replace sidious with this machine, and if we get the other servers working, than replace others. In general we always wish to reduce the hardware we need to maintain - i.e. have less machines doing more stuff. However, we'd like to do so without increasing our single points of failure (redundancy is nice). And given we never buy anything we have to generally stick to a "work with what you got" philosophy.

A small note about the front page "weekly outage" status - that's a line at the top of our project_news data file which is commented out. Every Tuesday morning I uncomment it (if I remember to) so people can see it, and hopefully later that day (if I remember to) I comment it back into oblivion. Sometimes I forget, or recovery is slow enough that I keep that warning there so people can get some idea why they're having trouble connecting. In any case, it's human controlled and therefore prone to error.

- Matt


2 Jun 2009 23:29:04 UTC
Had the weekly outage today - the normal database/compression/cleanup stuff was by the book, however we took the time to address some other hardware issues. First and foremost, we replaced the failed drive on thumper. I was griping about this yesterday and how this means we'll have to reboot, which means we're forced to resync the root RAID devices. Well, that's happening now. I also upgraded the kernel on worf. That sort of went well - except upon coming back on line one of the spare drives was marked as failed. We're dealing with that now.

Coming out of these weekly outages has gotten painful given our increased rate of traffic lately, and these web queries that continue to clobber us. I try to aim these at the replica, which helps, but right after outages the replica is effectively offline for many hours as it is still busy recreating the giant tables. So I have to temporarily aim those web queries at the master, which makes recovery even slower. We gotta figure this all out, come up with a better weekly backup/reorg policy, or get that new replica server up and running sooner than later. We did order drives for it - should be here later in the week.

- Matt


1 Jun 2009 22:27:24 UTC
Lots to talk about today. Let's start with the weekend: we had the usual drill of running out of raw data files for the Astropulse splitters to chew on. Due to file transfer speeds up from our off-site archival storage (NERSC) we can only put a few files up a day, which Astropulse goes through in no time. This isn't a big deal, but in order to regulate this a little better we adjusted the weights of the two applications so that the feeder gives 97% of its slots to multibeam, and 3% to Astropulse. This shouldn't change the current regular behavior, but will help smooth out the peak periods I think. There's still some BOINC logic changes that have to happen to keep Astropulse from taking over too many systems.

Some good news: Intel once again came through with a slew of donations - five servers to be exact. These are mostly test/used systems so three require some TLC to bring on line (a couple of those may be used as parts to boost up one of our current compute servers). However one of the remaining two will get our attention right away and became the new mysql replica server. I haven't confirmed the specs, but I've been told they each contain four 6-core CPUs and 64GB RAM. Intel would like us to do some benchmark tests right away, so expect a new server (or two) in the fold in the coming weeks. I guess I need to update the hardware donation page...

Of course, the release of Fedora Core 11 has slipped a couple times, but I hope to start a major wave of OS upgrades (or installs) next week as well.

The other big project is dealing with thumper - our science database server. We're replacing a bad drive tomorrow, which means rebooting it, which in turn means it will go through some painful RAID resync upon coming back up (due to its drive naming issues). We know we can fix this resync problem by reinstalling the OS, which we'll do when FC11 is out and we tested a similar install on bambi (the secondary science database server) first. Once that's working, we'd like to re-RAID the data drives (from RAID5 to RAID10) to vastly speed up throughput (necessary for NTPCkr performance). But to do that we need to get all the raw data off first. And to do that we need to first install a kernel update on worf (the NAS from Overland Storage which we are beta testing) so we can safely move all our raw data there. Oy. So many ducks to get in a row. Anyway.. one step at a time...

- Matt


28 May 2009 20:37:47 UTC
Question: so what's up with the near time persistency checker (NTPCkr)? If the live web streaming were working last Thursday you would have seen the tail end of my and Jeff's talk where Jeff went into a little details about the current status of things. Basically, we have some screws to tighten here and there, but the general thing is working. We're up against some database throughput issues which we hope to fix sooner than later, plus we are still tweaking the scoring algorithms. We hope to have a public page available soon where you can peer into the progress of things. Until then, here's version 0.0.1 of the NTPCkr FAQ.

It's becoming clearer that we need to adjust the weight of our applications so that we send out more SETI@home/multibeam workunits. We have things effectively set such that Astropulse work gets sent out as soon as it becomes available. This was partly to expedite getting as many Astropulse results back as possible (in the interest of getting that science done) but this is getting less and less possible given our resources and current participant demands. Things on this front may shift in the near future.

We've been near our bandwidth limit for the past day since unclogging the mysql database, providing more data for Astropulse to split, and our active user base going up about 15% over the past couple of weeks. This may account for recent upload/download difficulty. It looks like it's getting better, as least for the moment.

- Matt


27 May 2009 22:07:00 UTC
Had a few more bandwidth woes early in the morning. Turns out this was due to the replica recovery yesterday - a lot of long queries were still being aimed at the master. I turned the replica on, which immediately helped (though it is about 10-15 hours behind and slowly catching up so some stats may seem a little screwy).

Before we figured that out Jeff and I were a bit stumped as we thought this had to do with Astropulse work availability. In the process of looking for clues we discovered that for a long time Astropulse had an extra defunct project sitting in our applications table. This meant the feeder was saving a third of its slots for a project that will never have any work. I fixed that. I don't think that was causing any major problems lately, but it sure wouldn't help them, either.

This morning I dusted off some code - a program that would fix our doubly-precessed signals. I was hoping some changes Eric had since made to the (incredibly arcane) database code would have fixed some long standing problems, but they didn't. This isn't Eric's fault - it's some garbage in the esql libraries that won't let me do updates to rows with user-defined types. This normally isn't a problem as we can insert signals just fine. Updating them, however, is the problem, at least using esql. So I'll shelve this project once again - in the meantime we have a patch of signals that we cannot use to find candidates as their coordinates are slightly wrong.

Oh yeah - people were asking: I'm not sure when video of our anniversary talks will become available. The students involved in the filming/editing are also working on SERENDIP V, and they're in a mad scramble to get that ready for deployment down at Arecibo next week.

- Matt


26 May 2009 22:32:21 UTC
We're back after the long holiday/anniversary weekend. Phew! That was fun, and now we can get back to work on some outstanding projects.

First off it should be noted the weekend had some issues. For some reason the "forum preferences" table broke again, which wouldn't be that big a deal, except this messes up replication. I kicked it every few hours over the past couple of days which didn't help very much. So we're reloading the replica from scratch yet again. This'll take some time, so the recovery from today's regular outage may be particularly painful.

Meanwhile a random drive on thumper failed. No surprise - there are 48 drives in that thing. It's RAIDed, we're getting a spare from Sun, no big deal. Still, this will exercise our problems with rebooting thumper at this point - so this bumps up in priority our need to reinstall the OS on the thing.

I'm still trying to move data from our archives up here for Astropulse as fast as I can. We have over 100 files yet to transfer. I hope we get the data recorder back in working order before we use up all these files.

- Matt


20 May 2009 21:47:21 UTC
Another short note just to check in. Good news is that I finally was able to get more than just 1 or 2 files up from HPSS for Astropulse to chew on. In fact, I got 4 files! Well, that's still not very much, but more are on the way. We'll really have to get crackin' on the data recorder issues once this week is through.

It also seems that we have continuing problem with these difficult web queries clobbering us from time to time. I put a "hack" in place yesterday that I thought was helping, but Dave noted our problem may be from persistent mysql connections. Since php is embedded in apache, whenever it starts up it opens a database connection and keeps it open through multiple page requests. While we put explicit code to use the replica on the result pages, apparently php won't flip from master to replica (or vice versa) during these persistent connections, so we need better logic to handle all that. In the meantime it seems like we're in another ugly long query phase clogging the pipes. Still very annoying.

This is my last tech news item until next week, probably. Will be busy tomorrow with the big event and all.

- Matt


19 May 2009 23:29:17 UTC
It's Tuesday, that means outage time (for database backup/compression/etc.). Today's outage was by the book, and we're recovering from that now. We're still sloooowly getting more data back up here from our archives at NERSC, though the Astropulse splitters are tearing through those pretty fast. We were also having continuing issues with loooong queries on the mysql master database. We thought we fixed that yesterday. Looks like we didn't. Dave and I poking around with that for a while.

Other than that, chipping away on NTPCkr stuff for Jeff, getting things in order for the big event on Thursday. Wow - I got exactly 48 hours from now to get my little talk straight.

- Matt


18 May 2009 23:13:38 UTC
Happy Anniversary! Though we're officially celebrating later this week it was actually ten years ago yesterday that we launched this thing. We didn't know what to expect, and our ftp server was immediately clobbered from thousands of people simultaneously attempting to download the client. I remember a blur of chaos as we procured other ftp servers (and a remote mirror) that day. I still joke that we've been trying to catch up ever since.

The general workunit/result flow was a little weird lately. First, we ran out of data for Astropulse to process. The splitters kinda burned through a lot of these files - I'm wondering if there's something else going on - or maybe just data quality issues. We also updated some web code which broke our (temporary) master/replica code when looking up results via the web, so the database got clobbered again for a while. This morning Dave re-enacted these changes to use the replica and checked the code in. And once again we had a couple weird mounting issues - bruno was hung on bambi, lando was hung on thumper. This sudden rash of mounting problems is getting annoying if not worrisome. We had to reboot both bruno and lando, which I did this morning. I'm also pulling up some data from Arecibo to get Astropulse rolling again at least from time to time.

- Matt


14 May 2009 20:40:07 UTC
We are quite preoccupied with anniversary stuff so we've been doing the bare minimum amount of systems administration to get by until after the event. Still, it should be mentioned we continue to have SATA/driver issues on our data recorder at Arecibo, and haven't collected new data for about a month now. While we have a pile of data yet to crunch readily available on disk, I started pulling up unanalyzed data from our offsite archives.

Before doing so I went through the whole data inventory rigamarole this morning. We have 1787 raw multi-beam data files (mostly all 50GB in size) archived, of which 338 haven't been split at all. However, a portion of these files were recorded before 2008, i.e. before we had a hardware radar blanking signal embedded in the data. So until we get my software radar blanker working (a project postponed until post-anniversary) we can't chew on these files without dealing with major radio frequency interference. This isn't a major problem: 1225 of the 1787 archived data files are from 2008 or later, and of these 249 have yet to be split. So we got plenty of numbers to crunch until we get the data recorder working again.

- Matt


13 May 2009 19:24:37 UTC
No real server news today, but I'll respond to a couple things mentioned in the previous thread.

I said we have about 150 CPUs in our server fold. Of course, looking at the list of machines on the server status page you see about 40. First, this isn't a complete list - it only contains public facing or critical servers. We have a lot of other systems that are doing tangential tasks or behind-the-scenes stuff. We also have several appliances (like the NAS's) which contain multiple CPUs as well. Still, this number may be inflated a bit due to hyperthreading on some servers. I think the actual number of physical CPUs is still above 100 though. Plus, as I was calculating this just now I found that two of the CPUs on sidious have apparently died. This is no surprise - it's a used/experimental machine and had CPU issues since day one, which is why it is the replica mysql server and not the master.

The talk (which happens next week) should be viewable over the net after it happens. I don't think we're going to do live streaming or anything like that. We're going to meet and discuss early next week what our options are.

- Matt


12 May 2009 21:32:39 UTC
Today's Tuesday, which means regular outage day for us. The project is already coming back to life as I write this sentence, though Bob still has some work to do to sync the beta replica database up again (a process which failed last week due to one of the tables unexpectedly needing repair).

I got a funny call out of the blue yesterday from a person who works at a music production facility in LA. They do a lot of CPU intensive work there, and were surprised to find a bunch of BOINC clients running on their systems slowing things down. I'm guessing a former employee (or current employee afraid to speak up) planted them on as many CPUs as possible. Anyway, I'm not sure how he got my number, and even less why he chose to call me of all people, especially since the clients were all apparently running Einstein@home. Nevertheless, I gave him some uninstall tips, and that was that.

Still working on the talk, which is slowly coming into shape. I'm trying to squeeze in 10 years' worth of digressions about work creation/distribution, databases, web sites, and networks, as well as back-end server war stories into about 20 minutes. It's been a trip down memory lane, and we're kind of kicking ourselves for not taking as many pictures back
in the day of our puny little setup. I can't believe we got this thing off the ground with 3 Sun Ultra 10's (all doubling as desktops for me, Jeff, and Dan) and 2 IPC's. Our current server closet contains about 150 CPUs, 100 TB of disk, and 150 GB of RAM.

- Matt


11 May 2009 21:08:02 UTC
Over the weekend we hit a bit of a traffic "depression" - in other words we were sending out far less work than we should and so our outgoing bandwidth dropped. Why? Well, due to a single garbled astropulse file the astropulse assimilator was bailing, and so the queue was growing, and so workunits were staying on disk longer, and so we ran out of workunit storage, and so the splitters revved down. Eric kicked the assimilator in question yesterday, and we caught up more or less.

This morning I found bruno (the upload/BOINC general admin server) was having similar mounting problems that thumper was having the end of last week - it was hanging on a mount to anakin (the scheduling server) of all things. This didn't affect anything major, but the server status page was stuck since yesterday. Anyway this time I cut to the chase and reboot the system, which helped, but the drive arrays are configured in such a way that requires human intervention on boot to get fully working again. No big deal, but some result uploads were failing for a minute or two there.

Jeff and I practiced the first rev of our anniversary talk this morning. We need to trim it down by 15 minutes. I guess there's a lot to talk about (nothing regular readers of these threads don't already know).

- Matt


7 May 2009 22:03:43 UTC
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt


6 May 2009 20:39:57 UTC
We recovered fairly well after the outage, despite all the minor annoyances as of late. We still have to resync the beta database on the replica - turns out there was corruption in those tables that didn't get noticed until after we brought everything up again. Well, not so much corruption as a bit somewhere that told mysql to not bother dumping the beta database because it thinks there's corruption. So when I tried to rebuild the replica with the dump (when the beta project was back on line) and found the dump was zero length, I issued the proper repair statement and mysql responded "0 errors" but then was able to dump everything. Whatever. It's fine for now - and it is just the beta database, so we'll clean that up next week.

As for fears of running out of data while we're waiting for the data recorder to get fixed: we still have plenty on line, and a few drives on the shelf full of data sent up from Arecibo as part of the last shipment they made before the SATA card went kaput. Plus we have a bunch (how much? not sure, but a lot) of data in our archives at HPSS which we haven't processed yet. So we're good for now, and maybe even a month or two.

As for those network graphs talked about in the previous thread: that particular graph is for a router down on campus which handles the tunneled traffic to/from our lab and destined for our router at the PAIX (where we hook up with our ISP bandwidth). So yeah, green shows "incoming" from the lab, which is what we see as "outgoing" i.e. downloads. And vice versa for the uploads. Of course, there's a tiny tiny bit of noise due to scheduler traffic which also goes over that link.

- Matt


5 May 2009 21:42:36 UTC
There were indeed some weird lingering problems with the mysql database from this weekend. Some tables had bungled indexes. We think we cleaned that up during the usual weekly maintenance outage today. We also needed to regenerate the replica mysql database from scratch, so that'll be behind until later this evening (or tomorrow). The result pages may be out of whack until then. In fact, I just turned them off for now as they were eating too many resources.

By the way, we're still unable to collect data at Arecibo due to problems with the data recorder being unable to see the drives. Turns out the card we bought, which was an exact replacement of the previous card, is having driver issues. Why? Well, unbeknownst to us we weren't actually using the previous card - we were using a totally different card (i.e. one we didn't buy) this whole time. It's a mystery why the original card was swapped out and replaced with this third one, but we're kinda back at square one again. Sigh. Due to time zone/scheduling conflicts each iteration on this front takes about 24 hours (the staff at Arecibo is providing support for free, after all).

- Matt


4 May 2009 22:27:44 UTC
The weekend was a little bumpy. The mysql database was showing signs of trouble Saturday. Eric was the only one paying attention at the time, so he restarted the database. Everything seemed fine, except he made some posts of the forum and then they all disappeared. This is still a mystery (the cause, the exact effects, and if it still a problem). Eric is trying to recreate and diagnose.

But we were still getting web scraped to death. I played a gig Saturday night, getting home around 1:30am. I noticed the lingering problems at that point and blocked a couple more IP addresses and kicked off the long queries. Things more or less recovered on their own after that (except for the validators, which I fixed in the morning).

So this is getting to be a regular problem, which I partially addressed this morning. I dug through the php code and quickly figured out how to get a couple of the offensive long queries to point at the replica database. This seemed to be quite helpful, but the replica is still behind due to the other problems mentioned above. So people are seeing about a day in the past when checking out their current results on our web site. It's confusing, but not the worst tragedy in the world, and it's a problem that will correct itself shortly. It'll all be caught up after the outage tomorrow.

To keep things interesting, we seem to be in a middle of a spate of weird workunits - ones where the data isn't kosher and therefore returning quickly. Eric is also on top of that one. In the meantime, our outgoing traffic is a bit pegged.

Less than three weeks until the anniversary. I'm getting my powerpoint together now. And I couldn't think of a worthy thread title theme this month, so how about apt titles for a change?

- Matt


30 Apr 2009 21:21:40 UTC
We're officially three weeks away from the 10th anniversary celebration - I think Dave just put the official announcement of such on the front page. Jeff and I are bashing out all the details we can beforehand. I guess I will finally learn how to use powerpoint (at least the openoffice version).

So there were some splitters stuck after the outage so we ran out of work to send Tuesday night, but that got kicked back in line Wednesday morning. I wasn't involved with the outage and didn't notice until everything was better - I was taking the day off entertaining visiting family (which also explains the spotty nature of these current tech news items - sorry).

There are still lingering problems trying to record data at Arecibo. We sent them a new SATA card, which worked, but even though the part # was the same of the old card the connectors were different (I instead of L). Jeez. So we sent them the right cables. Now the drivers won't load - the system recognizes the card, but not the drive. What a headache.

Oh yeah. This is the last tech news item for the month, so after much anticipation (not) the thread title theme this month is revealed: names of cats I lived with throughout my life, some adored, some not so much. By far the best kitty ever was Normal (he and his littermates had Geek Love references as names). Our current cats (i.e. still alive and/or hanging around our house) are Olga (Alexei's sister) and Fner (Fnerina's feral half brother). Too bad our dog Laszlo - a purebred Doberman we recently rescued as an adult from the pound - still requires much effort in the ways of socialization, including reducing his desire to hunt down and eat smaller animals. We're working on it.


28 Apr 2009 22:35:46 UTC
Busy busy busy, though not many fun adventures to report in the server realm. The weekend was fairly smooth, as was the regular database backup outage today. Bob went to the MySQL conference last week, so yesterday we discussed some plans for mysql upgrades, tweaks, etc. which we won't implement until the end of next month (i.e. after the anniversary). Of course, there was discussion about the Oracle buyout of Sun, and how that will affect the future of mysql. Apparently panic is unwarranted and we were reminded that the innodb engine, which is mostly what we use within mysql, was already partly an Oracle project. Anyway we shall see.

Jeff and I are continuing to spend our time doing what we can to get the NTPCkr rolling before the anniversary, as well as scraping a talk to present together about the general data pipeline (which we hope to end with the "unveiling" of the NTPCkr). Jeff's been hitting some execution efficiency hurdles (mostly involving many long database queries), but we discovered some more significant optimizations (mostly involving getting around having to query the database in the first place). These speed-ups require some logic changes, which then means fresh code walkthroughs. Extreme programming time.

- Matt


23 Apr 2009 23:07:53 UTC
Today included more messing around with gnuplot and various web programming tasks. I also helped Dan format a pdflatex document. I'm kind of cursed with being really fast at working with these formatting markup languages, so such tasks get thrown onto the end of my work queue a lot.

I noticed we were having a network dip in the afternoon and found once again our web site was being DOS'ed. Somebody (or some robot) was scraping our site, completely ignoring our robots.txt file, etc. Quite infuriating. I wonder if it is officially unethical to make public IP addresses which exhibit this kind of foul behavior. The worrisome part is this kind of activity clobbers mysql (and thus the whole project), and last time this happened everything seemed to recover, and then the database crashed twice over the weekend. We shall see, I guess. It's recovering now.

- Matt


22 Apr 2009 22:33:18 UTC
Looks like there were some beta project problems after the outage yesterday caused by a missing executable. That got replaced, and I think that everything should be okay now on that front. I heard rumors that regular users were seeing beta errors, but I'm hoping that was just confusion. I haven't heard anything since.

Other than that today was more or less a day of system/web plumbing. The web stuff I'm working on is becoming a major kludge due to time constraints. It's actually a conglomeration of C code and perl, php, and C-shell scripts. You know, whatever works. I'm a big fan of getting things working as soon as possible, then making it pretty later.

- Matt


21 Apr 2009 22:16:04 UTC
Tuesday means weekly outage day for mysql database backup/compression. Since the replica got messed up during the duet of crashes over the weekend, we are using this backup today to recover the entire replica database from scratch right now. Should be ready to go in a few hours or so. I think the regular boinc stats xml dumps also broke over the weekend but those should be generating normally again now.

The secondary science database is also suffering some kind of malaise. Not sure what the deal is, but it's slowing down my NTPCkr web site development. I thought it was excess disk activity on the system (caused by writing a primary database backup image to one of its spare drives) wreaking havoc, so I waited for that to end, but still no dice. Had to stop/restart the engine and even then it went through some phase of vague recovery before I could access it again.

Finally got that replacement sata card for the datarecorder down at Arecibo. Jeff and I tested it in a system up here (mostly to make sure we didn't need to update its firmware) and I just put it in a box heading to Puerto Rico (along with a set of blank data drives). Hopefully it'll be a quick swap and we'll be back to recording data again.

Jeff and I are really getting into the mode of programming/development. I think we found a way to speed up the NTPCkr a little bit more this morning, which is always a good thing. I'm still mostly working on internal visualization tools (with some simultaneous thought to what the first rev of the publicly available pages may look like). Don't get too excited yet - it's mostly just a table of numbers.

- Matt


20 Apr 2009 23:04:44 UTC
The mysql database crashed on Friday, then again on Saturday. The reasons are mysterious, though we've had similar crashes in the past - just not two in immediate succession like that. Most of the large, important tables (user, host, workunit, result) are using the innodb engine, while the many others (including team, forum preferences, posts, etc.) are using mysql's standard myisam engine. There's worry we may have lost a few rows in some of the myisam tables, though they seem to check out okay. The replica database, though, is in a confused state so we just shut it off for the time being. We're going to save any remaining cleanup for tomorrow's usual outage. As stated elsewhere, Jeff and I have adopted a policy of no-system-changes (except for emergencies) until after the anniversary. So as long as mysql continues to run well, we're not going to worry about this so much.

I know I write all these missives and therefore I get the brunt of the accolades (or otherwise) but Jeff/Bob pretty much took care of the entire mess above. I did log in on Sunday and cleaned up the server status page and the validators (which for some reason *have* to start on the command line, as opposed to the usual cron job which restarts stopped processes), but that's the usual drill (we're always logging in on nights/weekends to kick one process or another).

- Matt


16 Apr 2009 21:39:09 UTC
Slow steady progress since the last tech news item. The science database continues to be massaged into shape from the past month of nastiness. It's working, but some indexes are still missing, and some queries are taking longer than we'd like. Sometime, probably next week, I'll turn the science status page updates back on - until then the numbers are old and/or flat out wrong.

We're narrowing down the cause of our data recorder woes to either the SATA card or the system itself. We're trying the former first. A new one is on order and we'll have to get it configured remotely (which is a lot easier than configuring a whole new system remotely).

We're also finding that we don't have the processing power we'd like. It seems like we lost a lot of active users over the past few months. I blame the recession. You could also blame Astropulse, I guess. In any case, we need more people. We're hoping the 10th anniversary buzz will help. And speaking of that, Jeff and I are putting all focus on the NTPCkr, just so we have something fun/new/interesting to present in time for any p.r. blitz. That means very little effort in systems/upgrades/etc. for the next 5-6 weeks. Simply don't have the time/manpower.

Sorry about the lull in tech news items. I was on vacation visiting 23 relatives. Many are under 5 years old, which meant a lot of them have colds, which meant I got sick immediately upon my return, earlier in the week.

- Matt


8 Apr 2009 20:00:28 UTC
The science database choked last night. Nothing terrible - it was just unable to deal with the pulse index rebuilds as well as the usual outage recovery. So the assimilators got a little hung up for a while until the current index build was finished. It's still a mystery why this was as big an issue as it was - we've built indexes before on live, fully functional databases. Hmm. Apparently we have to be a little less cavalier about it.

Turning off a server for good always has unintended consequences. Shutting down milkyway yesterday caused mail from the web server to fail. A couple red herrings later I found the problem - the milkyway mail server replacement (clarke) wasn't configured to allow relaying from the web server machine. Easy squeezy problem to fix. Now reset password requests, forum moderation notices, private message alerts, etc. are being sent.

Spent way too much time hunting down the cause of a seg fault in my NTPCker web page code. It's kinda hard when it's a C program that's being executed within a c-shell script, which in turn is being called by a php script, and which is all running under apache. It's frustrating when everything works on a command line, but not within apache. Anyway I finally figured it out, or at least got it working. The irony is this code was to produce a tiny close-up waterfall plot around any given signal (to immediately spot symptoms of RFI), and once it was running Jeff and I realized our database query logic was slightly wrong, and the correct logic would take too long to be of any use in a dynamically generated plot on the web anyway. Sigh. Looks like we'll have to batch job it or something like that.

- Matt


7 Apr 2009 23:15:25 UTC
Outage day today. No big news there on the mysql backup/compression front. We're busy building indexes that were lost during the pulse table rebuild, so that's adding some load to the science database. That may slow splitters/assimilators down at points over the next few days. We shall see.

I did shut down server milkyway for good today, which was our last solaris system still running. This makes me sad. In general, I still prefer solaris over linux, for what that's worth. And I definitely have had much better luck with Sun hardware than with anything else.

Lost in radar/ntpckr coding, hence the short note today. Now I have to catch a bus...

- Matt


6 Apr 2009 22:32:20 UTC
Much progress over the weekend on the science database front. The pulse table has successfully been rebuilt, we started up the assimilators, and the queue drained to zero. With the influx of resources the splitters revved up and more workunits went out. All was well until the logical log on thumper filled up. This is a log of transactions which is necessary for database replication, and given all the pulse table activity it's no surprise it did get clogged up with extra transactions. When the log fills, the database engines have no choice but to hold still until there's log space again. Jeff noticed the dip in the traffic graph and got that all sorted out.

Just now there was another dip in the traffic caused by some DOS'ing on our web site causing some mysql database overload. Damn robots skimming stats off our sites... I made a quick route rule to block the offending IP. This damaging effect was probably unintentional but still very annoying.

- Matt


2 Apr 2009 22:44:31 UTC
The science database issues slowly get better. The root drives are now all sync'ed up, but as I mentioned before this is only a temporary condition. This will fail again upon next bootup. That's fine because this forces the issue of reformatting the data RAIDs on the system which is something I've been wanting to do for a year now - might as well reformat the whole system, root, data, and all. The pulse table continues to get populated and assimilators remain off - at least for another day. We're about to run out of workunit disk storage (again) so expect another workunit shortage period in the very near future. My new rough estimate for the pulse load to finish is sometime tomorrow, and then we can turn the assimilators on, and we will be as back to normal (whatever that means).

One of the download servers (bane) has been having mounting issues the past few days, hence the locking-up of the server status page. I just rebooted the thing. Let's see if that holds.

Once again today was mostly a coding day. I've been annoyed by the radar blanking stuff, being as how the design has changed underneath me thus rendering a week (or two) of my effort moot. The old understanding was that we should only being seeing one type of radar at a time, but my output was showing this to be far from the case. Nevertheless once I got a quick handle on the fftw routines I made quick work of the correlation code and am already spotting radar quicker and more effectively. However a lot of graphing/threshold tweaking is in order before I can really start locking on and blanking.

- Matt


1 Apr 2009 22:01:27 UTC
Let's see.. we're *still* waiting for the RAID resync's to finish and likewise the pulse table rebuild. Another day or two? Meanwhile, I cleared off enough space on the workunit machine such that we can keep producing/sending out work. We still can't assimilate very much until the pulse table rebuild is over, but at least the people can do science and get credit. I'm worried about mysql bloat with the large result table (over 2 million waiting for assimilation), but we've been here many times before and lived.

Lost in the chaos of outage recovery yesterday was a bunch of "make science status page" processes piling up on top of each other, causing extra stress on the science database, and eventually making the splitters jam up. Oops. I killed all those this morning and that particular dam broke. Now that we're catching up on satisfying workunit demand I think we'll be maxed out traffic-wise for a while, which isn't the worst of problems (that means work *is* flowing as fast as we can send it).

Lots of code walkthroughs with Jeff today regarding the NTPCker. It's getting to be a mature piece of code. Scoring mechanisms are almost all in place (though they still may need major tuning once we sift through enough real data). We're still concerned about our ability to actually keep it running "near time," i.e. will the database be able to handle the load? We shall see. A lot of database improvements to help this have unfortunately been blocked on the last couple of weeks' worth of problems with thumper.

Happy April Fool's Day! Don't believe anything anybody says! Actually that's good advice regardless of the day of the year.


31 Mar 2009 22:48:04 UTC
Another Tuesday, another planned outage. We did the usual database compression and backup but it still took a long time as we're bloated with 2 million extra results waiting to be assimilated.

No big deal there, but of course we're still mired in the thumper projects. It's becoming a two-weeker (since the original crash the Friday before last). Remember we're fighting on two fronts: rebuilding the root drive RAID and rebuilding the pulse table. Starting with the former, all we (thought we) had left to do was install grub on one of the two bootable drives (even though the weird drive numbering causes grub to read the actual kernel image off a third, non-bootable drive). Before launching into that I rebooted the system just to make sure everything was working.

This system has very large ext3 file systems, and so I used tune2fs a while back to prevent a long (6-8 hour) forced file system check every 180 days (the default). Unbeknownst to us, it would *also* force a check every N mounts. So I was very displeased to find the system going through a round of forced checks when all I wanted to do was quickly reboot the thing. I was just going to let it go, but after a half hour I got sufficiently annoyed to just halt the check (gracefully) and re-tune2fs'ed to prevent this from happening again.

And upon coming up I was further displeased to find the only root drive (of the three) that appeared in the RAID was the one in the non-bootable slot. We're stumped as to why. Well, even though this RAID was seriously degraded, we powered down, did the planned drive swapping and brought the system up. Even though drives were swapped the only root drive this time in the RAID was the (new) one in the non-bootable slot. Fine. I'm pretty much of the opinion we need to reinstall the OS on this point to clean everything up, but until that happens we have some (oddly long) drive resyncs to un-degrade the RAID. Of course, this will all fail again upon next boot as far as I can tell.

Meanwhile, the pulse table reload that started yesterday failed last night. Since we have redundant database servers now, the informix engine is sensitive to anything that may bring the primary/secondary systems out of whack. This includes really long queries, like the one we started yesterday to copy 500 million pulses from one table to another. Back to square one. Jeff wrote a script that breaks this one query up into many smaller ones, thus hopefully circumventing any "long query" issues. We estimate this will be done Thursday sometime.

I did start up one assimilator - the trickery I mentioned yesterday (to let assimilation run alongside pulse table insertions) does work, however as the pulse table gets populated it eats up a lot of database locks, and the assimilator can barely get an insert in edgewise. In any case, I found a rich source of stuff to move off the workunit storage server, so at least that bottleneck will be temporarily alleviated.

Oh, yeah - end of the month, so that's the end of the current thread title theme. I think the only person who came close to describing the theme was QuietDad yesterday (apologies if others got it earlier). Anyway, the official theme was: Apple II hackers/game programmers who, as a budding young programmer myself in the 70's/80's, I thought were super heroes such that I fondly honor their names (real or otherwise). It takes a real game programmer to do *everything* - not just the game logic but also the design, the graphics, the animation, the sound, the music... and do it all in machine language (and 6 colors, including black and white, in 280x192 "hi-res" graphics).

- Matt


30 Mar 2009 21:58:54 UTC
Monday, Monday. There was little done on the science database/pulse table problem over the weekend - we hit a couple snags so we tabled it until we were all here in the lab today. It looks like we're doing the big move successfully now (taking the 500 million pulses from the old table and inserting them into a new, better formatted table with more extent space). I was hoping that we'd be able to do some trickery to get assimilation flowing again simultaneously, but it looks like that isn't in the cards.

With the assimilator queue clogged we can't delete anything, which means we ran out of room to create new workunits, or at least enough to keep up with demand. Hang in there, folks. Work is on the way.

- Matt


26 Mar 2009 20:25:02 UTC
So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.

Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.

Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).

Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.

- Matt


25 Mar 2009 21:07:21 UTC
Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).

Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.

Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.

Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.

Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.

- Matt


24 Mar 2009 20:27:33 UTC
The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.

Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.

However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.

To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.

When it rains it pours, but we'll be back to normal again soon enough.

- Matt


23 Mar 2009 19:30:51 UTC
We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart").

The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda.

Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure.

And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow.

- Matt


19 Mar 2009 20:44:53 UTC
Another work week is drawing to a close for me (I don't come in to the lab on Fridays - sometimes I work from home - sometimes not). The servers continue to hold on as long as we have the hardware/network resources available (when will they become unavailable? Hours? Days? Weeks? Months?). Yesterday I mostly worked on NTPCKr web programming - stuff for mostly internal use, but a "lite" version will be made public eventually. Why the "lite" version? It's not because we have something to hide - we just don't have the web server/database resources to handle the traffic. The hope is that the public version will at least have a regularly updated list (every hour?) of the current most interesting pixels on the sky, and you can click on them and see where they are in the sky, and get some sense of why they scored well (numbers of signals, they line up with stars/extrasolar planets, etc.). The internal version will have, among other things, additional clicks so we could pull a window of signals out of the database, plot them, and we can scan for RFI - you can see why this would add a big load on our servers. Nevertheless, we'll see what we can manage, and try to much as much information as possible available to everybody.

Today I spent way too long dealing with confusing subversion/trac configuration. Annoying. I guess I should be getting back to radar blanking (sigh).

- Matt


17 Mar 2009 21:37:38 UTC
Hello again. Sorry about missing a couple days there. The end of last week I did write a tech news item that I neglected to post as I got suddenly very busy at the end of the day with random programming tasks, and yesterday I was lost in many meetings and other post-weekend catchup. So be it. Here I am now.

The end of last week I was a stand-still with various projects, so I chipped away at neglected chores and other nagging annoyances. Like our new mail server's log filling up with cryptic automounter messages regarding a machine we haven't had on line in five years - I finally tracked this down to Eric's home-grown spam challenge script which made reference to this machine in its LD_LIBRARY_PATH. I also tried and failed to figure out why one of our systems, configured exactly like the others, refuses to acknowledge the lab-wide legato backup server. And I cleaned my keyboard for the first time ever (which was gross after years of eating at my desk, and this was probably not helping the lingering ant problem). Then I got lost in NTPCker stub web page design.

Yesterday there was much discussion about radar. Dan, recently back from Arecibo, confirmed some things and had news about others. The radar blanking code I took over and improved upon had faulty logic, caused by some early misunderstandings (not mine) about how the radar behaved. Most of the radar we see is from the airport, and that's all the hardware blanker thwarts. However, there are 5 other patterns we detect, including the aerostat balloon radar. So one problem is that at times we're seeing a jumble of various radars, making it very difficult to "lock on" and blank them. I'm working on that now. One other point is that the radar frequencies are all pretty much out of our band (typically around 1.3GHz - we're looking around 1.42GHz), but nevertheless are so loud they jam our receivers. However, sometimes if certain projects call for it the Arecibo operators turn on a high pass filter so that the radar frequencies under ~1.4GHz are completely silent. When this happens (about 20-25% of the time) our data are incredibly clean, even without hardware blanking. Of course, since we're piggybacking we can't control when the filter is on, but we do keep track of it in our data headers. We might prioritize this cleaner data for astropulse, which is far more sensitive to radar than SETI@home.

Today had the usual outage for mysql database backup/compression. I took extra time while everything was quiet to move a lot of big files around the raw data storage server - that's mostly why we were slow to get out of the outage this time around, but at least now I can start emptying the latest shipment of drives from Arecibo. Speaking of drives, there was some discussion about that, too. We may start trying to partially send data over the net, if not completely. We thought this was impossible due to bandwidth constraints, but operators at Arecibo told us to give it a shot. This is low priority since, however annoying, the drives, their enclosures, and the shipping rigamarole works well enough right now.

In general the public-facing servers continue to behave themselves. It's been a good couple of weeks. I don't believe in jinxes so I don't mind saying as much. I will say that the workunit storage server is filling up again - a factor of astropulse actually performing well, and workunits sitting around a long time waiting to validate. If it does fill up we'll have to deal with it.

- Matt


11 Mar 2009 20:43:03 UTC
Lots of machine rebooting today as Eric is getting his new hydrogen server online, and I'm finishing work on moving mail servers around. This shouldn't have affected the outside world. During all this Eric gave Jeff and I a quick tutorial on merged file systems. Wacky stuff.

Radar wise, I got some lengthy notes from Phil down at Arecibo. Turns out by far most of the radar we see is from the airport, which was news to me, and that's the only thing the hardware blanker checks for. Discussions will continue.

Dan, while at Arecibo earlier this week, replaced our non-working raw data drive enclosure with one we've been using up here. It's unclear whether this helped or not. We're learning that SATA drives (and enclosures/backplanes) aren't necessarily meant for excessive hot-swapping, and will fail after N "mating cycles." This may be what we're coming up against.

- Matt


10 Mar 2009 22:45:02 UTC
Tuesday means weekly outage day. Nothing really interesting or scary today. The only sysadmin thing I did during the quiet time was moving mail service off one machine (which we plan to retire soon) onto another. Still have a couple steps to go on that front.

I should mention that we upgraded our network connection from our auxiliary lab to the server closet from 100Mbit to 1Gbit. In practice this meant simply replacing an old cheap switch which a new cheap one. This was mostly for the benefit of Eric and his new compute server, but on the side helps vader (which handles half the downloads and all the assimilators) and our other compute servers maul and marvin (all of which still sit in the other lab, awaiting room in the closet).

Finally stopped being sidetracked enough to work on radar blanking again today. I'm finding some data is very clean and would like to not enforce blanking if it seems unnecessary. E-mails were sent to the experts for advice.

- Matt


9 Mar 2009 22:38:16 UTC
Happy Monday, everybody. It was a pretty smooth weekend, so not much to report there. Today I mostly took care of chores and the less glamorous/interesting side of systems administration. Eric bought a new server for his hydrogren projects. We needed to put it somewhere, so we decided to put it in our current auxiliary rack, which is currently sitting in our other lab waiting to replace one of the smaller (and less useful) racks in the closet. One of the download servers (vader) is actually in this auxiliary rack already. Anyway, we discovered that yet again the rails for this server are ever-so-slightly too big given the current rail configuration. Annoyed but determined Eric and I put forth the effort of taking vader out of this rack (which is why it was offline for an hour there) and adjust the stupid rails. Now everything fits. Good.

To answer PhonAcq's question ("Now what is on the agenda to improve things to the next level of performance??"), there is always some looking ahead to what we'll need soon. First up is more memory in our mysql server (jocelyn). When all is well it can easily handle a mixed bag of 2000 queries/sec, but during peaks or other crises it may start to page and cause massive disk i/o. Given the current memory configuration it'll be quite easy to add 4GB ram to the system, which will help. Of course we're simultaneously scanning different avenues of download/upload bandwidth increase. We still have yet to do the whole project of converting thumper's RAIDs from 5 to 10, which will boost science database (and likewise splitter/assimilator) performance. There's more, but that's a good start.

- Matt


5 Mar 2009 22:01:59 UTC
Once again not much hardware/server stuff to report. I guess the ap_validator "2" is failing due to seg faults. A fact that is obscured on the server status page (due to automatic parsing of configuration files) is that the ap_validator "2" does strictly astropulse_v5 workunits, while ap_validator "1" validates older astropulse workunits. In any case, I warned Josh, he's looking into it, etc. Probably a broken result file/database entry is causing it to seg fault and quit before doing very much.

Today was mostly conceptualizing/programming again for me, though focused back on radar blanking stuff as I should really get this done. I'm getting bogged down with "ragged files" - where the chunks of data aren't nearly ordered, thus causing confusion about where the software/hardware blanking bits are. This usually isn't a problem, except when a particular raw data file is ragged at the top or bottom, and the chunk containing blanking information needed by adjacent chunks is actually at the end of the previous file, or at the beginning of the next, or nowhere to be found at all.

- Matt


5 Mar 2009 0:26:41 UTC
Don't really have much to report today, tech-wise. The replica problems I mentioned yesterday ended up not being problems at all. There was some network security stuff I got bogged down with yesterday afternoon and again this morning - campus is ultra paranoid, so when they see what they think is nefarious activity (false positive or otherwise), or even potential security holes that haven't yet been exploited, you have to pretty much drop everything and act on it, which is fair enough.

I spent pretty much my entire day getting the ball rolling on the "visualization" of the NTPCkr output. Jeff has some code working which dumps out giant blobs of xml detailing the "current best" points on the sky. So I spent the day writing up some php which digests this xml and makes nice tables, plots, etc. It's all very basic so far, but it's a start.

We're getting large bursts of network activity at midnight every day now. Not sure what's up with that. Somebody's got a cronjob somewhere doing something.

- Matt


3 Mar 2009 23:11:39 UTC
Usual outage day today (database backup/maintenance, etc.). Actually it would have been "usual" except that certain finagling by us in the background may have messed the replica up. That remains to be seen - if it needs intervention the fix would be trivial.

Oh look. Somebody updated web code. Pretty colors. I think I overheard Dave talking to Rom about new forum features. I have no idea what they are.

Helped Jeff walk through NTPCkr code this morning, tracking down bugs, etc. In essence the goal of this program is simple - to find groups of signals in our data that fall within a certain window of frequency/space but have been seen over multiple observations, and preferably near stars/planets. But it's actually quite complicated - there's a lot of set analysis/manipulation requiring chunks of dense code where bugs can hide if you're not careful. Plus there are always new "special cases" we find (or dream up before we find them) that we need to consider. In any case, we're pressing to get this thing rolling and producing non-zero results before the 10 year anniversary of the SETI@home launch in May.

- Matt


2 Mar 2009 23:01:18 UTC
Not much going on (SETI@home-wise) over the weekend. The fallout from those traffic woes over a week ago are pretty much all behind us (I think completely after we do the database compression tomorrow). The average temperature in the server closet has risen slightly, but we think this is mostly a function of current weather (it seems that during rainy/foggy periods the air conditioner is less efficient).

I did get another server online - something donated by Intel a while ago but only now found the time to set it up, add more memory, etc. It's going to mostly used as a compute server for Eric's hydrogen study project, which is good for SETI as that means his IDL processes won't be competing with our NTPCkr/radar blanking tests.

We continue to have raw data drive enclosure problems. This time the set down at Arecibo is getting funky. Very hard to debug remotely.

- Matt


26 Feb 2009 19:46:29 UTC
Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous.

The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit.

Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to.

- Matt


25 Feb 2009 22:48:42 UTC
It looked like we got beyond the current deluge without too much intervention. Good. Then our bandwidth spiked again. Bad. But then it recovered once more. Good. Oh well, whatever. We're still just in "wait and see if it gets better on its own" mode around here - if we hit our bandwidth limits (and we understand why) there's not much else we can do.

Spent a chunk of the day tracking down current donation processing issues. What a pain. I really need to document the whole crazy donation system so other people around here can fix these problems when they arise. Maybe I'll do that later today. Other than that, just some data pipeline/sysadmin type stuff.

A note about the server status page: Every 10 minutes a BOINC script runs which does several things including: 1. start/restart servers that aren't running but should be, and 2. run a bunch of "task" scripts, like the one that generates the server status page. Since this status page script runs once every ten minutes, it is only a snapshot in time - not a continuum. It also could take several minutes to run its course, as it is scanning many heavily loaded servers. So the data towards the top of the page is representative of a minute or two earlier than the data towards the bottom. And server processes, like ap_validator, hiccup from time to time and get restarted every 10 minutes, then maybe process a few hundred workunits, but fail again a second before the status page checks its status. So even though it was running the past couple of minutes it shows up as "Not Running." In short, don't trust anything on that page at first glance.

- Matt


25 Feb 2009 0:16:11 UTC
Had our weekly maintenance outage today, including the usual chores. I took the opportunity to replace a failed drive on one of our administrative file servers. I also issued the long-overdue final "shutdown" command on another administrative server, kang, which we no longer use. Many years ago, during the early days of SETI@home, several Sun representatives came by one day to discuss our progress. We thought it was just an informal touching-base kind of meeting, but they told us at the end they were going to donate a whole rack full of 6 state-of-the-art Sun servers and 2 disk arrays. Sun has always been nice to us, but this was completely unexpected. We eventually dubbed this the "k-rack" as we named every server after a sci-fi character starting with "k" (kang, kodos, kosh, klaatu, kryten, koloth). Well, kang, was the last one to go - the end of an era. We're still using the rack itself, though - very useful.

Network bandwidth woes continue, moreso now that we're coming out of the weekly outage. Lots of discussion about this in the previous thread - let me see if I can wrap up all the major points quickly. There are three potential solutions to our bandwidth limitations that we are actively entertaining/researching with the related parties. They are: 1. get a full 1Gbit link up to our server closet (pros: zero migration, cons: time/cost - about $80K in parts/labor), 2. collocation on campus (pros: minimal cost/migration, cons: almost impossible nuisance having to administer from a distance), 3. have a third-party entity host/administer everything (pros: we can ditch sysadmin for once and get back to work, cons: major cost, major migration). Each of these solutions requires a major amount of "getting ducks in row" (due to equipment policies, contract terms, general scheduling issues, etc.) - it's hardly just a money issue. Of course there are other options, too, like putting all efforts into final data analysis and shutting down SETI@home. One major issue is that our server closet (roughly 100 CPUs, 100 TB disk, 200 GB RAM) operates atomically - it's all or nothing. We can't just move one piece somewhere else. It's long and complicated - please don't make me explain why unless there's a free pitcher of beer involved.

- Matt


23 Feb 2009 21:06:51 UTC
Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.

After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.

That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.

In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.

Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).

The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.

- Matt


19 Feb 2009 20:41:57 UTC
As we move toward the weekend we're sticking with the current raw data storage workarounds, which means servers are loaded heavier than we'd like, but at least data is still flowing. I wouldn't be surprised if there are network hiccups or if the assimilator queue swells during the weekend.

So far this morning lots of chores. Bob and I got a shipment of empty data drives bundled up to be sent to Arecibo. I finished getting the new CPU server configured (now me, Eric, Josh, and Jeff are in less competition for cycles). I made more strides towards retiring the last two Solaris machines. Honestly, depending on the development/production environment I'd still probably prefer Solaris over linux. So I'm sad to see these systems go, but they are both very old Sparc machines that we simply don't need anymore.

Late last week Eric, Jeff and I had a quick meeting to discuss current candidate scoring algorithms - we're pretty sure we'll have to tweak them as we go, but we're in enough agreement to get started implementing this part of the NTPCker. Jeff's been all over that this week. I'm just now turning my focus back to actual development, too. My software radar blanker now agrees with the hardware blanker 90% of the time, which is a very good start. I can add an additional 5% just by adjusting thresholds, but the real test is to run software blanked data through the pipeline and see which workunits generate more RFI (the ones using hardware blanking or the ones using software blanking).

- Matt


18 Feb 2009 23:39:35 UTC
Still having ups and downs with the raw data storage. Possibly a second disk failure. We'll get to the bottom of it soon enough. Traffic may be a bit rocky at times, but hopefully not so much. We also just noticed a drive failed on our upload backup storage. That RAID pulled in a spare without anybody realizing what happened until Jeff and I saw the little orange light in the closet today. We really need better monitoring tools. Actually, we have the tools - we just need time to implement them. Still, it's not a super-critical logical drive (it contains backup data from a separate RAID device) so we're not panicked trying to procure a new spare... yet.

I wish I had more positive things to report today. This details I'm failing to mention out aren't all that fun either. Not my day today, I guess.

- Matt


17 Feb 2009 23:42:50 UTC
Over the long (President's Day) weekend one of our storage servers had a headache. Not a big deal, and we got to the bottom of it today (pretty much just a RAID drive failure). We were able to get a workaround in place so we could start generating/saving workunits again, and will slowly transition back to normal over the next day or two. It has been a bit rocky the last few days because the workaround involves a different RAID with far less I/O throughput.

There's always a bright side during work transmission failures: we get to catch up on backlogged queues. So by the time we had our usual database compression/backup outage today the result table was relatively small, and therefore got packed down nice and tight. That's always helpful.

Spent most of the day with the fallout of the above, while also getting a couple systems configured for new duty - mostly administrative/CPU servers that will replace a some older clunkers.

- Matt


12 Feb 2009 20:20:58 UTC
Looks like "Astropulse V5" was finally released yesterday night. As far as I know so far, so good - work is going out, results are being validated. However, it seems like jocelyn (the master mysql database server) had a long period of mysterious pain over night, and recovered on its own this morning. This happens from time to time on our mysql servers, perhaps due to its own nebulous data scrubbing, or perhaps due to lack of memory which is becoming more a problem as the database continues to grow and less of it fits in RAM. Unless anybody out there has a couple Sun-qualified 2GB DIMMs that work in Sun v40z's kicking around, we're going to purchase a few. Currently the system has 28GB of RAM - 12 slots with 2GB DIMMs, the remaining 4 with 1GB. We hope to at least upgrade those four to 2GB. It is unclear whether or not our version of the v40z can take 4GB DIMMs (and go over 32GB total).

As for radar blanking, let me clear up the general picture.

Now that we are using the ALFA receiver (since 2006) we are susceptible to military radar, which causes many overflows in our SETI@home/astropulse analysis. The transmitter is aimed right at us approximately every 12 seconds, and then echoes bounce all over the mountains surrounding the telescope the rest of the time. Even the echoes cause us to overflow. The radar is fairly unpredictable - the military isn't very forthcoming about their transmission patterns, and when they are going to change to another pattern. Nevertheless, it is predictable enough: there are about 6 known "patterns" us civilians can lock on to.

Luckily, Arecibo solved this problem for us. They have a hardware device that broadcasts a bit letting all projects at the observatory know when it thinks the radar is on (1 for on, 0 for off). This we call the "hardware blanker" - and we inject this bit into an unused channel in our raw data. This has been quite helpful: when the bit is "1" we'd randomize the data, thus squashing the overflows. At least in theory - there were still three problems.

Problem 1: We only got the hardware blanker working sometime in 2007, so there is no such blanking information in the previous years' worth of data, thus rendering it fairly useless.

Problem 2: The hardware blanker sometimes isn't on like it should be, or even worse is mis-locked onto a wrong pattern and going out of phase with the actual radar, which also renders data quite useless.

Here's where my code comes in: The "software radar blanker." Actually, this is code/logic written by a summer student, Luke, and then I cleaned it up and (apparently so far) got it working. In short, the software radar blanker does a statistical analysis of the raw data - basically looking to see when we're blasted by radar and then trying to lock on to known patterns, and extrapolate from there. Luckily there's another free bit available in the raw data, so the ultimate plan is for raw data to come up here, go through the software radar blanker, and then process. The splitter will use the software and hardware radar blanker bits (exactly how is still up for discussion) to randomize the data. This brings us to...

Problem 3: The randomization shouldn't be totally random. Initially we were injecting white noise into the data when we were blanking. Turns out this causes edge effects and other artifacts during the client analysis. This noise was eventually shaped to fall in line with noise we'd expect to see from a quiet Arecibo. The exact mathematical details of this are left to others who aren't me. I was out of this loop.

All the above was taking too long, so Josh actually implemented code in the astropulse client to reduce some of these radar problems until they are completely solved. He isn't radar "blanking" (which happens during workunit creation) as much as having the clients find stuff that is probably radar and treating it accordingly. For what it's worth, one of the CASPER guys, Andrew, has been having the same exact military radar problems with the pulsar data they've been collecting at the ATA, so he's been simultaneously working on his own radar mitigation techniques. Man, the earth is noisy.

In any case, I figure it'll be about a month of testing/tweaking before we're actually using the software blanker.

- Matt


11 Feb 2009 23:01:41 UTC
Before releasing the astropulse application Eric had to add a couple fields to the result tables in the science database that are now necessary. These are large fields, and it's taking informix forever to update the table. The job was started 24 hours ago and is still chugging along. I guess it doesn't help that the assimilator queue is still rather large (though it is draining). So the release is delayed until this job finishes.

The radar blanking stuff I was whining about the other day has nothing to do with the astropulse release, in case there was some confusion about that. Josh and I are working on two completely separate and different forms of radar mitigation. Mine is to better clean up data before any splitting/analysis, Josh's is to deal with radar that squeaked through the first pass and made it all the way to the client. The good news is that I made significant progress on mine today.

- Matt


10 Feb 2009 23:03:00 UTC
Today's Tuesday - that means weekly outage. Outside of the normal database backup/compression drill I went through the rigamarole of changing the user id of mysql on the master database server (and updated the ownership of all its files), if only for administrative ease now that it matches the same user id as all other instances of mysql here in our group.

I also decided to yum up several servers that were lagging behind since we have been getting ugly yet harmless kernel warning messages for a while now. Unfortunately, this general update included a buggy nfs package (which I knew was buggy months ago but assumed they must have fixed this by now) which then locked up one of our main file servers, thus grinding everything to a halt. It was an annoying hour or so trying to figure this all out, and ultimately the only solution was to fall back to an old version of nfs. Not sure why this nfs-utils package is *still* in the repositories.

Josh is working on getting another astropulse client out into the world today, and is fighting with the code signing machine as I type this sentence.

Here's another problem we've been having over the past couple weeks, and it doesn't seem to be getting better: ants. I typically don't take a lunch break, and just nosh all day by my computer during small cracks of time. Dave and Jeff are the same way, and have the next two desks adjacent to me. Even though we're on the third floor the ants finally found the motherload of crumbs and unwashed utensils left on our desktops. There's not enough of them to find their exact point of entry nor plot their general plan of action. So throughout the day I've been mashing the little buggers as I spot them. Hopefully they'll just give up and disappear - meanwhile my work space is smelling more and more like formic acid.

- Matt


9 Feb 2009 22:48:18 UTC
My mondays are generally spent (a) figuring out what went wrong over the weekend (if anything), (b) cleaning up the data pipeline which has been running on its own for three days, and (c) preparing for or sitting in meetings. Today wasn't so different.

Between my radar blanking tests, Jeff's NTPCkr tests, Josh's astropulse development, and Eric's hydrogen studies we're suddenly finding ourselves woefully low on CPU/memory power. Sure, we have 100 CPUs in our closet, but I'm kind of a fuddy-duddy when it comes to running non-critical processes on our high-availability public facing machines. This is frustrating to others as these machines are the ones best suited for the testing/development we're doing. Luckily, we have one server, maul, which can never be a critical system as it has a test motherboard which would be fine except it intermittently loses contact with the keyboard. So this is our one CPU server which is now usually overloaded to the point of unusability.

We do have two machines coming to the rescue: One from Intel, actually donated around the same time as maul. We haven't gotten around to installing an OS on it until today. Why? Well, that means also needing an IP address for it. The university charges us monthly per IP address we use, so to conserve funds we've been keen on only bringing systems online we actually intend to use, preferably to replace a current system. The second machine is a similarly powerful one that we received from a private donor last week.. but the motherboard was DOA. At least that's our theory. We'll get that replaced soon. Both systems will go a long way towards reducing our current development/testing constraints - something we haven't been worried about too much over the past decade because we've been mostly in a mode of data collection/reduction instead of final data analysis... in case you haven't noticed. I'm happy this is changing (or at least portending to change).

- Matt


5 Feb 2009 23:57:23 UTC
Spent a large chunk of the day actually programming, which is nice. It seems like the network bandwidth bottleneck part of our malaise over the past couple of weeks has finally gone away - we're back down to a floor of 60 Mbits/sec. However, the mysql database is still quite clogged up. Looks like as I type this sentence we're still having fits as the splitters/feeders/etc. can't get their queries through fast enough. I'm hoping the bandwidth drop means the excess results were all finally downloaded, which means in the next few days they'll return, and we can finally get them validated/assimilated/deleted and out of our hair.

There was a sweeping change in web code brought on line this afternoon. This broke web account authentication, making it impossible for people to log in. Oops. Not my bad - don't kill the messenger. Anyway it was fixed quickly enough.

- Matt


4 Feb 2009 22:01:23 UTC
Moving on... We seem to have eventually recovered just fine from the replica resync, as well as the outage in general. Traffic is still very high, but at least just below the point of impossibility. The assimilator queue is indeed dropping, which is a good thing, as that means we're inching closer to removing all the excess workunits and results from the disk, as well as the database. We still seem to be dealing with the result indigestion I described two days ago, but this too is sloooowly getting better over time.

We've been having some load issues on the web server (thinman). There were no obvious signs of being DOS'ed or over-spidered, if anything it seemed like apache developed a memory leak. I yum'ed in the latest kernel, rebooted the machine (in case anybody noticed a 5 minute outage earlier today), and it looks okay at this point. Maybe just a simple case of reboot-itis.

Just found another potential problem with the radar blanking code. Sigh... (Don't worry - it's not a C++ issue).

- Matt


4 Feb 2009 0:35:10 UTC
So then. We had our weekly outage today. We knew it would be a long one - the result table is bloated for various reasons so it took forever to compress. This may help get past this period of "indigestion" I mentioned in the previous thread, but there's no sign of it getting much better any time soon. Expect continuing network pain. Plus Bob is resync'ing the mysql replica, so that'll be behind a bit in the near term.

Quite often we recompile all the back-end servers with code thoroughly tested in beta and switch in these new versions in the public project during the outage. We did so today, and the splitters and assimilators all freaked out upon starting up this afternoon with library linking errors. What a hassle. It seems like our servers are slowly getting more and more out of sync, given some are 32-bit, some are 64-bit, some are running this rev of the OS, some are running that rev, some have this package installed, some don't, etc. and this is apparently becoming a problem. Like we have time to clean this all up.

<obnoxious rant>
I was having an offline discussion with a friend who insists that C++ is a vast improvement on C, and that C programmers who complain about C++'s major failings are living in the past or "just don't understand." I wouldn't mind the debate except C++ afficianados usually adopt a smug, condescending tone regarding C programmers that reminds me of republicans describing democrats. In any case there was a programming mystery today that ate up a man-hour of my and Jeff's time. If the object in question was just a struct it would have been painfully obvious. Instead the problem was obscured in vague assignment operator behavior. Does anybody have an actual, simple example of C++ code that is (a) easier to debug than analogous C code, (b) required less manpower to generate, and (c) will be forever useful and understood? I'm willing to be convinced, but it hasn't happened yet. Maybe it's just a different (and not necessarily better) kind of brain that loves C++, but I tend to think it stemmed from the evil part of our monkey mind that turns a blind eye toward unnecessary complication for everybody in the hope that things may be easier for ourself later on. Or the other evil part of our monkey mind that foists contorted methodology on others as some sort of sick competition (which may be fun but is hardly productive). K&R = 200 pages. Stroustrup = 1000 pages. Is C++ really 500% better that it requires 500% the pages to describe? Nope. Case closed.
</obnoxious rant>

- Matt


2 Feb 2009 21:54:21 UTC
Happy Monday everybody. I guess I should move on from the January thread title theme (odd little towns/places/features in southern Utah which I've been to during many nearly-annual backpacking/hiking adventures in the area - easily one of the best parts of the U.S.).

We did almost run out of data files to split (to generate workunits) over the weekend. This was due to (a) awaiting data drives to be shipped up from Arecibo and (b) HPSS (the offsite archival storage) was down for several days last week for an upgrade - so we couldn't download any unanalysed data from there until the weekend. Jeff got that transfer started once HPSS was back up. We also got the data drives, and I'm reading in some now.

The Astropulse splitters have been deliberately off for several reasons, including to allow SETI@home to catch up. We also may increase the dispersion measure analysis range which will vastly increase the scientific output of Astropulse while having the beneficial side effect of taking longer to process (and thus helping to reduce our bandwidth constraint woes). However, word on the street is that some optimizations have been uncovered which may speed Astropulse back up again. We shall see how this all plays out. I'm all for optimized code, even if that means bandwidth headaches.

Speaking of bandwidth, we seem to be either maxed out or at zero lately. This is mostly due to massive indigestion - a couple weeks ago a bug in the scheduler sent out a ton of excess work, largely to CUDA clients. It took forever for these clients to download the workunits but they eventually did, and now the results are coming back en masse. This means the queries/sec rate on mysql went up about 50% on average for the past several days, which in turn caused the database to start paging to the point where queries backed up for hours, hence the traffic dips (and some web site slowness). We all agreed this morning that this would pass eventually and it'll just be slightly painful until it does. Maybe the worst is behind us.

- Matt


29 Jan 2009 23:25:26 UTC
The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure.

The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests.

We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out...

- Matt


28 Jan 2009 23:24:18 UTC
Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really.

And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week.

Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list.

- Matt


27 Jan 2009 22:40:49 UTC
Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while.

During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next.

Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else.

- Matt


26 Jan 2009 23:17:39 UTC
Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works.

I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while.

- Matt


22 Jan 2009 23:34:01 UTC
We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves?

I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode.

- Matt


21 Jan 2009 22:18:50 UTC
The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested.

As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another.

Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already.

- Matt


20 Jan 2009 22:58:04 UTC
Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events.

Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days).

A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support.

- Matt


15 Jan 2009 21:45:17 UTC
This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then.

Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it.

The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that.

- Matt


15 Jan 2009 0:09:47 UTC
Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage.

Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all.

- Matt


13 Jan 2009 22:58:50 UTC
Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things.

Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput.

- Matt


12 Jan 2009 23:58:40 UTC
A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects.

Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up.

Back to work - which means plotting lots of radar data for me.

- Matt


8 Jan 2009 22:26:13 UTC
I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work.

The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now.

- Matt


7 Jan 2009 23:56:34 UTC
Now it's Wednesday, which usually means my focus should shift towards programming tasks. This actually hasn't happened in a while due to holiday schedules and other crises, but the radar blanking code really needs to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery.

But this morning I was still busy with a bunch of things on my systems task list. Our informix replica server bambi was having fits with exporting/mounting so I had to go through the rigamarole of rebooting the system - which always seems to be the fastest way to fix things when things go awry. I also plugged away moving tons of data around our internal network for eventual filesystem rebuilding, tending to the raw data pipeline, etc. - the stuff I've been talking about for a while.

I've been using an old "Solaris 8" software box (coupled with the shell of a long-defunct external SCSI hard drive enclosure) as a stand for my desktop monitor, unaware how over the years the box has been slowly morphing out of square and sinking towards the left thus slanting the screen more and more. That might explain the crick in my neck I've had the past six months. This unergonomic situation was finally pointed out today by fellow SSL sysadmin Robert. Anywho, I now have the monitor sitting onto my shuttle enclosure, and even though it's perfectly level it seems it's slanting to the right. Talk about accommodation - my brain really got used to the old lean.

- Matt


7 Jan 2009 0:06:20 UTC
It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet.

Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running.

Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits.

By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general.

- Matt


30 Dec 2008 23:16:29 UTC
Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May).

That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done.

- Matt


29 Dec 2008 23:56:24 UTC
One short holiday week is behind us, now here comes another one.

We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind.

In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again).

We will have the usual outage drill tomorrow, followed by another set of "days off."

- Matt


23 Dec 2008 23:00:32 UTC
Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed.

Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam.

I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect.

Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year."

- Matt


Technical News Archives: 2008 2007 2006 2005 2004

Copyright © 2009 University of California