<?xml version="1.0" encoding="ISO-8859-1" ?>
    <rss version="2.0">
    <channel>
    <title>SETI@home</title>
    <link>http://setiathome.berkeley.edu/</link>
    <description>BOINC project SETI@home: Technical News</description>
    <copyright>University of California</copyright>
    <lastBuildDate>Thu, 26 Nov 2009 20:00:09 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
        <url>http://setiathome.berkeley.edu/rss_image.gif</url>
        <title>SETI@home</title>
        <link>http://setiathome.berkeley.edu/</link>
    </image>
<item>
            <title>Technical News 26 Nov 2009 17:07:21 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#142</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#142</guid>
            <description>Oh well, we tried. We thought we would just have to put some extra minutes monitoring the data pipeline over the weekend (after wasting a lot of time bringing up many broken files), which wouldn't have been too bad, but...

Then bambi crashed last night - it's our secondary science database server but also manage a lot of the data pipeline stuff. I happened to be free so I drove up to the lab around 10:30pm and rebooted it. After that, the pipeline zipped right along.

That is... until 11pm when the router up and died. Or something along the entire Hurricane Electric network path died. We have no idea. Jeff and I fought with it (both remotely) this morning, but we're throwing up our hands at this point and going on holiday.

Might as well have everything fail at once, and at the start of a long holiday weekend. Why not?

- Matt</description>
            <pubDate>Thu, 26 Nov 2009 17:07:21 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 25 Nov 2009 23:34:12 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#141</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#141</guid>
            <description>Okay then. The mysql commit behavior we were testing was an absolute failure - though for expected reasons (not enough disk i/o, even with the solid state drives). It was worth a shot, but we fell back to the old commit behavior for now.

However, this caused a lot of backend processes to clog up including the transitioners, which ultimately meant the splitters burned through all kinds of raw data files before they realized we had more than enough work on disk. This could have been bad, i.e. filled up our workunit storage server, but luckily it didn't even come close to doing that.

Anyway, we reverted this morning and all the dams broke for a while... until we ran out of work to send out. Turns out the last 10 files I brought up from Arecibo are all broken. &lt;sad trombone&gt;Fwa wa wa waaaaa&lt;/sad trombone&gt;. This is particularly frustrating as I was busting my hump trying to get enough work on line before the long holiday weekend, and now we have zero. So it'll be to me and Jeff to check in over the next few days and kick the pipeline along. We'll be out of real work to send out until this evening at the earliest, and quite probably hit long periods of no work throughout the weekend. Fine.

In better news, we did the last bits to get the Astropulse signal table fully copied over to another database fragment - only losing a few rows here and there (as opposed to many thousands as originally thought). Work will resume on Monday to make this exchange old/new fragments and hopefully the science database will be much happier.

That's it for now.

- Matt
</description>
            <pubDate>Wed, 25 Nov 2009 23:34:12 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 24 Nov 2009 22:46:11 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#140</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#140</guid>
            <description>At the end of the day yesterday our raw data file server lost a drive. The bottom line as far as you're concerned is that we had to stop the creation of workunits until we got on top of the RAID resync issues this morning. But by then we were into our normal weekly outage, so you've been unable to get any work for a while, and will continue to not be able to do so until I start splitting up again - probably later this evening.

Meanwhile, every other part of the project is coming back online. We're testing the new mysql commit behavior (mentioned in yesterday's post). It's not looking good right out of the gate, but that may be due to mysql needing to read everything back into memory again after a bounce to pick up the configuration change. I may have to bounce it again if it continues to be a problem. I hope not, but it's no big deal either way.

Looks like Bob got most, if not all, the corrupt astropulse table finally copied over to another table so we can drop/recreate the data and get rid of this corruption (which has been causing us random headaches over the past month or two). I just ran some preliminary tests on the data integrity. Looks good.

- Matt
</description>
            <pubDate>Tue, 24 Nov 2009 22:46:11 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 23 Nov 2009 22:46:08 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#139</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#139</guid>
            <description>How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.

However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.

Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.

Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.

It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.

Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.

- Matt
</description>
            <pubDate>Mon, 23 Nov 2009 22:46:08 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 17 Nov 2009 22:48:26 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#138</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#138</guid>
            <description>Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from &quot;feature/scope creep.&quot; I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home &quot;take a break&quot; to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt
</description>
            <pubDate>Tue, 17 Nov 2009 22:48:26 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 13 Nov 2009 0:01:13 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#137</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#137</guid>
            <description>Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.

- Matt</description>
            <pubDate>Fri, 13 Nov 2009 00:01:13 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 10 Nov 2009 22:58:34 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#136</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#136</guid>
            <description>Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:

1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).

2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.

3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.

That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).

Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual &quot;check in from home every few hours and tweak this and that&quot;). 

- Matt
</description>
            <pubDate>Tue, 10 Nov 2009 22:58:34 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 10 Nov 2009 0:24:48 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#135</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#135</guid>
            <description>Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a &quot;fluke&quot; - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of &quot;grave concern&quot; about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt
</description>
            <pubDate>Tue, 10 Nov 2009 00:24:48 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 5 Nov 2009 22:53:58 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#134</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#134</guid>
            <description>Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.

This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.

Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.

- Matt
</description>
            <pubDate>Thu, 05 Nov 2009 22:53:58 GMT</pubDate>
            </item>
        
    </channel>
    </rss>
