Message boards :
Number crunching :
CPU Computation errors
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
I am trying to run my AMD 2700 Box at 3.7GHz. Apparently the cpu is throwing errors every day. https://setiathome.berkeley.edu/results.php?hostid=8684146&offset=0&show_names=0&state=6&appid= I have upped the cpu voltage offset again today. Is there anything else I should be tinkering with? Tom A proud member of the OFA (Old Farts Association). |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
You could tell us what the error messages are :-) On a sample of three, all had Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT finish file present too longand in one case, "Restarted at 27.00 percent." three times. I think the poor CPU (and disk) is stressed out servicing all those GPUs. What you could do is to try building a new client from master, with #3019: When an app finishes, it writes a "finish file",Or you could wait for the official release of client version 7.16: that should start moving any day now, as soon as Keith Myers and I sign off on the 'work fetch with max_concurrent' bug. |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
You could tell us what the error messages are :-) I couldn't seem to find the error message. Sorry. Since I am running Tbars All-in-One unless the "client" resides outside of the gpu processing exe it's not going to be helpful. I can replace most of the executables in the BOINC folder. But replacing the gpu task means someone has to fix/recompile from the same source Tbar was using. Tom A proud member of the OFA (Old Farts Association). |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I couldn't seem to find the error message. Sorry.Not to worry. It took a while to load a list that long, but they arrived eventually and I know where to look. Since I am running Tbars All-in-One unless the "client" resides outside of the gpu processing exe it's not going to be helpful. I can replace most of the executables in the BOINC folder. But replacing the gpu task means someone has to fix/recompile from the same source Tbar was using.The 'client', in modern - post 2005 - terminology is the boinc or boinc.exe binary executable program. In older reference works, you'll sometimes see the SETI programs referred to as clients, but that dates from the old 'Classic' days when SETI did its own communications with the server closet, without the BOINC layer in the middle. TBar has chosen to manage, maintain and document his work himself: I did download his package (all 222 MB) a couple of weeks ago to see if he'd incorporated my improved workround for the Manager scrolling bug, but the docs suggested he'd still taken out the whole patch. Anyway, that was Manager and this is Client, so it would make no difference. The boinc client program is listed there, as one of the five files you need to install: I suggest you offer to test a build with #3019 (remember only in master, not a numbered branch) if he's prepared to build you one. |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
Richard, Thank you for clarifying what exactly "the client" refers to. If the change is in the "boinc.exe" I have less trouble with it. Tom A proud member of the OFA (Old Farts Association). |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
It also occurs to me if I go ahead and set the cpus to use 1 per GPU, then the CPU cores that are driving the gpus would no longer be "struggling" to spend time on the cpu app tasks. The downside is it takes away 6 cpu threads from cpu processing. Maybe there is a compromise like 0.33 cpus per gpu. Until the upgrade comes out :) Tom A proud member of the OFA (Old Farts Association). |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22713 Credit: 416,307,556 RAC: 380 ![]() ![]() |
What you loose by having one core per GPU is more than made up for by the improved performance of the GPUs. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
What you loose by having one core per GPU is more than made up for by the improved performance of the GPUs. The gpus weren't having a problem. It was the CPU threads which is why I am mourning the (temporary, I hope) loss of them. At least that is what I think was going. I haven't had any computation errors since yesterday when I upped the cpu voltage. But given it also had something to do with writing a file to the HD I figured I wanted at least some days of "no errors" before I try bringing back more cpu threads to a "shared" status. Tom A proud member of the OFA (Old Farts Association). |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
It wasn't really a 'computation error' - the actual processing ran to completion with no sign of errors. It was more of a 'housekeeping error', involving memory, disk, and the operating system. The sequence should be: 1. The SETI app writes a data file to disk for upload 2. The SETI app writes a 'finished' file to disk 3. The SETI app shuts down and removes itself from memory [at which point, another SETI app will probably start up and load itself from disk to memory] 4. The BOINC client checks that both 2 and 3 have happened within 10 seconds If it takes more than 10 seconds for that to happen, then the OS is thrashing. It might be memory paging as David suggests in his comment (in which case, more RAM might help). Or it might be the OS being too busy starting the next task (in which case, fewer concurrent operations requiring disk access might help). Or perhaps an SSD data disk. |
![]() ![]() ![]() Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 ![]() ![]() |
Beware of HIP wu , i've got some suspicious pulses with this sort of WU ... blc22_2bit_guppi_58406_01923_HIP116971_0034.25185.0.21.44.103.vlar take a look at his one http://setiathome.berkeley.edu/workunit.php?wuid=3421361418 WU restarted several times with my CPU, as for my wingman ... |
![]() ![]() ![]() Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 ![]() ![]() |
Messages i got [SETI@home] task postponed 300.000000 sec: Impossible Autocorr power, retrying from checkpoint. others ones after stopping computer and restart from cold Task postponed: Suspicious pulse results, host needs reboot or maintenance |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios. Tom A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I'm glad I visited this thread. I wasn't aware Richard, of #3019 patch for the "finish file present too long" bug. I am plagued by that on a lot of my hosts because of being too busy to service the task in the short time frame. Very happy to see the timeout increased. Typically my only errors are for this type of error. I am not running the kind of gpu count that Tom M. is running. Most of my hosts only run 3 gpus with a couple running 4 gpus. But I get the finish file error on even the 3 card hosts. Maybe . . . . one a week. @Richard, I wasn't aware you were waiting on me for the work_fetch workaround. I haven't tried out the latest dpa_work_fetch_mc branch lately. I never got any feedback on why the simulator again refuses to work for me. It takes my uploaded files without complaint or error but a scenario never shows up. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios. I haven't had in gpu computation errors (or invalid results) since I switched to 1 cpu to 1 gpu in the app_config.xml file. Tom A proud member of the OFA (Old Farts Association). |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
Looks like I have now having gpu computation errors. Going to switch to 1 to 1 cpu to gpu ratios. Looks like I am getting cpu "invalid results" though. Tom A proud member of the OFA (Old Farts Association). |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
@Richard, I wasn't aware you were waiting on me for the work_fetch workaround. I haven't tried out the latest dpa_work_fetch_mc branch lately. I never got any feedback on why the simulator again refuses to work for me. It takes my uploaded files without complaint or error but a scenario never shows up.No point just at the moment. I found a second bug after the last outage here: David ignored it for a week and then asked for a scenario at midnight Tuesday. He got it Wednesday morning, and for once the simulator actually showed the bug in action. Ball's back in his court. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13903 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Typically my only errors are for this type of error. I am not running the kind of gpu count that Tom M. is running. Most of my hosts only run 3 gpus with a couple running 4 gpus. But I get the finish file error on even the 3 card hosts. Maybe . . . . one a week. I would also get the occasional "Finish file present too long" when restarting my systems after updates are installed & waiting on a restart. I've just gotten in to the habit of exiting BOINC at least 30sec before restarting the system. Grant Darwin NT |
![]() ![]() Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 ![]() |
It wasn't really a 'computation error' - the actual processing ran to completion with no sign of errors. Interesting... I wonder if that will solve also some of the issues some people including myself have with Rosetta. Seen myself how few tasks went to error exactly when they were about to finish the work and exit. ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.