Experiment for server operations check...

log in

Advanced search

Message boards : News : Experiment for server operations check...

Previous · 1 · 2 · 3 · 4 · 5
Author Message
Claggy
Volunteer tester
Send message
Joined: 29 May 06
Posts: 905
Credit: 7,510,367
RAC: 3,738
Message 44323 - Posted: 14 Nov 2012, 19:42:06 UTC - in response to Message 44321.

O.K Thanks

Claggy

Claggy
Volunteer tester
Send message
Joined: 29 May 06
Posts: 905
Credit: 7,510,367
RAC: 3,738
Message 44332 - Posted: 16 Nov 2012, 18:15:45 UTC
Last modified: 16 Nov 2012, 18:41:53 UTC

New scheduler is on. Let me know if you have problems with anything.
Looks to be operating correctly now:

16/11/2012 18:07:42 SETI@home Beta Test [sched_op_debug] Starting scheduler request
16/11/2012 18:07:42 SETI@home Beta Test Sending scheduler request: To fetch work.
16/11/2012 18:07:42 SETI@home Beta Test Requesting new tasks for GPU
16/11/2012 18:07:42 SETI@home Beta Test [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
16/11/2012 18:07:42 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
16/11/2012 18:07:42 SETI@home Beta Test [sched_op_debug] ATI GPU work request: 10562.03 seconds; 0.00 GPUs
16/11/2012 18:08:01 SETI@home Beta Test Scheduler request completed: got 6 new tasks
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] Server version 701
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.245_1
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.246_1
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.247_1
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.248_1
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.249_1
16/11/2012 18:08:01 SETI@home Beta Test Message from server: Resent lost task 05ap10al.19223.2117.140733193388043.14.250_1
16/11/2012 18:08:01 SETI@home Beta Test Project requested delay of 7 seconds
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] estimated total CPU job duration: 0 seconds
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] estimated total ATI GPU job duration: 9322 seconds
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec
16/11/2012 18:08:01 SETI@home Beta Test [sched_op_debug] Reason: requested by project

and:

16/11/2012 18:26:21 SETI@home Beta Test [sched_op_debug] Starting scheduler request
16/11/2012 18:26:21 SETI@home Beta Test Sending scheduler request: To fetch work.
16/11/2012 18:26:21 SETI@home Beta Test Requesting new tasks for CPU and GPU
16/11/2012 18:26:21 SETI@home Beta Test [sched_op_debug] CPU work request: 54.41 seconds; 0.00 CPUs
16/11/2012 18:26:21 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
16/11/2012 18:26:21 SETI@home Beta Test [sched_op_debug] ATI GPU work request: 5456.83 seconds; 0.00 GPUs
16/11/2012 18:26:37 SETI@home Beta Test Scheduler request completed: got 5 new tasks
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] Server version 701
16/11/2012 18:26:37 SETI@home Beta Test Message from server: Resent lost task 05ap10al.21593.15205.140733193388042.14.18_0
16/11/2012 18:26:37 SETI@home Beta Test Message from server: Resent lost task 05ap10al.21593.15205.140733193388042.14.19_0
16/11/2012 18:26:37 SETI@home Beta Test Message from server: Resent lost task 05ap10al.21593.15205.140733193388042.14.20_0
16/11/2012 18:26:37 SETI@home Beta Test Message from server: Resent lost task 05ap10al.21593.15205.140733193388042.14.21_0
16/11/2012 18:26:37 SETI@home Beta Test Message from server: Resent lost task 05ap10al.21593.15205.140733193388042.14.22_0
16/11/2012 18:26:37 SETI@home Beta Test Project requested delay of 7 seconds
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] estimated total CPU job duration: 9445 seconds
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] estimated total ATI GPU job duration: 6199 seconds
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec
16/11/2012 18:26:37 SETI@home Beta Test [sched_op_debug] Reason: requested by project

and normal requests work fine:

16/11/2012 18:34:31 SETI@home Beta Test [sched_op_debug] Starting scheduler request
16/11/2012 18:34:31 SETI@home Beta Test Sending scheduler request: To fetch work.
16/11/2012 18:34:31 SETI@home Beta Test Requesting new tasks for CPU and GPU
16/11/2012 18:34:31 SETI@home Beta Test [sched_op_debug] CPU work request: 269.28 seconds; 0.00 CPUs
16/11/2012 18:34:31 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 14124.74 seconds; 0.00 GPUs
16/11/2012 18:34:31 SETI@home Beta Test [sched_op_debug] ATI GPU work request: 20406.80 seconds; 0.00 GPUs
16/11/2012 18:36:53 SETI@home Beta Test Scheduler request completed: got 29 new tasks
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] Server version 701
16/11/2012 18:36:53 SETI@home Beta Test Project requested delay of 7 seconds
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] estimated total CPU job duration: 9467 seconds
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] estimated total NVIDIA GPU job duration: 12990 seconds
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] estimated total ATI GPU job duration: 18677 seconds
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec
16/11/2012 18:36:53 SETI@home Beta Test [sched_op_debug] Reason: requested by project

Note: this all been done using a proxy to get round the scheduler timeout both the Main and Beta projects are suffering.

Claggy

Juha
Volunteer tester
Send message
Joined: 18 Jun 08
Posts: 41
Credit: 25,493
RAC: 17
Message 44345 - Posted: 19 Nov 2012, 23:59:55 UTC - in response to Message 43941.

Eric, check validator logic once again.
Such state should never happen:
http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=4007211


That is weird. I'll have to go through the logs to see what happened.

I found the problem. The result file for the additional result (#5) doesn't exist. I don't know how that would happen.

I haven't read server code carefully enough to be sure but there might a be a small window of opportunity for a late report just as the files are being scheduled for deletion. Or it might be something else.

Anyway, I made some changes that should take care of the stuck results for good.

The first part is the same as before. If the result is invalid or can't be opened the Astropulse side of the validator lies to the BOINC side to retry the validation later. I fixed that by making sure retry is signalled only when necessary, which is when your file server is not accessible.

Previously the code couldn't tell the difference between a missing file and a missing file server. I improved the ResultFile class to do better diagnosis of the problem. The information is then carried in the ResultFileError exception to the rest of the code. In the case of missing file server the BOINC side is told to retry later, otherwise a missing file will get the result marked with Validate Error and Invalid.

Also the log messages got a bit of touch up. Instead of logging that an error occurred the code now tries to tell where and and what kind of error occurred and what caused it in the first place.

And the rest of the changes. The exception handling in the code was a very fine example of how not to do exception handling. Basically it was just emulating the traditional way of returning an error code. Cleaning that up made the code easier to follow imho.

Combining all three changes into one made the patch a bit messy but considering that all of them are to same parts of the code I don't think separating the changes would have made the patch that much easier to follow. (Or I'm just too lazy to redo it.)

The amount of changes is a bit more than last time so instead of inlining the patch I'm just going to give a link to the patch file(1). It was made with git-svn. I don't know if svn patch can handle it but it does seem to be readable to patch. You very likely need to be in astropulse directory to apply the patch.

Just in case there's some problems with the patch I packaged the changed files. End result should be the same whichever you choose to use. Link to package(2).


(1) That was a direct link. In case it doesn't work here's the patch via Google Drive UI.
(2) Same as (1). Package via UI.

Profile Raistmer
Volunteer tester
Avatar
Send message
Joined: 18 Aug 05
Posts: 1628
Credit: 10,078,489
RAC: 7,716
Message 44348 - Posted: 20 Nov 2012, 15:31:34 UTC

Today I recived fresh pack of cuda22 tasks along with ~same amount of cuda23 tasks.
cuda22 almost twice slower on this host. Does it meant that this issue remains unfixed ?

Richard Haselgrove
Volunteer tester
Send message
Joined: 3 Jan 07
Posts: 1115
Credit: 2,898,641
RAC: 2,588
Message 44350 - Posted: 20 Nov 2012, 16:09:38 UTC - in response to Message 44348.

Today I recived fresh pack of cuda22 tasks along with ~same amount of cuda23 tasks.
cuda22 almost twice slower on this host. Does it meant that this issue remains unfixed ?

Different bug. Message 44303.

Previous · 1 · 2 · 3 · 4 · 5

Message boards : News : Experiment for server operations check...


Main page · Your account · Message boards


Copyright © 2014 University of California
AstroPulse is funded in part by the NSF through grant AST-0307956