Occasional WUs stall

Message boards : SETI@home Enhanced : Occasional WUs stall
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile CElliott
Volunteer tester

Send message
Joined: 16 Aug 05
Posts: 79
Credit: 71,936,490
RAC: 0
United States
Message 44315 - Posted: 11 Nov 2012, 22:57:52 UTC

Occasionally some WUs stall; sending 'quit' to offending Boinc restarts WU. Unfortunately, this happens on only one computer (ID: 58772, the only one I have with AVX) and only on WUs processed on CPU. Here are the messages:

Nov 11, 2012 2:02:06 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.12456.3753.140733193388039.14.158 is not advancing on 192.168.1.81
Prev: 0.320191, Present: 0.320191 Fraction done in 900.703605 seconds.
Nov 11, 2012 2:02:27 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 11, 2012 5:39:45 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.8909.16841.140733193388038.14.111 is not advancing on 192.168.1.81
Prev: 0.339201, Present: 0.339201 Fraction done in 900.675553 seconds.
Nov 11, 2012 5:40:05 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 5:21:15 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.31522.1708.140733193388038.14.39 is not advancing on 192.168.1.81
Prev: 0.395996, Present: 0.395996 Fraction done in 965.567264 seconds.
Nov 10, 2012 5:21:36 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 10:53:03 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.5819.11115.140733193388037.14.231 is not advancing on 192.168.1.81
Prev: 0.405652, Present: 0.405652 Fraction done in 916.725705 seconds.
Nov 10, 2012 10:53:23 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 11:08:40 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.8909.16432.6.14.10 is not advancing on 192.168.1.81
Prev: 0.215151, Present: 0.210665 Fraction done in 936.702477 seconds.
Nov 10, 2012 11:08:40 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 11, 2012 6:30:58 AM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU ap_22my12ad_B5_P0_00179_20121105_13391.wu is not advancing on 192.168.1.81
Prev: 0.225225, Present: 0.225225 Fraction done in 901.340058 seconds.
Nov 11, 2012 6:31:19 AM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81




ID: 44315 · Report as offensive
Josef W. Segur
Volunteer tester

Send message
Joined: 14 Oct 05
Posts: 1137
Credit: 1,848,733
RAC: 0
United States
Message 44316 - Posted: 13 Nov 2012, 2:34:45 UTC - in response to Message 44315.  

Occasionally some WUs stall; sending 'quit' to offending Boinc restarts WU. Unfortunately, this happens on only one computer (ID: 58772, the only one I have with AVX) and only on WUs processed on CPU. Here are the messages:

Nov 11, 2012 2:02:06 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.12456.3753.140733193388039.14.158 is not advancing on 192.168.1.81
Prev: 0.320191, Present: 0.320191 Fraction done in 900.703605 seconds.
Nov 11, 2012 2:02:27 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 11, 2012 5:39:45 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.8909.16841.140733193388038.14.111 is not advancing on 192.168.1.81
Prev: 0.339201, Present: 0.339201 Fraction done in 900.675553 seconds.
Nov 11, 2012 5:40:05 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 5:21:15 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.31522.1708.140733193388038.14.39 is not advancing on 192.168.1.81
Prev: 0.395996, Present: 0.395996 Fraction done in 965.567264 seconds.
Nov 10, 2012 5:21:36 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 10:53:03 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.5819.11115.140733193388037.14.231 is not advancing on 192.168.1.81
Prev: 0.405652, Present: 0.405652 Fraction done in 916.725705 seconds.
Nov 10, 2012 10:53:23 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 10, 2012 11:08:40 PM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU 05ap10al.8909.16432.6.14.10 is not advancing on 192.168.1.81
Prev: 0.215151, Present: 0.210665 Fraction done in 936.702477 seconds.
Nov 10, 2012 11:08:40 PM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

Nov 11, 2012 6:30:58 AM desktopapplication1.DesktopApplication1$MyThread run
SEVERE: WU ap_22my12ad_B5_P0_00179_20121105_13391.wu is not advancing on 192.168.1.81
Prev: 0.225225, Present: 0.225225 Fraction done in 901.340058 seconds.
Nov 11, 2012 6:31:19 AM desktopapplication1.DesktopApplication1$MyThread sendQuit
SEVERE: Sent quit to 192.168.1.81

I've turned the WU names into links to your task details, for the convenience of anyone else who wants to see the outcome, etc. So far, I haven't been able to spot anything unusual about the chosen functions or other things in the stderr information. Whatever the problem, it differs from the long time one where a few hosts sometimes hang within the function testing.

The last one you listed is of course an AP v6 task done on GPU, all that proves is nothing is ever as near perfect as we hope.

I'm not familiar with whatever you're using as a watchdog timer. It's certainly a good idea, can you clarify?. I tried to get Dr. Anderson to implement an option in the BOINC API which would have served that function, but he didn't see the need.
                                                                  Joe
ID: 44316 · Report as offensive

Message boards : SETI@home Enhanced : Occasional WUs stall


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.