Message boards :
Number crunching :
Lost "Ghost" task recovery protocol
Message board moderation
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 ![]() ![]() |
|
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
I thought it would be a good time to test, since I have some to recover as well. Even though the server is currently 'broken' it is handing out lost tasks. ![]() |
![]() ![]() ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 ![]() ![]() |
|
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
You have to have room in your cache for the resends. So you need to set NNT long enough and report finished work for you to fall below your normal gpu cache task allotment by 20 tasks. That way you have room for the resends. Also you have to make sure you do not get a completed task request acknowledgement before you stop Network Activity. That involves watching the Event Log closely for the first sign of the scheduler request and quickly clicking the Suspend Network Activity selection in the Manager. Sun 05 May 2019 12:48:41 PM PDT | SETI@home | [sched_op] Starting scheduler request Sun 05 May 2019 12:48:41 PM PDT | SETI@home | Sending scheduler request: To fetch work. When you see the Sending scheduler request: To fetch work, click the Suspend Network Activity selection in the Activity menu option with the mouse. If you see Sun 05 May 2019 12:48:50 PM PDT | SETI@home | Scheduler request completed: got 64 new tasks you have missed the timing on stopping network activity. All you can do is wait out the next 5 minute scheduler connection and try again. The most resends the scheduler can send out at any one time is 20 tasks. So if you have many ghosts you might have to spend an hour running through the protocol to clear them. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 ![]() ![]() |
|
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
OK, thanks for the commentary. I see where you are confused. I will rewrite the procedure for better comprehension. It is rather easy to perform and for those of us who have been doing for years, it is using nothing but muscle memory. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
I finally understand I have a humongous # of ghost tasks. Can I run the work cache down significantly and try to get hundreds of re-sends at once? Tom A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them. That explains the "20" in the directions. I will be spending more than an hour a day at this because I don't like screwing things up this way. Maybe I can nail this in the next week or so. Tom A proud member of the OFA (Old Farts Association). |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them. So far my out standing tasks # is increasing. I am wondering if I have the reflexes to do this. I suppose I could re-set the project and stop trying so hard. Or just stop trying so hard :( Tom A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 ![]() ![]() |
The recovery process shouldn't cause more ghosts. If you aren't fast enough, then you will get more WUs and then have to wait to have the free space to try again. I got very frustrated trying to do this. I had to walk away and do other things in between tries. The system will recover eventually even if you do nothing, so don't feel guilty. Even if you manage to partially recover some, that will help, so it isn't an all or nothing scenario. Take breaks and don't let it get to you. Any idea what caused the problem in the first place? Was it issues on your end, or the server issues that we had? Good Luck |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22656 Credit: 416,307,556 RAC: 380 ![]() ![]() |
In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place. I don't have a clue on how I accrued so many in the first place. I have been using the same proceedures on two different multi-gpu systems and one has ghosts and one doesn't. I am going to change to a standard setup on the machine that is having issues so the worse that can happen is gpus X 100 + 100 which will be a bunch smaller! Tom A proud member of the OFA (Old Farts Association). |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Actually, you should be able to set it up to where instead of sending back 20 tasks it will 'Expire' all your 'Lost tasks' in one move. Change your Preferences to Not list SETI@home v8: yes, change it to No. The resend sends tasks according to your Preferences, if SETI@home is No, it won't send any when triggered. Set it here, https://setiathome.berkeley.edu/prefs.php?subset=project Personally, I trigger the resend by waiting until there is a task to report, copy the client_state.xml to another directory, hit Update to report the task, then Stop BOINC. Copy the old client_state.xml back to BOINC, add 1 to the <rpc_seqno></rpc_seqno> number, and then start BOINC. I usually remove all the Active tasks from the old client_state.xml when changing the <rpc_seqno> , but, I don't think it really matters as long as you have it set to Not checkpoint. |
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place. Jumping up and down and screaming..... I did it, I did it, I did it. Once :) A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 3827 Credit: 1,114,826,392 RAC: 3,319 ![]() ![]() |
@Keith: Thank you very much for this easy-to follow process... was worth the rewrites! Because I was checking all my caches due to the "shortie storm" yesterday... oops, this machine had 540 in progress with 3xGPUs unspoofed; should have been max. 400 so were at least 140 ghosts. I expected there may be some as there is a failing GTX980 which sometimes overheats and can lock the system up, but not that many. As it turned out, it was worse: I kept performing the process even at 400 and I think there were possibly up to 100 more of them. The only issue I had is that sometimes the servers would be too fast: I would suspend networking right after the second "[sched_op]", not get a "Scheduler request completed" and tasks to report would still show, but when I restarted the client I'd still get "Not sending work - last request too recent" from the scheduler. The simple workaround for this was just to exit BOINC for a minimum of 303 seconds and after this resends would occur as expected. ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
As soon as I see "sending scheduler request" I slam the mouse button on the Suspend Network Activity. I give the host about 20 seconds after exiting before firing it back up just to make sure I have let 305 seconds elapse since the last scheduler request. If you wait the 305 seconds after shutting down you can guarantee you won't be asking too soon. Glad you were able to remove your ghosts. I had noticed them on your hosts when I looked for your 1M RAC milestone. Congratz. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them. . . Are you remembering to set 'No New Tasks' for the project BEFORE you begin the exercise? I have made that mistake and created a lot of extra work ... Stephen ? ? |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
. . One good thing is that if the ghosted tasks are old enough they will be abandoned as soon as you try to recover them and that can clear quite a lot of ghosts in one fell swoop. That alone is worth the effort. Stephen :) |
![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 3827 Credit: 1,114,826,392 RAC: 3,319 ![]() ![]() |
Unfortunately, due to hard drive failure as well as numerous other issues on my largest host I've made thousands of ghosts on it again. In a dozen times trying to resend, I was only able to cause it once no matter how fast I am on disabling network access. I have no idea why the max. number of resends is set so arbitrarily low at 20. As the BOINC group is meeting I've asked that this be looked into as well. I think that, if anything, it should be the same number as the max. number of new work units that can be assigned in a single scheduler request. ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.