Request for help- BOINC server software configuration.

Message boards : Number crunching : Request for help- BOINC server software configuration.
Message board moderation

To post messages, you must log in.

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044454 - Posted: 13 Apr 2020, 1:17:02 UTC
Last modified: 13 Apr 2020, 2:14:10 UTC

Rosetta has a problem- if a batch of tasks results in instant errors, those instant bomb out times are used in the Estimated time to completion calculations.
End result- systems even with small caches getting work they have no chance of finishing.

We don't have that happen here at Seti, so i figure the servers here are set to only use Validated Task times to determine Estimated completion times. Would any of those familiar with BOINC happen to know what is required/where this can be configured?

No luck noticing any reference to it here https://boinc.berkeley.edu/trac/wiki/ProjectOptions

Thanks.


Edit- and looking in to things closer makes it even uglier- it all ties in with Credit New.
It looks like all Invalid & Error Task completion times are taken in to account for Estimated completion time calculations, but those that are significantly different shouldn't be.
Grant
Darwin NT
ID: 2044454 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044467 - Posted: 13 Apr 2020, 6:28:19 UTC - in response to Message 2044454.  

The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.
in MB tasks, and the percentage of radar blanking in AP tasks.

Tell them to look at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers
ID: 2044467 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044468 - Posted: 13 Apr 2020, 6:39:16 UTC - in response to Message 2044467.  

The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.
in MB tasks, and the percentage of radar blanking in AP tasks.

Tell them to look at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers
Excellent.
Thankyou.
Grant
Darwin NT
ID: 2044468 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15182
Credit: 4,362,181
RAC: 3
Netherlands
Message 2044723 - Posted: 14 Apr 2020, 16:43:55 UTC

Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit.
ID: 2044723 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044725 - Posted: 14 Apr 2020, 16:54:44 UTC - in response to Message 2044723.  

Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit.
I think even with replication 1, projects still need a validator - it can act as a sanity-check that the result file is properly formatted and complete. All they need is to design a rule to sort the sheep from the goats.
ID: 2044725 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15182
Credit: 4,362,181
RAC: 3
Netherlands
Message 2044729 - Posted: 14 Apr 2020, 16:58:52 UTC - in response to Message 2044725.  

I know that, it's just, with a replication of more than 1 task per WU, if something goes wrong with a (batch of) task(s) it's easier to check if this is a bad host or a bad (batch of) task(s).
ID: 2044729 · Report as offensive     Reply Quote
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 708
Credit: 8,032,827
RAC: 62
France
Message 2044777 - Posted: 14 Apr 2020, 20:02:25 UTC
Last modified: 14 Apr 2020, 20:03:02 UTC

after reading this[url]https://ralph.bakerlab.org/forum_thread.php?id=84 [/url]

each wu we process is a part of a research ... perhaps there's a half crossed with others in each wu we process ?
ID: 2044777 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044808 - Posted: 14 Apr 2020, 23:00:17 UTC - in response to Message 2044723.  

Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit.
I thought the same thing considering some of the noise that systems have pumped out over the years here at Seti, but my understanding is that each Task is seeded with a random number when it starts, so even for a Task that is resent, even if it were to the same computer, the result produced would be different. So comparing 2 Results from the same Task/WU wouldn't work.
I guess they find out just how Valid it is when they use it in their actual models.
Grant
Darwin NT
ID: 2044808 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 20013
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2044885 - Posted: 15 Apr 2020, 7:13:42 UTC

Rosetta jobs are nothing like those of SETI.
As I understand it each Rosetta task follows on from its predecessor, and calculates for a set amount of time, thus the endpoint of the task is not determined by reaching a particular conclusion but by reaching a specified duration regardless of what progress towards the final end-point has been made in that time. Further there is a degree of checking within the task to ensure that the task is being executed correctly. The fact that the tasks are run-time dependant not goal dependant makes validation of in individual task almost impossible. The next task for that job picks up where its predecessor left off and carries on for its duration and so on until the job is complete.
SETI however had discrete work-units, each of which was evaluated against a set of goals by a pair of independent computers. Within limits the tasks run for as long as each computer required to finish the task. Once both computers finished their results are then compared.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2044885 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044894 - Posted: 15 Apr 2020, 8:10:48 UTC - in response to Message 2044454.  

I suppose it depends on what the exact problem is.

Grant wrote:
Rosetta has a problem- if a batch of tasks results in instant errors, those instant bomb out times are used in the Estimated time to completion calculations.
If that was an 'error while computing' on the volunteer's computer, the server code should discard the runtime for the task with no effect on future estimates.

But if it was a fubar in creating the workunit, and the volunteer successfully completed calculating nop 1000 times, or whatever, then that's a problem.

LHC has a similar problem with its sixtrack application. The job is to simulate possibly stable, possibly unstable, orbits round the LHC. The unstable orbits are the target for elimination: a very short unstable run is a valid and significant data point. But it messes up BOINC's estimates. LHC have requested a change to the server code, but I don't understand what they're asking. They want to move the 'runtime outlier' check to the WU - they have normal replication-of-two validation - whereas outliers are strictly a host-result level problem. It doesn't affect wingmates - their estimates are based on their own runtimes.
ID: 2044894 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044904 - Posted: 15 Apr 2020, 8:54:20 UTC - in response to Message 2044894.  
Last modified: 15 Apr 2020, 8:56:58 UTC

I suppose it depends on what the exact problem is.

Grant wrote:
Rosetta has a problem- if a batch of tasks results in instant errors, those instant bomb out times are used in the Estimated time to completion calculations.
If that was an 'error while computing' on the volunteer's computer, the server code should discard the runtime for the task with no effect on future estimates.

But if it was a fubar in creating the workunit, and the volunteer successfully completed calculating nop 1000 times, or whatever, then that's a problem.
The error was in the WU creation, in that the application doesn't recognise the Task as a valid format, so it comes up as a Computation error.
     Outcome Computation error
Client state Compute error


ERROR: Cannot determine file type. Current supported types are: PDB, CIF, SRLZ, MMTF
ERROR:: Exit from: ..\..\..\src\core\import_pose\import_pose.cc line: 380
BOINC:: Error reading and gzipping output datafile: default.out

Run time 40sec or less, Target CPU time 8hrs.


I had an look at Job runtime Estimation, which ties in closely with Credit New.
Because Rosetta Tasks are for a fixed period of time (selectable from 2hrs to 36hrs) and have a 4 hr grace period after which a Watchdog timer will end the Task, as near as i could figure out the extremely large wu.fpops_bound value (necessary for the 4 hour cutoff for the Watchdog timer) appears to break the Sanity check, so extremely short completion times (ie Tasks erroring out in seconds) even of those that are an Error, are included in Estimated completion time calculations instead of being excluded.

Someone also mentioned (but i haven't checked for myself) that the wu.fpops_est for Tasks is set at 80,000 GFLOPs, regardless of whether the Target CPU runtime is 2hrs, or 36hrs. Which would probably explain the APR values for their applications (along with the huge variability of granted Credit), and also means the that wu.fpops_bound values would have to be truly huge to allow for 2hr to 36hr Tasks with a 4hr extension allowed.
Grant
Darwin NT
ID: 2044904 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044906 - Posted: 15 Apr 2020, 9:01:00 UTC - in response to Message 2044904.  

But this:
Outcome Computation error
Client state Compute error
should trump everything else. That should mean the result goes nowhere near the runtime estimation code.
ID: 2044906 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044908 - Posted: 15 Apr 2020, 9:05:48 UTC - in response to Message 2044906.  
Last modified: 15 Apr 2020, 9:07:43 UTC

But this:
Outcome Computation error
Client state Compute error
should trump everything else. That should mean the result goes nowhere near the runtime estimation code.
*shrug*
I'd have thought so too- that only Valid Tasks are used for Runtime Estimation calculations.

But it was a widespread problem, i had about 5 or 6 of those Tasks on my systems and they errored out in 20-40sec. Next batch of new work i got against that application had an estimated completion time of around 39min (pretty sure prior to that the Estimated time was around 7 hours or so).
Grant
Darwin NT
ID: 2044908 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044915 - Posted: 15 Apr 2020, 9:14:07 UTC - in response to Message 2044908.  

It's up to the system admins now. Either they disabled the 'failed task' check when they wrote their bespoke target time code, or there's a bug in Credit New (surprise!) which they'll have to take up with David.
ID: 2044915 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13179
Credit: 208,696,464
RAC: 304
Australia
Message 2044920 - Posted: 15 Apr 2020, 9:29:37 UTC - in response to Message 2044915.  
Last modified: 15 Apr 2020, 9:29:52 UTC

It's up to the system admins now. Either they disabled the 'failed task' check when they wrote their bespoke target time code, or there's a bug in Credit New (surprise!) which they'll have to take up with David.
I forwarded your link and sent my WAG (Wild Arse Guess) relating to bound values etc.
So it's all in their court now.
Grant
Darwin NT
ID: 2044920 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14350
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044924 - Posted: 15 Apr 2020, 9:35:42 UTC - in response to Message 2044920.  

Now that you've clarified the error status, the link is probably redundant - although it points to an alternate solution.

The other issue - fixed fpops_est for different duration tasks - is a weakness I've seen at other projects too, but it's a second-order problem - without the errors, the project should survive. It would help if they checked their WUs before sending them out, but that's probably too much to hope for.
ID: 2044924 · Report as offensive     Reply Quote

Message boards : Number crunching : Request for help- BOINC server software configuration.


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.