Message boards :
SETI@home Enhanced :
Runtime estimates and transitional matters
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
As Eric said in message 40417, Completion estimates are way off, but David assures me that they will even out soon enough. I've been monitoring two hosts closely to see how well 'soon enough' pans out. They are: i5 laptop HT enabled, running the Windows Beta_v7 v6.91 CPU app. Q6600+9800GT, running the Windows cuda23 v6.09 app. Because we haven't got a CUDA app for testing here in Beta at the moment, I've had to assume the Main and Beta schedulers are behaving the same - they seem to be, as far as I can tell. To keep things simple, I restricted both hosts to a single app for test purposes - the i5 is running 'pure MB CPU' here at Beta, and the 9800GT is running 'pure MB CUDA' at main (though both are attached to other projects, using other resources). Both hosts started with a clean app_version record for the app in question - no prior validations. The initial estimate for the i5, for a VLAR at DCF=1.00, was 17hr 37mn. By the time my 10th. validation was about to kick in, DCF (and estimates) were down to almost half that. The initial estimate for the 9800GT was 3hr 45mn for mid-AR. I'm not sure where that came from - it seems too quick for CPU, and too slow for CUDA. Again, I started at DCF=1.00: by the time I was ready to hit 10 validations, I was down to DCF=0.14, and estimates (correct for this card) of around 30 minutes. What happened next surprised me. Most readers here will be familiar with optimised apps, and will have experienced (possibly without realising it) the tenth-validation transition. To summarise, and taking the CUDA case to demonstrate the point more dramatically: With an app info, once 10 tasks have been validated, the server starts adjusting the runtime estimate of each new task. So, for each old task I finished and reported, I would get seven new tasks (each with one-seventh the estimated runtime). That would continue with a near-enough steady DCF while I continued to crunch 'old' tasks. Eventually, when the first 'new' task reaches the head of the queue and completes, DCF is corrected to near 1.0, and the whole cache suddenly becomes seven times the desired size. Ooops. No work fetch, and quite possibly some EDF, until the extra is worked off. Without an app info, things are different. There's still a transition after the 10th validation, but the server handles it by sending a new speed (<flops>) estimate in the <app_version> the next time work is fetched. BOINC re-calculates all the estimates as if the host is (in my case) suddenly seven times faster. Big shortfall, big work fetch request. And then, when the very next task is completed, DCF jumps up to 1.0 or something, and the cache is back to normal size - plus any work allocated in the meantime. -------------------------------------------------------------------------------- Once our beta testing work here is finished, I assume the plan will be to deploy sah_v7 - across all platforms? all processors, CPU and GPU? - at the main project. There has only been one new application deployed on the main site since 2009 - the Windows v6.10 CUDA app for Fermi cards, in June 2010. That was bad enough, though there would have been comparatively few cards in operation at that time requiring the new app. By the time of the forthcoming transition, Fermis will be commonplace: many people will see the effect of the transition at first hand, whether the 'app_info' or stock variant. And we've also seen in the past the effect that a big new application rollout can have on the download channel - imagine if a version change requires a new, larger, cufft.dll as well? It seems we have two choices: a) Install the new apps, turn on the splitters, and see what happens. A very real beta test of the new scheduler and estimation code. I suspect that when the dust settles, we may be in a position to tell David 'I told you so'. b) Try to take some pre-emptive action in mitigation. Advice to users will be insufficient - we all know that the vast bulk of the user base can't be reached through the message boards. I would consider: i) Better initial estimates - whether propagating each host's sah_v6 record forward as a seed for sah_v7, or doing better a priori speed calculations, especially for GPUs. ii) Smoothing the 10-validation transition (both types, but especially the app_info one: seeing a 10-day cache suddenly re-estimated to 70 days is not a pretty sight, and my test card is modest by current standards). iii) Re-imposing the limitations on max_wus_in_progress when v7 is launched, at least until a high proportion of hosts have reached 10 validations (remembering that large caches can delay wingmates' validations). And probably more. Ideas, anyone? |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
... As S@H v7 is a new application, there's no reason it has to use the same fpop standard as previous applications. Here at Beta the splitter code is unchanged from Enhanced in that respect, but testing will generate enough data to make appropriate adjustment before v7 goes to main. Your i5 has "Measured floating point speed 2508.35 million ops/sec" from the misinterpreted WMIPS of the BOINC Whetstone benchmark, but for v7 work it has "Average processing rate 5.0261114958408". If the splitter were saying each WU had half as many fpops, that rate would be halved and very nearly match the WMIPS value. The initial first estimates would have been close to right on your host, DCF would have remained near 1, etc. The tape file being split has VLAR, midrange, and VHAR parts, though I'm not sure if the mix is similar to long term proportions. All of the first channel has been sent, maybe when nearly all the results are back Eric could do a query to get statistics on that ratio averaged over the set of hosts here. Or maybe it's better to wait until the second channel is done also. Adjusting rsc_fpops_est for CPUs isn't enough to fully correct for GPU processing, though it goes the right direction. The GPU app_plans derive a speed estimate for the combination of the GPU with CPU in the same code which estimates what fraction of a CPU is needed to support the GPU. That fraction gets sent to the client as <avg_ncpus> and maybe the combined speed as <flops>. Neither is accurate, of course, but it should be possible to adjust the way they are calculated to better match S@H reality. That won't be possible until there are some v7 GPU applications to generate actual speed measurement data, though. Joe |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
Your i5 has "Measured floating point speed 2508.35 million ops/sec" from the misinterpreted WMIPS of the BOINC Whetstone benchmark, but for v7 work it has "Average processing rate 5.0261114958408". If the splitter were saying each WU had half as many fpops, that rate would be halved and very nearly match the WMIPS value. The initial first estimates would have been close to right on your host, DCF would have remained near 1, etc. The same i5 got <app_version> <app_name>einstein_S5GC1HF</app_name> <version_num>306</version_num> <platform>windows_intelx86</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>1.000000</max_ncpus> <flops>4010298189.207595</flops> <plan_class>S5GCESSE2</plan_class> ... </app_version> <app_version> <app_name>einsteinbinary_BRP3</app_name> <version_num>107</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.200000</avg_ncpus> <max_ncpus>1.000000</max_ncpus> <flops>25087740348.401230</flops> <plan_class>BRP3cuda32</plan_class> ... </app_version> from Einstein - that's via app_plan only, they haven't loaded (or activated) any of the CreditNew stuff yet. I suspect some hand tuning to match their own app efficiencies - note that SSE2 line, with <flops> well above Whetstone - but the CUDA estimate is reasonable for their current app. If we could do the same here, it would give the averager a much better start. |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
Hmm, the Einstein BRP3cuda32 app_plan may just be figuring <flops> = 10X CPU Whetstones for CUDA. Although that might work out OK for typical combinations of CPU and GPU, I hope we can do better. Joe |
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0 ![]() |
Hmm, the Einstein BRP3cuda32 app_plan may just be figuring <flops> = 10X CPU Whetstones for CUDA. Although that might work out OK for typical combinations of CPU and GPU, I hope we can do better.Joe You could be right. I've just tried one on my GTX 470 Fermi (don't usually run Einstein GPU on that, it's fast enough to use GPUGrid as the alternate project). Got a similar <flops> count, so likely 10x the Q9300 Whetstone, but a task estimated at 2hr 26mn finished in 1:10 - and gave DCF a good kicking in the process. Back to the drawing board.... |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.