Message boards :
News :
Astropulse 7,00 released for Linux 32&64, Win 32&64, Win32+AMD/NVIDIA/Intel GPU
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 35 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
No problems here on my GTX-750. Most important one - blanking handling. There were many incremental changes and one quite big one - FFA_TWIN so all that one could see in still delayed new Lunatics installer release for GPU AstroPulse one could see here. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Here is how iGPU gone mad could look on AP task: http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17284447 Single pulse: peak_power=2.997e+004 dm=-4528 fft_num=6291456 peak_bin=6291462 scale=0 Perhaps we need to add some sanity checking to AstroPulse as we did for MultiBeam recently. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
BOINC 6.10.60 Can't get ATi work: 7/7/2014 01:40:05 AM SETI@home Beta Test Requesting new tasks for GPU 7/7/2014 01:40:05 AM SETI@home Beta Test [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs 7/7/2014 01:40:05 AM SETI@home Beta Test [sched_op_debug] ATI GPU work request: 503430.59 seconds; 1.00 GPUs 7/7/2014 01:40:08 AM SETI@home Beta Test Scheduler request completed: got 0 new tasks 7/7/2014 01:40:08 AM SETI@home Beta Test [sched_op_debug] Server version 703 7/7/2014 01:40:08 AM SETI@home Beta Test Project requested delay of 7 seconds 7/7/2014 01:40:08 AM SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec 7/7/2014 01:40:08 AM SETI@home Beta Test [sched_op_debug] Reason: requested by project 7/7/2014 01:37:55 AM ATI GPU 0: ATI unknown (CAL version 1.4.1848, 256MB, 44 GFLOPS peak) No OpenCL in this BOINC. Will we support such versions with APv7 ? If yes, new plan class is needed. |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
Here is how iGPU gone mad could look on AP task: IMO, what makes that set of signals definitely wrong is it would surely have been detected at earlier dispersions with lower powers (but plenty to be reported and get to the single pulse limit). Perhaps a solitary single pulse with that kind of power may be possible, I'm not sure. I'm in favor of sanity checks wherever they can be added without slowing processing significantly. A simple check might only take a couple of nanoseconds on a current processor, so if done 500 million times would add 1 second to the run time. More complex checks could be carefully placed to not run too often, of course. The other difficult thing is deciding what's best to do when there's an apparent problem. For a case with significant progress on a task, IMO restarting from the last checkpoint is most sensible and can simply use BOINC's temporary exit feature. That does a fresh intialization and rereads the WU file, so can cure some corrupted data cases. BOINC doesn't tell the application anything about restarts, though, so if there's no cure the app will try the temporary exit again and again until BOINC stops the cycle for too many exits. Perhaps we should consider adding application code to keep track to allow shifting to a different strategy. That task and the similar invalid AP v6 task 15788381 last January both show some variation in the peaks, but having the peaks occur at 8 sample intervals in both cases is possibly significant. It's really the single pulse code claiming it sees a repetitive pulse sequence with a 3.2 usec period, too fast to be seen by even the short FFA. Joe |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
I rather looked on fft_num and peak_bin vaues. peak_bin of 6,3kk with only 32k in array... quite strong sign of error IMO. Starting from checkpoint approach (as we already discussed) will not cure logged signals so far. Hence could not save task from turning invalid in the end but just add some time fore restarting and reprocessing. Restarting from checkpoint is applicable only when we sure we detected very first attempt to log invalid signal. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
403de6: 8b 0d c0 b5 4e 00 mov 0x4eb5c0,%ecx 403dec: 8b 15 c8 eb 4c 00 mov 0x4cebc8,%edx 403df2: 8b 04 91 mov (%ecx,%edx,4),%eax 403df5: 50 push %eax 403df6: ff 15 b8 90 4a 00 call *0x4a90b8 Symbol table: https://dl.dropboxusercontent.com/u/60381958/for_APv7_00_AP7_win_x86_SSE2_OpenCL_NV.pdb.7z |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
I rather looked on fft_num and peak_bin vaues. fft_num and peak_bin are locations within the full 32 Mebisample array. fft_num marks the beginning of the particular data chunk, the difference between peak_bin and fft_num is the location within the data chunk. Those differences go 6, 14, 22, 30... for your task 17284447, a pattern with period 8. Restarting from checkpoint absolutely discards any signals which were logged after the checkpoint. Those signals have not been written to disk, starting from checkpoint loads the signal vector from the pulse.out file written at the checkpoint. In this case all 30 single pulses are in a single data chunk at a single dispersion, so a sanity check which could detect the error pattern before the next checkpoint is all that's needed. The restart would begin with no signals. Making a smart sanity check which could be triggered by having reached the single pulse limit is one possibility, since it would only be run once it could be quite complex. Determining the parameters it should use is the difficult part. Joe |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
(can't put image, it changed every time someone added new screenshot to service). In short, beta project properties on that host say there is no ATi app. And hence no ATi work asked :( How so?? News about SETI opt app releases: https://twitter.com/Raistmer |
Send message Joined: 15 Jul 05 Posts: 176 Credit: 1,674,830 RAC: 0 ![]() |
I got some errors on one Windows 7 x64 machine with AstroPulse v7 v7.00 (sse2) and AstroPulse v7 v7.00 (sse) I've updated to the latest alpha of BOINC (7.4.8) last Friday -> same behavior got some additional work on that machine with AstroPulse v7 v7.00 (sse + sse2) http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17297264 <core_client_version>7.4.8</core_client_version> on the other hand, I could finish results from AstroPulse v7 v7.00 on both BOINC alpha versions, so there must be a difference in the sse + sse2 builds http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17246249 <core_client_version>7.3.19</core_client_version> http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17271866 <core_client_version>7.4.8</core_client_version> Now I've gone back to the lastest stable BOINC 7.2.42, waiting for sse and sse2 results Matthias |
![]() Send message Joined: 4 Jun 10 Posts: 6 Credit: 526,721 RAC: 0 ![]() |
Would it be possible for the CPUs that are SSE2 capable to only get the SSE2 work units? The non-SSE2 capable work units take over 2 1/2 times as long to run on the same computer. AMD Phenom XII 64 Bit Linux AstroPulse v7 v7.00 92,979 seconds AstroPulse v7 v7.00 SSE2 37,436 seconds Thanks Conan |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
Would it be possible for the CPUs that are SSE2 capable to only get the SSE2 work units? This is Beta testing and all the application versions need to be tested. But the server code favors the faster application so your host should not get many tasks for the slower generic version. Joe |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
(can't put image, it changed every time someone added new screenshot to service). This host: http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=39394 can't recive ATi work on beta, cause in project properties stated, that this project has no application for AMD/ATi. Please, fix! What can I do client-side to solve this issue? Host receives APv7 CPU work OK. News about SETI opt app releases: https://twitter.com/Raistmer |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Would it be possible for the CPUs that are SSE2 capable to only get the SSE2 work units? Also, very that mechanism under testing too. Hence "true tester" should not abort slower tasks until it will be evident that mechanism gone wrong (more than few dozens of tasks for slowest app processed and can't collect 10 tasks for faster app) as, for example, this one does (example of wrong behavior): http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=70383 News about SETI opt app releases: https://twitter.com/Raistmer |
Send message Joined: 29 May 06 Posts: 1037 Credit: 8,440,339 RAC: 0 ![]() |
(can't put image, it changed every time someone added new screenshot to service). Update to Boinc 7.2.42 [boinc_alpha] Boinc 7.2.18, after removing a device specific app_info, Boinc won't ask for work for other devices. It now clears the flags on every scheduler RPC; that should suffice. The workaround is to remove the <no_rsc_apps>type of device</no_rsc_apps> entry from the client_state.xml Claggy |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
The workaround is to remove the <no_rsc_apps>type of device</no_rsc_apps> entry from the client_state.xml Thanks, Claggy! Will try. EDIT: Yes, it asks for work now! Thanks again! News about SETI opt app releases: https://twitter.com/Raistmer |
Send message Joined: 12 Nov 10 Posts: 1149 Credit: 32,460,657 RAC: 1 ![]() |
Hi Running on a Windows 7 (64-bit) machine - BOINC 7.2.42. In the Task list an AstroPulse v7.7.00 (sse) work unit lists as 100% Progress and Computation Error and the following message is in the Event Log 09/07/2014 09:20:32 | SETI@home Beta Test | Task ap_10fe09ab_B4_P0_00143_20140707_28726.wu_0 exited with zero status but no 'finished' file 09/07/2014 09:20:32 | SETI@home Beta Test | If this happens repeatedly you may need to reset the project. It has just started another work unit (sse2) and the time elapsed gets up to 09 seconds and then restarts SusieQ |
![]() Send message Joined: 4 Jun 10 Posts: 6 Credit: 526,721 RAC: 0 ![]() |
Hi In the error log it says the work units is losing contact with the BOINC Client, saying it "no longer exists and is dead". The work unit then starts over again. This will continue until BOINC says "Too Many Exits" and terminates the work unit. I have no idea why. Conan |
Send message Joined: 15 Jul 05 Posts: 176 Credit: 1,674,830 RAC: 0 ![]() |
I found an invalid AstroPulse v7 v7.00 (opencl_intel_gpu_100) result http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=6478986 Valid are AstroPulse v7 v7.00 (opencl_ati_100) AstroPulse v7 v7.00 (opencl_nvidia_100) with state.fold_buf_size_short=65536; state.fold_buf_size_long=262144 single pulses: 2 repetitive pulses: 2 percent blanked: 0.00 invalid is AstroPulse v7 v7.00 (opencl_intel_gpu_100) state.fold_buf_size_short=65536; state.fold_buf_size_long=262144 Found 30 single pulses and 30 repeating pulses, exiting. percent blanked: 0.00 how can it be possible to find so much additional pulses on the same WU? Actual I get valid results also for AstroPulse v7 v7.00 (opencl_intel_gpu_100) on that machine e.g.: http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17298162 Matthias |
Send message Joined: 12 Nov 10 Posts: 1149 Credit: 32,460,657 RAC: 1 ![]() |
Hi I've got a couple of AstroPulse 7.7.00 work units on the go now that have been running for 4 hrs+ and 1 hr+ respectively so it looks as though the error was only on the (sse) and (sse2) work units. SusieQ |
Send message Joined: 14 Oct 05 Posts: 1137 Credit: 1,848,733 RAC: 0 ![]() |
... The exact "how" generally cannot be pinned down. Could be a software bug, hardware related, or combination. If the GPU were being used to play a game or a video, that might be a brief flash of wrong color or brightness someplace on the screen which you might not notice at all. A GPU manufacturer might even know that the hardware can spontaneously do that sometimes, but release it anyhow because it happens rarely enough. Almost any GPU or CPU will do similar things if it gets too hot, and RAM is also more subject to bit flips if too hot. A PSU can contribute to the problem if it cannot stay well regulated under peak loads. All in all, doing distributed science processing with consumer grade equipment requires some kind of checking of the results. Redundant processing and validation using cross-checking is the method in use here, and appears to be effective though probably not perfect. Joe |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.