Workunits with inconsistent results with ATi 6.99 app involved

Message boards : SETI@home Enhanced : Workunits with inconsistent results with ATi 6.99 app involved
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 44108 - Posted: 17 Oct 2012, 17:36:46 UTC
Last modified: 17 Oct 2012, 17:37:02 UTC

Please, post links to such workunits here. With short description of possible problem to investigate.
ID: 44108 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 44111 - Posted: 17 Oct 2012, 17:46:06 UTC

http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=4153109

CUDA 3.2
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.

Flopcounter: 23190166560580.547000

Spike count: 3
Autocorr count: 24
Pulse count: 3
Triplet count: 0
Gaussian count: 0


ATi:

WU true angle range is : 0.422314
SpikeR2: score:-0.21765, peak=24.23342, time=100.7, d_freq=1418824476.73, chirp=6.8802, fft_len=128k
SpikeR2: score:-0.20135, peak=25.16003, time=100.7, d_freq=1418824476.73, chirp=6.8839, fft_len=128k
SpikeR2: score:-0.20957, peak=24.68848, time=100.7, d_freq=1418824476.73, chirp=6.8876, fft_len=128k
Pulse: peak=7.976645, time=12.73, period=2.744, d_freq=1418826529.16, score=1.077, chirp=-14.489, fft_len=256
Pulse: peak=7.526055, time=12.73, period=2.744, d_freq=1418826522.8, score=1.017, chirp=-14.989, fft_len=256

Best spike: peak=25.16003, time=100.7, d_freq=1418824476.73, chirp=6.8839, fft_len=128k
Best autocorr: peak=17.6995, time=73.82, delay=3.591, d_freq=1418820625.37, chirp=-27.181, fft_len=128k
Best gaussian: peak=3.197366, mean=0.5137201, ChiSq=1.255641, time=32.72, d_freq=1418826187.73,
score=0.4408815, null_hyp=2.186755, chirp=-89.131, fft_len=16k
Best pulse: peak=7.976645, time=12.73, period=2.744, d_freq=1418826529.16, score=1.077, chirp=-14.489, fft_len=256
Best triplet: peak=0, time=-2.121e+011, period=0, d_freq=0, chirp=0, fft_len=0


Flopcounter: 19057184035267.117000

Spike count: 3
Autocorr count: 0
Pulse count: 2
Triplet count: 0
Gaussian count: 0
ID: 44111 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 44117 - Posted: 17 Oct 2012, 18:16:10 UTC - in response to Message 44111.  

http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=4153109

We've noted already - I pointed out in the news thread - that Peter@H.K.'s host 59712 has a tendency to over-report autocorrs.
ID: 44117 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44122 - Posted: 17 Oct 2012, 20:26:05 UTC - in response to Message 44117.  
Last modified: 17 Oct 2012, 23:30:00 UTC

http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=4157131

Both are ATI hosts. One found 6 signals (result ending 608), the other (result ending 609) found zero and has an empty stdout.

The host for result #609 appears to consistently fail on GPU results. (Driver revision, maybe?) This host successfully has completed two ATI astropulse results.

Result number #608 does not include the signal counts in its stdout.

The host for #608 has returned validated v7 ATI results
ID: 44122 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44125 - Posted: 17 Oct 2012, 23:36:39 UTC
Last modified: 17 Oct 2012, 23:39:19 UTC

Another host that is failing some/many ATI SAH_v7 results (55791) It is exiting after finding 30 autocorr in every result.
ID: 44125 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2530
Credit: 1,074,556
RAC: 0
Germany
Message 44130 - Posted: 18 Oct 2012, 7:45:58 UTC - in response to Message 44122.  

http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=4157131

Both are ATI hosts. One found 6 signals (result ending 608), the other (result ending 609) found zero and has an empty stdout.

The host for result #609 appears to consistently fail on GPU results. (Driver revision, maybe?) This host successfully has completed two ATI astropulse results.

Result number #608 does not include the signal counts in its stdout.

The host for #608 has returned validated v7 ATI results


Host 608 has high overclocked Juniper (HD 5770).
So no wonders.

Host 609 probably has thermal issues or dust bunnnies.

Will try to cantact him.

With each crime and every kindness we birth our future.
ID: 44130 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2530
Credit: 1,074,556
RAC: 0
Germany
Message 44132 - Posted: 18 Oct 2012, 7:50:10 UTC - in response to Message 44125.  

Another host that is failing some/many ATI SAH_v7 results (55791) It is exiting after finding 30 autocorr in every result.


This host is using driver 11.11 which is buggy.


With each crime and every kindness we birth our future.
ID: 44132 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 44133 - Posted: 18 Oct 2012, 10:20:24 UTC - in response to Message 44132.  
Last modified: 18 Oct 2012, 10:26:08 UTC

Another host that is failing some/many ATI SAH_v7 results (55791) It is exiting after finding 30 autocorr in every result.


This host is using driver 11.11 which is buggy.



In addition:

Driver version: CAL 1.4.1607 (VM)
Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (793.1)


As I said before, app built with APP SDK 2.6. Here we see APP SDK 2.5 driver.
Drivers below 11.12 can work or can not work. Here the case where it not work.

When I tired to build under APP SDK 2.7 drivers I had on PC at that time refuse to work with app. So there is no backward compatibility.
If we will establish (Eric, only you an give us such statistics perhaps) that share of pre-SDK 2.6 hosts in SETI@home host pool is high enough I can try to make build based on SDK 2.5. Another way is to use driver version restrictions.

FYI:
Driver Conformance

AMD APP SDK v2.6 AMD Catalystâ„¢ 11.12 (8.92)
AMD APP SDK v2.5 AMD Catalystâ„¢ 11.7 (8.872)
AMD APP SDK v2.4 ATI Catalystâ„¢ 11.4 Update Driver (8.841)
AMD APP SDK v2.3 ATI Catalystâ„¢ 10.12 (8.801)
ATI Stream SDK v2.2
ATI Catalystâ„¢ 10.7 Update Driver for OpenCLâ„¢ 1.1 Support (8.753.1)

ATI Stream SDK v2.1

ATI Catalystâ„¢ 10.4 (8.723)

ATI Stream SDK v2.01

ATI Catalystâ„¢ 10.2 (8.701)

ATI Stream SDK v2.0

ATI Catalystâ„¢ 9.12 (8.682)
ID: 44133 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 44134 - Posted: 18 Oct 2012, 11:50:03 UTC

Since Raistmer has just asked for this, other people may find it useful too.

http://www.hal6000.com/seti/boinc_ati_gpu_cheat_sheet.htm
ID: 44134 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2530
Credit: 1,074,556
RAC: 0
Germany
Message 44136 - Posted: 18 Oct 2012, 12:44:30 UTC


Additional note:

I would suggest not to allow SDK 2.5 since driver set 11.7. - 11.11 had a lot of bugs, incl the 100% CPU usage bug.

Mike

With each crime and every kindness we birth our future.
ID: 44136 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44148 - Posted: 18 Oct 2012, 18:04:49 UTC - in response to Message 44136.  

There must be something wrong with how I specified the min driver revision... I'll look into it.
ID: 44148 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 29 May 06
Posts: 1037
Credit: 8,440,339
RAC: 0
United Kingdom
Message 44151 - Posted: 18 Oct 2012, 19:08:16 UTC - in response to Message 44148.  

There must be something wrong with how I specified the min driver revision... I'll look into it.

While you're looking into that, can you make a change so that that If Boinc 7 doesn't detect an OpenCL device (eithier ATI or Nvidia), and so doesn't get AP 6.04 (opencl_ati_100) Wu's, it also shouldn't get 6.04 (ati_opencl_100) Wu's
as they won't work as OpenCL support is missing:

Astropulse V6 6.04 (opencl_ati_100) Computation Error

Claggy
ID: 44151 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44152 - Posted: 18 Oct 2012, 19:10:34 UTC - in response to Message 44148.  

OK. I've fixed the minimum driver revision to 11.12 . That may annoy people who don't what to update their drivers.
ID: 44152 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44154 - Posted: 18 Oct 2012, 19:14:00 UTC - in response to Message 44151.  


While you're looking into that, can you make a change so that that If Boinc 7 doesn't detect an OpenCL device (eithier ATI or Nvidia), and so doesn't get AP 6.04 (opencl_ati_100) Wu's, it also shouldn't get 6.04 (ati_opencl_100) Wu's
as they won't work as OpenCL support is missing:


I'll see what I can do.
ID: 44154 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 29 May 06
Posts: 1037
Credit: 8,440,339
RAC: 0
United Kingdom
Message 44155 - Posted: 18 Oct 2012, 19:18:20 UTC - in response to Message 44154.  


While you're looking into that, can you make a change so that that If Boinc 7 doesn't detect an OpenCL device (eithier ATI or Nvidia), and so doesn't get AP 6.04 (opencl_ati_100) Wu's, it also shouldn't get 6.04 (ati_opencl_100) Wu's
as they won't work as OpenCL support is missing:


I'll see what I can do.

Thanks,

Claggy
ID: 44155 · Report as offensive
Profile Eric J Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 15 Mar 05
Posts: 1547
Credit: 27,183,456
RAC: 0
United States
Message 44157 - Posted: 18 Oct 2012, 19:31:46 UTC - in response to Message 44155.  

Apparently, I can't at this point. Core client version is not a part of plan_class_spec.xml plan classes.
ID: 44157 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2530
Credit: 1,074,556
RAC: 0
Germany
Message 44163 - Posted: 18 Oct 2012, 20:46:32 UTC - in response to Message 44152.  

OK. I've fixed the minimum driver revision to 11.12 . That may annoy people who don't what to update their drivers.


Thanks Eric.
I dont think so many people will still use such old drivers.


With each crime and every kindness we birth our future.
ID: 44163 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 44176 - Posted: 19 Oct 2012, 10:21:57 UTC

I'm a little bit worried about WU 4158699

It's not inconsistent - indeed, it's validated - but both hosts are on our list of hosts which are prone to overflow on autocorrelations, as in this case.

Preliminary checks suggest it shouldn't be an overflow task, but I'll run a full bench (including CPU) later. The danger is that valid science might be lost, as is sometimes the case with problematic CUDA hosts.
ID: 44176 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 44177 - Posted: 19 Oct 2012, 11:07:01 UTC - in response to Message 44176.  

I'm a little bit worried about WU 4158699

It's not inconsistent - indeed, it's validated - but both hosts are on our list of hosts which are prone to overflow on autocorrelations, as in this case.

Preliminary checks suggest it shouldn't be an overflow task, but I'll run a full bench (including CPU) later. The danger is that valid science might be lost, as is sometimes the case with problematic CUDA hosts.


This driver version should not recive any ATi MB work in the first place.
SDK 2.5 ...
ID: 44177 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 44182 - Posted: 19 Oct 2012, 14:13:10 UTC

Two more workunits have emerged: WUs 4158694 and 4159327. Both are cases where apparently-plausible results (subject to bench verification, of course) have been outvoted by two ATI hosts with autocorr overflows.

We've known since the beginning of GPU computing that there are problem hosts and problem cards which generate false overflows. But they come into two classes:

Random thermal, overclocking, power supply, memory-corruption, or even cosmic ray events. These tend to introduce random errors into the result files too, and the bad results are usually weeded out by validation.

But there are also systemic overflow errors. I'm reminded of the early days of the NV 'Fermi' cards, when incompatible applications (608, 609, VLAR_KILL) caused whole swathes of tasks to overflow with identical errors. The problem with these systemic errors, like the current autocorrs, is that they validate each other - and fail to pass the valid signals we're here to find into the science database.

In the Fermi case, it was a relatively simple programming problem: NVidia themselves had been using an undocumented optimisation which wasn't forward-compatible into the new environment (see message 1068390, 'volatile' variable declarations).

We could, of course, adopt the brute-force approach, and exclude a class of GPU-enabled volunteers from participation in the project - but I'm wondering if we really have to do that? Other projects seem to manage without. I've posted a call for information at Einstein (message 119669) - it may even be that they have an undiagnosed problem with SDK (sic!) 2.5, and need to exclude certain drivers too! They were quick enough to exclude NV driver version 295.51 and 296.10 volunteers from their project because of the 'sleepy monitor' bug - they may need to do the same as we're contemplating here.
ID: 44182 · Report as offensive
1 · 2 · 3 · Next

Message boards : SETI@home Enhanced : Workunits with inconsistent results with ATi 6.99 app involved


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.