Problems running MB v7.00 CUDA apps (x41zc) - report here

Message boards : SETI@home Enhanced : Problems running MB v7.00 CUDA apps (x41zc) - report here
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 45281 - Posted: 18 Mar 2013, 12:25:28 UTC

This thread is for reporting and troubleshooting problems with the v7.00 CUDA apps.
ID: 45281 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 45342 - Posted: 24 Mar 2013, 0:32:43 UTC - in response to Message 45281.  

Is that a tumbleweed rolling past ?
ID: 45342 · Report as offensive
arkayn
Volunteer tester
Avatar

Send message
Joined: 16 Jan 07
Posts: 155
Credit: 194,400
RAC: 0
United States
Message 45343 - Posted: 24 Mar 2013, 7:35:56 UTC

I can also hear the crickets chirping.
ID: 45343 · Report as offensive
AndrewM
Volunteer tester

Send message
Joined: 19 Apr 13
Posts: 3
Credit: 328,489
RAC: 0
Australia
Message 45762 - Posted: 9 May 2013, 4:42:17 UTC

Do inconclusives count as problems?
Workunit 524941
My 680 (cuda50) missed the 5 spikes his 660TI (cuda32) found.
ID: 45762 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 45763 - Posted: 9 May 2013, 8:07:30 UTC

Might want to check WU 5254770.

cuda32 (GTX 670) missed an autocorr that Urs found with Linux/CPU v7.01
ID: 45763 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 45765 - Posted: 9 May 2013, 11:59:32 UTC - in response to Message 45762.  

Do inconclusives count as problems?
Workunit 524941
My 680 (cuda50) missed the 5 spikes his 660TI (cuda32) found.


'Can' do, though in this example your wingman's looking rather suspect. Unlikely to be an application code related issue. If patterns of something like that arise comparing many reliable hosts, then it'd be worth looking deeper.
ID: 45765 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 45766 - Posted: 9 May 2013, 12:14:09 UTC - in response to Message 45763.  
Last modified: 9 May 2013, 12:43:03 UTC

Might want to check WU 5254770.

cuda32 (GTX 670) missed an autocorr that Urs found with Linux/CPU v7.01


Might be worth keeping a copy aside just in case it's anything, or for later development or investigation. The ACs are pretty tight, due to simple logic, apart from the known issues with using absolute thresholds vs differing compiler technologies etc.

Cross platform CPU, vs GPU target, used to see around 5% inconclusive rate from the Cuda on Windows end (on V6, excluding overflows). Self imposed target for V7 from my end is fewer than 0.5% inconclusive excluding overflows (one tenth the non-overflowing resends). [ around 367KiB x 5/1000 = ~1.9KiB bandwidth overhead per task, instead of 19KiB ]

If the Linux direction sees many more than that against other devices or platforms then maybe something needs looking at there, or here. I include the known Cuda Gaussfitting implementation limitations in that 0.5% rate.
ID: 45766 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 45817 - Posted: 13 May 2013, 17:15:03 UTC

Noting WU 5285224 (another inconclusive on autocorrs), so we can find it again after the tiebreaker validates it and see what happened.
ID: 45817 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 45828 - Posted: 13 May 2013, 21:29:35 UTC - in response to Message 45763.  
Last modified: 13 May 2013, 21:32:04 UTC

Might want to check WU 5254770.

cuda32 (GTX 670) missed an autocorr that Urs found with Linux/CPU v7.01

note, so if necessary, right app can be choosen: That's 32bit Linux app on a 64bit OS.
_\|/_
U r s
ID: 45828 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 45845 - Posted: 14 May 2013, 7:23:47 UTC - in response to Message 45828.  
Last modified: 14 May 2013, 7:26:41 UTC

Might want to check WU 5254770.

cuda32 (GTX 670) missed an autocorr that Urs found with Linux/CPU v7.01

note, so if necessary, right app can be choosen: That's 32bit Linux app on a 64bit OS.

Urs, if you have a spare moment, could you run a bench on that WU with both apps, please?
The results from Windows CPU and x41zc are

<autocorr_thresh>17.7999992</autocorr_thresh>

while best AC in x41zc is
<best_autocorr>
<peak_power>17.267876412045</peak_power>

Reference Best AC is:
<best_autocorr>
<peak_power>17.267883616732</peak_power>


Which isn't really that close to threshold, so we may want to keep an eye on the linux stock app.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 45845 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 45866 - Posted: 15 May 2013, 8:53:37 UTC - in response to Message 45845.  

Might want to check WU 5254770.

cuda32 (GTX 670) missed an autocorr that Urs found with Linux/CPU v7.01

note, so if necessary, right app can be choosen: That's 32bit Linux app on a 64bit OS.

Urs, if you have a spare moment, could you run a bench on that WU with both apps, please?
The results from Windows CPU and x41zc are

<autocorr_thresh>17.7999992</autocorr_thresh>

while best AC in x41zc is
<best_autocorr>
<peak_power>17.267876412045</peak_power>

Reference Best AC is:
<best_autocorr>
<peak_power>17.267883616732</peak_power>



Which isn't really that close to threshold, so we may want to keep an eye on the linux stock app.

Did that rerun last night on my i7 host :
<autocorr_thresh>17.7999992</autocorr_thresh>

64bit Linux Best AC has :
<best_autocorr>
<peak_power>17.267879486084</peak_power>

32bit Linux Best AC has :
<best_autocorr>
<peak_power>17.26788080417</peak_power>


Looks fine.
Maybe checking that on the original host shows a problem on different hardware (-> core2) ?
_\|/_
U r s
ID: 45866 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 45867 - Posted: 15 May 2013, 9:05:44 UTC - in response to Message 45866.  


Looks fine.
Maybe checking that on the original host shows a problem on different hardware (-> core2) ?

Thanks a lot, so no basic problem there.
Original host is yours - 16343? That's why we asked?
Would be good if you could run a bench on that one as well, see if there's a problem or just a hiccup.
We've been seeing a few hosts that occasionally come up with extra autocorrs on the Windows side, maybe Linux is affected, too (whatever is is).
Unfortunately unless such extra autocorrs turn up on an Alpha host, we have no easy means to get a rerun for debugging.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 45867 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 45895 - Posted: 16 May 2013, 12:45:08 UTC - in response to Message 45867.  


Looks fine.
Maybe checking that on the original host shows a problem on different hardware (-> core2) ?

Thanks a lot, so no basic problem there.
Original host is yours - 16343? That's why we asked?
Would be good if you could run a bench on that one as well, see if there's a problem or just a hiccup.
We've been seeing a few hosts that occasionally come up with extra autocorrs on the Windows side, maybe Linux is affected, too (whatever is is).
Unfortunately unless such extra autocorrs turn up on an Alpha host, we have no easy means to get a rerun for debugging.

Did the same wu again on 16343 in standalone mode, but could not reproduce the result of the live Beta run, so finding that "extra" autocorr seems to be a glitch ftm.

stock 64bit Linux app :
<best_autocorr>
<peak_power>17.267881393433</peak_power>

stock 32bit Linux app :
<best_autocorr>
<peak_power>17.267882090053</peak_power>


I'll repeat once more and keep the host under load this time.
_\|/_
U r s
ID: 45895 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 45943 - Posted: 18 May 2013, 1:37:40 UTC

The extra autocorr signal finding did not reproduce under load. So, anymore ?
_\|/_
U r s
ID: 45943 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 46019 - Posted: 22 May 2013, 12:02:10 UTC - in response to Message 45943.  

The extra autocorr signal finding did not reproduce under load. So, anymore ?

Don't think so - some glitch then.
But thanks for taking the time to look into it in detail.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 46019 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46035 - Posted: 23 May 2013, 14:21:15 UTC
Last modified: 23 May 2013, 14:40:08 UTC

Not sure what happened with the recent batch of Cuda 3.2 on my host. Error show transfer problems for the DLLs (?). I looked and these DLLs are present in the beta project folder. Maybe this happened while my ISP was doing upgrade work.

Could someone with working stock distributed Cuda 3.2 please post the md5 checksums of the two Cuda DLLs, and I'll check the sizes in my client state and filalyzer md5 on the files present.

[Edit:] Hmmm, my sizes look different to client state. I'll reset the project & see if I get new files next time 3.2 is tried.

[Edit2:] Well whatever happened (to the DLLs) originally, reset project seems to have fixed it. 3.2 Processing now.
ID: 46035 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 46036 - Posted: 23 May 2013, 14:51:04 UTC

I take it I don't need to try and find out how to get MD5 checksums of files then?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 46036 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46037 - Posted: 23 May 2013, 14:57:47 UTC - in response to Message 46036.  

I take it I don't need to try and find out how to get MD5 checksums of files then?


Nope, thanks!. Seeing the incorrect filesizes turned out to be enough to decide they were broken. But for future reference there are online md5 calculators, or I just use filalyzer (from the spybot search & destroy authors' website), for a right click shell menu function.

Good to know everything came good with a project reset, and I only have to worry about putting my cfg settings back.
ID: 46037 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 18 Jun 08
Posts: 76
Credit: 113,089
RAC: 0
Finland
Message 46217 - Posted: 4 Jun 2013, 20:49:05 UTC

So now that the new apps has been released this must a good time to start looking for any issues in them, right?

Workunit 5391995 has had two tries from CUDA hosts.

GTX 680, driver: 314.22, cuda42, and
GTX 660 Ti, driver: 314.22, cuda42

Both found

Spike count: 1
Autocorr count: 1
Pulse count: 4
Triplet count: 0
Gaussian count: 0

My copy is now ~93% done. So far the signal counts are the same. I can put my result online somewhere is there is any interest in it, as well as keep it suspended if anyone wants to grab a copy of the other results from server.
ID: 46217 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 46224 - Posted: 5 Jun 2013, 0:53:31 UTC - in response to Message 46217.  
Last modified: 5 Jun 2013, 0:56:26 UTC

So now that the new apps has been released this must a good time to start looking for any issues in them, right?

Workunit 5391995 has had two tries from CUDA hosts.


Haha, yep. That's an easy one because the wingman's host is a known troublemaker ;) Take a look at the impressive number of invalids he's clocked up.

http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=56392

Well OK, More burned APs than V7 MBs, but either way he's pushing the stability envelope there
ID: 46224 · Report as offensive
1 · 2 · Next

Message boards : SETI@home Enhanced : Problems running MB v7.00 CUDA apps (x41zc) - report here


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.