Deprecated: Function get_magic_quotes_gpc() is deprecated in /disks/centurion/b/carolyn/b/home/boincadm/projects/beta/html/inc/util.inc on line 663
Problem running new GPU apps on BETA

Problem running new GPU apps on BETA

Message boards : SETI@home Enhanced : Problem running new GPU apps on BETA
Message board moderation

To post messages, you must log in.

AuthorMessage
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56142 - Posted: 21 Jan 2016, 14:15:19 UTC
Last modified: 21 Jan 2016, 14:32:00 UTC

I can't see any reports like this on the board but mention it in case anyone else is experiencing similar problems. Correction - I now see there is a thread which says "openCL is crashing my GPU" - but no PC locking up and I am only running stock V8 CPU tasks on main. Also a very old thread from last January relating to splitter problems causing display issues. - but no hanging

I am a bit late to the V8 BETA party but rejoined BETA on my two machines today.

One machine is running OK but my main PC is having problems.

I have selected both PCs to run GPU tasks only on BETA via prefs and am still running stock apps on SETI main.

The PC up until this point was stable and running SETI Main OK.

After running BETA tasks for a while the PC was completely locked up. No keyboard/mouse control and taskbar clock frozen.

I rebooted and the same thing happened a short while later.

I had the app_config configured to run 2 GPU tasks simultaneously and have now set it to run 1 only. This seems to have stopped the hanging but now periodically my display blanks and windows gives an error popup saying "Display driver stopped responding and has recovered". As far as I know I have never previously had this random display driver crash.

The running GPU task at the time then shows as postponed.

BOINC event log shows e.g. :-

21/01/2016 08:58:03 | SETI@home Beta Test | task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit

21/01/2016 09:04:12 | SETI@home Beta Test | task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit

21/01/2016 09:09:38 | SETI@home Beta Test | task postponed 180.000000 sec: Cuda runtime, memory related failure, threadsafe temporary Exit

The tasks later recover and continue to run.

I am running Nvidia driver 361.43. I have no ATI card but do have on CPU Intel graphics which has drivers loaded but I do not receive any tasks allocated to that device (perhaps no app yet for that). The card is not overclocked and temperature is regulated via EVGA Precision X to about 64 deg C. (was about 72 deg C running 2 tasks)

Event log shows :-

21/01/2016 13:39:10 | | Starting BOINC client version 7.6.22 for windows_x86_64
21/01/2016 13:39:10 | | log flags: file_xfer, sched_ops, task, cpu_sched
21/01/2016 13:39:10 | | Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8
21/01/2016 13:39:10 | | Data directory: C:\ProgramData\BOINC
21/01/2016 13:39:10 | | Running under account xxxx
21/01/2016 13:39:21 | | CUDA: NVIDIA GPU 0: GeForce GTX 570 (driver version 361.43, CUDA version 8.0, compute capability 2.0, 1280MB, 1179MB available, 1536 GFLOPS peak)
21/01/2016 13:39:21 | | OpenCL: NVIDIA GPU 0: GeForce GTX 570 (driver version 361.43, device version OpenCL 1.1 CUDA, 1280MB, 1179MB available, 1536 GFLOPS peak)
21/01/2016 13:39:21 | | OpenCL CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 4.2.0.148, device version OpenCL 1.2 (Build 148))
21/01/2016 13:39:22 | | Host name: Zzzzzzz
21/01/2016 13:39:22 | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3]
21/01/2016 13:39:22 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 pbe fsgsbase bmi1 smep bmi2
21/01/2016 13:39:22 | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
21/01/2016 13:39:22 | | Memory: 15.89 GB physical, 39.82 GB virtual
21/01/2016 13:39:22 | | Disk: 195.21 GB total, 13.93 GB free
21/01/2016 13:39:22 | | Local time is UTC +0 hours
21/01/2016 13:39:22 | | VirtualBox version: 5.0.14
21/01/2016 13:39:22 | SETI@home | Found app_config.xml
21/01/2016 13:39:22 | SETI@home Beta Test | Found app_config.xml
21/01/2016 13:39:22 | | Config: simulate 8 CPUs
21/01/2016 13:39:22 | | Config: report completed tasks immediately
21/01/2016 13:39:22 | | Config: use all coprocessors

The computer in question is :-

http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=77829

and no tasks shows as having actually errored.

Has anyone else experienced anything similar to this?

Thanks,

John.
ID: 56142 · Report as offensive
William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 606
Credit: 588,843
RAC: 0
Message 56152 - Posted: 21 Jan 2016, 15:12:48 UTC
Last modified: 21 Jan 2016, 15:17:05 UTC

can you find and link an affected task please?

nevermind, found one : http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=22228759

is it limited to cuda32 tasks or are other variants affcted too?

Nb your particular driver has been called into question elsewhere, so you might want to drop to an earlier version and see if that helps.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 56152 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 56153 - Posted: 21 Jan 2016, 15:29:28 UTC - in response to Message 56142.  

Has anyone else experienced anything similar to this?

Most of the discussion has taken place in the News announcement thread - not good form, I know, but that's how it goes.

I see driver 361.43 mentioned as a possible problem in

Message 55773 (Jeff Buck)
Message 55865 (Ulrich Metzner)
Message 56038 (Rob Smith)
(and the discussions following on from each of those posts)

- and the NVidia OpenCL app seems to be affected too, not just CUDA. I'd suggest you revert back to 359.xx and try that, before blaming anything else.
ID: 56153 · Report as offensive
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56155 - Posted: 21 Jan 2016, 16:56:58 UTC - in response to Message 56153.  

Has anyone else experienced anything similar to this?


I see driver 361.43 mentioned as a possible problem in

- and the NVidia OpenCL app seems to be affected too, not just CUDA. I'd suggest you revert back to 359.xx and try that, before blaming anything else.

is it limited to cuda32 tasks or are other variants affcted too?


Thanks Richard/William.

Apologies, I'd mistakenly checked for Errored tasks rather than checking completed results for errors so mistakenly assumed they were completing normally despite the driver crash - I see from the example that they weren't.

I have been running 2 tasks on the GPU on the machine which doesn't lockup so don't know why that one is not suffering - running the same 361.43 nvidia driver also on a GTX570. I've searched the event log since BOINC was restarted at 12:23 and no postpone events since that time or machine lockups. If the problem only occurs with particular workunits then I guess I was just unlucky and got them all one one PC.

Just spotted one task postponing and it was CUDA32. The example you gave was also CUDA32. Checking further I've also found a CUDA50 but no OPEN_CL or CUDA42 - not to say there won't be any.

I will rollback the Nvidia driver on the problem PC to 359.06 and see how that goes.

John.
ID: 56155 · Report as offensive
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Dec 14
Posts: 96
Credit: 1,240,941
RAC: 0
United States
Message 56156 - Posted: 21 Jan 2016, 17:07:02 UTC - in response to Message 56153.  

Has anyone else experienced anything similar to this?

Most of the discussion has taken place in the News announcement thread - not good form, I know, but that's how it goes.

I see driver 361.43 mentioned as a possible problem in

Message 55773 (Jeff Buck)
Message 55865 (Ulrich Metzner)
Message 56038 (Rob Smith)
(and the discussions following on from each of those posts)

- and the NVidia OpenCL app seems to be affected too, not just CUDA. I'd suggest you revert back to 359.xx and try that, before blaming anything else.

I only have circumstantial evidence pointing to 361.43 being at the root of my lockup problems but, in addition to the case with OpenCL apps that I first reported, I also ran into a very similar situation when I was testing with Cuda. Running 1 or 2 per GPU worked fine for several days, but shortly after I bumped it up to 3 per GPU, the lockup happened again. About the only difference when I was running Cuda was that the machine didn't eventually reboot itself. I had to manually power it down and reboot.

At that point, I just reverted back to my previous 335.23 driver (since I didn't need a 350+ driver for Cuda) and the problem never occurred again.

So, for me, it appeared that once my GPUs got loaded to about 80% or more, either by a single instance of the OpenCL app, or 3 instances of Cuda, 361.43 runs into problems. As I said, just circumstantial evidence, but good enough for me to avoid 361.43.
ID: 56156 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 56157 - Posted: 21 Jan 2016, 17:27:39 UTC
Last modified: 21 Jan 2016, 17:29:09 UTC

If not able to replicate some similar situation under 359 series or earlier drivers, it might suggest a few things about the 360+ drivers. It gets a little technical from there, but reducing app pulsefind settings as per readme, could conceivably reduce pressure on the driver stack, if they introduced some kindof extra latencies in the new drivers (with all those bundled services). Probably if problems persist in further updates, I'll try on the 980 myself and see if I can figure out what they've changed.

Mine's not the best reference comparison for vanilla driver+OS installs at the moment, because suspect Win7 updates and all the unused nVidia services are disabled. One factor I won't be able to check yet, is if they broke past gen or mixed GPU while wedging in Cuda 7.5. Each of those have happened before, and been fixed going through registered the developer portal. Usually fairly precise descriptions and step by step instruction to replicate can get things fixed fairly quickly, if it comes to that.
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 56157 · Report as offensive
Rob Smith
Volunteer moderator
Volunteer tester

Send message
Joined: 21 Nov 12
Posts: 1015
Credit: 5,459,295
RAC: 0
United Kingdom
Message 56172 - Posted: 21 Jan 2016, 22:00:07 UTC

I've just rolled back to 359.06.


First thing I notice is that I've got the stagger back when running a couple of OpenCL tasks, one per GPU, compared with the recent bout of CUDA (4.2 &5) work was quite smooth in comparison.
I'll have to have a try at the tuning that Raistmer suggested the other day, but not tonight.... (I'm still at 100/256 - which is where I ran out of time last time)
ID: 56172 · Report as offensive
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56187 - Posted: 21 Jan 2016, 23:28:04 UTC

I rolled back to 359.06 but that still has the GPU memory errors (didn't test for lockups).

Will try rolling back to another earlier version tomorrow to see if I can get back to a stable setup. Earliest 35x.yy version I have on disk is 352.86 so I'll give that a go.


John.
ID: 56187 · Report as offensive
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56209 - Posted: 22 Jan 2016, 9:13:59 UTC
Last modified: 22 Jan 2016, 9:50:00 UTC

Well an unanticipated event overnight.

Just after rolling back the driver to 352.86, SETI Main downloaded some GPU Apps and has been running tasks since then with no BETA workunits being processed.

There were also no postponed tasks over night so all MAIN GPU tasks ran without incident.

I just suspended all MAIN GPU tasks allowing the BETAs to run and immedaitely there was some display blanking and tasks being postponed.

At around the same time BETA downloaded a new version of the app - 8.01. The apps downloaded from MAIN say version 8.00 but they are identical to the 8.01 files downloaded from BETA. The file details show the MAIN release version and BETA 8.01 to be Lunatics version zi (the previous BETA 8.00 being Lunatics zh)

All my original tasks from BETA were allocated to run as 8.00 - newer tasks as 8.01. When I suspended the MAIN GPU tasks it was the BETA 8.00 which then started to run and had problems.

Question. Was there a significant change between version zh and zi to fix a problem relating to driver issues?

It seems odd that the BETA testing version was not the release candidate which went LIVE on MAIN last night/early this morning. Edit: I now see from a thread in the news forum that 8.01 has been running on BETA at some stage

I haven't yet had a problem with any 8.01 tasks so will continue to run those only and try reapplying newer Nvidia drivers to see if I can repeat my problems of yesterday with the newer apps.

MAIN was already setup via app_config.xml to run 2 tasks on the GPU so the 8.01 tasks also have been running without locking up my PC (but on an earlier nvidia driver)!! I won't abort any of my 8.00 tasks yet so I can test back against them if 8.01 has no problems.

John.

P.S. Throughout this the second PC has continued to run everything without any problems. The problem PC is Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3] http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=77829 and the one that is Ok is an older Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Family 6 Model 26 Stepping 5].http://setiweb.ssl.berkeley.edu/beta/show_host_detail.php?hostid=77830

Do the GPU Apps run differently depending on the capabilities of the CPU which could account for the different behaviour?
ID: 56209 · Report as offensive
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56211 - Posted: 22 Jan 2016, 10:41:31 UTC

Just playing with BETA again and I noticed that with the zh app, the CPU usage seems to be much higher than with zi. (not an extensive analysis)

zh is about 5-8 (out of 13) [8 CPU threads so about 50% of a thread.]

zi is about 1-3 (out of 13)
ID: 56211 · Report as offensive
Questor
Volunteer tester

Send message
Joined: 18 Apr 07
Posts: 6
Credit: 33,296
RAC: 0
United Kingdom
Message 56213 - Posted: 22 Jan 2016, 11:33:03 UTC

Update: zi apps running on MAIN - no display problems and PC no longer locking up using latest Nvdia driver 361.43 and running 2 tasks on GPU.
ID: 56213 · Report as offensive

Message boards : SETI@home Enhanced : Problem running new GPU apps on BETA


 
©2023 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.