Cuda 100, 90, 91 - OH MY

Questions and Answers : Unix/Linux : Cuda 100, 90, 91 - OH MY
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1974601 - Posted: 10 Jan 2019, 22:35:27 UTC
Last modified: 10 Jan 2019, 22:41:33 UTC

Have a fairly new machine that I am setting up to run the "Special Apps" CPU is overclocked to 4.5ghz

After downloading the repository version of ( apt install boinc-client boinc-manager and also apt install nvidia-cuda-toolkit ) also installed nvidia driver version 415.25 - after running it for a few weeks, I am trying to download the "Special App" everyone talks about.

I loosely followed this guide https://setiathome.berkeley.edu/forum_thread.php?id=83274#1952052 for cuda 90 and downloaded the TBars all in one listed here https://setiathome.berkeley.edu/forum_thread.php?id=83274&postid=1961607#1961607

besides the NOBS perameter does anyone have any suggestions on fine tuning the app ? Tom M mentioned cuda 91 but when I look at the log file it lists cuda10.0 that is loaded

======================================================================================================
Event Log

Thu 10 Jan 2019 04:17:54 PM CST | | Starting BOINC client version 7.8.3 for x86_64-pc-linux-gnu
Thu 10 Jan 2019 04:17:54 PM CST | | log flags: file_xfer, sched_ops, task, sched_op_debug
Thu 10 Jan 2019 04:17:54 PM CST | | Libraries: libcurl/7.58.0 OpenSSL/1.0.2n zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Thu 10 Jan 2019 04:17:54 PM CST | | Data directory: /home/justaguy/BOINC
Thu 10 Jan 2019 04:17:54 PM CST | | CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 415.25, CUDA version 10.0, compute capability 6.1, 4096MB, 3980MB available, 9070 GFLOPS peak)
Thu 10 Jan 2019 04:17:54 PM CST | | OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 415.25, device version OpenCL 1.2 CUDA, 8116MB, 3980MB available, 9070 GFLOPS peak)
Thu 10 Jan 2019 04:17:54 PM CST | SETI@home | Found app_info.xml; using anonymous platform
Thu 10 Jan 2019 04:17:54 PM CST | | Host name: beeblebrox
Thu 10 Jan 2019 04:17:54 PM CST | | Processor: 16 GenuineIntel Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz [Family 6 Model 158 Stepping 12]
Thu 10 Jan 2019 04:17:54 PM CST | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d arch_capabilities
Thu 10 Jan 2019 04:17:54 PM CST | | OS: Linux Ubuntu: Ubuntu 18.04.1 LTS [4.15.0-43-generic]
Thu 10 Jan 2019 04:17:54 PM CST | | Memory: 15.60 GB physical, 980.00 MB virtual
Thu 10 Jan 2019 04:17:54 PM CST | | Disk: 253.59 GB total, 222.38 GB free
Thu 10 Jan 2019 04:17:54 PM CST | | Local time is UTC -6 hours
Thu 10 Jan 2019 04:17:54 PM CST | | Config: use all coprocessors
Thu 10 Jan 2019 04:17:54 PM CST | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 8643244; resource share 100
ID: 1974601 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1974607 - Posted: 10 Jan 2019, 23:13:39 UTC
Last modified: 10 Jan 2019, 23:23:15 UTC

Just curious - where do I specify -nobs or this line below? is it just a new line in the app_info.xml file?
<cmdline>-nobs</cmdline>

Found it in README_x41p_V0.97.txt in the project directory DOCS folder, I added this to the very last line in the XML file.

If you wish to use 100% CPU per task, add the command -nobs to the app_info.xml.
<cmdline>-nobs</cmdline>
There isn't any requirement to use a Full CPU per task, but, it may be a few seconds faster.
ID: 1974607 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 4692
Credit: 275,639,431
RAC: 204,885
Message 1974705 - Posted: 11 Jan 2019, 7:44:15 UTC - in response to Message 1974607.  

Just curious - where do I specify -nobs or this line below? is it just a new line in the app_info.xml file?
<cmdline>-nobs</cmdline>

Found it in README_x41p_V0.97.txt in the project directory DOCS folder, I added this to the very last line in the XML file.

If you wish to use 100% CPU per task, add the command -nobs to the app_info.xml.
<cmdline>-nobs</cmdline>
There isn't any requirement to use a Full CPU per task, but, it may be a few seconds faster.


If you are crowding your cpu's the gpu's will get starved without the -nobs parameter. This mostly should not be an issue on an Intel running 90% of its cpus. And on the 2950wx and lower Ryzen models running at 90% of the cpu.

On the 2990wx if you try to run much more than 26 threads it starts starving the gpus without the -nobs parameter. Yes, the 32c/64t 2990wx will starve the gpus if you run too many cpu threads. Yes, the cpu crunching gets slower above 26 cpu threads.

HTH,
Tom
A proud member of the OFA (Old Farts Assoc).
Former member of the YFA (Young Farts Assoc.)
ID: 1974705 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1974712 - Posted: 11 Jan 2019, 8:22:13 UTC

The special app itself is a CUDA91 app. Your drivers are CUDA10. The CUDA10 drivers support the lesser CUDA library that the special app was built with. No issues with either the app or the drivers. You are not losing much if any performance. There is an CUDA92 and CUDA10 version of the special app but again, no performance differences to speak of between all the versions. So you can continue to use the version from your All-in-One package with no regrets.

If you have the spare cpu thread to spare, you can speed up your gpu tasks by setting the -nobs parameter either in the app_info.xml <cmdline>-nobs</cmdline> or put the <cmdline>-nobs</cmdline> into a app_config.xml file. Either location will work.

From your observed run_times, I would guess you are overcommitted on your cpu threads. A 1080 should be running the special app a LOT faster than 3 minutes.

My 1080 is doing the BLC34 tasks in 80 seconds. https://setiathome.berkeley.edu/result.php?resultid=7323343388 with -nobs parameter set.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1974712 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1974765 - Posted: 11 Jan 2019, 16:54:56 UTC - in response to Message 1974712.  

Thanks for the advice Tom and Keith, How do I avoid overcrowding the CPU? I am running an i9 9900K 8core with a GTX1080.

I did put the -NOBS option at the very end of the app_info.XML file should I put it in the ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt file instead, I noticed that it contains the -unroll 14 among other options.

Just to note that when looking at the properties of the finished WU, it states (Cuda90) and 3min runtimes, the GPU on the special app is less busy and bouncing around from 15% to 80% utilization according to conky and the nvidia-sml -l option. These WU seem less busy on the gpu compared to the default app in the repositories or it could be possibly more efficient I'm not sure?

Can you provide me a link to the Cuda91 special app?

I will be building another machine in the next few weeks with hopefully somewhere between 4-8 GPU's and would like to fine-tune this one as practice for the new one.
ID: 1974765 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1974793 - Posted: 11 Jan 2019, 20:08:53 UTC
Last modified: 11 Jan 2019, 20:14:16 UTC

I found these two apps for download Linux_Pascal+0.97b2_Special and Linux_MultiGPU-v0.97b1_Special - the app_info file states cuda92 - Should I be using these instead for the 1080 because of its Pascal architecture?

also, I found a lunatics app_config file that lists

<gpu_usage>0.5</gpu_usage>
<cpu_usage>.04</cpu_usage>

instead of app_info

<avg_ncpus>0.1</avg_ncpus>
<max_ncpus>0.1</max_ncpus>

Not sure of each set differences or if I used the app_config instead of app_info or if I need them both.
ID: 1974793 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1974796 - Posted: 11 Jan 2019, 20:59:14 UTC - in response to Message 1974793.  

I think I'm on the right track now - Below are the steps I took to get things running better (I think) at least the GPU it pegged at 100% and the percentages during processing are climbing tons faster, I will let it run and see how it goes for the next few days.
First I followed TomM guide for TBars all-in-one setup, then to use the newer programs I then simply dropped the Linux_Pascal+0.97b2_Special version files in the Seti project folder-(BOINC>Projects>Setiathome.berkly.edu) overwriting the existing files. Don't forget to make a back up of the project folder just in case it blows up. By doing this I went from 3to4.5 minutes per Work Unit to roughly 1.5 minutes per Work Unit

TomM Guide from here https://setiathome.berkeley.edu/forum_thread.php?id=83274&postid=1952639#1952639
Special Cuda files from here - https://setiathome.berkeley.edu/forum_thread.php?id=81271&postid=1969844#1969844

Thank you to Tom, TBar and those who take the time to assist and make the apps better :)[/url]
ID: 1974796 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1974877 - Posted: 12 Jan 2019, 6:39:15 UTC

Looks like you got things running OK. To answer your question the -nobs parameter goes into a command line statement in either app_info or app_config. You've accomplished that it seems.

The CUDA apps are not related to Lunatics in any fashion. Nor are they related to the OpenCL AP application. That ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt file contains the tuning parameters for the AP OpenCL app and doesn't affect the CUDA special app in any way. The default parameters that TBar provides for the AP can be tuned more aggressively for a 1080. The -unroll can be increased to 20. The unroll usually works best by matching the number of SM units of the gpu. In the case of the 1080 that is 20. Read the AP readme for more information on a more appropriate tuning line for a high-end card like the 1080.

The special app uses all of the resources of the Nvidia card all at once for a single task. You should only run one task per card. Don't worry about the gpu_usage parameters in the app_info. They don't have anything to do with the running of the gpu task. Only the application itself determines its needs. You should just leave the default parameters that TBar set in the app_info alone.

To prevent the overcommitting of the cpu and running too many cpu tasks, you can either reduce the number of cpu threads being used by BOINC by setting a Local Preference in the Manager in the Options menu. Or use a <max_concurrent> or <project_max_concurrent> statement in an app_config. You should probably limit your concurrent tasks of the total of both cpu and gpu tasks to around 10-12 or about 60% total of the gpu in the Local Preferences. That way you won't starve the gpu tasks from not getting enough cpu support. With the -nobs setting each gpu task needs a full cpu thread to support it. Also by reducing the total number of cpu tasks running you won't be overcommitted on the cpu and your run_times will more closely match your actual compute cpu_time and you will finish your cpu tasks faster.

https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration

for the app_config and cc_config file parameters and usage.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1974877 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975424 - Posted: 16 Jan 2019, 15:47:50 UTC - in response to Message 1974877.  

Yes, it is running well now thanks- I did leave the CPU at 100% till today then I just used the computing preferences to set it at 80%. Looks like that reduced it to 14 cpu threads running. It will also be nicer trying to work while it's crunching too lol
ID: 1975424 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1975469 - Posted: 16 Jan 2019, 20:38:07 UTC - in response to Message 1975424.  

That did the trick. Your cpu_times match your run_times on the cpu tasks.

You still have your AP tasks running at stock settings and are leaving a lot of performance in the tank. You should put this command line into the ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt file.

-unroll 20 -oclFFT_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 8 1 -tune 2 64 8 1


You could improve your RAC greatly with that tuning and the abundance of AP tasks for the last couple of weeks now. Might as well git while the getting is good. AP is so rare these days it is a shame to not use their immediate improvement to your RAC bottom line.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1975469 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975596 - Posted: 17 Jan 2019, 16:38:48 UTC - in response to Message 1975469.  

The current line is
-sbs 256 -unroll 20 -oclFFT_plan 256 16 256 -ffa_block 2304 -ffa_block_fetch 1152

Do I remove the -sbs 256 portion?
ID: 1975596 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975598 - Posted: 17 Jan 2019, 16:47:27 UTC - in response to Message 1975469.  
Last modified: 17 Jan 2019, 16:48:08 UTC

Wow, thanks Keith for the suggestion, That really seems to be running better from the few WU completion times I'm seeing now. From 1:45 to 1:15 improvement, I'll watch them for a bit to see if those times are consistent.
ID: 1975598 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1975682 - Posted: 18 Jan 2019, 2:00:53 UTC - in response to Message 1975598.  

If you made the change to the AP file, it hasn't been read yet. You are still using the default AP tunings. You have to either restart BOINC or use the Manager to re-read configuration files from the Advanced menu. Your stderr.txt output is still showing a very small buffer size and the workgroup size is still the same. If you use the command line I posted previously your stderr.txt output would look like this.

DATA_CHUNK_UNROLL set to:20
oclFFT plan class overrides requested: global radix 256; local radix 16; max workgroup size 256
FFA thread block override value:16384
FFA thread fetchblock override value:8192
TUNE: kernel 1 now has workgroup size of (64,8,1)
TUNE: kernel 2 now has workgroup size of (64,8,1)
OpenCL platform detected: NVIDIA Corporation

Yours looks like this.

Maximum single buffer size set to:256MB
DATA_CHUNK_UNROLL set to:20
oclFFT plan class overrides requested: global radix 256; local radix 16; max workgroup size 256
FFA thread block override value:2304
FFA thread fetchblock override value:1152

You need to remove the -sbs 256 parameter and use the new FFA block override values to increase the amount of RAM the task uses on the card. Unless you are running two AP tasks per card. If that is the case you could leave the FFA block override alone or increase it a bit more. Still remove the -sbs 256 parameter.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1975682 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975805 - Posted: 18 Jan 2019, 18:52:25 UTC - in response to Message 1975682.  

Thanks Keith, I did make the changes as you suggested in the ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt file. Then shutdown and restarted both the app and the machine.

I did leave out the -sbs 256 in the beginning of the file.

The properties of the WU's show improvement from 1:45 to 1:10 ish - some very more/less in completion times. Maybe I am not looking at them correctly?
ID: 1975805 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975806 - Posted: 18 Jan 2019, 19:00:07 UTC - in response to Message 1975682.  

I did find
app_info.xml
and
 ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt
in the Boinc root directory and in the boinc>projects>setiathome.berkly.edu folders. I just made the changes in both sets of files in both folders.

See if that helps possibly as I am not sure where it is reading the files from.
ID: 1975806 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1975809 - Posted: 18 Jan 2019, 19:20:51 UTC - in response to Message 1975805.  

I think you may be confusing Multiband tasks from Astropulse tasks. I don't know of any AP task that can finish in 100 seconds or less. Unless they are early overflows or 100% radar blanking. Normally an AP task runs around 12 minutes.

The command line I gave you was for AP tasks not MB. The tuning line goes into ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt file. You only need to put it into one of two places, either that file or in the <cmdline></cmdline> statement in app_info.xml in the AstroPulse gpu section.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1975809 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1975825 - Posted: 18 Jan 2019, 22:12:06 UTC - in response to Message 1975809.  

when looking at the client and click on tasks tab, then highlight the cpu+nvidia gpu task that has completed and click properties - it states elapsed time. That what I was looking at
ID: 1975825 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 11484
Credit: 1,159,385,486
RAC: 860,072
United States
Message 1975840 - Posted: 18 Jan 2019, 23:04:56 UTC - in response to Message 1975825.  

There are two types of tasks distributed by the project. Multiband and Astropulse. A MB task has the naming format of the calendar date at the beginning of the taskname if it comes from Arecibo Observatory or a BLC beginning name if it comes from Greenbank Observatory. The AP tasks only come from the observations at Arecibo. Their naming convention always has ap at the beginning of the taskname. Examples:

Arecibo MB task - 14ja19ac.23414.268313.12.39.17

Arecibo AP task - ap_10ja19ac_B3_P0_00111_20190111_10320.wu

Greenbank MB task - blc12_2bit_guppi_58406_32047_HIP21036_0118.27213.409.21.44.194.vlar


Unless you are looking at a finished task in your Manager and it has an ap beginning the name, it is a multiband MB task and the times you have referenced would be appropriate. You haven't provided any times for one of your finished AP tasks. These are available at the website under your account and the Tasks View menu. Open up the menu item labeled AstroPulseV7 to see the AP tasks you have finished or are still in your host's cache to be crunched. The ones in the Valid or Validation Pending will show your computation times like this one https://setiathome.berkeley.edu/workunit.php?wuid=3314373950
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1975840 · Report as offensive
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 45
Credit: 25,106,289
RAC: 82,655
United States
Message 1978963 - Posted: 6 Feb 2019, 20:04:58 UTC - in response to Message 1975840.  

Thanks for explaining the differences.
ID: 1978963 · Report as offensive

Questions and Answers : Unix/Linux : Cuda 100, 90, 91 - OH MY


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.