Message boards :
Number crunching :
Question about the special app for linux
Message board moderation
Author | Message |
---|---|
Oddbjornik ![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 ![]() ![]() |
Looking to improve the performance for the special app, I have a question. Probably mainly for Petri: As far as I can see on my hosts, there is an interval of a couple of seconds from a task completes until the next one is fully loaded and up and running. Would it be possible to reduce the wasted time by setting Boinc up to run two tasks at a time, and then use semaphores to let two instances of the special app synchronize between themselves approximately in the following manner: - Instance 1 starts, aquires the semaphore, loads stuff and starts working. - Instance 2 starts, does all possible initialization but does not start working with such activities as load the GPU. - Instance 1 completes its work on the GPU, signals instance 2 to start working (releases the semaphore), and then finishes up such tasks as do not load the GPU. - Instance 2 aquires the semaphore and immediately starts working while instance 1 is finishing up. - Instance 1 then starts, does all possible initialization etc... step 2 from above. Could there be anything to gain from such an approach? |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22713 Credit: 416,307,556 RAC: 380 ![]() ![]() |
The special app is designed to use as much of the GPU's resources as possible when running a single task. The overhead using semaphores in the manner you describe may well reduce the performance. It is rumoured that Petri has come up with a few more wrinkles that should further improve the performance. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
petri is working on a new version of the app, with a similar goal to reduce some of the wasted time, but maybe he's using a different method. from what i remember he's claiming a 10-15% speed boost. but it's still in the testing phase. one thing you can do now, if you aren't already, is add the -nobs cmdline argument to your app_info file in the appropriate location. it'll work your CPU harder, but you'll get maybe 5% speedup. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
Oddbjornik ![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 ![]() ![]() |
The overhead using semaphores in the manner you describe may well reduce the performance.I respectfully disagree. My suggestion is that the app would still run as a single task for as long as it has work to do on the GPU. The semaphore (or mutex) would only be aquired once, i.e. before GPU-processing starts, and then it will be held for the duration of the GPU-work. But it is a real question whether or not there would be anything substantial to gain from such an approach. |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
Looking to improve the performance for the special app, I have a question. Probably mainly for Petri: Hi Oddbjörnik, In short: Yes! Running one at a time and initializing one in the background makes sense. I'm glad you noticed that too. I like to have my GPU to cool-off those seconds but the super crunchers with their water cooled units would like to have that feature (I guess) right now or preferably yesterday. Those seconds could really make a difference. Especially when running a long batch of shorties. Implementing such a scheme is not so hard. I'd be happy to include that in to the code if someone has time to experiment, develop and test. The source code is available and I'd be happy if someone had time to do so. The upcoming version has a much reduced memory footprint. You will all be able to experiment. (You can try with the current code to set -unroll 1 and run 2 at a time. My machine was slow with that). Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Oddbjornik ![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 ![]() ![]() |
I've started fooling around with mutexes/semaphores/futexes on the Linux platform. Don't know if anything useful will come out of it yet, but it's fun to play with. I'm a bit surprised with the apparent lack of a simple, robust mutex in Linux. Much error handling and cleaning up needs to be taken care of, while the only thing I'm really interested in from my perspective is: Do I own the mutex or don't I? I don't care whether the other task/previous holder died a natural death, quit in a controlled manner, or was shot in the head with kill -9. However, such a careless mutex mechanism seems evasive on the Linux platform. Please correct me if I'm wrong, and point me in the right direction. |
Oddbjornik ![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 ![]() ![]() |
I think I have a (relatively) robust test / proof of concept up and running. The idea is that you can start as many instances of this program as you like, and they will hold the mutex lock one at a time. If you then kill one of the running instances, one of the others will inherit the mutex, do the cleanup and continue as if nothing happened. If anyone would care to take a closer look at it; here's the code: //============================================================================ // Name : robust.cpp // Author : Oddbjornik //============================================================================ #include <string.h> #include <pthread.h> #include <unistd.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <cerrno> #include <cstdio> #include <cstdlib> #include <iostream> using namespace std; pthread_mutex_t *robustMutex; pthread_mutexattr_t robustAttr; int main(int argc, char *argv[]) { int delay = 2; if (argc > 1) { delay = atoi(argv[1]); } printf("Robust interprocess mutex test program\n"); int ret; if ((ret = pthread_mutexattr_init(&robustAttr)) != 0) { printf("pthread_mutexattr_init: %d - %s!\n", ret, strerror(ret)); exit(9); } if ((ret = pthread_mutexattr_setrobust(&robustAttr, PTHREAD_MUTEX_ROBUST)) != 0) { printf("pthread_mutexattr_setrobust: %d - %s!\n", ret, strerror(ret)); exit(9); } if ((ret = pthread_mutexattr_setpshared(&robustAttr, PTHREAD_PROCESS_SHARED)) != 0) { printf("pthread_mutexattr_setpshared: %d - %s!\n", ret, strerror(ret)); exit(9); } // Mutex must be created in named shared memory. // Code partly stolen from https://stackoverflow.com/questions/4068974/initializing-a-pthread-mutex-in-shared-memory const char *shmName = "/obliMutex_1"; int shm = shm_open(shmName, (O_CREAT | O_RDWR | O_EXCL), (S_IRUSR | S_IWUSR)); if (shm == -1) { // We failed, so someone else probably already owns the mutex. if (errno == EEXIST) { // Yes, right, wait for that other task to properly initialize usleep(1000); // Then just open it shm = shm_open (shmName, O_RDWR, (S_IRUSR | S_IWUSR)); if (shm == -1) { printf("shm_open(O_RDWR): %d - %s!\n", errno, strerror(errno)); exit(9); } // And find the already working mutex in shared memory robustMutex = (pthread_mutex_t*)mmap(NULL, sizeof *robustMutex, PROT_READ | PROT_WRITE, MAP_SHARED, shm, 0); // Check for memory mapping failure if (robustMutex == (pthread_mutex_t*)-1) { printf("mmap: %d - %s!\n", errno, strerror(errno)); exit(9); } } else { // Some other error occurred printf("shm_open(O_CREAT | O_RDWR | O_EXCL): %d - %s!\n", errno, strerror(errno)); exit(9); } } else { // We successfully created the shared memory. Now we must initialize the mutex inside it. // First allocate the necessary space if ((ret = ftruncate(shm, sizeof *robustMutex)) != 0) { printf("ftruncate: %d - %s!\n", errno, strerror(errno)); exit(9); } // Then map the memory to our mutex pointer robustMutex = (pthread_mutex_t*)mmap(NULL, sizeof *robustMutex, PROT_READ | PROT_WRITE, MAP_SHARED, shm, 0); if (robustMutex == (pthread_mutex_t*)-1) { printf("mmap: %d - %s!\n", errno, strerror(errno)); exit(9); } // And initialize the mutex if ((ret = pthread_mutex_init(robustMutex, &robustAttr)) != 0) { printf("pthread_mutex_init: %d - %s!\n", ret, strerror(ret)); exit(9); } } // Mutex object has been obtained, one way or the other. Now loop for days and see if anything fails int count = 0; while (true) { // Obtain the lock if ((ret = pthread_mutex_lock(robustMutex)) != 0) { if (ret == EOWNERDEAD) { printf("That one died in a bad way, or maybe it just died.\n"); if ((ret = pthread_mutex_consistent(robustMutex)) != 0) { printf("pthread_mutex_consistent: %d - %s!\n", ret, strerror(ret)); exit(9); } // No need to (re-)lock the mutex in here, since EOWNERDEAD means we got the lock, we just // have to clean it up before unlocking it. } else if (ret == ENOTRECOVERABLE) { printf("Not recoverable!\n"); if ((ret = pthread_mutex_consistent(robustMutex)) != 0) { printf("pthread_mutex_consistent: %d - %s!\n", ret, strerror(ret)); exit(9); } printf("We marked it consistent anyway!\n"); if ((ret = pthread_mutex_lock(robustMutex)) != 0) { printf("But relock failed with pthread_mutex_lock: %d - %s!\n", ret, strerror(ret)); exit(9); } } else { printf("pthread_mutex_lock: %d - %s!\n", ret, strerror(ret)); exit(9); } } printf("Mutex loop count: %d\n", ++count); sleep(delay); pthread_mutex_unlock(robustMutex); usleep(1); // Yield to next process } return 0; } |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
Hi oddbjornik, there is a global int variable gCUDADevPref that holds the -device num parameter value. Each GPU should be allowed to run one task at a time. See PM for additional details. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Oddbjornik ![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 ![]() ![]() |
Each GPU should be allowed to run one task at a time.That makes sense! |
MarkJ ![]() ![]() ![]() ![]() Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5 ![]() |
Going off topic here. I recently upgraded one machine from a GTX 1060 to a 1660Ti. I took the opportunity to also upgrade from the CUDA 80 to 101 while I was at it. Below is the output from one of each. Should I be worried the the CUDA 10.1 has decided to use -pfp 1 on the GTX1660Ti while the GTX1060 decided to use -pfp 9? They both have autotune in the command line. GTX 1660 Ti unroll limits: min = 1, max = 256. Using unroll autotune. setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1660 Ti, 5914 MiB, regsPerBlock 65536 computeCap 7.5, multiProcs 24 pciBusID = 9, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1660 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 1660 Ti Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special GTX 1060 unroll limits: min = 1, max = 256. Using unroll autotune. setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 1060 3GB, 3019 MiB, regsPerBlock 65536 computeCap 6.1, multiProcs 9 pciBusID = 9, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 1060 3GB is okay SETI@home using CUDA accelerated device GeForce GTX 1060 3GB Unroll autotune 9. Overriding Pulse find periods per launch. Parameter -pfp set to 9 setiathome v8 enhanced x41p_zi3v, Cuda 8.00 special CUDA 8.0 Special version by petri33. BOINC blog |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
No, the new autotune at -unroll 1 is faster in the new 0.98b1 app. You can prove it to yourself by running both unroll values in the benchmark. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.