Message boards :
Number crunching :
Developing a Multi-Threaded Benchmarking App for Linux
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 ![]() ![]() |
I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason. I was just able to reproduce the problem. It happens when there are less GPU jobs than GPUs. Working on it... |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Sent you a PM with your requested file contents. Will have to try the new beta now that I understand how to use it better. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 ![]() ![]() |
Here is my comparison of all of the Linux CPU apps that I have. I ran 7 Arecibo and 8 GBT WUs twice each on each system using only 30 threads on each system. The 2990WX had SMT disabled and the 1950X had it enabled. The MB, cooling solution, memory, BIOS, OS are all the same between the 2 systems. BIOS settings are also the same with the exception of manual Vcore and CPU Core ratio. LLC is -L2 on 2990WX and -L1 on the 1950X. ![]() Based on these results, the r3711_sse41 app is fastest, though the 2 newer apps have a noticeable reduction in Similarity. Not sure if that difference is significant though, GitHub: Ricks-Lab Instagram: ricks_labs ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Thanks for the update benchmark runs with the All-in-One package standard app Rick. That is what I too have always found, the r3711 SSE41 app has always been the fastest on my Ryzens. I even found it faster on my i7-6850K system though Juan found the Intel AVX2 app the fastest on his i7-6850K host. I'll give the new benchmark executable a run again. I would like to get some reference runs on my Ryzen 2700X host so that I will have it later for comparison against the upcoming Threadripper 2920X system. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
I even found it faster on my i7-6850K system though Juan found the Intel AVX2 app the fastest on his i7-6850K host. I redo the test now since i i'm running without hyperthreading (4 GPU + 2CPU only and NO -nobs) and both apps give me strongly similar numbers 33-36 min to crunch a BLC11 WU. Will leave running with SSE41for some more time waiting new types of WU to be sure. As expected running with this configuration my CPU temps downs to <43C and CPU usage is at 50-60% range. Good for keep the host cool even with external temp of +36C. I use a TT Water 3.0 cooler. ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Theoretically any non-AVX instruction should be fastest on Intel cpus since most motherboards impose a AVX instruction penalty whereby they reduce the clock frequency in the BIOS with a AVX offset because of the high loading and heat production when Intel cpus run an AVX instruction. Unless you set a very low offset(and I found it can't be set to zero nor completely disabled in the BIOS) the cpu will run an AVX instruction or app at a much lower clock frequency than the set cpu clock frequency. So if you choose an app that doesn't run AVX or AVX2 instructions, you will keep your cpu clocks always up at their max set value. Or if the performance benefit of running the AVX instructions at reduced clocks still is better than the non-AVX app or instruction. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
Theoretically any non-AVX instruction should be fastest on Intel cpus since most motherboards impose a AVX instruction penalty whereby they reduce the clock frequency in the BIOS with a AVX offset because of the high loading and heat production when Intel cpus run an AVX instruction. Unless you set a very low offset(and I found it can't be set to zero nor completely disabled in the BIOS) the cpu will run an AVX instruction or app at a much lower clock frequency than the set cpu clock frequency. not all intel CPUs act this way. the AVX turbo penalty only showed up with the Haswell line and beyond (socket 2011-v3, E5 v3+ Xeons). earlier 2011-1 Ivy Bridge EP (think v1 and v2 E5 Xeons) chips didnt see this behavior. theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that? Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that? Yes though modern Intel cpu can do AVX-512 instructions, very few apps actually are able to use that instruction. Our AVX apps are old enough to not understand the AVX512 instructions so can't use it. Our code branch is ancient so even with our developers using modern compilers with modern cpu flags, the code base can't really utilize the newer AVX instructions to their maximum potential. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that? Keith, you're confusing AVX/AVX2 and AVX-512. AXV and AVX2 only have 256bit FP registers. AVX2 added FMA. AVX-512 expands that to 512bits. AVX-512 is relatively new and first showed up i think in the Xeon Phi co-processor cards. it's on some of the more recent HEDT chips (Skylake and beyond) and the big Xeon server chips. SSE4 i believe is 128-bit. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that? No I am not confusing the different AVX instructions nor the register widths. A point I was trying to make is that AVX apps only can be used with efficiency on Intel hardware since they have had full width 256 bit AVX registers for a long time. With AMD cpus they have been stuck with 128 bit registers so far and have had to fuse two registers together in an inefficient way to run the 256 bit AVX instruction. That is only going to change with the new Ryzen 2 cpus next year which will get the usual 256 bit register for AVX. So SSE41 with a 128 bit instructions falls right into the current AMD cpu wheelhouse. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
yeah, i guess i didn't see the relevance of mentioning avx-512 since there isn't an app for it and very very few people here even have hardware capable of running it anyway. i was strictly comparing SSE4 to AVX with the 128 vs 256-bit register differences. but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers. Which goes back to my original comment about the app codebase which Joe Segur et al worked on back in the 2000's. I'm sure it was cutting edge at the time but hardware has moved on. And the codebase is still from that period. No developers have added anything new to it since then. So we are stuck with rudimentary AVX code for all our AVX apps. Nothing you can to do with a compiler can change that. The base code has to be updated to understand better and more efficient use of AVX2 and AVX-512. That ain't happening until we get some new code writers. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
well if i understand correctly, unless SETI can use FMA (multiply accumulate), there probably wont be much improvement between AVX and AVX2. i thought it was all FP code, in which case AVX vs AVX2 would have minimal difference. but i get your point that it's all old code. sounds like even the AVX app was more of a straight port from the SSE4 code Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
![]() Send message Joined: 25 Nov 01 Posts: 21681 Credit: 7,508,002 RAC: 20 ![]() ![]() |
Good work there, looks interesting... Can this sort of benchmarking also be used to test for host system bottlenecks such as CPU resource units contention, CPU cache exhaustion/poisoning, memory bandwidth limits, whatever other bottlenecks? Happy cool crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 34486 Credit: 79,922,639 RAC: 80 ![]() ![]() |
but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers. No, no, no. Last code change by Joe Segur was in 2015. That`s just 3 years ago. Each windows application has been hand optimized. Also don`t forget these changes has been made for windows and not all of them have the same effect on Linux. Different compilers is just one reason. For AVX2 there simply is no data which would benefit from it. With each crime and every kindness we birth our future. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Sure, Rick has been using it extensively to determine the optimal number of cpu tasks to run on his mega-cpu cruncher with 64 threads. Any cpu benchmark can be used to determine bottlenecks when changing a single variable. But other common benchmarks only run a single cpu core or all cpu cores with no choice of inbetween. His benchmark will be able to closely mimic our actual cpu and gpu loading with our specific cpu and gpu apps to replicate actual crunching conditions on the host. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
No, no, no. Ok, so where do we go to lookup the app commit changes for all the apps for those of us that don't have perfect mimetic memory. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 34486 Credit: 79,922,639 RAC: 80 ![]() ![]() |
No, no, no. That`s in the development section of Lunatics. Only a few of us Lunatics have access to it. With each crime and every kindness we birth our future. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
No, no, no. OK, but without such knowledge of a hidden developer area that most of us don't have access to all anyone can do about the history of the development of the apps is to make guesses from the release date and any docs accompanying the app. My guesstimate was off by a decade. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 ![]() ![]() |
Way off topic but does anyone know if Joe was from NY? ![]() ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.