Message boards :
SETI@home Enhanced :
Large difference in SoG speed in Mac and Windows?
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
I finally put SoG on Windows 10 computer with an ATI 7990. It is a card that has 2 7970's on it running at 1000MHz. I had been fairly pleased with my Mac Pro's SoG run times (was getting around 53,000 RAC before the Greenback wu's nuked everyones RAC) which has D700 cards, They are the same as the 7990 except they run at 850MHz, so depending on boost speeds and all of that around 15-17% slower. However run time are over 40+% slower on the Mac, despite a stronger CPU/Memory subsystem. I made the settings the same on the two systems and it didn't bring them really any closer at all. Could this difference be due to the lack of CPU_Lock on the Mac? Within the Mac's console logs I also see system diagnostic reports over "wakeups", on the order of 1200 per second for around 30 seconds. These are happening every 5 minutes or so according to the logs. The Mac Pro was running 3 at a time until around 17:00 UTC, at which point I switched over to 2 at a time to match the Windows settings. We've discussed the Mac Pro a couple of times before because it has a high number of inconclusives, typically because it has missed a gaussian. Unfortunately there aren't many other Mac running the SoG so I don't have any good comparison. There is another Mac on beta running a couple of SoG wu's a day and one of its inconclusives does miss a gaussian too. (https://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=8476208 That said, all the other Mac Pro's on beta are only running a few a day, vs the 400-500 a day I'm pushing through, so its statistically to compare the occurrences on the other machines. I'll be getting another Mac Pro with D500 cards in it next week so I'll have a better collection of data once I have it up and running. That wasn't the point of this message, but I did want to make sure it was known we were talking bout the same machine that has even discussed before. The two systems are here: Windows http://setiathome.berkeley.edu/results.php?hostid=7330085 Mac Pro http://setiathome.berkeley.edu/results.php?hostid=6105482 Just curious if anyone had any ideas why there would be this much difference between the two systems running basically the same cards, albeit at slightly different clock speeds. Thanks, Chris |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
What about speed of other, non-SoG, ATi apps on that Mac? News about SETI opt app releases: https://twitter.com/Raistmer |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
I haven't tried them on Main to get a good variety of angle ranges, but on beta the non-SOG apps have been slower than the SoG apps by 10-15%, so I expect moving to the non-SoG version will just be slower still, but I'd need to verify that more rigorously before I say that definatively. The difference in AstroPulse run times between the computers correlates almost exactly to the difference in GPU clock speed. I'll see about running the non-SoG version of 8.10 for a bit this evening to see how it does. Chris |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Chris, Have you thought about using different optimization settings on OS X than on windows. Apple might have altered the GPU driver in a way that they think works "best". _\|/_ U r s |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
I've played around a lot with the different settings on the Mac, much more so than the PC. That said, I won't claim that I have found the very best settings on the Mac. I usually run the Mac with higher SBS for example. But frankly, aside from setting the oclfft local radix size to 16 and dropping the Pulsefind iterations down, nothing produces a very noticeable change in performance. Another quirk I still see, on Windows, then boinc reports 100% complete the run time stops. On the Mac, when it reaches 100% the wu will still run for another 20-30 seconds even though GPU and CPU utilization on that wu has stopped. But that's only 20-30 seconds and I'm seeing around 300 seconds overall difference between the two machines running 2 at a time. I've been pretty swamped the last few days but I'll trying playing around with the setting some more. I'll have my new Mac Pro next week too, so I'll be able to see if the errors inconclusives and general slowness persists as well. Thanks, Chris |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
As for the Idle Wake Ups (2000+ per second) this same SoG app does not cause this on my laptop that has an AMD 6750M. Maybe its because it is much slower, not sure. Not even sure that is the related to the slow processing by the D700's. I will say looking at all the D700's on main that they all seem to be running stock slower than expected. They are barely faster than the D500's running stock and in most cases no faster at all. Considering the D500's have only 24 cu's compared to the 32 in the D700, I would expect to see a little bit better numbers. Of course I have no idea what those folks are doing with their computers so until I have mine with a D500 next week I can't really do a fair comparison. Any suggestions on which variables to change? Astropulse seems to respond well to changing the oclfft plan class and my MB times go down with it as well. I've tried a couple of options for Tune but there is a insignificant difference. SpikeFind threshold doesn't seem to matter a whole lot whether it is there or not. I'm not real sure what the oclfft max local fft size does and which memory on the card its trying to fit into best, i.e. register memory, local memory, global, etc. I did change the number of local memory banks to 32 from the recommended 64 because the Tahiti based chips only have 32 per AMD documentation. I can't say there was any noticeable difference in changing that though. Lastly, not sure what if anything the coalesced widths does for me. the AMD chips don't support coalesced reads I think, might be writes, don't remember off the top of my head which their documentation said. Anyway, I'll fiddle around with some of the settings this evening and see if there is any noticeable difference. Thanks, Chris |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
Some quick testing of different versions. The following settings were used and are the only ones that seem to make a difference in run times on the Mac: -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -period_iterations_num 5 Mac AR .39 1 wu, 3430 SoG: 549s AR .39 1 wu, 3430 non-SoG: 653s AR .39 1 wu, 3347 Tbar build: 590s AR .39 2 wu, 3430 SoG: 975s AR .39 2 wu, 3430 non-SoG: 1055s Windows AR .38 2 wu, 3430 SoG: 675s It should also be noted, only build 3430, both SoG and non-SoG cause the idle wake ups to be in the thousands per second. Tbar's build (the only "stock" type build I readily had available, only causes minimal wake ups, 200-300 per second. Again I have no idea if that has anything to do with the slow run times or not. Thanks, Chris |
![]() Send message Joined: 18 Jan 06 Posts: 1038 Credit: 18,734,730 RAC: 0 ![]() |
Do you run with BOINCs default priority settings or have you set <no_priority_change>1<no_priority_change> in "cc_config.xml" ? _\|/_ U r s |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
I've never fiddled with that, so it's probably default. I'll make that change and take a look. Thanks, Chris |
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0 ![]() |
I was compiling a couple cuda apps in Mountain Lion and decided to see how the SoG compile would go. The App seems to work about the same as my older non-SoG App with the exception of the progress switching every few seconds. This can be annoying if you have the tasks sorted by progress, is this normal? It can switch from just a few percent to more than double, triple, or higher right up until it's finished. The amount of the switch seems to vary as to the actual percentage completed. I'm also seeing the console message; 5/5/16 6:57:55.000 AM kernel[0]: process MBv8_8.08r3452_a[6300] caught causing excessive wakeups. Observed wakeups rate (per sec): 570; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 45053 All of my machines have been using the <no_priority_change> option for years. If anyone wants to test this app they are more than welcome. I'm going to put the nVidia cards back in the machine. |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
Yup, the progress bounces around on mine as well. I think there has been some discussion on here about it on Windows at least and on there is manifests itself as an irratic jump in percent complete vs 10% to 45% and back to 10% over and over. I think it basically was because the GPU is not reporting anything back to the CPU on a regular basis like it was before. For what it's worth changing the priority setting didn't change my run times in a preciptible way. Those are the same console messages I'm getting as well, except they are on the order of 2000-3000 per second. I've seen it as high as 17,000 as well. Chris |
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0 ![]() |
Hmmm, I thought the progress 'anomaly' had been solved. I guess I haven't been paying attention. Hey, those last couple of tasks with the sbs set at 256 look interesting, normal times on the 6850 were around 23-4 minutes for that AR. Just over 20 minutes could hint at an improvement, http://setiathome.berkeley.edu/result.php?resultid=4910016299. Too bad I don't have a second Mac I could place those old ATI cards in, I have a few laying around. |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
My D700 actually seems to like an SBS value of 1536, luckily it each card has 6GB so running 3 at a time is doable. My computer with D500's shows up at work Tuesday, so I'll be interested to see how it fares compared to my D700, it should be pretty comparable to your Cayman card, same CU's but clocked a hair less. That said at this point I'm hoping there is something wrong with my D700's as it will be easier to have them pulled and replaced than it will be to track down whatever the issue is that is slowing their processing down so much. That said they work perfectly with Astropulse, actually a little faster as there is only a 15% difference between the two machines. Did you put your app on CA? I thought the progress problem was resolved too but then I figured it was just resolved on the Windows side. May be an "anomaly" in my thinking however.=) You can pick up a Mac Pro 4,1 for pretty cheap these days, some of the 5,1's aren't bad either. Thanks, Chris |
Send message Joined: 30 Dec 13 Posts: 258 Credit: 12,340,341 RAC: 0 ![]() |
Yup, the progress bounces around on mine as well. I think there has been some discussion on here about it on Windows at least and on there is manifests itself as an irratic jump in percent complete vs 10% to 45% and back to 10% over and over. That was something I brought up with Raistmer when the SoG first came out. Since the majority of the work was being done on the GPU, it wasn't moving any of the data or compete checks back to the CPU. ( if I am getting the terminology right) It was waiting until near completion, then it would move those results. Resulting in CPU usage going from 40% to 100% What it was doing, running at a set rate until 70% complete then it crawled until 100% completion. The time it took for 2/3 of the work was equal to the time for the last 1/3. He eventually fixed that. The other issue had to do with number work units exceeding the number of cores. They resolved that as well, fixing something with the cpu_lock and commands in 2 separate areas, one in the command line txt and and in the app_confix.xml Both were needed to specify the total number of work units per cards and whole number in total. Before that, what was happening was work progressed to different points, then stared over again..example it would say 40 or 60% complete then drop down again to 0 or 10% Eventually they said it was an issue with the work not being tied a physical core. That is what was corrected in the later versions of SoG. Don't know if either of these are what you are seeing. |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
What is that "wake up" thing about? Also, for listed completion times I don't see difference between SOG vs non-SoG. But SoG times too fluctuating. News about SETI opt app releases: https://twitter.com/Raistmer |
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0 ![]() |
From what I can tell, excessive wake ups are an indication that the App is not responsive. The higher the number, the less responsive the App. Usually, excessive wakes happen prior to a hang. You might want to research it yourself, but, that's what it appears to me. Did you put your app on CA? I was waiting to see if the tasks validated, it appears most have. I'm also a little reluctant to post something that has the progress skipping around so much. Maybe a different route than posting it. |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
Yup, the progress bounces around on mine as well. I think there has been some discussion on here about it on Windows at least and on there is manifests itself as an irratic jump in percent complete vs 10% to 45% and back to 10% over and over. May well be what's happening since I don't believe there is a -cpu_lock implemented on the Mac side of things. Anyone have any idea how to view that possibility on the Mac? I don't see anything in activity monitor that shows which CPU a thread is running on. Thanks, Chris |
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0 ![]() |
My D700 actually seems to like an SBS value of 1536, luckily it each card has 6GB so running 3 at a time is doable. My computer with D500's shows up at work Tuesday, so I'll be interested to see how it fares compared to my D700, it should be pretty comparable to your Cayman card, same CU's but clocked a hair less. That said at this point I'm hoping there is something wrong with my D700's as it will be easier to have them pulled and replaced than it will be to track down whatever the issue is that is slowing their processing down so much. That said they work perfectly with Astropulse, actually a little faster as there is only a 15% difference between the two machines... Is that a Beta OS? The highest I can find at Apple is 10.11.4. Seems strange other machines don't have the Gaussian problem. Do you have an external drive or some other method to possibly run Darwin 15.4? |
Send message Joined: 27 Aug 12 Posts: 56 Credit: 127,133 RAC: 0 ![]() |
Yes it is, however it was doing it on 15.4 as well. I keep updating hoping that there is a fix in their latest driver, but no luck so far. Is there a verbose setting that would show what the app is doing/finding during Gaussian search to try and figure out if it is even running at all? Are there any settings that effect the sensitivity/optimizations for the Gaussian searches? It's bizarre that neither card never seems to find any... Thanks, Chris |
![]() ![]() Send message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0 ![]() |
Yes it is, however it was doing it on 15.4 as well. I keep updating hoping that there is a fix in their latest driver, but no luck so far. Is there a verbose setting that would show what the app is doing/finding during Gaussian search to try and figure out if it is even running at all? Are there any settings that effect the sensitivity/optimizations for the Gaussian searches? It's bizarre that neither card never seems to find any... compare counters values with your wingman and report if they differ in PoT transferred to CPU part. News about SETI opt app releases: https://twitter.com/Raistmer |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.