Message boards :
Number crunching :
Astropulse error with AMD/ATI GPU?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
![]() Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 ![]() |
I am running the optional video driver but I am not getting AP's for the GPU. On my CPU it runs "fine" (just a bit slow though). Might check the MB website for chipset drivers. I just reviewed my "all tasks" and I have 4 AP tasks aimed at my GPU. All have a "zero" time remaining. So I expect they will behave just like you have described. Its odd. Tom A proud member of the OFA (Old Farts Association). |
![]() ![]() ![]() Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 ![]() ![]() |
Reset the project on that computer.I did a reset, and I have one AMD AP7 task in the queue with a 0:00 estimated completion time, so I'm assuming that will error out. I will try the remove option in a few days. Seti@home classic: 1,456 results, 1.613 years CPU time |
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
APs ran fine on my old HD7870, but never did on the RX470 I replaced it with. Something in the RX's is different that Astropulse really doesn't like, so they err within a minute. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Did the AP task failures on RX cards ever get brought to Raistmer's attention? Just curious. Never have messed with ATI/AMD cards before. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 ![]() ![]() |
Did the AP task failures on RX cards ever get brought to Raistmer's attention? Just curious. Never have messed with ATI/AMD cards before.I guess I have never "filed a complaint" with anyone with why they are not working. I could see how the new Ryzen APUs have kinks that need to be worked out. Is Raistmer the guy for this? For the little bit of attention I pay to these boards I know he does a lot of development for the project (at least with special builds), but I never assumed he is the one to fix it. Edit: If it matters, when I was waiting for NNT to run its course, I had MilkyWay@Home running as my backup project, and all GPU tasks for that project ended in error. I could see this being an error with my setup, but since I haven't tweaked much of anything OOB I'm not sure what needs fixing. Seti@home classic: 1,456 results, 1.613 years CPU time |
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
I rechecked that. APs on my RX470 tend to run from start to finish, they'll just never validate as they'll always have zero spikes and a large percentage radar blanking. I did report that here. |
Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0 ![]() |
APs ran fine on my old HD7870, but never did on the RX470 I replaced it with. Something in the RX's is different that Astropulse really doesn't like, so they err within a minute. Although AP's also ran fine on my HD6950, the cursor didn't, it used to glitch (stick) something awful, if I needed to use the pc for anything serious, I had to suspend GPU work. The RX480 that replaced it kept giving occasional errors, so I stopped processing AP's on it. P. |
![]() ![]() ![]() Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 ![]() ![]() |
So I have been looking into this a little bit more, and I discovered a few things. Task 7340502156 has the exit status of: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED The stderr output for this task is: <core_client_version>7.14.2</core_client_version>In looking at the stderr for the MilkyWay@home GPU tasks that ended in error, they had a similar output of exceeding the time limit. I also noticed for MW@H that I have three warnings listed in the task. I'm not trying to get my MW@H problems solved here, but I can't help but think these are related somehow. So I don't know how this works, but what I suspect is happening is that I download a task, and for whatever reason, it is given an estimated completion time of 0:00. Because the time is 0 (or presumably really really small), the task cannot be completed within this time and errors out. So, why is the task assigned such a small estimated completion time in the first place? I don't know if that is something happening in the BOINC software, parameters my computer is set at for my hardware, or something in between. Seti@home classic: 1,456 results, 1.613 years CPU time |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 ![]() ![]() |
APs ran fine on my old HD7870, but never did on the RX470 I replaced it with. Something in the RX's is different that Astropulse really doesn't like, so they err within a minute. Same here with an rx 570. |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22739 Credit: 416,307,556 RAC: 380 ![]() ![]() |
I just found a clue - in the first part of the output file for your failed AP task I found this line: Device peak FLOPS 14,513,557.69 GFLOPS The comparable line for the GTX1080 on one of my computers is: Device peak FLOPS 8,875.52 GFLOPS Since this value is used to guess at the estimated run time and yours is 1600 times larger than that for a faster processor (it should be smaller) I think you can see the where the problem lies. I think I've found a solution - you need to edit the "coproc_info.xml" file, look for the section that has info about your GPU, then edit the line that starts with "<peak_flops>" and reduce the very big number that follows that by a factor of say 2000. Save the file, and restart BOINC. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0 ![]() |
I just found a clue - in the first part of the output file for your failed AP task I found this line: Not sure that would work, unless I'm reading my data wrong, my rx470's copro file shows peak flops as "<peak_flops>5990400000000.000000</peak_flops>" which seems less than your 1080. P. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I guess I have never "filed a complaint" with anyone with why they are not working. I could see how the new Ryzen APUs have kinks that need to be worked out. Is Raistmer the guy for this? For the little bit of attention I pay to these boards I know he does a lot of development for the project (at least with special builds), but I never assumed he is the one to fix it. Yes, Raistmer is the prime developer of all the OpenCL apps. That includes the AP app too. The other place that a complaint should be lodged is on the Questions and Answers, GPU Applications forum so that it is noticed by the developers. I see Jord already commented so he also could lodge a complaint with the developers. This eventually need to be logged into the BOINC/SETI github repository as a bug for the application. From my irregular perusal of all the logged issues, I have seen similar issues logged already about the failure of the platform to properly identify the processing power of gpus. I believe Richard Haselgrove just logged something very similar about the issue. https://github.com/BOINC/boinc/issues/2949 Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22739 Credit: 416,307,556 RAC: 380 ![]() ![]() |
Phil, Looking at the output from one of your valid results you have: Device peak FLOPS 5,990.40 GFLOPS compared with one from one of my GTX1080: Device peak FLOPS 8,875.52 GFLOPS An RX470 is a slower GPU (for SETI) than a GTX1080, so I would expect the peak-flops (effectively the processing rate) to be lower for the RX470. Bill's figure is "stupidly high", he might solve the zero expected runtime by just knocking three zeros off the value (divide by a thousand). Doing so would at least be a guide as to this being the right place to look. There is of course a question - Why did the peak-flops value go so high in the first place? I have a few ideas, some are "easy" to rule out, others are going to need some digging. A few of the simplest are: - did this happen very soon after installing the new GPU? - are you heavily re-scheduling work between processors? - have you just stopped using the I-GPU? After that we are heading into the murky depths of BOINC, and that could well be a very messy can of worms. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
So, it's perfectly clear that the GFLOP calculation both from Boinc, and from the apps are totally and completely F'ed up, and nothing to count on at all. :-)BOINC doesn't calculate the flops value, it reads it from the CUDA and OpenCL files provided by the drivers. |
![]() ![]() ![]() Send message Joined: 30 Nov 05 Posts: 282 Credit: 6,916,194 RAC: 60 ![]() ![]() |
- did this happen very soon after installing the new GPU?Technically, yes. The computer was built from scratch and has been running for about a month. I didn't get any AP tasks until a few weeks ago. - are you heavily re-scheduling work between processors?No, I am just letting it run its course. - have you just stopped using the I-GPU?No, I am using it for s@h v8 all the time with no problems, both CPU and GPU tasks. I have either been suspending any AP v7 GPU tasks or updating my app config to not allow AP v7 GPU tasks to download until I have a better idea of what can be done to fix this. After that we are heading into the murky depths of BOINC, and that could well be a very messy can of worms.And that is above my level of expertise. I'd love to help, but I don't know that I have the time or experience to try to solve this myself in a reasonable amount of time. Seti@home classic: 1,456 results, 1.613 years CPU time |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22739 Credit: 416,307,556 RAC: 380 ![]() ![]() |
...actually BOINC does calculate the value for peak flops, based on data collected from the driver(s). This triggers a couple of thoughts, first have AMD in their wisdom changed the API (RPC) for getting such data? and second, is the data coming from the driver "correct"? If either is true then it is a "bit of a problem" that may cause a lot of head scratching. Here's an extract from coproc.cpp (for release 7.7.8) where the peak flops is calculated. I can see at least one place where an apparently innocent change could blow things out in the manner that Bill and others are seeing. void COPROC_ATI::set_peak_flops() { double x = 0; if (attribs.numberOfSIMD) { x = attribs.numberOfSIMD * attribs.wavefrontSize * 5 * attribs.engineClock * 1.e6; // clock is in MHz } else if (opencl_prop.amd_simd_per_compute_unit) { // OpenCL w/ cl_amd_device_attribute_query extension // Per: https://www.khronos.org/registry/cl/extensions/amd/cl_amd_device_attribute_query.txt // // Single precision performance is calculated as two times the number of shaders multiplied by the base core clock speed. // Per: https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units // // clock is in MHz x = opencl_prop.max_compute_units * opencl_prop.amd_simd_per_compute_unit * opencl_prop.amd_simd_width * opencl_prop.amd_simd_instruction_width * 2 * (opencl_prop.max_clock_frequency * 1.e6); } else if (opencl_prop.max_compute_units) { // OpenCL gives us only: // - max_compute_units // (which I'll assume is the same as attribs.numberOfSIMD) // - max_clock_frequency (which I'll assume is the same as engineClock) // It doesn't give wavefrontSize, which can be 16/32/64. // So let's be conservative and use 16 // x = opencl_prop.max_compute_units * 16 * 5 * opencl_prop.max_clock_frequency * 1e6; } peak_flops = (x>0)?x:5e10; } void COPROC_ATI::fake(double ram, double avail_ram, int n) { safe_strcpy(type, proc_type_name_xml(PROC_TYPE_AMD_GPU)); safe_strcpy(version, "1.4.3"); safe_strcpy(name, "foobar"); count = n; available_ram = avail_ram; have_cal = true; memset(&attribs, 0, sizeof(attribs)); memset(&info, 0, sizeof(info)); attribs.localRAM = (int)(ram/MEGA); attribs.numberOfSIMD = 32; attribs.wavefrontSize = 32; attribs.engineClock = 50; for (int i=0; i<count; i++) { device_nums[i] = i; } set_peak_flops(); } ...and a few places that may well have been affected by later drivers..... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 ![]() |
I'd been looking through the source code for something like that and couldn't find it, just that it read and set peak_flops. I do remember that the science app calculates the flops separately on its own. |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22739 Credit: 416,307,556 RAC: 380 ![]() ![]() |
I too have been plodding around the code, trying to find out if there is anywhere else those values are used. Nothing yet, and a glass of amber nectar beckons me. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Did ATI make any kind of change between those families with respect to how they calculate the number of cores per SM on the card? Richard submitted a change in the code to accommodate the new Turing cards from Nvidia which changed the number of cores per SM. Pascal seems to use cores_per_proc = 128 but the the new Turing cards use cores_per_proc = 64. From the bug report: The BOINC client decodes the 'Peak Flops' value for NVidia GPUs according to the architecture used in each succeeding card generation. Did something similar happen with ATI and nobody noticed the change in how the drivers report the number of cores? And now BOINC decodes incorrectly the peak FLOPS for certain cards. https://github.com/BOINC/boinc/issues/2706 Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
andivb Send message Joined: 9 Aug 09 Posts: 7 Credit: 14,510,909 RAC: 2 ![]() |
I recently installed an AMD RX 580 GPU and just noticed that Astropulse results are invalid. 2 bad results are ID 7584720987 and 7584330887. Was hoping for an answer/fix in this thread, but I'm not really sure I understand what the resolution to the problem is. Any suggestions? http://setiathome.berkeley.edu/result.php?resultid=7584720987 http://setiathome.berkeley.edu/result.php?resultid=7584330887 |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.