Message boards :
News :
SETI@home v8 beta to begin on Tuesday
Message board moderation
Previous · 1 . . . 62 · 63 · 64 · 65 · 66 · 67 · 68 . . . 99 · Next
| Author | Message |
|---|---|
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
Thanks. There are only two distinct tests on API version in the current (head) BOINC client code: api_version_at_least(6, 0) api_version_at_least(7, 5) I don't think that any change between 7.5.0 and 7.7.0 will make any difference - and I doubt even 7.5.0 by itself is causing this problem. It would still be helpful to see the matching <app_version> for the Main project (derived from your app_info), to see what difference might be allowing device selection to work properly there. Then, as a definitive test, could you please try this: Cache up with a few extra Beta tasks from the server, with current deployment settings. Set 'no new tasks' for Beta - finish and report any Beta tasks in progress, but suspend the new 'Ready to start' tasks so they don't run until... Once the Beta project is idle, shut down the BOINC client and delete the entire <api_version>7.5.0</api_version> line from the Beta copy of <app_version> Restart BOINC, and allow the cached tasks to run - ideally on all three GPUs at the same time. But don't allow BOINC to fetch any new Beta work yet. Report the completed tasks run without an <api_version>, and give us the task numbers so we can examine stderr. If Jason's code detects the command line device number correctly under those conditions, we have our smoking gun. Edit - for a second, confirmation, test, try changing the <api_version> to 7.3.0, instead of removing it entirely. Note that the <api_version> is likely to be reset to 7.5.0 every time new work is fetched from the server. |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
OK, device 1 (the 750Ti) is running task 24472343 Both are complete, and both ran on Jason's Device 1 (BOINC's Device 0) - one of the GTX 950 cards. We need to solve this. |
|
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0
|
I had already started another test by turning off networking, stopping BOINC, and changing 7.5 to 7.7...but that didn't help. Still with networking off I stopped BOINC and deleted the line <api_version>7.7.0</api_version>. It appears that may have worked. It's finishing the running tasks. Once it starts new tasks I'll turn networking back on and see how it goes. In cudaAcc_initializeDevice(): Boinc passed DevPref 3 setiathome_CUDA: CUDA Device 3 specified, checking... Device 3: GeForce GTX 950 is okay SETI@home using CUDA accelerated device GeForce GTX 950 The app_info section on Main is; <app>
<name>setiathome_v8</name>
</app>
<file_info>
<name>setiathome_8.10_x86_64-apple-darwin__cuda75_mac</name>
<executable/>
</file_info>
<file_info>
<name>libcudart.6.5.dylib</name>
<executable/>
</file_info>
<file_info>
<name>libcufft.6.5.dylib</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_v8</app_name>
<platform>x86_64-apple-darwin</platform>
<version_num>802</version_num>
<plan_class>cuda75</plan_class>
<avg_ncpus>0.1</avg_ncpus>
<max_ncpus>0.1</max_ncpus>
<cmdline></cmdline>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>setiathome_8.10_x86_64-apple-darwin__cuda75_mac</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>libcudart.6.5.dylib</file_name>
</file_ref>
<file_ref>
<file_name>libcufft.6.5.dylib</file_name>
</file_ref>
</app_version>That seems to have fixed it, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24474446 Amazing how that one line causes such problems. It's Summer. We just had a Storm move through with a few Power outages, so, there are a few restarts in the last few tasks. |
RaistmerSend message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0
|
So, OS X flavour doesn't interact well with new way for device handler passing by BOINC. It accepts only device ID sent via command line. This explains all observed behavior. So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol. News about SETI opt app releases: https://twitter.com/Raistmer |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
That seems to have fixed it, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24474446 We're definitely making progress, but we're not out of the woods yet. I posted the BOINC client source code tests for api_versions a while back: from those, 7.7.0 was never going to make any difference, but 7.3.0 should help. We also need to check that BOINC's device numbering, and Jason's "Boinc passed DevPref n", match up properly: once the storm and the power outages have passed, could you please try to let one Beta task run through from beginning to end without interruption, and note which device number BOINC displays while it's running. Then, link the result, and we can check whether the DevPref shows the appropriate number (should be one higher). After that, I need to tell both you and Eric how to stop this happening again. |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
So, OS X flavour doesn't interact well with new way for device handler passing by BOINC. It accepts only device ID sent via command line. See my conversation with Jason at Main. 1) Jason never bothered to implement device passing via init_data.xml, as he should have done starting in September 2011. 2) David Anderson disabled the fallback command line device numbering, with effect from API version 7.5.0 in July 2014. |
|
Send message Joined: 18 Jun 08 Posts: 76 Credit: 113,089 RAC: 0
|
So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol. There has been API fixes since 7.5.0. I think the next step should T-Bar re-building the app with the really latest code from master. |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol. No, that won't help, unless the core application code makes the appropriate API calls to work with the new(er) interface. That would need Jason to get his act up to speed, and he shows no inclination towards doing so. My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention. |
|
Send message Joined: 18 Jun 08 Posts: 76 Credit: 113,089 RAC: 0
|
So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol. Yes, I have to take that back. It's clearly a limitation in the app. main.cpp analyzeFuncs.cpp (and another a few lines later) And I imagine the Windows versions work with multiple GPUs present because Jason is building them with 6.whatever API. My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention. With normal deployment, the API version is extracted from the app automatically. You'd have to get Eric to agree to set the API version manually every time he updates the apps. |
RaistmerSend message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0
|
yes, it's workaround that will work fow awhile but with both OS X OpenCL and CUDA MB existance we will "test" all those difficulties new device ID passing protocol should shield us from. So, better would be to get this path working still. Still to be explained why we don't see the same on Windows side... News about SETI opt app releases: https://twitter.com/Raistmer |
RaistmerSend message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0
|
So, that explains Windows-part too. Well, perhaps TBar should rebuild with the same ancient BOINC API version too. To save Eric from this new versioning nightmare. Or real patching to CUDA app needed. News about SETI opt app releases: https://twitter.com/Raistmer |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
And I imagine the Windows versions work with multiple GPUs present because Jason is building them with 6.whatever API. Jason is working with a forked (self-modified) API library, with several (Windows only) improvements over the BOINC code. But unfortunately he and David A have never established a constructive dialog, so the two sets of improvements have never been cross-migrated (in either direction). Jason's version still declares itself as being 6.2.18 (I checked x41zi today - see below), so his own Windows builds are well within the safety zone for this particular problem. With normal deployment, the API version is extracted from the app automatically. You'd have to get Eric to agree to set the API version manually every time he updates the apps. The BOINC API has never been properly (independently) versioned: it simply inherits a string from the (roughly contemporaneous) client development version. That client version string is almost entirely useless in determining the actual API calls implemented. So I propose that we - quite safely - hack it. Depending on TBar's precise build operations, there are three possible places where the hack could be made, and in some cases be persistent across build versions. 1) His copy of the API source code - if he builds the API library from code every time. 2) The API library object file, prior to linking into the application. 3) The finished binary executable sent to Eric. (not persistent) (2) and (3) could be simply achieved using a hex editor, like TBar's will show 7.5.0 (that's where the app_version will have picked it up from, when Eric ran the deployment script). I'd suggest the simplest possible one-byte change, to say 7.3.0 |
RaistmerSend message Joined: 18 Aug 05 Posts: 2423 Credit: 15,878,738 RAC: 0
|
And will this ensure OpenCL/CUDA enumeration consistency ? News about SETI opt app releases: https://twitter.com/Raistmer |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
And will this ensure OpenCL/CUDA enumeration consistency ? We're only talking NVidia cards here. My experience from Windows is that all NVidia CUDA cards are also OpenCL capable, and that all NVidia Drivers install both CUDA and OpenCL runtimes, without any option to make choices. But of course Microsoft drivers from Windows Update may have bits missing. TBar, please advise whether the similar combined CUDA+OpenCL behaviour applies to Apple's OS X drivers. I think I've seen Linux users complain (recently*) that NVidia graphics drivers for Linux don't necessarily contain either CUDA or OpenCL runtimes, and they may have to be downloaded separately. Downloading one but not the other might break enumeration compatibility... I'm simply suggesting this as a workround that might enable Eric to re-deploy the application after maybe 30 seconds of patching - then testing could continue properly, and I'll bow out while the Mac development team decide how to proceed in the longer term (I haven't done this since using ResEdit for similar binary edits on System 7, 20 or more years ago). But something tells me it's going to take more than 30 seconds to persuade Jason to update his API calls... * source |
|
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0
|
As far as I know Apple uses their own OpenCL driver. You can see it listed here; https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24419654 OpenCL platform detected: Apple cl_APPLE_fp64_basic_ops supported. nVidia only supports a nVidia Video only driver and a CUDA driver. If you look at an AMD Mac it also says Apple, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24458743 You don't have to install an OpenCL driver in OSX, it's built-into the system. And will this ensure OpenCL/CUDA enumeration consistency ? |
|
Send message Joined: 18 Jun 08 Posts: 76 Credit: 113,089 RAC: 0
|
But something tells me it's going to take more than 30 seconds to persuade Jason to update his API calls... Shrug. Following the docs, main.cpp:
boinc_parse_init_data_file();
boinc_get_init_data(app_init_data);
+#if (BOINC_MAJOR_VERSION >= 8) || \
+ ((BOINC_MAJOR_VERSION == 7) && (BOINC_MINOR_VERSION >= 5))
+ if (app_init_data.gpu_device_num >= 0) {
+ gCUDADevPref = app_init_data.gpu_device_num + 1;
+ }
+#endif
if (!strlen(app_init_data.project_dir)) {
#ifdef _WIN32
edit: Fixed for 1-based indexing. This preserves the existing behaviour of running on FWIW, app_init_data.gpu_device_num was added in a858fe79 (API version 6.13.3) and fixed in 6b81e2ff (API version 7.1.0). |
|
Send message Joined: 30 Dec 13 Posts: 258 Credit: 12,340,341 RAC: 0
|
I was able to download some work but the internet crashed again. I don't think they will get to my house today. Sorry probably be a few days before I can get those back to you. Tomorrow I'm at different location and will try installing the new app there to test. Different internet provider there |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
Thanks. Cross-referenced in the discussion we were having with Jason at Main. |
|
Send message Joined: 2 Jul 13 Posts: 505 Credit: 5,019,318 RAC: 0
|
Any decision on a fix yet? Right now what works is to suspend network activity, stop BOINC and remove <api_version>7.5.0</api_version> from client_state.xml, then start BOINC. Run about a dozen tasks then enable networking, report tasks, then rinse and repeat. It's easy to tell if the api line is back, the cards start working in slow motion. A while back I attempted to find an older boinc-master and could only find "boinc-client_release-7-7.4.14_android_hotfix". I don't think that's old enough, it's not much older than the 7.5 I'm using. |
|
Send message Joined: 3 Jan 07 Posts: 1451 Credit: 3,272,268 RAC: 0
|
Let Eric have his weekend in peace. I think we've diagnosed the problem thoroughly, and I've given *you* three alternative hotfixes whereby *you* could prepare a fixed binary which *you* could email to Eric with a request for a replacement deployment (explain why it's needed, by reference to this thread if necessary). All three hotfix possibilities are single byte binary edits, in your choice of one out of three possible files. No need to dig into ancient archaeology to find older copies of unrecognisable files. (edit - and I'm off to bed now. You're on your own for the next 8 hours.) |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.