SETI@home v8 beta to begin on Tuesday

Message boards : News : SETI@home v8 beta to begin on Tuesday
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 62 · 63 · 64 · 65 · 66 · 67 · 68 . . . 99 · Next

AuthorMessage
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59193 - Posted: 30 Jul 2016, 13:52:58 UTC - in response to Message 59192.  
Last modified: 30 Jul 2016, 14:11:14 UTC

Thanks. There are only two distinct tests on API version in the current (head) BOINC client code:

api_version_at_least(6, 0)
api_version_at_least(7, 5)

I don't think that any change between 7.5.0 and 7.7.0 will make any difference - and I doubt even 7.5.0 by itself is causing this problem.

It would still be helpful to see the matching <app_version> for the Main project (derived from your app_info), to see what difference might be allowing device selection to work properly there.

Then, as a definitive test, could you please try this:

Cache up with a few extra Beta tasks from the server, with current deployment settings.
Set 'no new tasks' for Beta - finish and report any Beta tasks in progress, but suspend the new 'Ready to start' tasks so they don't run until...
Once the Beta project is idle, shut down the BOINC client and delete the entire

<api_version>7.5.0</api_version>

line from the Beta copy of <app_version>

Restart BOINC, and allow the cached tasks to run - ideally on all three GPUs at the same time. But don't allow BOINC to fetch any new Beta work yet.

Report the completed tasks run without an <api_version>, and give us the task numbers so we can examine stderr. If Jason's code detects the command line device number correctly under those conditions, we have our smoking gun.

Edit - for a second, confirmation, test, try changing the <api_version> to 7.3.0, instead of removing it entirely. Note that the <api_version> is likely to be reset to 7.5.0 every time new work is fetched from the server.
ID: 59193 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59194 - Posted: 30 Jul 2016, 14:25:40 UTC - in response to Message 59190.  

OK, device 1 (the 750Ti) is running task 24472343
Device 2 (one of the 950s) is running task 24472056

Both are complete, and both ran on Jason's Device 1 (BOINC's Device 0) - one of the GTX 950 cards. We need to solve this.
ID: 59194 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 2 Jul 13
Posts: 505
Credit: 5,019,318
RAC: 0
United States
Message 59195 - Posted: 30 Jul 2016, 14:32:54 UTC - in response to Message 59193.  
Last modified: 30 Jul 2016, 15:29:18 UTC

I had already started another test by turning off networking, stopping BOINC, and changing 7.5 to 7.7...but that didn't help. Still with networking off I stopped BOINC and deleted the line <api_version>7.7.0</api_version>. It appears that may have worked. It's finishing the running tasks. Once it starts new tasks I'll turn networking back on and see how it goes.

In cudaAcc_initializeDevice(): Boinc passed DevPref 3
setiathome_CUDA: CUDA Device 3 specified, checking...
   Device 3: GeForce GTX 950 is okay
SETI@home using CUDA accelerated device GeForce GTX 950


The app_info section on Main is;
  <app>
      <name>setiathome_v8</name>
  </app>
    <file_info>
        <name>setiathome_8.10_x86_64-apple-darwin__cuda75_mac</name>
        <executable/>
    </file_info>
    <file_info>
        <name>libcudart.6.5.dylib</name>
        <executable/>
    </file_info>
    <file_info>
        <name>libcufft.6.5.dylib</name>
        <executable/>
    </file_info>
   <app_version>
        <app_name>setiathome_v8</app_name>
        <platform>x86_64-apple-darwin</platform>
        <version_num>802</version_num>
        <plan_class>cuda75</plan_class>
        <avg_ncpus>0.1</avg_ncpus>
        <max_ncpus>0.1</max_ncpus>
        <cmdline></cmdline>
         <coproc>
            <type>CUDA</type>
            <count>1</count>
         </coproc>
        <file_ref>
            <file_name>setiathome_8.10_x86_64-apple-darwin__cuda75_mac</file_name>
            <main_program/>
         </file_ref>
        <file_ref>
            <file_name>libcudart.6.5.dylib</file_name>
        </file_ref>
        <file_ref>
            <file_name>libcufft.6.5.dylib</file_name>
        </file_ref>
     </app_version>


That seems to have fixed it, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24474446
Amazing how that one line causes such problems.
It's Summer. We just had a Storm move through with a few Power outages, so, there are a few restarts in the last few tasks.
ID: 59195 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 59196 - Posted: 30 Jul 2016, 15:42:40 UTC - in response to Message 59195.  
Last modified: 30 Jul 2016, 15:43:25 UTC

So, OS X flavour doesn't interact well with new way for device handler passing by BOINC. It accepts only device ID sent via command line.
This explains all observed behavior.

So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 59196 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59197 - Posted: 30 Jul 2016, 15:48:40 UTC - in response to Message 59195.  

That seems to have fixed it, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24474446
Amazing how that one line causes such problems.
It's Summer. We just had a Storm move through with a few Power outages, so, there are a few restarts in the last few tasks.

We're definitely making progress, but we're not out of the woods yet.

I posted the BOINC client source code tests for api_versions a while back: from those, 7.7.0 was never going to make any difference, but 7.3.0 should help.

We also need to check that BOINC's device numbering, and Jason's "Boinc passed DevPref n", match up properly: once the storm and the power outages have passed, could you please try to let one Beta task run through from beginning to end without interruption, and note which device number BOINC displays while it's running. Then, link the result, and we can check whether the DevPref shows the appropriate number (should be one higher).

After that, I need to tell both you and Eric how to stop this happening again.
ID: 59197 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59198 - Posted: 30 Jul 2016, 15:54:56 UTC - in response to Message 59196.  

So, OS X flavour doesn't interact well with new way for device handler passing by BOINC. It accepts only device ID sent via command line.
This explains all observed behavior.

So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol.

See my conversation with Jason at Main.

1) Jason never bothered to implement device passing via init_data.xml, as he should have done starting in September 2011.
2) David Anderson disabled the fallback command line device numbering, with effect from API version 7.5.0 in July 2014.
ID: 59198 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 18 Jun 08
Posts: 76
Credit: 113,089
RAC: 0
Finland
Message 59199 - Posted: 30 Jul 2016, 16:01:38 UTC - in response to Message 59196.  

So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol.


There has been API fixes since 7.5.0. I think the next step should T-Bar re-building the app with the really latest code from master.
ID: 59199 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59200 - Posted: 30 Jul 2016, 16:10:57 UTC - in response to Message 59199.  

So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol.


There has been API fixes since 7.5.0. I think the next step should T-Bar re-building the app with the really latest code from master.

No, that won't help, unless the core application code makes the appropriate API calls to work with the new(er) interface. That would need Jason to get his act up to speed, and he shows no inclination towards doing so.

My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention.
ID: 59200 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 18 Jun 08
Posts: 76
Credit: 113,089
RAC: 0
Finland
Message 59201 - Posted: 30 Jul 2016, 16:24:59 UTC - in response to Message 59200.  

So, next step should be CUDA MB code walk with OS X path selected to see where it misses new BOINC device ID passing protocol.


There has been API fixes since 7.5.0. I think the next step should T-Bar re-building the app with the really latest code from master.

No, that won't help, unless the core application code makes the appropriate API calls to work with the new(er) interface. That would need Jason to get his act up to speed, and he shows no inclination towards doing so.


Yes, I have to take that back. It's clearly a limitation in the app.

main.cpp
analyzeFuncs.cpp (and another a few lines later)

And I imagine the Windows versions work with multiple GPUs present because Jason is building them with 6.whatever API.

My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention.


With normal deployment, the API version is extracted from the app automatically. You'd have to get Eric to agree to set the API version manually every time he updates the apps.
ID: 59201 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 59202 - Posted: 30 Jul 2016, 16:25:04 UTC - in response to Message 59200.  


My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention.

yes, it's workaround that will work fow awhile but with both OS X OpenCL and CUDA MB existance we will "test" all those difficulties new device ID passing protocol should shield us from.
So, better would be to get this path working still.

Still to be explained why we don't see the same on Windows side...
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 59202 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 59203 - Posted: 30 Jul 2016, 16:28:10 UTC - in response to Message 59201.  


And I imagine the Windows versions work with multiple GPUs present because Jason is building them with 6.whatever API.

My workround would be for the application to declare itself - accurately - as being written to use older API calls, thus re-enabling the fallback calling convention.


With normal deployment, the API version is extracted from the app automatically. You'd have to get Eric to agree to set the API version manually every time he updates the apps.


So, that explains Windows-part too.
Well, perhaps TBar should rebuild with the same ancient BOINC API version too. To save Eric from this new versioning nightmare.
Or real patching to CUDA app needed.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 59203 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59204 - Posted: 30 Jul 2016, 17:01:06 UTC - in response to Message 59201.  

And I imagine the Windows versions work with multiple GPUs present because Jason is building them with 6.whatever API.

Jason is working with a forked (self-modified) API library, with several (Windows only) improvements over the BOINC code. But unfortunately he and David A have never established a constructive dialog, so the two sets of improvements have never been cross-migrated (in either direction).

Jason's version still declares itself as being 6.2.18 (I checked x41zi today - see below), so his own Windows builds are well within the safety zone for this particular problem.

With normal deployment, the API version is extracted from the app automatically. You'd have to get Eric to agree to set the API version manually every time he updates the apps.

The BOINC API has never been properly (independently) versioned: it simply inherits a string from the (roughly contemporaneous) client development version. That client version string is almost entirely useless in determining the actual API calls implemented. So I propose that we - quite safely - hack it.

Depending on TBar's precise build operations, there are three possible places where the hack could be made, and in some cases be persistent across build versions.

1) His copy of the API source code - if he builds the API library from code every time.
2) The API library object file, prior to linking into the application.
3) The finished binary executable sent to Eric. (not persistent)

(2) and (3) could be simply achieved using a hex editor, like



TBar's will show 7.5.0 (that's where the app_version will have picked it up from, when Eric ran the deployment script). I'd suggest the simplest possible one-byte change, to say 7.3.0
ID: 59204 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 59205 - Posted: 30 Jul 2016, 17:20:07 UTC - in response to Message 59204.  

And will this ensure OpenCL/CUDA enumeration consistency ?
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 59205 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59206 - Posted: 30 Jul 2016, 17:37:52 UTC - in response to Message 59205.  

And will this ensure OpenCL/CUDA enumeration consistency ?

We're only talking NVidia cards here. My experience from Windows is that all NVidia CUDA cards are also OpenCL capable, and that all NVidia Drivers install both CUDA and OpenCL runtimes, without any option to make choices. But of course Microsoft drivers from Windows Update may have bits missing.

TBar, please advise whether the similar combined CUDA+OpenCL behaviour applies to Apple's OS X drivers.

I think I've seen Linux users complain (recently*) that NVidia graphics drivers for Linux don't necessarily contain either CUDA or OpenCL runtimes, and they may have to be downloaded separately. Downloading one but not the other might break enumeration compatibility...

I'm simply suggesting this as a workround that might enable Eric to re-deploy the application after maybe 30 seconds of patching - then testing could continue properly, and I'll bow out while the Mac development team decide how to proceed in the longer term (I haven't done this since using ResEdit for similar binary edits on System 7, 20 or more years ago). But something tells me it's going to take more than 30 seconds to persuade Jason to update his API calls...

* source
ID: 59206 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 2 Jul 13
Posts: 505
Credit: 5,019,318
RAC: 0
United States
Message 59207 - Posted: 30 Jul 2016, 17:59:56 UTC - in response to Message 59206.  
Last modified: 30 Jul 2016, 18:17:50 UTC

As far as I know Apple uses their own OpenCL driver. You can see it listed here; https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24419654
OpenCL platform detected: Apple
cl_APPLE_fp64_basic_ops supported.

nVidia only supports a nVidia Video only driver and a CUDA driver.

If you look at an AMD Mac it also says Apple, https://setiweb.ssl.berkeley.edu/beta/result.php?resultid=24458743
You don't have to install an OpenCL driver in OSX, it's built-into the system.


And will this ensure OpenCL/CUDA enumeration consistency ?

We're only talking NVidia cards here. My experience from Windows is that all NVidia CUDA cards are also OpenCL capable, and that all NVidia Drivers install both CUDA and OpenCL runtimes, without any option to make choices. But of course Microsoft drivers from Windows Update may have bits missing.

TBar, please advise whether the similar combined CUDA+OpenCL behaviour applies to Apple's OS X drivers.

I think I've seen Linux users complain (recently*) that NVidia graphics drivers for Linux don't necessarily contain either CUDA or OpenCL runtimes, and they may have to be downloaded separately. Downloading one but not the other might break enumeration compatibility...

I'm simply suggesting this as a workround that might enable Eric to re-deploy the application after maybe 30 seconds of patching - then testing could continue properly, and I'll bow out while the Mac development team decide how to proceed in the longer term (I haven't done this since using ResEdit for similar binary edits on System 7, 20 or more years ago). But something tells me it's going to take more than 30 seconds to persuade Jason to update his API calls...

* source
ID: 59207 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 18 Jun 08
Posts: 76
Credit: 113,089
RAC: 0
Finland
Message 59208 - Posted: 30 Jul 2016, 19:32:28 UTC - in response to Message 59206.  
Last modified: 30 Jul 2016, 19:46:42 UTC

But something tells me it's going to take more than 30 seconds to persuade Jason to update his API calls...


Shrug. Following the docs, main.cpp:

     boinc_parse_init_data_file();
     boinc_get_init_data(app_init_data);

+#if (BOINC_MAJOR_VERSION >= 8) || \
+    ((BOINC_MAJOR_VERSION == 7) && (BOINC_MINOR_VERSION >= 5))
+    if (app_init_data.gpu_device_num >= 0) {
+        gCUDADevPref = app_init_data.gpu_device_num + 1;
+    }
+#endif

     if (!strlen(app_init_data.project_dir)) {
         #ifdef _WIN32


edit: Fixed for 1-based indexing.

This preserves the existing behaviour of running on GPU 0the best GPU (whateva) if GPU number wasn't specified in init_data.xml or command line. And Jason can keep using the ancient API while the rest of the world can upgrade freely.

FWIW, app_init_data.gpu_device_num was added in a858fe79 (API version 6.13.3) and fixed in 6b81e2ff (API version 7.1.0).
ID: 59208 · Report as offensive
Zalster
Volunteer tester

Send message
Joined: 30 Dec 13
Posts: 258
Credit: 12,340,341
RAC: 0
United States
Message 59209 - Posted: 30 Jul 2016, 19:40:35 UTC - in response to Message 59170.  

I was able to download some work but the internet crashed again. I don't think they will get to my house today. Sorry probably be a few days before I can get those back to you. Tomorrow I'm at different location and will try installing the new app there to test. Different internet provider there
ID: 59209 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59210 - Posted: 30 Jul 2016, 20:03:38 UTC - in response to Message 59208.  

Thanks. Cross-referenced in the discussion we were having with Jason at Main.
ID: 59210 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 2 Jul 13
Posts: 505
Credit: 5,019,318
RAC: 0
United States
Message 59213 - Posted: 30 Jul 2016, 22:59:04 UTC

Any decision on a fix yet? Right now what works is to suspend network activity, stop BOINC and remove <api_version>7.5.0</api_version> from client_state.xml, then start BOINC. Run about a dozen tasks then enable networking, report tasks, then rinse and repeat. It's easy to tell if the api line is back, the cards start working in slow motion.

A while back I attempted to find an older boinc-master and could only find "boinc-client_release-7-7.4.14_android_hotfix". I don't think that's old enough, it's not much older than the 7.5 I'm using.
ID: 59213 · Report as offensive
Richard Haselgrove
Volunteer tester

Send message
Joined: 3 Jan 07
Posts: 1451
Credit: 3,272,268
RAC: 0
United Kingdom
Message 59214 - Posted: 30 Jul 2016, 23:07:52 UTC - in response to Message 59213.  
Last modified: 30 Jul 2016, 23:09:58 UTC

Let Eric have his weekend in peace. I think we've diagnosed the problem thoroughly, and I've given *you* three alternative hotfixes whereby *you* could prepare a fixed binary which *you* could email to Eric with a request for a replacement deployment (explain why it's needed, by reference to this thread if necessary).

All three hotfix possibilities are single byte binary edits, in your choice of one out of three possible files. No need to dig into ancient archaeology to find older copies of unrecognisable files.

(edit - and I'm off to bed now. You're on your own for the next 8 hours.)
ID: 59214 · Report as offensive
Previous · 1 . . . 62 · 63 · 64 · 65 · 66 · 67 · 68 . . . 99 · Next

Message boards : News : SETI@home v8 beta to begin on Tuesday


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.