Cannot kill stuck tasks.

Questions and Answers : Unix/Linux : Cannot kill stuck tasks.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 268
Credit: 70,614,948
RAC: 35,802
United States
Message 2020803 - Posted: 27 Nov 2019, 19:03:52 UTC
Last modified: 27 Nov 2019, 19:06:54 UTC

tried the following
jstateson@h110btc:/usr/bin$ boinccmd --quit
can't connect to local host

root@h110btc:/var/lib/boinc/projects# sudo killall -v boinc
boinc: no process found

sudo kill -9 12374




However, the tasks are all still running, hours after they were timed out. The CPU % changes as they get a time slice and occasionally the shared memory shows a change. I would hope that if a task is "stuck" and the client cannot kill it that it would know not to assign subsequent tasks to the same device.

In other news, my cheap p102-100 "mining gtx1080ti" seems to work fine with that special SETI client, unlike the even cheaper p104-90 one of which got stuck.
ID: 2020803 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15060
Credit: 4,314,065
RAC: 2,244
Netherlands
Message 2020824 - Posted: 27 Nov 2019, 20:46:38 UTC - in response to Message 2020803.  

Try stopping the actual science applications still running. It looks like they're orphaned due to the BOINC client process already being stopped/quit/crashed. That's why you can't stop running BOINC, because it's no longer running. Your image only shows the science apps running, not the client. Only that the BOINC user runs those science apps. But the BOINC users isn't the same as the BOINC client.

Kill the processes by PID: https://www.linux.com/tutorials/how-kill-process-command-line/

E.g.
kill SIGNAL 14245


You can always try to reboot.
ID: 2020824 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 268
Credit: 70,614,948
RAC: 35,802
United States
Message 2020830 - Posted: 27 Nov 2019, 21:19:53 UTC - in response to Message 2020824.  

Thanks jord! Will try that next time.
Makes sense.
ID: 2020830 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15060
Credit: 4,314,065
RAC: 2,244
Netherlands
Message 2020834 - Posted: 27 Nov 2019, 21:45:02 UTC - in response to Message 2020830.  

So what did you do this time then? I thought you still had the processes running.
ID: 2020834 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 268
Credit: 70,614,948
RAC: 35,802
United States
Message 2020869 - Posted: 28 Nov 2019, 2:29:49 UTC - in response to Message 2020834.  
Last modified: 28 Nov 2019, 2:44:27 UTC

I rebooted which clear all.

However, that same gpu failed again and I cannot get rid of them. However, I do not think I need to as I excluded the gpu and it is not being assigned tasks. I also aborted the tasks it was executing as the "gpu_exclude" did not abort it and I did not want to wait for it to time out. Below is from a screen text grab

6238	boinc	30	10	36.8G	30100	10880	R	50.5	o	2h32:11	. ./.	./projects/setiathome.berkeley.edu/setiathome	_x41p_	V0.58bl_	_x86		64-pc-linux-gnu	cuda90	—device	5
6644	boinc	30	10	36.8G	25560	10728	R	56.0	0.1	2h08:57	. ./.	. /projects/setiathome.berkeley.edu/setiathome	_x41p_	V0.58bl_	_x86		64-pc-linux-gnu	cuda90	—device	5
7571	boinc	30	10	36.8G	30104	10880	R	85.0	0.1	lh25:15	. ./.	./projects/setiathome.berkeley.edu/setiathome	_x41p_	V0.58bl_	_x86		64-pc-linux-gnu	cuda90	—device	5
8622	boinc	30	10	36.8G	25548	10724	R	105.	0.1	30:22.42	. ./.	. /projects/setiathome.berkeley.edu/setiathome	x41p ‘	V0.58bl	x86		64-pc-linux-gnu	cuda90	—device	5


I tried various kills on 6238 through 8622 but nothing happened. Is it harmless to leave these alone? I do see the %cpu changing constantly so I assume they are getting a time slice. I am going to pull the card. It is a p106-90 and is not very efficient and is probably overheating.

as it is , I cannot reset that device as it has tasks associated with it

jstateson@h110btc:~$ sudo nvidia-smi -i 5 -r
GPU 00000000:08:00.0 is currently in use by another process.

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.
jstateson@h110btc:~$
ID: 2020869 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 18301
Credit: 411,437,300
RAC: 42,888
United Kingdom
Message 2020898 - Posted: 28 Nov 2019, 7:49:34 UTC

If a GPU is failing consistently then the best thing to do is physically remove it from the stack rather than trying to get software to solve a hardware problem - it can be done, but the board will still consume power, and may even affect other devices on the same bus.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2020898 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5086
Credit: 769,944,016
RAC: 1,782,047
United States
Message 2020911 - Posted: 28 Nov 2019, 12:30:16 UTC - in response to Message 2020869.  
Last modified: 28 Nov 2019, 12:34:13 UTC

I rebooted which clear all.....1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi)......
When I have that problem, it's caused by the Driver losing contact with the GPU. You can check it by opening NVIDIA X Server Settings and the PowerMizer Tab for that GPU. If the Current values are listed as UnKnown, then the Driver has lost communication with the GPU and there isn't any way to control the GPU. You Must Reboot to regain communications with the GPU. I've always been able to fix that problem by simply rearranging the Power connections to the GPU. If using power adapters, change adapters, or just swap the cable with another GPU. It helps if you know the GPU is working though, if it works in other configurations/machines, and always make sure the GPU is connected to just One Power supply.
ID: 2020911 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 268
Credit: 70,614,948
RAC: 35,802
United States
Message 2020941 - Posted: 28 Nov 2019, 17:29:58 UTC

I thought there was some hope as there was an "R" in the stats column but if nvidia-smi says "cant find device please reboot" then not much can be done it would appear.

I put another board in its place, a quality eVga 1060, and it is working fine with the same riser and cable and have not had a problem.
ID: 2020941 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Cannot kill stuck tasks.


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.