13 GPU NVIDIA Rig stuck and not restarting (all GPUs) after an hour

Geesu · January 22, 2018, 1:41pm

I’m running a rig with 4 1060s, 2 1070s, 1 1080, and 8 1080 tis. Unfortunately after about an hour it just completely stops all all GPUs are stopped at 0 Sol/s with the following error:

ERROR: Looks like GPUX are stuck he not respond.
ERROR: Looks like GPUX are stopped. Restart attempt.

I’m not entirely sure how to debug further. I have the cards spread across 6 power supplies and I’m not entirely sure how to debug further.

I just unplugged the last 3 cards and started it to see if maybe it is a single card having a problem?

Any other suggestions on how to debug this? Given it doesn’t fail until after an hour this won’t be fun to debug.

CitricAcid · January 22, 2018, 1:47pm

Wattage of each PSU and what is connected to it., GPU riser, Hd, mobo. Details…

How are the PSU’s connected together? picture or link to what you using.

Motherboard (MB):
Overclocked (Yes or No):
Overclock Settings:
O/s
Mining software / version
Additional Information (Please include bat or config settings):

Geesu · January 22, 2018, 1:53pm

Motherboard (MB): ASRock H110 Pro BTC+ 13GPU
Overclocked (Yes or No): No
Overclock Settings: N/A
O/s ethos 1.2.9
Mining software / version: EWBF’s Zcash CUDA miner. 0.3.4b

Additional Information (Please include bat or config settings):
Photo of rig, sorry not home so can do a better one later: Dropbox - mining rig.JPG - Simplify your life

Geesu · January 22, 2018, 1:53pm

My local.conf for ethos:

maxgputemp 85
stratumproxy enabled

globalminer ewbf-zcash
proxywallet MYWALLET
proxypool1 zec-us-east1.nanopool.org:6666
proxypool2 zec-us-west1.nanopool.org:6666
flags --cl-global-work 8192 --farm-recheck 200
globalfan 85

johnwisdom · January 22, 2018, 2:35pm

“The Hanging Towers”. Ha!

Here are some thoughts that could help brainstorming a solution:

Did you eletrically isolate risers on the motherboard so that pci connectors don’t “touch” eachother? (the space between each riser connected on the MB side)

Did you try ewbf . 0.3.4c or dstm for any chance?

What are your PSU and where are they connected?

I read now:

If you are running stock settings all cards you might be pulling eletricity to the limit. Try underpowering your system at 60% and see if everything mines smoothly, if yes: then your issue is power related.
Best efficiency settings for nvidia cards is when you lower the power limit in any case, so even if it’s not the problem I would highly suggest you considering a thought.

Geesu · January 22, 2018, 3:59pm

How do you suggest I electrically isolate the PCI risers? They are damn close to each other given the tight spacing, and I’m sure they’re touching a bit? Interestingly enough I unplugged the last 3 and the 10 cards have been running fine for about 2 hours.

Geesu · January 22, 2018, 5:35pm

It just failed again after about 2.5 hours. I rebooted it via command line and it’s mining again. But here is something interesting, all solvers were set to 0 except for one. Maybe a bad riser on that card?

CUDA: Device: 0 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 1 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 2 GeForce GTX 1070, 8114 MB i:64
CUDA: Device: 3 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 4 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 5 GeForce GTX 1060 3GB, 3013 MB i:64
CUDA: Device: 6 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 7 GeForce GTX 1080, 8114 MB i:64
CUDA: Device: 8 GeForce GTX 1070, 8114 MB i:64
CUDA: Device: 9 GeForce GTX 1060 3GB, 3013 MB i:64
CUDA: Device: 0 Selected solver: 0
CUDA: Device: 1 Selected solver: 0
CUDA: Device: 4 Selected solver: 0
CUDA: Device: 2 Selected solver: 0
CUDA: Device: 3 Selected solver: 2
CUDA: Device: 6 Selected solver: 0
CUDA: Device: 7 Selected solver: 0
CUDA: Device: 8 Selected solver: 0
CUDA: Device: 9 Selected solver: 0
CUDA: Device: 5 Selected solver: 0

Geesu · January 22, 2018, 5:42pm

Just rebooted, now device 0 is using solver 2, which doesn’t make any sense.

dbfusion · January 22, 2018, 5:46pm

I have no idea , but I am contributing to something I saw on you tube somewhere has something to do with this board .
for this mother board one guy was wrapping a masking tape around the risers on the motherboard end so that they don’t touch each other .
seemed simple and effective .

johnwisdom · January 22, 2018, 7:15pm

is this a question or statement?

I have 2 of these MB and I saw how close risers get so I isolated them with some cardboard (to make sure they don’t touch, don’t want to risk a fire or a gpu burning).

Try with some other risers making sure they don’t touch and keep us updated!

Geesu · January 22, 2018, 7:25pm

K i’ll give it a try thanks. Although I’m not sure how you used cardboard, they are SO close.

johnwisdom · January 22, 2018, 7:36pm

a type of cardboard* my bad!
I don’t know how to call it in English, let’s say its more like a paper (:

Best of luck!

crazycraig · January 23, 2018, 8:08pm

can you run a separate worker for each gpu, that way if one hangs it doesnt crash the rest of the cards

crazycraig · January 23, 2018, 8:18pm

also is there a possibility that with so many cards you may be getting memory allocation issues, with gpu’s trying to use the same OS memory space thus causing conficts, it used to be a common problem with admins having to adjust memory allocations manually in the OS to resolve the issues

Geesu · January 23, 2018, 8:25pm

So I spent a LOT of time on this last night and I think I got it going. I unplugged all cards, then I plugged each one in one at a time. Then I let ethos run for a minute or two to mine, then I’d add another card.

I also only had the relevant power supplies on for each card I plugged in (I know, seems obvious). And I also stopped having the pci-x cable go across more than one GPU (although I didn’t follow this in one case, 2 geforce 1060s).

I also cut some of the anti-static plastic and put those in between each riser so they weren’t touching directly.

Now it’s been running for about 15 hours with no hiccups at all!

Thanks so much everyone!

ZC93 · January 23, 2018, 11:29pm

What color anti-static plastic? Gray and black are conductive, pink is not conductive but some have lower resistivity. If you want to isolate something you DONT use something that is conductive.

Geesu · January 24, 2018, 1:03am

I’m using the plastic that graphics cards arrive in. The anti-static plastic.

ZC93 · January 24, 2018, 2:40am

The silvery gray anti-static plastic bags is a conductor and not an insulator. You have basically shorted your risers together with a 1K to 10K ohm resistor. I would shut your rig down and get those out of there and replace them with something that is not designed to conduct electricity.

Geesu · January 24, 2018, 2:05pm

Damn I’ll definitely change it. I guess that wasn’t the problem though given it has been 2 days and there hasn’t been an issue?

Any suggestion on what material I should use? I thought about electrical tape but the risers are already so close to each other.

dbfusion · January 24, 2018, 2:17pm

its chinese but in first minute you will see the electrical tape covering it .

I saw it on another video also but can’t find it now

good luck

Topic		Replies	Views
GPU are stopped. Attemping Restart Mining Support	4	5492	June 18, 2017
Mining stoped at night Mining	13	3225	February 18, 2018
EWBF cuda miner crashing gpu Mining	12	10462	July 8, 2017
Troubleshooting GPU stability? Mining	2	813	December 8, 2017
Rig of 5 GPU, one is mining at 50% ot its power ( EWBF) Mining Support	4	1347	January 16, 2018

13 GPU NVIDIA Rig stuck and not restarting (all GPUs) after an hour

Related topics