13 GPU NVIDIA Rig stuck and not restarting (all GPUs) after an hour

I’m running a rig with 4 1060s, 2 1070s, 1 1080, and 8 1080 tis. Unfortunately after about an hour it just completely stops all all GPUs are stopped at 0 Sol/s with the following error:

ERROR: Looks like GPUX are stuck he not respond.
ERROR: Looks like GPUX are stopped. Restart attempt.

I’m not entirely sure how to debug further. I have the cards spread across 6 power supplies and I’m not entirely sure how to debug further.

I just unplugged the last 3 cards and started it to see if maybe it is a single card having a problem?

Any other suggestions on how to debug this? Given it doesn’t fail until after an hour this won’t be fun to debug.

1 Like

Wattage of each PSU and what is connected to it., GPU riser, Hd, mobo. Details…

How are the PSU’s connected together? picture or link to what you using.

Motherboard (MB):
Overclocked (Yes or No):
Overclock Settings:
O/s
Mining software / version
Additional Information (Please include bat or config settings):

1 Like

Motherboard (MB): ASRock H110 Pro BTC+ 13GPU
Overclocked (Yes or No): No
Overclock Settings: N/A
O/s ethos 1.2.9
Mining software / version: EWBF’s Zcash CUDA miner. 0.3.4b

Additional Information (Please include bat or config settings):
Photo of rig, sorry not home so can do a better one later: https://www.dropbox.com/s/2hehdad7n66heer/mining%20rig.JPG?dl=0

My local.conf for ethos:

maxgputemp 85
stratumproxy enabled

globalminer ewbf-zcash
proxywallet MYWALLET
proxypool1 zec-us-east1.nanopool.org:6666
proxypool2 zec-us-west1.nanopool.org:6666
flags --cl-global-work 8192 --farm-recheck 200
globalfan 85

“The Hanging Towers”. Ha!

Here are some thoughts that could help brainstorming a solution:

Did you eletrically isolate risers on the motherboard so that pci connectors don’t “touch” eachother? (the space between each riser connected on the MB side)

Did you try ewbf . 0.3.4c or dstm for any chance?

What are your PSU and where are they connected?

I read now:

If you are running stock settings all cards you might be pulling eletricity to the limit. Try underpowering your system at 60% and see if everything mines smoothly, if yes: then your issue is power related.
Best efficiency settings for nvidia cards is when you lower the power limit in any case, so even if it’s not the problem I would highly suggest you considering a thought.

1 Like

How do you suggest I electrically isolate the PCI risers? They are damn close to each other given the tight spacing, and I’m sure they’re touching a bit? Interestingly enough I unplugged the last 3 and the 10 cards have been running fine for about 2 hours.

It just failed again after about 2.5 hours. I rebooted it via command line and it’s mining again. But here is something interesting, all solvers were set to 0 except for one. Maybe a bad riser on that card?

CUDA: Device: 0 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 1 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 2 GeForce GTX 1070, 8114 MB i:64
CUDA: Device: 3 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 4 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 5 GeForce GTX 1060 3GB, 3013 MB i:64
CUDA: Device: 6 GeForce GTX 1080 Ti, 11172 MB i:64
CUDA: Device: 7 GeForce GTX 1080, 8114 MB i:64
CUDA: Device: 8 GeForce GTX 1070, 8114 MB i:64
CUDA: Device: 9 GeForce GTX 1060 3GB, 3013 MB i:64
CUDA: Device: 0 Selected solver: 0
CUDA: Device: 1 Selected solver: 0
CUDA: Device: 4 Selected solver: 0
CUDA: Device: 2 Selected solver: 0
CUDA: Device: 3 Selected solver: 2
CUDA: Device: 6 Selected solver: 0
CUDA: Device: 7 Selected solver: 0
CUDA: Device: 8 Selected solver: 0
CUDA: Device: 9 Selected solver: 0
CUDA: Device: 5 Selected solver: 0

Just rebooted, now device 0 is using solver 2, which doesn’t make any sense.

I have no idea , but I am contributing to something I saw on you tube somewhere has something to do with this board .
for this mother board one guy was wrapping a masking tape around the risers on the motherboard end so that they don’t touch each other .
seemed simple and effective .

is this a question or statement?

I have 2 of these MB and I saw how close risers get so I isolated them with some cardboard (to make sure they don’t touch, don’t want to risk a fire or a gpu burning).

Try with some other risers making sure they don’t touch and keep us updated!

1 Like

K i’ll give it a try thanks. Although I’m not sure how you used cardboard, they are SO close.

a type of cardboard* my bad!
I don’t know how to call it in English, let’s say its more like a paper (:

Best of luck!

can you run a separate worker for each gpu, that way if one hangs it doesnt crash the rest of the cards

also is there a possibility that with so many cards you may be getting memory allocation issues, with gpu’s trying to use the same OS memory space thus causing conficts, it used to be a common problem with admins having to adjust memory allocations manually in the OS to resolve the issues

So I spent a LOT of time on this last night and I think I got it going. I unplugged all cards, then I plugged each one in one at a time. Then I let ethos run for a minute or two to mine, then I’d add another card.

I also only had the relevant power supplies on for each card I plugged in (I know, seems obvious). And I also stopped having the pci-x cable go across more than one GPU (although I didn’t follow this in one case, 2 geforce 1060s).

I also cut some of the anti-static plastic and put those in between each riser so they weren’t touching directly.

Now it’s been running for about 15 hours with no hiccups at all!

Thanks so much everyone!

What color anti-static plastic? Gray and black are conductive, pink is not conductive but some have lower resistivity. If you want to isolate something you DONT use something that is conductive.

I’m using the plastic that graphics cards arrive in. The anti-static plastic.

The silvery gray anti-static plastic bags is a conductor and not an insulator. You have basically shorted your risers together with a 1K to 10K ohm resistor. I would shut your rig down and get those out of there and replace them with something that is not designed to conduct electricity.

1 Like

Damn I’ll definitely change it. I guess that wasn’t the problem though given it has been 2 days and there hasn’t been an issue?

Any suggestion on what material I should use? I thought about electrical tape but the risers are already so close to each other.

its chinese but in first minute you will see the electrical tape covering it .

I saw it on another video also but can’t find it now :slight_smile:

good luck