I’m running a rig with 4 1060s, 2 1070s, 1 1080, and 8 1080 tis. Unfortunately after about an hour it just completely stops all all GPUs are stopped at 0 Sol/s with the following error:
ERROR: Looks like GPUX are stuck he not respond.
ERROR: Looks like GPUX are stopped. Restart attempt.
I’m not entirely sure how to debug further. I have the cards spread across 6 power supplies and I’m not entirely sure how to debug further.
I just unplugged the last 3 cards and started it to see if maybe it is a single card having a problem?
Any other suggestions on how to debug this? Given it doesn’t fail until after an hour this won’t be fun to debug.
Wattage of each PSU and what is connected to it., GPU riser, Hd, mobo. Details…
How are the PSU’s connected together? picture or link to what you using.
Motherboard (MB):
Overclocked (Yes or No):
Overclock Settings:
O/s
Mining software / version
Additional Information (Please include bat or config settings):
Here are some thoughts that could help brainstorming a solution:
Did you eletrically isolate risers on the motherboard so that pci connectors don’t “touch” eachother? (the space between each riser connected on the MB side)
Did you try ewbf . 0.3.4c or dstm for any chance?
What are your PSU and where are they connected?
I read now:
If you are running stock settings all cards you might be pulling eletricity to the limit. Try underpowering your system at 60% and see if everything mines smoothly, if yes: then your issue is power related.
Best efficiency settings for nvidia cards is when you lower the power limit in any case, so even if it’s not the problem I would highly suggest you considering a thought.
How do you suggest I electrically isolate the PCI risers? They are damn close to each other given the tight spacing, and I’m sure they’re touching a bit? Interestingly enough I unplugged the last 3 and the 10 cards have been running fine for about 2 hours.
It just failed again after about 2.5 hours. I rebooted it via command line and it’s mining again. But here is something interesting, all solvers were set to 0 except for one. Maybe a bad riser on that card?
I have no idea , but I am contributing to something I saw on you tube somewhere has something to do with this board .
for this mother board one guy was wrapping a masking tape around the risers on the motherboard end so that they don’t touch each other .
seemed simple and effective .
I have 2 of these MB and I saw how close risers get so I isolated them with some cardboard (to make sure they don’t touch, don’t want to risk a fire or a gpu burning).
Try with some other risers making sure they don’t touch and keep us updated!
also is there a possibility that with so many cards you may be getting memory allocation issues, with gpu’s trying to use the same OS memory space thus causing conficts, it used to be a common problem with admins having to adjust memory allocations manually in the OS to resolve the issues
So I spent a LOT of time on this last night and I think I got it going. I unplugged all cards, then I plugged each one in one at a time. Then I let ethos run for a minute or two to mine, then I’d add another card.
I also only had the relevant power supplies on for each card I plugged in (I know, seems obvious). And I also stopped having the pci-x cable go across more than one GPU (although I didn’t follow this in one case, 2 geforce 1060s).
I also cut some of the anti-static plastic and put those in between each riser so they weren’t touching directly.
Now it’s been running for about 15 hours with no hiccups at all!
What color anti-static plastic? Gray and black are conductive, pink is not conductive but some have lower resistivity. If you want to isolate something you DONT use something that is conductive.
The silvery gray anti-static plastic bags is a conductor and not an insulator. You have basically shorted your risers together with a 1K to 10K ohm resistor. I would shut your rig down and get those out of there and replace them with something that is not designed to conduct electricity.