Tromp's solvers

You have open sourced your work – Thank you! – please keep my small donation.

2 Likes

No dependencies, I think?! Let me know if you find otherwise…

git clone https://github.com/tromp/equihash

works for me.

I get these results:

3 solutions
3 total solutions
1.75user 0.10system 0:01.86elapsed 99%CPU (0avgtext+0avgdata 216080maxresident)k
0inputs+0outputs (0major+7298minor)pagefaults 0swaps
2 Likes

which cpu are you using?

Intel Core i7-47090K @ 4.00 GHz

1 Like

Hey, I thought you were going to open source it! But from what I see, it is proprietary software that nobody else has the right to use or redistribute without prior permission from the author. :wink:

If you want a suggestion, you could add something like this:

Copyright 2016 John Tromp
You may use this package under the MIT Licence. You may use this package under the Transitive Grace Period Public Licence, version 1.0, or at your option, any later version. (You may choose to use this package under the terms of either licence, at your option.) See the file COPYING.MIT for the terms of the MIT Licence. See the file COPYING.TGPPL for the terms of the Transitive Grace Period Public Licence, version 1.0. See TGPPL.PDF for why the TGPPL exists, graphically illustrated on three slides.

1 Like

I am getting core dumps on Ubuntu 64 in a VirtualBox on a Debian64 host

jank@ubuntu-modeli:~/equihash$ make all
g++ -march=native -m64 -maes -mavx -std=c++11 -Wall -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -pthread  -DATOMIC equi_miner.cpp blake/blake2b.cpp -o equi
g++ -march=native -m64 -maes -mavx -std=c++11 -Wall -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -pthread  -DSPARK equi_miner.cpp blake/blake2b.cpp -o equi1
g++ -march=native -m64 -maes -mavx -std=c++11 -Wall -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -pthread  -DJOINHT -DATOMIC equi_miner.cpp blake/blake2b.cpp -o faster
g++ -march=native -m64 -maes -mavx -std=c++11 -Wall -Wno-deprecated-declarations -D_POSIX_C_SOURCE=200112L -O3 -pthread  -DJOINHT equi_miner.cpp blake/blake2b.cpp -o faster1
g++ -g equi.c blake/blake2b.cpp -o verify
time ./equi -h "" -n 0 -t 1 -s | grep ^Sol | ./verify -h "" -n 0
Verifying size 512 proof for equi("",0)
Command terminated by signal 4
0.00user 0.00system 0:00.20elapsed 0%CPU (0avgtext+0avgdata 2760maxresident)k
0inputs+0outputs (0major+124minor)pagefaults 0swaps
time ./equi1
Looking for wagner-tree on ("",0) with 10 20-bits digits and 1 threads
Command terminated by signal 4
0.00user 0.00system 0:00.20elapsed 0%CPU (0avgtext+0avgdata 2692maxresident)k
0inputs+0outputs (0major+120minor)pagefaults 0swaps
Makefile:47: recipe for target 'spark' failed
make: *** [spark] Error 132
jank@ubuntu-modeli:~/equihash$ ./equi
WARNING: use of atomics hurts single threaded performance!
Looking for wagner-tree on ("",0) with 10 20-bits digits and 1 threads
Illegal instruction (core dumped)
jank@ubuntu-modeli:~/equihash$ ./equi1
Looking for wagner-tree on ("",0) with 10 20-bits digits and 1 threads
Illegal instruction (core dumped)
jank@ubuntu-modeli:~/equihash$ ./faster
WARNING: use of atomics hurts single threaded performance!
Looking for wagner-tree on ("",0) with 10 20-bits digits and 1 threads
Illegal instruction (core dumped)
jank@ubuntu-modeli:~/equihash$ cat /proc/version
Linux version 4.4.0-42-generic (buildd@lgw01-13) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #62-Ubuntu SMP Fri Oct 7 23:11:45 UTC 2016
jank@ubuntu-modeli:~/equihash$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609

1 Like

Thank you, @tromp! Testing this on our “super” box, which you also have an account on and can use for testing now that the code is (almost) open source (need a license, as Zooko pointed out), the eqcuda and feqcuda sometimes fail to find solutions (and take multiple seconds to complete in that case). For example, the first time I ran them, they reported 0 solutions. Trying other nonce values, I got them to non-zero solutions, and then trying nonce 0 again finally gave the expected 3 solutions. Retrying after some other tests - and it’s 0 solutions again. You probably have an uninitialized variable somewhere.

Failing run:

$ time ./eqcuda -n 0
Looking for wagner-tree on ("",0) with 10 20-bits digits and 8192 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 3.900 seconds.
0 solutions
0 total solutions

real    0m5.344s
user    0m2.875s
sys     0m2.281s

Working run:

$ time ./eqcuda
Looking for wagner-tree on ("",0) with 10 20-bits digits and 8192 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 0.096 seconds.
3 solutions
3 total solutions

real    0m1.532s
user    0m0.081s
sys     0m1.265s

0.096 would suggest 1.88/0.096 = 19.6 Sol/s, right? Per nvidia-smi, this runs on Maxwell Titan X. The box also has old Kepler Titan, but you don’t seem to have included an option to choose the CUDA device.

I also tried CPU runs. Works great on i7-4770K, but the scaling to 32 threads on 2x E5-2670 in this “super” box is poor - perhaps running some independent instances with fewer threads each (maybe just 1 thread/instance) would be faster (but would eat up more RAM, which is fine at least for testing - got 128 GB here). Feel free to experiment with this, too.

Edit: “-t 12288” (upping CUDA thread count in accordance with the difference between GTX 980 and GTX Titan X) somehow makes the speed slightly worse for eqcuda, but improves it for feqcuda, which now gets (also not all the time, but when it’s lucky):

$ time ./feqcuda -t 12288
Looking for wagner-tree on ("",0) with 10 20-bits digits and 12288 threads (128 per block)
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
9 rounds completed in 0.076 seconds.
3 solutions
3 total solutions

real    0m1.524s
user    0m0.070s
sys     0m1.328s

This is apparently 1.88/0.076 = 24.7 Sol/s.

1 Like

MIT LICENSE added…

2 Likes

Thank you! Looks like blake2b.cu is third-party code (right?) - are you sure its author is OK with the code being placed under MIT license? Was it already released under a MIT-compatible license?

// Blake2-B CUDA Implementation
// tpruvot@github July 2016
2 Likes

there is some bug left in faster[1] with -r option that I’ll try to iron out soon

Thanks for pointing that out, solardiz. Seems I jumped the fence there. Let me enquire with the author…

“-maes -mavx” should be removed. “-march=native” is sufficient anyway.

1 Like

ok, compiler flags fixed…

Mr. Tromp, BTC address you provided is permanent or You will remove it?

1 Like

The Cuckoo Cycle Bounty Fund address will be permanent. I will soon update my Cuckoo Cycle project page to list it there…

16 threads:

real    0m0.363s
user    0m4.017s
sys     0m0.452s

16 concurrent instances of equi1 (showing one of them here, the rest are similar):

real    0m3.504s
user    0m3.344s
sys     0m0.160s

3500/16 = ~220 ms, so faster than 363 ms seen with threads.

32 threads:

real    0m0.297s
user    0m5.042s
sys     0m0.992s

32 concurrent instances of equi1 (showing one of them here, although their running time varies from 5.5 to 6.2 seconds):

real    0m5.892s
user    0m5.637s
sys     0m0.255s

5900/32 = ~185 ms, so faster than 297 ms seen with threads.

This is over 10 Sol/s on this dual-CPU system. That’s still with memory (de)allocation kept in the loop (means unnecessary zeroization of memory by the kernel, which should be measurable overhead at speeds like this), and without explicit use of huge pages, which are some easy things to improve (before going SIMD, like xenoncat did).

1 Like

Yes, the compile flags change worked.

3 solutions
3 total solutions
1.58user 0.08system 0:01.66elapsed 99%CPU (0avgtext+0avgdata 216088maxresident)

around 1.9 H/s. For comparison, this system ran 0.15 H/s with the non-optimised solver.

1 Like

Yes, I should forget about optimizing for average memory use, and just statically allocate all storage like xenoncat does, matching the peak memory.

2 Likes

Btw, about my use of blake. I rewrote some of the code in the blake subdirectory to minimize the blake state and more importantly, to replace the lazy (delayed) calls to compress by eager ones, which helps for headers of length at least 128 bytes, as in zcash, since I pass the midstate around.