Tromp's solvers

1.2 Joule/Sol would be a new efficiency record.
Can you give any further details for reproducing/verifying this?

Thanks for the update. I will have a couple of hours this weekend to have a go at the CPU challenge.

A laptop? What CPU? Also, how was the code benchmarked?

I’ve not yet exceeded 124 watts or (0.124 kWh) on these cards under full load.

this Cuckoo Cycle you’re talking about, is it related to zcash and equihash ? I’m not sure since it seems different but you’re talking about it here (just asking anyway) Thanks

Cuckoo Cycle, my own memory bound PoW design at

GitHub - tromp/cuckoo: a memory-bound graph-theoretic proof-of-work system,

was considered for use as Zcash’ PoW, and may be reconsidered in future.
Cuckoo Cycle and Equihash are the only known instantly verifiable memory hard PoW.

1 Like

Sorry, I probably made a mistake and it seems to be 1.5 or 2 W for these two laptops. I was looking at the 35 W rating instead of the TDP 45 to get 29.3 S/s. I’m 3 W per Sol/s on a i5 3rd generation avx1 2012 CPU as a full desktop (55 W killowatt to get 18 S/s while connected to the pool via xenoncat), so I would have thought 1 W is possible with a new chip and some system efficiency.

I am guessing his total laptop power was 45 W with a TDP of 45 W

Here’s another who’s CPU TDP watts are 2x his S/s:

Here’s someone telling me his 470 GPU got 45 Sol/s at 70 watt, but 90 W system-wide, so it is still neck and neck with CPUs.

The cheapest capital costs I’ve seen are 270x ($100) getting 45 Sol/s with claymore. This beats buying used desktops by 2.5x but costs 5 J/s.

1 Like

@tromp Thanks for the explanation

Are CUDA implementations using this sorting method:

“Our radix sort is the fastest GPU sort reported in the literature, and
is up to 4 times faster than the graphics-based GPUSort. It is also
highly competitive with CPU implementations, being up to 3.5 times
faster than comparable routines on an 8-core 2.33 GHz Intel Core2 Xeon
system.”

No; I haven’t seen any Equihash GPU implementation using those radix sorts.
Bucket sorting has the advantage of being simpler to implement and takes advantage of the uniform distribution of input values to avoid bucket overflow.

That paper’s algorithm was made available in CUDPP:

"Harris et al’s adaptation of radix sort to GPUs uses the radix 2 (i.e., eachphase sorts on a bit of the key) and uses thebitsplittechniqueof [3] in each phase of the radix sort to reorder records bythe bit being considered in that phase. This implementationof radix sort is available in the CUDA Data Parallel Primitive(CUDPP) library "

The equihash paper’s 4x benefit refers to parallel sorting, not a bunch of CPU-like threads on GPUs. GPU wattage is not being maxed, giving me the impression parallel work can be done.

Parallel (GPUs): Quick sort and merge sort
Sequential (CPUs): Radix

Equihash paper’s citations mention radix and merge sort, and their [49] upon which their 4x was based cited the above paper by CUDPP researchers Satish and Harris.

Here is a very good comparison of sequential and parallel sorting, again Citing Harris and Satish for radix and merge sorts. Quicksort also did well for parallel

radixSort using Satish and Harris results is in the CUDA SDK using CUDPP.

http://cudpp.github.io/cudpp/1.1.1/index.html

But the 4x seems to imply “if GPU RAM is limited”, so a sequential method may be best for CPU and GPU, hence radixSort for both.

Yes, but it generates lots of tiny scattered writes, which wastes memory bandwidth. GPUs cannot write less than 256 contiguous bits at a time. Even an entire row of the 200,9 table is less than that, and a single cell (the unit used in sorting) is only 20 bits.

Mergesort and radix (for “normal” digit sizes) saturate the memory bus with long streaming reads and writes. Here, “normal” means smaller than log_2(sram_size).

Vote Tromp!

The slots that are written to random memory locations vary in size from 28 bytes in the first round down to 8 bytes in the last round.

My GPU code for slot writing looks like this

    slot1 &xs = htl.hta.trees1[r/2][xorbucketid][xorslot]; // RANDOM ACCESS
    xs.attr = tree(bucketid, s0, s1, xhash);
    for (u32 i=htl.dunits; i < htl.prevhashunits; i++)
      xs.hash[i-htl.dunits].word = pslot0->hash[i].word ^ pslot1->hash[i].word;

xs.attr is 4 bytes and so is each xs.hash[…].word

I remember doing a test where I replace the random access by a sequential one (just to see performance; while producing bogus results). I was kind of surprised that it was only 2 to 3 times faster.

So avoiding the random writes by doing what those radix sorts are doing, even at 2 bits at a times would take 10 steps to cover 20 bits, and thus be way slower…

Edit: I deleted this post because I had assumed Equihash was keys-only data.

Can someone help me here with regards to following.

  1. Can we join x number of mining machines together as a single rig or can we join different CPUs to mine with combine hash power for solo mining.

  2. Which is the fastest mining machine option as a solo mining ( CPU or GPU mining)?

  3. What are the best specs one can use to achieve maximum hashpower and invest for good about 15,000$

I would appreciate if someone who has in depth assessment can guide me further.

Hi, I don’t remember how it was called, but I saw reserchers bind multiple servers together as one for use with hashcat. But I’m not sure It would be of any use for mining.

Thanks @lexele for prompt reply,
I would be thankful if someone could guide me in this regards. Moreover is it advisible to solo mine when someone have greater Hash rate like 500-700 H/s.
Someone to please guide me.

Such large memory transfers are not necessary. You only need to sort the current 20-bit-wide column.

If the switch to 144,5 occurs this effect will become even more pronounced.

Can you express this statement in terms of concepts in the Equihash paper?

Well sure if you insist on shipping around “28 bytes in the first round down to 8 bytes in the last round” that’s between 224 and 64 bits per access, which is around 3x to 10x more data than actually needs to cross the memory bus. So it’s not surprising that adjusting the access order didn’t improve things.

You can radix sort on digits which are quite a bit larger than 2 bits!

PS congratulations on your election victory. I am available for cabinet appointments.

The “world record holder” in many sorts is CUDA’s CUDPP radixSort developed in 2009 and it uses 4 bits. Thrust and CLOGS (OpenCL) are second. I believe 4-bits is ideally fast on 4^2 = 16 bits, so 20 bits should be a lot faster than 32. It does not seem a lot better than merge sort for key-values, but it’s about 20x faster than the next fastest sort on GPUs if it is keys-only uniform. It pulls away from the pack as the keys get longer.

You also need to compute the xor on all remaining hashbits, so you can identify later collisions.

The equihash paper didn’t consider this DAG representation, so I refer you to the algorithm description in

lines 184-218 in particular.