I (finally) tried changing the number of buckets used in my solver, and seem to get more than 50% speedup with 2^12 buckets instead of the former 2^16. Still cleaning up the code; should be committed soon…
And done…
Next on to-do list: port recent improvements over to CUDA…