Looks like Optiminer uses Global Worksize of 524288, 4 , 1 and a Workgroup Size of 256,1, 1 for each round.
And uses Global Worksize of 640000, 4, 1 and Workgroup Size of 64, 1, 1 for Sol detection.
I just dont know the exact specifics of how that can be used for better effeciency. I looked at the IL/ISA code and its very hard to read pure IL properly.
Each round has around 30% occupancy with the limiting factor being LDS (on my Hawaii Card).
Im not sure where im going with this post im just hoping someone with more knowledge than me can use it …
Silentarmy seems to have the limitation of VGPR (Vector GPR per work item).
I find it easier to read the ISA, honestly - dump that.
He seems to use three kernels per round, and because he’s a good coder and makes things that are only inputs const, we can infer which params are inputs, which are in/outs, and which are pure outputs.
OK, ill try looking at the ISA. Ill screen shot the kernel execution maybe that will help figure stuff out too.
Screenshot? What do you mean? A regular screenshot is pretty useless - stepping through instructions up before the clEnqueueNDRange() kernel call with a debugger and finding out what he does right before might be helpful.
Yeah I guess the screenshot would only show the overall kernel calls. I have to find the clEnqueueNDRange dll call in the debugger, which I think is possible…Im still figuring out how to use the ARK dissembler.
I’d just use GDB and breakpoint the damn thing.