When you guys start looking at optimizations have a look at tromps CUDA solver.
He just made a 50% speed improvement!