Shared memory version works fast but with limitations
By transferring the cases/controls data to shared memory and calculating likelihoods, a 1000 population/2500grid point data set now takes about 15 seconds. Big improvements after previous setback. The limitation is that shared memory is 16KB per SM. This limits case/control data severely.
My solution is to calculate intermediate window data. This is how it works:
- Load (next) segment i of cases/controls data into SMEM.
- For each nearest neighbor n:
- If n is in the range of i, add the appropriate cases/controls to a window array.
- Repeat steps 1 and 2 (using next segment i).
- Load (next) window array segment j into shared memory.
- Iterate over windows in j, calculating likelihoods.
- Repeat steps 4 and 5 (using next segment j).
I suspect this will be faster than all chaotic global memory access because loading data from global memory in large blocks in a coalesced fashion is much faster. And I suspect this will hold true even if I have to load the cases/controls many times. We will see tomorrow.