Archive

Archive for June, 2009

Shared memory version works fast but with limitations

June 29th, 2009 Stephen Larew Comments off

By transferring the cases/controls data to shared memory and calculating likelihoods, a 1000 population/2500grid point data set now takes about 15 seconds. Big improvements after previous setback. The limitation is that shared memory is 16KB per SM. This limits case/control data severely.

My solution is to calculate intermediate window data. This is how it works:

  1. Load (next) segment i of cases/controls data into SMEM.
  2. For each nearest neighbor n:
    1. If n is in the range of i, add the appropriate cases/controls to a window array.
  3. Repeat steps 1 and 2 (using next segment i).
  4. Load (next) window array segment j into shared memory.
  5. Iterate over windows in j, calculating likelihoods.
  6. Repeat steps 4 and 5 (using next segment j).

I suspect this will be faster than all chaotic global memory access because loading data from global memory in large blocks in a coalesced fashion is much faster. And I suspect this will hold true even if I have to load the cases/controls many times. We will see tomorrow.

Categories: SURF Tags:

Shared memory, occupancy, and other stuff

June 26th, 2009 Stephen Larew Comments off

This afternoon I spent some time experimenting with how the block sizes affect performance and specifically occupancy. As it turns out, maxing out SM occupancy is usually only beneficial when a program is bandwidth bound. Not my case here.

I also tried changing the way the compute distances kernel retrieves global memory. This didn’t work and I found out that the kernel was executing in thousandths of a second so no need to optimize this further yet.

I then spent a lot of time experimenting with using shared memory when computing likelihoods since this would provide the most immediate benefit to the program. But I haven’t got this working yet. More debugging tomorrow.

And, btw, I started my own local, private svn repository to keep track of my changes a couple days ago.

Todos:

  • Experiment with different gpu sorting algorithms.
  • Debug and time shared memory implementation of ScanLocationKernel
Categories: SURF Tags:

GPU Sorting is slow

June 26th, 2009 Stephen Larew Comments off

When computing the distances and then sorting them, it turns out that it is really slow to sort these distance arrays on the GPU. It’s actually faster to transfer the data to the CPU memory and then sort it there and transfer it back to the device.

The odd thing is that when I sort the data on the CPU, although I get a faster soft, the overall program slows down! And when I do GPU sorting the gpu sorting takes longer but the program speeds up overall!

Categories: SURF Tags:

Computing distances is slow

June 25th, 2009 Stephen Larew Comments off

I timed the distance computations and they are taking 22 seconds out of 36 seconds for Input 3 (see previous post). This is almost certainly due to the warps being serialized due to bank conflict in shared memory. I will investigate this tomorrow. This is an independent problem from the slow down due to scanning over all locations.

Categories: SURF Tags:

Profiler Results and some questions

June 25th, 2009 Stephen Larew Comments off

I looked at the profiler output. I used input 3 (previous post) for input data. The results:

My kernels that calculate the scan statistic are limited by the number of registers. My ScanLocationKernel is using 20 registers per thread. A solution would be to use the shared memory instead of registers. Only 0.0625 of the shared memory is being used. Another solution would be to split the kernel into two somehow.

My ComputeDistances kernel might have shared memory bank conflicts. Will investigate further.

Also, some questions. What are realistic data input sizes that the CUDA implementation should be able to handle? How big is max number of locations, and how big would be the maximum number of grid points? If the # of locations or grid points reaches a certain point, there won’t be enough memory on the gpu and the algorithm will have to work on partial data sets and combine results.

Categories: SURF Tags:

Good and bad news

June 25th, 2009 Stephen Larew Comments off

I have the corrected implementation done now. My output matches SatScan’s exactly. that’s the good news.

The bad news is that this new implementation is slow and much more memory hungry.

An array of distances must be calculated and stored for each grid point. Each array of distances is the size the number of locations. This means that for a data input with 2500 grid points and 10,000 locations (each with x cases and y controls), an array of 2500*10,000 floats needs to be calculated and sorted. But then, another array of the same size is needed to store indexes referring to the locations so that when the distances array is sorted, the location that each sorted distance points to is known as well.

Also, the complexity of the current algorithm is O(GMR) (G=# of grid points, M=# of locations, R=number of monte carlo repetitions) compared to what I was previously doing which was O(G^2 R).

Some timing data:
Input 1: 10,000 population, 987 cases, 2500 grid points.
CUDA: 40.94 seconds
SatScan: 109 seconds
CUDA is 2.67x faster.

Input 2: 10,000 population, 987 cases, 625 grid points.
CUDA: 28.6 seconds
SatScan: 29 seconds
About EVEN speed.

Input 3: 1000 population, 98 cases, 10,000 grid points.
CUDA: 34.38 seconds
SatScan: 36 seconds
Nearly even speed.

I didn’t know what to expect for inputs 1 and 2. Input 3 is a surprise for me though. Currently, the cuda algorithm is parallelizing based on grid points. For each grid point, a thread is started in parallel. So I expected the implementation to shine with a high number of parallel threads such as in Input 3. This is not so.

I don’t have an explanation for why Input 1 was faster. Yet. I am analyzing profiler results now.

Categories: SURF Tags:

Grid Data

June 24th, 2009 Stephen Larew Comments off

I just realized I am likely implementing the grid incorrectly which is causing my results to differ from SatScan’s. According to algorithm 14.3.6 from Kulldorff (1999), at every grid point, as the scanning window is expanded, only one new location (not grid point) should be added. I had been adding grid points for each window expansion. So, I spent much of the afternoon changing this. I expect significant slow downs for the data sets where I was using 10,000 population points but only 2500 grid points. I have no results yet, but I will hopefully have some tomorrow morning.

Categories: SURF Tags:

Kernel Instruction Timing

June 24th, 2009 Stephen Larew Comments off

I did a quick experiment with instruction timing. When calculating the likelihood values, we area actually calculating the log of the likelihood. Say L = A^x*B^y/C^z. Then we’re calculating the log of L: ln(L) = ln(A^x*B^y/C^z) = x*ln(A)+y*ln(B)-z*ln(C). In the kernel code, i had been calculating L as x*ln(A)+y*ln(B)-z*ln(C). I tried using the pow function and calculating A^x*B^y/C^z and then taking the ln of that and it turned out to be nearly 10 seconds slower execution time. The only two math functions I use in the kernels are sqrt and log. Log appears to be faster than using powers and I can’t substitute or not use the sqrt. Maybe I can look at other inefficient instructions?

Categories: SURF Tags:

More on Texture Caching

June 24th, 2009 Stephen Larew Comments off

I have decided to remove the texture caching for now. I will explain why.

When finding the maximum likelihood (scan statistic), a window is placed at each grid point. For each grid point, the window is grown in size, increasing enough each time to include the closest grid point, and updates the number of cases and controls in the window as it grows to include new grid points.

As the window grows, it must access an array with # of cases and # of controls at each grid point. The problem is that its access into these cases and controls arrays is unpredictable. Each thread accesses a random location in the arrays, resulting in two (very slow) global memory access per thread per window size. This adds up very fast!

The easiest solution would be to limit the number of grid points so that the grid point cases and controls arrays could be transferred to shared memory (as fast as registers). But without a limitation on grid size, this isn’t possible.

So, my solution was to try using the texture caching/fetching feature of CUDA to possibly help the cases/controls array access. CUDA and the graphics card cache parts of the cases and controls array into fast memory so that accesses are served quickly. But cache misses result in global memory reads which slow down the code. Since I’m accessing rather random locations in the cases and controls arrays, their are lots of cache misses. This is why I’m going back to the old global memory accesses.

There are two possibilities: 1) Limit the size of the grid. 2) Completely change the way the algorithm scans the data.

Categories: SURF Tags:

Texture Fetching not working as expected

June 23rd, 2009 Stephen Larew Comments off

I implemented texture fetching in the likelihood calculation function. As the scanning window grows in radius, it must sample in an unpredictable order from an array of points containing case/control data. This array is too large to fit into device shared memory (fast!) so it must stay in global memory (slow…). By using texture fetching on this array though, the device might be able to cache parts of the array and increase speed. Or so I thought.

It seems that the texture fetching is same speed or slower as the non texture fetching version. In addition, the texture fetching version’s run time is more variable compared to the non texture fetching version. I am speculating that the texture fetches are causing too many cache misses. For comparison, a data set with 2500 grid points ran at 15.01 seconds when using global memory access. The texture fetching version on the other hand has not reached that speed and has run from around 16 to 18 seconds long.

It might be that texture fetching is not going to work. Tomorrow, I will see if the profiler will provide any clues as to the slow down.

Categories: SURF Tags: