Home > SURF > Memory Limits and Timing

Memory Limits and Timing

July 10th, 2009 Stephen Larew

Today I finished my large array transpose in-place kernel.  It unfortunately took considerable debugging to make it work.  This is good news as it has halved the required memory.

I have said earlier that the maximum grid points is approximately 95×95.  Here’s why:  Given 95×95 = 9025 grid points.  For each grid point, I have to calculate a distance to each other grid point and store both the distance to and index of each grid point.  This means 95^4 floats and ints.  95^4*(8bytes)/1024^2 = 621 MB.  For 96×96 grid points: 648 MB; 97×97: 675 MB; 98×98: 704 MB; 99×99: 733 MB.  This computer’s graphics card (Geforce 8800 GTX) has 768 MB, some of which goes to the display buffer.  So really, I might be able to do a maximum of 99×99 grid points if i turned the resolution and bit depth way down.

Unfortunately, this theoretical max on this hardware isn’t possible yet.  Currently, I compute “windows” for each grid point.  This means another array of equivalent size to the distances array, reducing the number of grid points from 99×99 to 83×83.  I see two options.  I could move computations to the cpu with access to larger system memory.  Another option is to somehow reuse the distances array for the windows calculation.  The former would be slow and the latter tricky if not impossible.

A third option is to run the computation on a better gpu.  Current consumer cards have 2GB max it looks like which would allow 127×127 grid points.

On a side note, currently a large amount of time is spent aggregating location data to grid points with large data sets.  I would speed this part up by doing it on the GPU but the 8800GTX is a compute capability 1.0 device meaning it does not support atomic instructions.  I implemented a CUDA kernel that I know works all except for the lack of atomicAdd instruction.  So what will happen is if the program is run on a computer with a compute capability 1.1 or better device, the CUDA kernel will do the aggregation instead of the CPU and this should speed up the overall runtime.

Another thing I have noticed is that for large data sets, such as 1 million population, reading the input data takes a surprising amount of time.  1 million population points and 75×75 grid points takes about 14 seconds to load.  Is there a faster way to read in data in a text file other than by using fscanf for each line?

I ran another timing test last night and have the results.  I will show you on Monday.  They are looking good.

I did not get to start implementing the space-time scan yet.  I just keep finding more things to correct.

Categories: SURF Tags:
Comments are closed.