UI done
I have complete the UI portion of SaTScan addition. I just need to figure out the data input and cluster drawing mechanisms and all will be done.
I have complete the UI portion of SaTScan addition. I just need to figure out the data input and cluster drawing mechanisms and all will be done.
I decided to focus on the reporting of clusters today. When the user clicks on the menu item to run satscan, a dockable widget pops up. One tab can have input parameters, and the other tab has a list of detected clusters. I also made a little progress on my lab report. As for the input data, I couldn’t find any variables directly about sick patients or all patients. I did find an array in commented out code that was about sick and all patients. Maybe you could be more specific?
I added the class that calls SaTScan. I can write out data, run the program, and read in the results. What kinds of options should the user have? (ex. control number of monte carlo reps)
I need to know how and which data to pass along to SaTScan now.
In the meantime, I worked on adding a pop-up progress bar to notify the user of SaTScan’s progress. It’s not working yet.
I spent considerable time getting IVTEK_QT to compile. It turned out I had to recompile QT with some option. I began adding the UI element that calls the SaTScan class to run SaTScanBatch.exe.
I wrote an abstract draft and emailed it to you.
I have an idea for speeding up the monte carlo repetitions but I don’t know when I’ll have time to try it out. Adding SaTScan to the IVTEK_QT program, writing the SURF report, and making the poster all in the next two weeks is enough for now. Unless you want me to try it out? I can explain it to you if you’d like.
I decided I would try to skip the use of global memory and directly implement the md5 hash random generator in a device function that is callled by the GenerateNullHypothesisKernel kernel. This would skip the use of the global memory and possibly speed up the random generator and provide actual speedup. Unfortunately, things became confusing.
I implemented this and the null hypothesis generation sped up considerably but the scanning of data slowed down! I realized this was because of the way kernel launches are handled. A kernel launch is an asynchronous call so in order to time a kernel’s runtime individually, you must add a cudaThreadSynchronize() call right after the kernel launch to wait for the kernel to finish.
I need to test this again, but it would seem that random number generation isn’t the culprit and it is instead actually the scanning and calculation of likelihoods that is slow. I will test this again in the morning.
I plan to and am implementing a different way of calculating likelihoods. I will do a sum-scan on the window data which will allow each thread to calculate just one likelihood. There will be one thread per window instead of per grid point. I suspect this will be faster. We will see though. It should provide the maximum amount of concurrency though.
Today I finished my large array transpose in-place kernel. It unfortunately took considerable debugging to make it work. This is good news as it has halved the required memory.
I have said earlier that the maximum grid points is approximately 95×95. Here’s why: Given 95×95 = 9025 grid points. For each grid point, I have to calculate a distance to each other grid point and store both the distance to and index of each grid point. This means 95^4 floats and ints. 95^4*(8bytes)/1024^2 = 621 MB. For 96×96 grid points: 648 MB; 97×97: 675 MB; 98×98: 704 MB; 99×99: 733 MB. This computer’s graphics card (Geforce 8800 GTX) has 768 MB, some of which goes to the display buffer. So really, I might be able to do a maximum of 99×99 grid points if i turned the resolution and bit depth way down.
Unfortunately, this theoretical max on this hardware isn’t possible yet. Currently, I compute “windows” for each grid point. This means another array of equivalent size to the distances array, reducing the number of grid points from 99×99 to 83×83. I see two options. I could move computations to the cpu with access to larger system memory. Another option is to somehow reuse the distances array for the windows calculation. The former would be slow and the latter tricky if not impossible.
A third option is to run the computation on a better gpu. Current consumer cards have 2GB max it looks like which would allow 127×127 grid points.
On a side note, currently a large amount of time is spent aggregating location data to grid points with large data sets. I would speed this part up by doing it on the GPU but the 8800GTX is a compute capability 1.0 device meaning it does not support atomic instructions. I implemented a CUDA kernel that I know works all except for the lack of atomicAdd instruction. So what will happen is if the program is run on a computer with a compute capability 1.1 or better device, the CUDA kernel will do the aggregation instead of the CPU and this should speed up the overall runtime.
Another thing I have noticed is that for large data sets, such as 1 million population, reading the input data takes a surprising amount of time. 1 million population points and 75×75 grid points takes about 14 seconds to load. Is there a faster way to read in data in a text file other than by using fscanf for each line?
I ran another timing test last night and have the results. I will show you on Monday. They are looking good.
I did not get to start implementing the space-time scan yet. I just keep finding more things to correct.
The timing framework I made yesterday ran last night. Unfortunately, an oversight of mine led to the program crashing with larger data sets. I have fixed this now and will run the timing framework again tonight. Results were at least positive.
As said earlier, the maximum number of grid points is about 95×95. Currently though, this is not true due to another oversight related to transposing a matrix. I need to change the current CUDA kernel that transposes the distances matrix so that it transposes the array in-place without a second copy to write to. This is just slightly tricky due to memory coalescing issues. I will have this working tomorrow morning hopefully.
The code is fully commented now.
Tomorrow I will begin (and maybe finish?) the code to do space-time scans.
Today I set up a timing test framework. In the process I found a few bugs and fixed them. I also began updating my comments to reflect all the changes.
The timing framework generates many day sets bases from a parameters file and then runs my cuda imp and satscan and records the runtimes. It will run my cuda imp multiple times per data set and take an average time.
Currently the cuda imp is still limited to about 95×95 grid points. It should handle millions of data points though.
Tomorrow I will look at the results of the time tests and begin the time scan code.
Since the Null Hypothesis generation is the bottleneck currently, and specifically the generation of random numbers, I’ve tried a few things. First off, I changed the kernel that uses the random numbers so that it is based off probability. Secondly, I moved the probability calculation off the CPU (and onto the GPU). Lastly, I’ve tried generating random numbers on the GPU using cudpp and on the cpu using SFMT. The SFMT appears to be just slightly faster by tenths of a second.
Null hypothesis generation is still the bottleneck!
Also, I fixed the bad p-value output bug from yesterday. Turned out I had to seed the cudpp random generator every useage.
I tried using the GPU for random number generation when doing the null hypothesis generation. This time I’m using CUDPP’s random function which will generate an array of random values using CUDA. There is a speed increase from approximately 18 seconds down to 14 seconds when using a 10000 population, 789 cases, 2500 grid point data set. The biggest bottleneck right now is generating the null hypothesis. Specifically, generating the random numbers. I might be near or at the maximum random generation speed available?
Important note though: There is a bug somewhere causing bad p-values to be output now. Will investigate tomorrow.