Thomas Shen tbshen ECE498al1 MP5 Report 1. Performance Results **===-------------------------------------------------===** Processing 16000000 elements... Host CPU Processing time: 5079.641113 (ms) G80 CUDA Processing time: 312.075012 (ms) Speedup: 16.276988X Test PASSED 2. Implementation I did a bitonic sort(mybitonic) on each block of 512 elements, and then merged these blocks into a sorted 1024 block, 2048, 4096, .., 16777216 block. To merge two 512 blocks into a 1024 block, one of the 512 blocks is sorted ascending and the second descending. The 512 block sort is done in one kernel. To merge into a 1024 sorted array, two kernel calls are necessary. First, one call to kernel 'comparator' to get the large numbers into the bottom half and the small numbers into the lower half. Then, a kernel (endStageComparator) that will get each of those 512 element halves in order is called. This is done using 9 comparisons. First, with offset 256, 128, 64,...,1. Originally, I used the bitonic sort to get the 512 elements in order, but since the elements are in ascending-descending order, there is a simpler algorithm to get them sorted. It takes more and more kernel calls to merge the next step, since we can't do all the compares necessary in one kernel call. The whole problem comes about because we can't guarantee that all the blocks in the grid are up to a certain step in the process. Thus, we need to exit the kernel and start a new one to make sure they are in sync. 3. The program seems to be limited by memory bandwidth. For 16 million elements, there is one call to 'mybitonic', 15 to 'endStageComparator', and 120 (15 summation) calls to 'comparator'. The first two kernels are not that bad, since many operations are done on the same data. Thus, we can store the data into a shared array to cut down on global memory accesses. However, for each of the 120 calls to 'comparator', there is a huge overhead of copying from global memory. Each element is compared with only one other element. If a comparison is considered a floating point operation, this is a 2-memory-read:1-FLOP ratio which is terrible. Furthermore, if we need to swap the two elements, then it becomes a 4-memory-read/write:1-FLOP ratio. Below, I have calculated the memory access to flop/comparison ratio. As can be seen, the ratio for the 'comparator' kernel is terrible. 'comparator' does 83888608 comparisons in average 1322us = 63455comps/us 'endStageComparator' does 83888608*9 comparisons in average of 7173 us, = 105255comps/us. If we could do more comparisons per memory access, this would undoubtedly increase our performance. mybitonic calls: 1 Total time: 43373us percentage of time: 13.9854% memory access to flop ratio: 4 : 45 (2 reads, 2 writes per thread. 45 stages to sort 512 elements) comparator calls: 120 time: 158711us avg time per: 1322us percentage of time: 51.1755% memory access to flop ratio: no swap: 2:1 swap: 4:1 endStageComparator calls: 15 time: 107598.75us avg time per: 7173 percentage of time: 34.7502% memory access to flop ratio: 4 : 9 (2 reads, 2 writes per thread. 9 stages to finish sorting 512 elements) constant memory copy calls : 120 time: 275.55us percentage of time: 0.0889% Total time = 310130.55us