Q2: **===-------------------------------------------------===** Processing 16000000 elements... Host CPU Processing time: 5110.744141 (ms) G80 CUDA Processing time: 444.885010 (ms) Speedup: 11.487787X Test PASSED This is the performance for 16000000 elements. However, since I've padded the input to a power of 2 total number, the performance is not stable (numbers closer to power of 2 will be faster and otherwise slower). For example, 6000000 is only 9.479X and 1000000 is 13.004X Q3: First, I pad the data to power of 2 total numbers. Then, cut them into 4*BLOCK_SIZE parts and load them into shared memory and every part is then bitonic sorted to ascend or descend lists. After this, I merge them together by continuing the bitonic sorting procedure on GPU with global memory. For each process, when the two compared elements are no more than 4*BLOCK_SIZE away, I load them into shared memory again to hide the bandwith limit. Q4: Kernel1 is limited by the register. Since it has 11 registers per block, only two blocks can concurrently run on one SM instead of 3. I tried to reduce it to 10, but didn't work out. Kernel2 is limited by the global memory bandwith. Each swap will need to acess the global memory twice to write them back. And texture fetch seems not too fast either. For the 16000000 case, kernel2 itself takes 160ms out of the total 445 ms. Kernel3 is taken cared for both the bandwith and the register issue. Then the execution throughput become important. For every swap, we need to do one calculation and 3 comparison.