CISC372 并行计算

CISC372-Parallel
Project 6
Overview:
In the last project, you implemented an image filter program using pThreads. A special case of the filter
program is called a box blur. Basically, this type of filter involves setting all of the values in the matrix to
1, then dividing the result by the number of values in the matrix.
= [
1 1 1
1 1 1
1 1 1
] ℎ ℎ , ℎ , 9
The issue we had in the last program, is that when an image is high resolution, a 3x3 filter does very little
to change the appearance of the image. We would like a bigger filter (i.e. the radius here is 1, we might
want a radius of 20 or 40), but this would make the problem somewhat intractable.
A fast way to do this is to simply keep a running sum for each row of the last 2*radius+1 elements, then
take the resultant image, and do the same for each column. If we divide each of these by the width of
the kernel (2*radius+1), Then we end up computing exactly what the filter computes (average around a
radius), with exactly one pass through the columns and one pass through the rows. Now that each row
and each column is independent, we have a hope of parallelizing this algorithm.
Project Details:
For this project, you may either work alone, or in pairs. You will have until the final Friday (5/14) to
complete this assignment. If you work in pairs, make sure that the header of all files that you generate
contains the names of both people who worked on the project so that you both get credit. Both people
should hand in the final project via Canvas. You may run this code anywhere you like (on PSC, on
cisc372 using srun, or on your own machine configured for CUDA). You should hand in your final .cu file
and any other files you produce.
Part 1: Fast Blur
You can retrieve my fast blur code from github, along with a sample image (Gauss,jpg) from github at:
gsilber/CISC372_HW6 (github.com)
Use the included makefile to build the program. You can run it as is by executing ./fastblur gauss.jpg 40
where 40 is the desired radius (this is a big image). You can play with different values of radius to see
how it behaves. The radius is dependent on the image resolution. On different resolutions, the radius is
a different percentage of the entire image, and thus will have a different blurring effect.
Part 2: Simple CUDA
In this part of the project, you should modify the fastblur.c file to create cudablur.cu (cuda code must
have a .cu extension to work). You will need to change the makefile to use nvcc instead of gcc to
compile for cuda.
Rewrite the program, so that each column runs in its own thread. I suggest a thread block size of 256.
This means turning the computeColumn function into a kernel, and figuring out the col parameter from
the threadIdx, blockIdx, and blockdim variables.
Then you must sync up the threads with a call to cudaDeviceSync and repeat the process for each row.
Finally convert back to uint8_t array, and save the image.
I suggest for this part you use cudaMallocManaged and cudaFree for all the arrays to simplify the code.
If you have a block size of 256, then you would have a block count of (width+255)/256 columns. Make
sure to check in your kernel function for unused threads where the computed column>pWidth. Do the
same for the rows (height+255)/256. And check the computed row against height. If the height or width
is not divisible by the blocksize, then we will have some extra threads that need to just return
immediately.
Part 2 is kind of slow. This is because of the managed memory. To speed it up, we want to allocate the
memory we need on the device where possible and move that memory with cudaMalloc and
cudaMemcpy up to the device for calculation. Then when complete, copy that memory back to the host
in order to save it to the output file. Play with the values for blocksize to try to maximize performance.
See how fast you can get the computation to run.
What to hand in:
Hand in your cudaBlur.cu file from part2, and from part3 along with makefiles for each and any other
files you added which are required to build your program. Make sure your program compiles and runs,
and put the system where you ran it in the comments to avoid any confusion.