CUDA Basic Example - Vector Addition Explanation
This document provides a detailed explanation of the vector addition CUDA example in basic01.cu
.
You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial
Prerequisites
To run this example, you need: - NVIDIA GPU with CUDA support - NVIDIA CUDA Toolkit installed - A C++ compiler compatible with your CUDA version - GNU Make (for building with the provided Makefile)
Building and Running
-
Build the example:
-
Run the program:
Code Structure and Explanation
1. Header Files and Includes
These standard C headers provide: -stdio.h
: Input/output functions like printf
- stdlib.h
: Memory management functions like malloc
and free
2. CUDA Kernel Function
__global__
: Specifies this is a CUDA kernel function that:- Runs on the GPU
- Can be called from CPU code
- Must return void
- Parameters:
float *A, *B
: Input vectors in GPU memoryfloat *C
: Output vector in GPU memorynumElements
: Size of the vectors
Inside the kernel:
This calculates a unique index for each thread where: -threadIdx.x
: Thread index within the block (0 to blockDim.x-1)
- blockIdx.x
: Block index within the grid
- blockDim.x
: Number of threads per block
3. Main Function Components
3.1 Memory Allocation
// Host memory allocation
float *h_A = (float *)malloc(size); // CPU memory
// Device memory allocation
float *d_A = NULL;
cudaMalloc((void **)&d_A, size); // GPU memory
- Host (CPU) memory uses standard C
malloc
- Device (GPU) memory uses CUDA's
cudaMalloc
- The 'h_' prefix denotes host memory
- The 'd_' prefix denotes device memory
3.2 Data Transfer
cudaMemcpy
parameters:
1. Destination pointer
2. Source pointer
3. Size in bytes
4. Direction of transfer:
- cudaMemcpyHostToDevice
: CPU to GPU
- cudaMemcpyDeviceToHost
: GPU to CPU
3.3 Kernel Launch Configuration
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
threadsPerBlock = 256
: Common size for good performanceblocksPerGrid
: Calculated to ensure enough threads for all elements- The formula
(numElements + threadsPerBlock - 1) / threadsPerBlock
rounds up the division - Launch syntax
<<<blocks, threads>>>
specifies the execution configuration
3.4 Error Checking
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to launch kernel: %s\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Always check for CUDA errors after kernel launches and CUDA API calls.
3.5 Result Verification
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
Verifies the GPU computation by comparing with CPU results.
3.6 Cleanup
// Free GPU memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free CPU memory
free(h_A);
free(h_B);
free(h_C);
Always free allocated memory to prevent memory leaks.
Performance Considerations
- Thread Block Size
- We use 256 threads per block
- This is a common choice that works well on most GPUs
-
Powers of 2 are typically efficient
-
Memory Coalescing
- Adjacent threads access adjacent memory locations
-
This pattern enables efficient memory access
-
Error Checking
- The code includes robust error checking
- Important for debugging and reliability
Common Issues and Debugging
- CUDA Installation
- Ensure CUDA toolkit is properly installed
- Check
nvcc --version
works -
Verify GPU compatibility with
nvidia-smi
-
Compilation Errors
- Check CUDA path is in system PATH
-
Verify GPU compute capability matches
-arch
flag in Makefile -
Runtime Errors
- Out of memory: Reduce vector size
- Kernel launch failure: Check GPU availability
- Incorrect results: Verify index calculations
Expected Output
When running successfully, you should see:
Vector addition of 50000 elements
CUDA kernel launch with 196 blocks of 256 threads
Test PASSED
Done
Modifying the Example
To experiment with the code:
- Change vector size (
numElements
) - Modify threads per block
- Add timing measurements
- Try different data types
- Implement other vector operations
Remember to handle errors and verify results after modifications.