Skip to content

CUDA Basic Example - Vector Addition Explanation

This document provides a detailed explanation of the vector addition CUDA example in basic01.cu.

You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

Prerequisites

To run this example, you need: - NVIDIA GPU with CUDA support - NVIDIA CUDA Toolkit installed - A C++ compiler compatible with your CUDA version - GNU Make (for building with the provided Makefile)

Building and Running

  1. Build the example:

    make
    

  2. Run the program:

    ./basic01
    

Code Structure and Explanation

1. Header Files and Includes

#include <stdio.h>
#include <stdlib.h>
These standard C headers provide: - stdio.h: Input/output functions like printf - stdlib.h: Memory management functions like malloc and free

2. CUDA Kernel Function

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements)
  • __global__: Specifies this is a CUDA kernel function that:
  • Runs on the GPU
  • Can be called from CPU code
  • Must return void
  • Parameters:
  • float *A, *B: Input vectors in GPU memory
  • float *C: Output vector in GPU memory
  • numElements: Size of the vectors

Inside the kernel:

int i = blockDim.x * blockIdx.x + threadIdx.x;
This calculates a unique index for each thread where: - threadIdx.x: Thread index within the block (0 to blockDim.x-1) - blockIdx.x: Block index within the grid - blockDim.x: Number of threads per block

3. Main Function Components

3.1 Memory Allocation

// Host memory allocation
float *h_A = (float *)malloc(size);  // CPU memory

// Device memory allocation
float *d_A = NULL;
cudaMalloc((void **)&d_A, size);     // GPU memory
  • Host (CPU) memory uses standard C malloc
  • Device (GPU) memory uses CUDA's cudaMalloc
  • The 'h_' prefix denotes host memory
  • The 'd_' prefix denotes device memory

3.2 Data Transfer

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy parameters: 1. Destination pointer 2. Source pointer 3. Size in bytes 4. Direction of transfer: - cudaMemcpyHostToDevice: CPU to GPU - cudaMemcpyDeviceToHost: GPU to CPU

3.3 Kernel Launch Configuration

int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
  • threadsPerBlock = 256: Common size for good performance
  • blocksPerGrid: Calculated to ensure enough threads for all elements
  • The formula (numElements + threadsPerBlock - 1) / threadsPerBlock rounds up the division
  • Launch syntax <<<blocks, threads>>> specifies the execution configuration

3.4 Error Checking

cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    fprintf(stderr, "Failed to launch kernel: %s\n", cudaGetErrorString(err));
    exit(EXIT_FAILURE);
}

Always check for CUDA errors after kernel launches and CUDA API calls.

3.5 Result Verification

for (int i = 0; i < numElements; ++i) {
    if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
        fprintf(stderr, "Result verification failed at element %d!\n", i);
        exit(EXIT_FAILURE);
    }
}

Verifies the GPU computation by comparing with CPU results.

3.6 Cleanup

// Free GPU memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

// Free CPU memory
free(h_A);
free(h_B);
free(h_C);

Always free allocated memory to prevent memory leaks.

Performance Considerations

  1. Thread Block Size
  2. We use 256 threads per block
  3. This is a common choice that works well on most GPUs
  4. Powers of 2 are typically efficient

  5. Memory Coalescing

  6. Adjacent threads access adjacent memory locations
  7. This pattern enables efficient memory access

  8. Error Checking

  9. The code includes robust error checking
  10. Important for debugging and reliability

Common Issues and Debugging

  1. CUDA Installation
  2. Ensure CUDA toolkit is properly installed
  3. Check nvcc --version works
  4. Verify GPU compatibility with nvidia-smi

  5. Compilation Errors

  6. Check CUDA path is in system PATH
  7. Verify GPU compute capability matches -arch flag in Makefile

  8. Runtime Errors

  9. Out of memory: Reduce vector size
  10. Kernel launch failure: Check GPU availability
  11. Incorrect results: Verify index calculations

Expected Output

When running successfully, you should see:

Vector addition of 50000 elements
CUDA kernel launch with 196 blocks of 256 threads
Test PASSED
Done

Modifying the Example

To experiment with the code:

  1. Change vector size (numElements)
  2. Modify threads per block
  3. Add timing measurements
  4. Try different data types
  5. Implement other vector operations

Remember to handle errors and verify results after modifications.

Share on Share on