OpenCL Basic Example - Vector Addition Explanation
This document provides a detailed explanation of the OpenCL vector addition example in 15-opencl-vector-addition.c
.
This example demonstrates the OpenCL equivalent of the CUDA vector addition, showcasing the differences and similarities between CUDA and OpenCL programming models.
Prerequisites
To run this example, you need: - OpenCL-compatible device (GPU, CPU, or other accelerator) - OpenCL runtime and headers installed - A C compiler (gcc, clang, etc.) - GNU Make (for building with the provided Makefile)
Installing OpenCL
Ubuntu/Debian:
# For NVIDIA GPUs
sudo apt-get install nvidia-opencl-dev
# For AMD GPUs
sudo apt-get install amdgpu-pro-opencl-dev
# For Intel GPUs/CPUs
sudo apt-get install intel-opencl-icd
# Generic OpenCL headers
sudo apt-get install opencl-headers ocl-icd-opencl-dev
CentOS/RHEL:
# Install OpenCL headers and loader
sudo yum install opencl-headers ocl-icd-devel
# For NVIDIA GPUs, install CUDA toolkit
# For AMD GPUs, install ROCm
macOS: OpenCL is included with the system (no additional installation needed).
Building and Running
-
Build the example:
-
Run the program:
Code Structure and Explanation
1. Header Files and Platform Detection
OpenCL headers are located differently on macOS vs. other platforms:
- macOS: <OpenCL/opencl.h>
- Linux/Windows: <CL/cl.h>
2. OpenCL Kernel Source
const char* kernelSource =
"__kernel void vectorAdd(__global const float* A,\n"
" __global const float* B,\n"
" __global float* C,\n"
" const int numElements) {\n"
" int i = get_global_id(0);\n"
" if (i < numElements) {\n"
" C[i] = A[i] + B[i];\n"
" }\n"
"}\n";
Key differences from CUDA:
- __kernel
instead of __global__
- __global
memory space qualifier for pointers
- get_global_id(0)
instead of manual thread index calculation
- OpenCL kernels are compiled at runtime from source strings
3. Error Handling
OpenCL requires extensive error checking. The example includes:
void checkError(cl_int error, const char* operation) {
if (error != CL_SUCCESS) {
printf("Error during %s: %d\n", operation, error);
exit(1);
}
}
And a comprehensive error string function for debugging.
4. Platform and Device Discovery
Unlike CUDA which automatically uses NVIDIA GPUs, OpenCL requires explicit platform and device discovery:
// Get platform
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
checkError(ret, "getting platform IDs");
// Get device (prefer GPU, fallback to any device)
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);
if (ret != CL_SUCCESS) {
printf("No GPU found, trying any device type...\n");
ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 1, &device_id, &ret_num_devices);
checkError(ret, "getting device IDs");
}
This code: 1. Finds the first available OpenCL platform 2. Tries to get a GPU device 3. Falls back to any available device if no GPU is found
5. Context and Command Queue Creation
// Create OpenCL context
cl_context context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);
checkError(ret, "creating context");
// Create command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
checkError(ret, "creating command queue");
OpenCL uses: - Context: Manages devices and memory objects - Command Queue: Queues operations for execution on a device
6. Memory Management
// Create memory buffers on device
cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSize, NULL, &ret);
cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, dataSize, NULL, &ret);
// Copy data to device buffers
ret = clEnqueueWriteBuffer(command_queue, d_A, CL_TRUE, 0, dataSize, h_A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, d_B, CL_TRUE, 0, dataSize, h_B, 0, NULL, NULL);
Key differences from CUDA:
- clCreateBuffer()
instead of cudaMalloc()
- Memory access patterns specified at creation (CL_MEM_READ_ONLY
, CL_MEM_WRITE_ONLY
)
- clEnqueueWriteBuffer()
instead of cudaMemcpy()
- All operations are queued on command queues
7. Runtime Compilation
// Create program from source
cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &ret);
checkError(ret, "creating program");
// Build program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
if (ret != CL_SUCCESS) {
// Get build log for debugging
size_t log_size;
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
char *log = (char*)malloc(log_size);
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
printf("Build log:\n%s\n", log);
free(log);
exit(1);
}
OpenCL compiles kernels at runtime, allowing for: - Platform-specific optimizations - Runtime kernel generation - Better portability across vendors
8. Kernel Execution
// Create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd", &ret);
// Set kernel arguments
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_A);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_B);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_C);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void*)&numElements);
// Execute kernel
size_t globalWorkSize = numElements;
size_t localWorkSize = 256; // Work group size
// Adjust global work size to be multiple of local work size
if (globalWorkSize % localWorkSize != 0) {
globalWorkSize = ((globalWorkSize / localWorkSize) + 1) * localWorkSize;
}
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL);
Key concepts: - Global Work Size: Total number of work items (similar to total threads in CUDA) - Local Work Size: Work group size (similar to block size in CUDA) - Global work size must be multiple of local work size - Arguments are set individually with type and size information
9. Synchronization and Results
// Wait for kernel to complete
ret = clFinish(command_queue);
checkError(ret, "waiting for kernel to finish");
// Read result back to host
ret = clEnqueueReadBuffer(command_queue, d_C, CL_TRUE, 0, dataSize, h_C, 0, NULL, NULL);
clFinish()
waits for all queued operations to completeclEnqueueReadBuffer()
withCL_TRUE
performs blocking read
CUDA vs OpenCL Comparison
Aspect | CUDA | OpenCL |
---|---|---|
Vendor | NVIDIA only | Cross-platform (NVIDIA, AMD, Intel, etc.) |
Language | C++ with extensions | C99 with extensions |
Compilation | Compile-time (nvcc ) |
Runtime compilation |
Memory Model | Implicit global memory | Explicit memory spaces (__global , __local , etc.) |
Thread Indexing | Manual calculation | Built-in functions (get_global_id() ) |
Error Handling | Return codes + cudaGetLastError() |
Return codes for all functions |
Kernel Launch | <<<blocks, threads>>> syntax |
clEnqueueNDRangeKernel() |
Memory Management | cudaMalloc , cudaMemcpy |
clCreateBuffer , clEnqueueWriteBuffer |
Performance Considerations
- Work Group Size
- Similar to CUDA block size
- Should be multiple of 32 (warp size) on NVIDIA GPUs
-
Should be multiple of 64 (wavefront size) on AMD GPUs
-
Memory Access Patterns
- Coalesced access still important
-
OpenCL provides more explicit control over memory spaces
-
Kernel Compilation
- Runtime compilation adds overhead
- Can cache compiled binaries for production use
Common Issues and Debugging
-
No OpenCL Platforms Found
-
Kernel Compilation Failures
-
Work Size Errors
-
Memory Errors
Expected Output
When running successfully, you should see:
OpenCL Vector addition of 50000 elements
Using OpenCL platform: NVIDIA CUDA
Using device: Tesla P40
Device type: GPU
Global memory: 22906 MB
Compute units: 60
Max work group size: 1024
OpenCL kernel launch with global work size 50176 and local work size 256
Verifying results...
Test PASSED
Done
Advanced Features
This basic example can be extended to explore:
- Multiple Devices: Run on multiple GPUs/CPUs simultaneously
- Asynchronous Execution: Use events for fine-grained synchronization
- Image Processing: Use OpenCL image objects and samplers
- Local Memory: Utilize
__local
memory for shared data - Profiling: Enable command queue profiling for performance analysis
Building for Different Platforms
The example includes conditional compilation for different platforms and can be adapted for: - NVIDIA GPUs (via CUDA OpenCL implementation) - AMD GPUs (via ROCm or proprietary drivers) - Intel CPUs/GPUs (via Intel OpenCL runtime) - ARM Mali GPUs (via ARM Compute Library)
This makes OpenCL an excellent choice for cross-platform GPU computing applications.