CUDA Function Type Annotations: A Comprehensive Guide
Overview
CUDA provides several function type annotations that specify where functions can be called from and where they execute. Understanding these annotations is crucial for effective CUDA programming as they determine the execution space and calling constraints of your functions.
Function Type Annotations
1. __global__
- Kernel Functions
Purpose: Defines functions that run on the GPU and are called from the host (CPU).
Characteristics:
- Executed on the device (GPU)
- Called from the host (CPU)
- Must return void
- Cannot be called from other device functions
- Asynchronous execution (unless synchronized)
Syntax:
2. __device__
- Device Functions
Purpose: Defines functions that run on the GPU and are called from other GPU functions.
Characteristics:
- Executed on the device (GPU)
- Called from device code only (__global__
or other __device__
functions)
- Can return any type
- Cannot be called from host code
- Inlined by default for better performance
Syntax:
3. __host__
- Host Functions
Purpose: Defines functions that run on the CPU (default behavior).
Characteristics: - Executed on the host (CPU) - Called from host code only - Default annotation (can be omitted) - Cannot be called from device code
Syntax:
__host__ returnType hostFunction(parameters) {
// CPU code
return value;
}
// Equivalent to:
returnType hostFunction(parameters) {
// CPU code
return value;
}
4. Combined Annotations
__host__ __device__
: Functions that can be compiled for both host and device execution.
Characteristics: - Compiled for both CPU and GPU - Can be called from both host and device code - Useful for utility functions - Code must be compatible with both architectures
Syntax:
__host__ __device__ returnType dualFunction(parameters) {
// Code that works on both CPU and GPU
return value;
}
Memory Space Annotations
__shared__
- Shared Memory
Purpose: Declares variables in shared memory within a thread block.
Characteristics: - Shared among all threads in a block - Much faster than global memory - Limited in size (typically 48KB-96KB per block) - Lifetime matches the block execution
__constant__
- Constant Memory
Purpose: Declares read-only variables in constant memory.
Characteristics: - Read-only from device code - Cached for fast access - Limited to 64KB total - Initialized from host code
__managed__
- Unified Memory
Purpose: Declares variables accessible from both CPU and GPU with automatic migration.
Characteristics: - Automatically migrated between CPU and GPU - Simplifies memory management - May have performance implications - Requires compute capability 3.0+
Practical Examples and Demonstrations
Let's explore these annotations through practical examples that build upon each other.
Example 1: Basic Kernel and Device Function Interaction
This example demonstrates how __global__
kernels call __device__
functions:
// Device function for mathematical computation
__device__ float computeDistance(float x1, float y1, float x2, float y2) {
float dx = x2 - x1;
float dy = y2 - y1;
return sqrtf(dx * dx + dy * dy);
}
// Global kernel that uses the device function
__global__ void calculateDistances(float* x1, float* y1, float* x2, float* y2,
float* distances, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
distances[idx] = computeDistance(x1[idx], y1[idx], x2[idx], y2[idx]);
}
}
Example 2: Host-Device Dual Functions
This example shows functions that work on both CPU and GPU:
// Function that works on both host and device
__host__ __device__ float lerp(float a, float b, float t) {
return a + t * (b - a);
}
// Device function using the dual function
__device__ float interpolateValue(float* array, float index, int size) {
int lower = (int)index;
int upper = min(lower + 1, size - 1);
float t = index - lower;
return lerp(array[lower], array[upper], t);
}
// Host function also using the dual function
__host__ void preprocessData(float* data, int size) {
for (int i = 0; i < size - 1; i++) {
data[i] = lerp(data[i], data[i + 1], 0.5f);
}
}
Example 3: Memory Space Annotations
This example demonstrates different memory spaces:
// Constant memory declaration
__constant__ float convolution_kernel[9];
// Global kernel using shared memory
__global__ void convolutionWithShared(float* input, float* output,
int width, int height) {
// Shared memory for tile-based processing
__shared__ float tile[18][18]; // Assuming 16x16 threads + 2-pixel border
int tx = threadIdx.x;
int ty = threadIdx.y;
int gx = blockIdx.x * blockDim.x + tx;
int gy = blockIdx.y * blockDim.y + ty;
// Load data into shared memory with border handling
// ... (implementation details)
__syncthreads();
// Perform convolution using constant memory kernel
if (gx < width && gy < height) {
float result = 0.0f;
for (int i = 0; i < 3; i++) {
for (int j = 0; j < 3; j++) {
result += tile[ty + i][tx + j] * convolution_kernel[i * 3 + j];
}
}
output[gy * width + gx] = result;
}
}
Best Practices and Guidelines
1. Function Design Principles
- Keep device functions simple: Avoid complex control flow in
__device__
functions - Minimize parameter passing: Use references or pointers when possible
- Consider inlining: Small
__device__
functions are automatically inlined
2. Memory Management
- Use appropriate memory spaces:
- Shared memory for data shared within a block
- Constant memory for read-only data accessed by all threads
- Global memory for large datasets
3. Performance Considerations
- Avoid divergent branching: Minimize
if-else
statements in kernels - Optimize memory access patterns: Ensure coalesced memory access
- Use dual functions wisely:
__host__ __device__
functions can help with code reuse
4. Debugging and Development
- Start simple: Begin with basic
__global__
kernels before adding complexity - Test incrementally: Verify each function type works correctly
- Use proper error checking: Always check CUDA error codes
Common Pitfalls and Solutions
1. Calling Restrictions
Problem: Trying to call __device__
functions from host code
// WRONG: This will cause compilation error
__device__ int deviceFunc() { return 42; }
int main() {
int result = deviceFunc(); // ERROR!
return 0;
}
Solution: Use appropriate calling patterns
__device__ int deviceFunc() { return 42; }
__global__ void kernel(int* result) {
*result = deviceFunc(); // CORRECT
}
2. Return Type Restrictions
Problem: Returning non-void from __global__
functions
Solution: Use output parameters
3. Memory Space Confusion
Problem: Incorrectly accessing different memory spaces
__shared__ float sharedData[256];
__global__ void kernel() {
// WRONG: Trying to pass shared memory address to host
cudaMemcpy(hostPtr, sharedData, sizeof(float) * 256, cudaMemcpyDeviceToHost);
}
Solution: Copy through global memory
__global__ void kernel(float* globalOutput) {
__shared__ float sharedData[256];
// Process data in shared memory
// ...
// Copy to global memory
if (threadIdx.x < 256) {
globalOutput[threadIdx.x] = sharedData[threadIdx.x];
}
}
Advanced Topics
1. Dynamic Parallelism
CUDA supports calling kernels from device code (compute capability 3.5+):
__global__ void parentKernel() {
// Launch child kernel from device
childKernel<<<1, 1>>>();
cudaDeviceSynchronize(); // Synchronize child kernels
}
__global__ void childKernel() {
printf("Hello from child kernel!\n");
}
2. Cooperative Groups
Modern CUDA programming with cooperative groups:
#include <cooperative_groups.h>
using namespace cooperative_groups;
__global__ void cooperativeKernel() {
thread_block block = this_thread_block();
// Synchronize all threads in block
block.sync();
// Use cooperative group operations
int sum = reduce(block, threadIdx.x, plus<int>());
}
Conclusion
Understanding CUDA function type annotations is fundamental to effective GPU programming. These annotations control:
- Execution location: Where the function runs (CPU vs GPU)
- Calling context: From where the function can be called
- Memory access: What memory spaces the function can access
- Performance characteristics: How the function is optimized
By mastering these concepts and following best practices, you can write efficient, maintainable CUDA code that leverages the full power of GPU computing.
Related Topics
- CUDA Memory Model
- Thread Synchronization
- Performance Optimization
- Error Handling in CUDA
- Advanced CUDA Features
This document provides a comprehensive overview of CUDA function annotations. For more advanced topics and latest features, refer to the official CUDA Programming Guide.