CUDA GPU Organization Hierarchy
This document provides a comprehensive overview of the NVIDIA GPU architecture and programming model hierarchy, from both hardware and software perspectives.
You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial
Table of Contents
- Hardware Organization
- Software Programming Model
- Memory Hierarchy
- Execution Model
- Performance Considerations
- Example Application
Hardware Organization
GPU Architecture Evolution
NVIDIA GPUs have evolved through multiple architecture generations:
Architecture | Example GPUs | Key Features |
---|---|---|
Tesla | GeForce 8/9/200 series | First CUDA-capable GPUs |
Fermi | GeForce 400/500 series | L1/L2 cache, improved double precision |
Kepler | GeForce 600/700 series | Dynamic parallelism, Hyper-Q |
Maxwell | GeForce 900 series | Improved power efficiency |
Pascal | GeForce 10 series, Tesla P100 | Unified memory improvements, NVLink |
Volta | Tesla V100 | Tensor Cores, independent thread scheduling |
Turing | GeForce RTX 20 series | RT Cores, improved Tensor Cores |
Ampere | GeForce RTX 30 series, A100 | 3rd gen Tensor Cores, sparsity acceleration |
Hopper | H100 | 4th gen Tensor Cores, Transformer Engine |
Ada Lovelace | GeForce RTX 40 series | RT improvements, DLSS 3 |
Hardware Components
A modern NVIDIA GPU consists of:
- Streaming Multiprocessors (SMs): The basic computational units
- Tensor Cores: Specialized for matrix operations (newer GPUs)
- RT Cores: Specialized for ray tracing (RTX GPUs)
- Memory Controllers: Interface with device memory
- L2 Cache: Shared among all SMs
- Scheduler: Manages execution of thread blocks
Streaming Multiprocessor (SM) Architecture
Each SM contains:
- CUDA Cores: Integer and floating-point arithmetic units
- Tensor Cores: Matrix multiply-accumulate units
- Warp Schedulers: Manage thread execution
- Register File: Ultra-fast storage for thread variables
- Shared Memory/L1 Cache: Fast memory shared by threads in a block
- Load/Store Units: Handle memory operations
- Special Function Units (SFUs): Calculate transcendentals (sin, cos, etc.)
- Texture Units: Specialized for texture operations
Example SM architecture (diagram not included, reference only)
Software Programming Model
CUDA programs are organized in a hierarchical structure:
Thread Hierarchy
- Thread: The smallest execution unit, runs a program instance
- Warp: Group of 32 threads that execute in lockstep (SIMT)
- Block: Group of threads that can cooperate via shared memory
- Grid: Collection of blocks that execute the same kernel
Grid
├── Block (0,0) Block (1,0) Block (2,0)
├── Block (0,1) Block (1,1) Block (2,1)
└── Block (0,2) Block (1,2) Block (2,2)
Block (1,1)
├── Thread (0,0) Thread (1,0) Thread (2,0)
├── Thread (0,1) Thread (1,1) Thread (2,1)
└── Thread (0,2) Thread (1,2) Thread (2,2)
Thread Indexing
Threads can be organized in 1D, 2D, or 3D arrangements. Each thread can be uniquely identified by:
// 1D grid of 1D blocks
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// 2D grid of 2D blocks
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
int tid = tid_y * gridDim.x * blockDim.x + tid_x;
// 3D grid of 3D blocks
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
int tid_z = blockIdx.z * blockDim.z + threadIdx.z;
Kernel Execution Configuration
A kernel is launched with a specific grid and block configuration:
dim3 block(16, 16, 1); // 16×16 threads per block
dim3 grid(N/16, N/16, 1); // Grid dimensions adjusted to data size
myKernel<<<grid, block>>>(params...);
Synchronization
- Block-level:
__syncthreads()
synchronizes all threads in a block - System-level:
cudaDeviceSynchronize()
waits for all kernels to complete - Stream-level:
cudaStreamSynchronize(stream)
waits for operations in a stream - Cooperative Groups: More flexible synchronization patterns (newer CUDA versions)
Memory Hierarchy
GPUs have a complex memory hierarchy with different performance characteristics:
Device Memory Types
- Global Memory
- Largest capacity (several GB)
- Accessible by all threads
- High latency (hundreds of cycles)
- Used for main data storage
-
Bandwidth: ~500-2000 GB/s depending on GPU
-
Shared Memory
- Small capacity (up to 164KB per SM in newer GPUs)
- Accessible by threads within a block
- Low latency (similar to L1 cache)
- Used for inter-thread communication and data reuse
-
Organized in banks for parallel access
-
Constant Memory
- Small (64KB per device)
- Read-only for kernels
- Cached and optimized for broadcast
-
Used for unchanging parameters
-
Texture Memory
- Cached read-only memory
- Optimized for 2D/3D spatial locality
- Hardware interpolation
-
Used for image processing
-
Local Memory
- Per-thread private storage
- Used for register spills
- Actually resides in global memory
-
Automatic variable arrays often stored here
-
Registers
- Fastest memory type
- Per-thread private storage
- Limited number per thread
- Used for thread-local variables
Memory Management Models
-
Explicit Memory Management
// Allocate device memory float *d_data; cudaMalloc(&d_data, size); // Transfer data to device cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Launch kernel kernel<<<grid, block>>>(d_data); // Transfer results back cudaMemcpy(h_result, d_data, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_data);
-
Unified Memory
// Allocate unified memory float *data; cudaMallocManaged(&data, size); // Initialize data (on host) for (int i = 0; i < N; i++) data[i] = i; // Launch kernel (data automatically migrated) kernel<<<grid, block>>>(data); // Wait for kernel to finish cudaDeviceSynchronize(); // Access results (data automatically migrated back) float result = data[0]; // Free unified memory cudaFree(data);
-
Zero-Copy Memory
Memory Access Patterns
-
Coalesced Access: Threads in a warp access contiguous memory
-
Strided Access: Threads in a warp access memory with stride
-
Bank Conflicts: Multiple threads access the same shared memory bank
Execution Model
SIMT Execution
GPU executes threads in groups of 32 (warps) using Single Instruction, Multiple Thread (SIMT) execution:
- All threads in a warp execute the same instruction
- Divergent paths are serialized (warp divergence)
- Predication is used for short conditional sections
Scheduling
- Block Scheduling:
- Blocks are assigned to SMs based on resources
- Once assigned, a block runs to completion on that SM
-
Blocks cannot communicate with each other
-
Warp Scheduling:
- Warps are the basic scheduling unit
- Hardware warp schedulers select ready warps for execution
-
Latency hiding through warp interleaving
-
Instruction-Level Scheduling:
- Instructions from different warps can be interleaved
- Helps hide memory and instruction latency
Occupancy
Occupancy is the ratio of active warps to maximum possible warps on an SM:
- Limited by resources: registers, shared memory, block size
- Higher occupancy generally improves latency hiding
- Not always linearly correlated with performance
Factors affecting occupancy: - Register usage per thread: More registers = fewer warps - Shared memory per block: More shared memory = fewer blocks - Block size: Very small blocks reduce occupancy
Performance Considerations
Memory Optimization
- Coalesced Access: Ensure threads in a warp access contiguous memory
- Shared Memory: Use for data reused within a block
- L1/Texture Cache: Leverage for read-only data with spatial locality
- Memory Bandwidth: Often the limiting factor; minimize transfers
Execution Optimization
- Occupancy: Balance resource usage to maximize active warps
- Warp Divergence: Minimize divergent paths within warps
- Instruction Mix: Balance arithmetic operations and memory accesses
- Kernel Fusion: Combine multiple operations into one kernel to reduce launch overhead
Common Optimization Techniques
- Tiling: Divide data into tiles that fit in shared memory
- Loop Unrolling: Reduce loop overhead
- Prefetching: Load data before it's needed
- Warp Shuffle: Exchange data between threads in a warp without shared memory
- Persistent Threads: Keep threads active for multiple work items
Example Application
The accompanying basic04.cu
demonstrates:
- Hardware inspection: Querying and displaying device properties
- Thread hierarchy: Visualizing the grid/block/thread structure
- Memory types: Using global, shared, constant, local, and register memory
- Memory access patterns: Demonstrating coalesced vs. non-coalesced access
- Warp execution: Showing warp ID, lane ID, and divergence effects
Key Code Sections
Thread identification and hierarchy:
__global__ void threadHierarchyKernel() {
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
int bx = blockIdx.x;
int by = blockIdx.y;
int bz = blockIdx.z;
// Print thread position
printf("Thread (%d,%d,%d) in Block (%d,%d,%d)\n", tx, ty, tz, bx, by, bz);
}
Shared memory usage:
__global__ void sharedMemoryKernel(float *input, float *output) {
__shared__ float sharedData[256];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int localId = threadIdx.x;
// Load data to shared memory
sharedData[localId] = input[tid];
// Synchronize
__syncthreads();
// Use shared data
output[tid] = sharedData[localId];
}