CUDA GPU Organization Hierarchy

This document provides a comprehensive overview of the NVIDIA GPU architecture and programming model hierarchy, from both hardware and software perspectives.

You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

Hardware Organization

GPU Architecture Evolution

NVIDIA GPUs have evolved through multiple architecture generations:

Architecture	Example GPUs	Key Features
Tesla	GeForce 8/9/200 series	First CUDA-capable GPUs
Fermi	GeForce 400/500 series	L1/L2 cache, improved double precision
Kepler	GeForce 600/700 series	Dynamic parallelism, Hyper-Q
Maxwell	GeForce 900 series	Improved power efficiency
Pascal	GeForce 10 series, Tesla P100	Unified memory improvements, NVLink
Volta	Tesla V100	Tensor Cores, independent thread scheduling
Turing	GeForce RTX 20 series	RT Cores, improved Tensor Cores
Ampere	GeForce RTX 30 series, A100	3rd gen Tensor Cores, sparsity acceleration
Hopper	H100	4th gen Tensor Cores, Transformer Engine
Ada Lovelace	GeForce RTX 40 series	RT improvements, DLSS 3

Hardware Components

A modern NVIDIA GPU consists of:

Streaming Multiprocessors (SMs): The basic computational units
Tensor Cores: Specialized for matrix operations (newer GPUs)
RT Cores: Specialized for ray tracing (RTX GPUs)
Memory Controllers: Interface with device memory
L2 Cache: Shared among all SMs
Scheduler: Manages execution of thread blocks

Streaming Multiprocessor (SM) Architecture

Each SM contains:

CUDA Cores: Integer and floating-point arithmetic units
Tensor Cores: Matrix multiply-accumulate units
Warp Schedulers: Manage thread execution
Register File: Ultra-fast storage for thread variables
Shared Memory/L1 Cache: Fast memory shared by threads in a block
Load/Store Units: Handle memory operations
Special Function Units (SFUs): Calculate transcendentals (sin, cos, etc.)
Texture Units: Specialized for texture operations

Example SM architecture (diagram not included, reference only)

Software Programming Model

CUDA programs are organized in a hierarchical structure:

Thread Hierarchy

Thread: The smallest execution unit, runs a program instance
Warp: Group of 32 threads that execute in lockstep (SIMT)
Block: Group of threads that can cooperate via shared memory
Grid: Collection of blocks that execute the same kernel

Grid
├── Block (0,0)  Block (1,0)  Block (2,0)
├── Block (0,1)  Block (1,1)  Block (2,1)
└── Block (0,2)  Block (1,2)  Block (2,2)

Block (1,1)
├── Thread (0,0)  Thread (1,0)  Thread (2,0)
├── Thread (0,1)  Thread (1,1)  Thread (2,1)
└── Thread (0,2)  Thread (1,2)  Thread (2,2)

Thread Indexing

Threads can be organized in 1D, 2D, or 3D arrangements. Each thread can be uniquely identified by:

// 1D grid of 1D blocks
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// 2D grid of 2D blocks
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
int tid = tid_y * gridDim.x * blockDim.x + tid_x;

// 3D grid of 3D blocks
int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
int tid_z = blockIdx.z * blockDim.z + threadIdx.z;

Kernel Execution Configuration

A kernel is launched with a specific grid and block configuration:

dim3 block(16, 16, 1);   // 16×16 threads per block
dim3 grid(N/16, N/16, 1); // Grid dimensions adjusted to data size
myKernel<<<grid, block>>>(params...);

Synchronization

Block-level: __syncthreads() synchronizes all threads in a block
System-level: cudaDeviceSynchronize() waits for all kernels to complete
Stream-level: cudaStreamSynchronize(stream) waits for operations in a stream
Cooperative Groups: More flexible synchronization patterns (newer CUDA versions)

Memory Hierarchy

GPUs have a complex memory hierarchy with different performance characteristics:

Device Memory Types

Global Memory
Largest capacity (several GB)
Accessible by all threads
High latency (hundreds of cycles)
Used for main data storage
Bandwidth: ~500-2000 GB/s depending on GPU
Shared Memory
Small capacity (up to 164KB per SM in newer GPUs)
Accessible by threads within a block
Low latency (similar to L1 cache)
Used for inter-thread communication and data reuse
Organized in banks for parallel access
Constant Memory
Small (64KB per device)
Read-only for kernels
Cached and optimized for broadcast
Used for unchanging parameters
Texture Memory
Cached read-only memory
Optimized for 2D/3D spatial locality
Hardware interpolation
Used for image processing
Local Memory
Per-thread private storage
Used for register spills
Actually resides in global memory
Automatic variable arrays often stored here
Registers
Fastest memory type
Per-thread private storage
Limited number per thread
Used for thread-local variables

Memory Management Models

Explicit Memory Management

// Allocate device memory
float *d_data;
cudaMalloc(&d_data, size);

// Transfer data to device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);

// Launch kernel
kernel<<<grid, block>>>(d_data);

// Transfer results back
cudaMemcpy(h_result, d_data, size, cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_data);

Unified Memory

// Allocate unified memory
float *data;
cudaMallocManaged(&data, size);

// Initialize data (on host)
for (int i = 0; i < N; i++) data[i] = i;

// Launch kernel (data automatically migrated)
kernel<<<grid, block>>>(data);

// Wait for kernel to finish
cudaDeviceSynchronize();

// Access results (data automatically migrated back)
float result = data[0];

// Free unified memory
cudaFree(data);

Zero-Copy Memory

float *data;
cudaHostAlloc(&data, size, cudaHostAllocMapped);

float *d_data;
cudaHostGetDevicePointer(&d_data, data, 0);

kernel<<<grid, block>>>(d_data);

Memory Access Patterns

Coalesced Access: Threads in a warp access contiguous memory
```
// Coalesced (efficient)
data[threadIdx.x] = value;
```

Strided Access: Threads in a warp access memory with stride

// Strided (inefficient)
data[threadIdx.x * stride] = value;

Bank Conflicts: Multiple threads access the same shared memory bank

// Potential bank conflict if threadIdx.x % 32 is the same for multiple threads
shared[threadIdx.x] = data[threadIdx.x];

Execution Model

SIMT Execution

GPU executes threads in groups of 32 (warps) using Single Instruction, Multiple Thread (SIMT) execution:

All threads in a warp execute the same instruction
Divergent paths are serialized (warp divergence)
Predication is used for short conditional sections

Scheduling

Block Scheduling:
Blocks are assigned to SMs based on resources
Once assigned, a block runs to completion on that SM
Blocks cannot communicate with each other
Warp Scheduling:
Warps are the basic scheduling unit
Hardware warp schedulers select ready warps for execution
Latency hiding through warp interleaving
Instruction-Level Scheduling:
Instructions from different warps can be interleaved
Helps hide memory and instruction latency

Occupancy

Occupancy is the ratio of active warps to maximum possible warps on an SM:

Limited by resources: registers, shared memory, block size
Higher occupancy generally improves latency hiding
Not always linearly correlated with performance

Factors affecting occupancy: - Register usage per thread: More registers = fewer warps - Shared memory per block: More shared memory = fewer blocks - Block size: Very small blocks reduce occupancy

Performance Considerations

Memory Optimization

Coalesced Access: Ensure threads in a warp access contiguous memory
Shared Memory: Use for data reused within a block
L1/Texture Cache: Leverage for read-only data with spatial locality
Memory Bandwidth: Often the limiting factor; minimize transfers

Execution Optimization

Occupancy: Balance resource usage to maximize active warps
Warp Divergence: Minimize divergent paths within warps
Instruction Mix: Balance arithmetic operations and memory accesses
Kernel Fusion: Combine multiple operations into one kernel to reduce launch overhead

Common Optimization Techniques

Tiling: Divide data into tiles that fit in shared memory
Loop Unrolling: Reduce loop overhead
Prefetching: Load data before it's needed
Warp Shuffle: Exchange data between threads in a warp without shared memory
Persistent Threads: Keep threads active for multiple work items

Example Application

The accompanying basic04.cu demonstrates:

Hardware inspection: Querying and displaying device properties
Thread hierarchy: Visualizing the grid/block/thread structure
Memory types: Using global, shared, constant, local, and register memory
Memory access patterns: Demonstrating coalesced vs. non-coalesced access
Warp execution: Showing warp ID, lane ID, and divergence effects

Key Code Sections

Thread identification and hierarchy:

__global__ void threadHierarchyKernel() {
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int tz = threadIdx.z;

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int bz = blockIdx.z;

    // Print thread position
    printf("Thread (%d,%d,%d) in Block (%d,%d,%d)\n", tx, ty, tz, bx, by, bz);
}

Shared memory usage:

__global__ void sharedMemoryKernel(float *input, float *output) {
    __shared__ float sharedData[256];

    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int localId = threadIdx.x;

    // Load data to shared memory
    sharedData[localId] = input[tid];

    // Synchronize
    __syncthreads();

    // Use shared data
    output[tid] = sharedData[localId];
}