basic-cuda-tutorial

You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

A collection of CUDA programming examples to learn GPU programming with NVIDIA CUDA.

Make sure to change the gpu architecture sm_61 to your own gpu architecture in Makefile

Examples and tutorials

01-vector-addition.cu and 01-vector-addition.md: Introduction to CUDA programming with a vector addition example
02-ptx-assembly.cu and 02-ptx-assembly.md: Demonstration of CUDA PTX inline assembly with a vector multiplication example
03-gpu-programming-methods.cu and 03-gpu-programming-methods.md: Comprehensive comparison of GPU programming methods including CUDA, PTX, Thrust, Unified Memory, Shared Memory, CUDA Streams, and Dynamic Parallelism using matrix multiplication
04-gpu-architecture.cu and 04-gpu-architecture.md: Detailed exploration of GPU organization hierarchy including hardware architecture, thread/block/grid structure, memory hierarchy, and execution model
05-neural-network.cu and 05-neural-network.md: Implementing a basic neural network forward pass on GPU with CUDA
06-cnn-convolution.cu and 06-cnn-convolution.md: GPU-accelerated convolution operations for CNN with shared memory optimization
07-attention-mechanism.cu and 07-attention-mechanism.md: CUDA implementation of attention mechanism for transformer models
08-profiling-tracing.cu and 08-profiling-tracing.md: Profiling and tracing CUDA applications with CUDA Events, NVTX, and CUPTI for performance optimization
09-gpu-extension.cu and 09-gpu-extension.md: GPU application extension mechanisms for modifying behavior without source code changes, including API interception, memory management, kernel optimization, and error resilience
10-cpu-gpu-profiling-boundaries.cu and 10-cpu-gpu-profiling-boundaries.md: Advanced GPU kernel instrumentation techniques demonstrating fine-grained internal timing, divergent path analysis, dynamic workload profiling, and adaptive algorithm selection within CUDA kernels
11-fine-grained-gpu-modifications.cu and 11-fine-grained-gpu-modifications.md: Fine-grained GPU code customizations including data structure layout optimization, warp-level primitives, memory access patterns, kernel fusion, and dynamic execution path selection
12-advanced-gpu-customizations.cu and 12-advanced-gpu-customizations.md: Advanced GPU customization techniques including thread divergence mitigation, register usage optimization, mixed precision computation, persistent threads for load balancing, and warp specialization patterns
13-low-latency-gpu-packet-processing.cu and 13-low-latency-gpu-packet-processing.md: Techniques for minimizing latency in GPU-based network packet processing, including pinned memory, zero-copy memory, stream pipelining, persistent kernels, and CUDA Graphs for real-time network applications

Each tutorial includes comprehensive documentation explaining the concepts, implementation details, and optimization techniques used in ML/AI workloads on GPUs.

Share on Share on