- Introduction
- Chapter 1 Foundations of Parallel Thinking
- Chapter 2 Hardware Trends: Multicore, Manycore, and Memory Hierarchies
- Chapter 3 Workload Analysis and Parallelism Discovery
- Chapter 4 Concurrency Models: Threads, Tasks, and Dataflow
- Chapter 5 Shared Memory Basics and the Cache Coherence Landscape
- Chapter 6 Synchronization Primitives: Locks, Atomics, and Fences
- Chapter 7 Lock-Free and Wait-Free Patterns
- Chapter 8 Task Decomposition and Scheduling Strategies
- Chapter 9 Data Decomposition: Domain, Pipeline, and Tiling
- Chapter 10 Patterns for CPU Parallelism with C++ and Threading Libraries
- Chapter 11 GPU Architecture Essentials: SIMT, Warps, and Memory
- Chapter 12 GPU Programming Models: CUDA, HIP, and OpenCL
- Chapter 13 Cross-Platform Abstractions: SYCL, OpenMP Offload, and Kokkos
- Chapter 14 Data-Parallel Patterns: Map, Reduce, Scan, and Stencil
- Chapter 15 Graph and Irregular Workloads on CPUs and GPUs
- Chapter 16 Pipelining, Streams, and Overlap of Computation and Data Movement
- Chapter 17 Memory Optimization: Locality, NUMA, and Bandwidth
- Chapter 18 Synchronization on GPUs: Barriers, Atomics, and Cooperative Groups
- Chapter 19 Performance Measurement and Profiling Techniques
- Chapter 20 Tuning for Throughput and Latency: Roofline and Beyond
- Chapter 21 Parallel I/O and Data Layouts for Performance
- Chapter 22 Debugging Concurrency: Races, Deadlocks, and Heisenbugs
- Chapter 23 Reliability and Determinism: Testing and Reproducibility
- Chapter 24 Scalability Case Studies: From Desktop to Cluster
- Chapter 25 Designing for Portability, Maintainability, and Energy Efficiency
Parallel Programming Patterns: Multicore and GPU Solutions
Table of Contents
Introduction
Modern processors have stopped getting dramatically faster for a single thread, but they have become dramatically wider. Laptops ship with many cores, servers host dozens or hundreds, and even a modest GPU exposes thousands of hardware threads. Parallelism is no longer an exotic optimization; it is the path to baseline performance. This book equips you with the techniques and patterns needed to exploit concurrency on CPUs and GPUs, turning hardware parallelism into real throughput and responsiveness.
Our approach is pragmatic and pattern-driven. Rather than prescribing one framework or one language, we organize the material around reusable solutions to recurring design problems: how to decompose work, how to coordinate tasks safely, how to move data efficiently, and how to measure and tune performance. The patterns are accompanied by worked examples that make trade-offs explicit—illustrating when a technique shines, when it fails, and how to recognize the difference in your own workloads.
We begin by building mental models of concurrency. You will learn the strengths and weaknesses of threads, tasks, and dataflow; how memory models and cache coherence influence correctness and speed; and why synchronization primitives—locks, atomics, fences, and barriers—are both indispensable and dangerous. From there, we explore scalable alternatives such as lock-free and wait-free techniques, and we show how scheduling choices interact with decomposition strategies to determine utilization.
GPUs introduce a different—but complementary—execution model. We demystify SIMT, warps, occupancy, and the memory hierarchy, then ground those concepts in concrete programming models such as CUDA, HIP, OpenCL, and cross-platform layers like SYCL and OpenMP offload. You will see how classic data-parallel patterns—map, reduce, scan, stencil—translate to GPU kernels, how to orchestrate streams to overlap computation and data movement, and how to handle irregular workloads without squandering parallel efficiency.
Parallel performance is earned, not assumed. We emphasize measurement-first development using profilers, tracing, and hardware counters. You will learn to read roofline models, identify bandwidth and latency bottlenecks, reason about NUMA effects, and design data layouts that align with your target architecture. Along the way, we present debugging strategies for races, deadlocks, livelocks, and nondeterminism, and we discuss techniques for testing and reproducing concurrency bugs that resist traditional methods.
Finally, we take a holistic view of software quality in parallel systems. Performance is only one axis; portability, maintainability, energy efficiency, and reliability matter just as much for real-world deployments. The case studies and exercises at the end of each chapter connect the patterns to full applications, from desktop analytics to GPU-accelerated services and multi-node pipelines. By the end of the book, you will be able to identify parallelism opportunities, select appropriate patterns, avoid common pitfalls, and deliver correct, scalable solutions across CPUs and GPUs.
Whether you are a systems programmer squeezing cycles from a server, a data scientist accelerating kernels on a laptop GPU, or an application developer modernizing legacy code, this book is your guide to parallel programming in practice. The aim is not just to make code run faster once, but to cultivate intuition and discipline so that every design choice moves you toward correctness, scalability, and sustained performance.
CHAPTER ONE: Foundations of Parallel Thinking
Parallel programming begins with a mindset shift. For decades, programmers relied on clock speed increases to absorb the cost of inefficient algorithms and sequential logic. That era is over. Today, speed comes from doing many things at once, not from doing one thing faster. The first step toward mastery is learning to see problems through a parallel lens, identifying opportunities to split work, overlap operations, and reduce waiting. Parallelism is not a feature you bolt on at the end; it is a design dimension that shapes every decision from data layout to control flow.
A useful starting point is the distinction between concurrency and parallelism. Concurrency is about managing multiple tasks that make progress over the same period, even if only one runs at a time. It is the foundation of responsiveness and event-driven systems. Parallelism, on the other hand, is about executing multiple tasks simultaneously to reduce latency or increase throughput. A program can be concurrent without being parallel, and it can be parallel without being elegantly concurrent. Understanding both concepts prevents misapplied techniques and wasted effort.
Consider a web server. Handling multiple requests concurrently is essential even on a single core, using asynchronous I/O or event loops to avoid blocking. Parallelism enhances this by serving requests on multiple cores, but the concurrent design remains. If the server naively spawned a thread per request without resource limits, contention and memory overhead could collapse performance. Recognizing where concurrency improves structure and where parallelism accelerates execution helps you choose patterns that fit the problem and the hardware, rather than forcing a single tool everywhere.
The mental model of parallelism also depends on the nature of work. Embarrassingly parallel problems, such as applying the same transformation to independent elements, map naturally to many threads or GPU kernels. Other workloads require careful coordination, like simulations with temporal dependencies. In such cases, decomposing the problem into phases or using pipelined execution may be more effective than brute-force parallelization. The key is to match decomposition strategies to the algorithm’s constraints, data dependencies, and desired throughput.
Execution models guide how you express parallelism. On CPUs, the thread-based model provides fine-grained control but demands explicit synchronization. Task-based models elevate work units to first-class objects, enabling scheduling and load balancing. Dataflow models emphasize moving data through a graph of operations, often exposing natural concurrency. GPU programming uses a SIMT model where groups of threads execute the same instruction path in lockstep, offering massive throughput but requiring careful divergence handling. Mastering these models lets you choose the right abstraction for each situation.
Parallelism also interacts deeply with memory. Shared memory allows threads to communicate quickly but introduces risks of data races and incoherent views. Distributed memory or partitioned address spaces reduce contention but require explicit movement or message passing. In modern systems, cache hierarchies add complexity: data that is logically shared may be replicated across caches, leading to coherence traffic and false sharing. Aligning your parallel design with memory topology is often the difference between a slow prototype and a production-capable solution.
Synchronization is the glue that holds parallel designs together, and it is also the primary source of bugs and performance stalls. Locks serialize access, ensuring correctness at the cost of contention. Atomics provide lightweight coordination for single words but require careful ordering semantics. Fences enforce memory ordering across threads without locking, enabling more scalable patterns. Barriers align phases of computation, but overuse can create stragglers and idle cores. Choosing the right primitive depends on the frequency of coordination, the granularity of work, and the acceptable trade-off between simplicity and scalability.
Lock-free and wait-free techniques avoid blocking by using atomic operations to coordinate without traditional locks. These patterns can deliver high throughput under contention, but they introduce complexity and subtle correctness conditions. Lock-free algorithms guarantee that some thread makes progress, while wait-free algorithms guarantee that all threads complete in a bounded number of steps. Implementing these correctly is challenging, and often better left to libraries. Still, understanding their properties helps you evaluate existing solutions and decide when the extra complexity is justified.
Work decomposition determines how parallelism manifests across CPU cores. Domain decomposition splits data into chunks that can be processed independently, often with static partitions for simplicity or dynamic work-stealing for load balance. Pipeline decomposition separates stages of processing, allowing different cores to work on different stages concurrently, which is especially effective when stages have uneven costs. Functional decomposition assigns different tasks to different cores, like separating input parsing, computation, and output formatting. Hybrid approaches combine these to match workload characteristics and hardware resources.
On GPUs, decomposition is guided by the execution model and memory hierarchy. Threads are grouped into warps or wavefronts that execute in lockstep, and these groups are arranged into blocks and grids. Effective GPU programs partition work so that threads in a group follow similar control paths to minimize divergence. Data is organized to maximize coalesced memory accesses and to fit into shared memory for reuse. Launch configurations, such as block size and grid size, directly impact occupancy and throughput. GPU decomposition emphasizes massive parallelism, with thousands of concurrent threads cooperating on data-parallel tasks.
The role of the programmer in parallel systems evolves. You become a coordinator and a resource manager, orchestrating threads, tasks, and data movement while avoiding contention and overhead. Profiling and measurement become essential tools; intuition is valuable but must be verified with counters, traces, and timing. Common pitfalls include over-synchronization, under-decomposition, ignoring memory locality, and assuming that more threads always mean more speed. Success involves iterative refinement: identify parallelism, choose patterns, measure results, and adjust.
Looking ahead, the path to scalable software requires a blend of theory and pragmatism. Some algorithms cannot be parallelized efficiently without rethinking their core structure. Others can be accelerated dramatically with simple patterns. Hardware continues to evolve, with heterogeneous systems combining CPU cores, integrated GPUs, and discrete accelerators. Portable designs must account for these variations without sacrificing performance. By building a solid foundation in parallel thinking now, you prepare to adapt as architectures and programming models change, keeping your solutions correct, efficient, and maintainable across the systems of today and tomorrow.
This is a sample preview. The complete book contains 27 sections.