Parallel Programming Patterns: Multicore and GPU Solutions

Introduction
Chapter 1 Foundations of Parallel Thinking
Chapter 2 Hardware Trends: Multicore, Manycore, and Memory Hierarchies
Chapter 3 Workload Analysis and Parallelism Discovery
Chapter 4 Concurrency Models: Threads, Tasks, and Dataflow
Chapter 5 Shared Memory Basics and the Cache Coherence Landscape
Chapter 6 Synchronization Primitives: Locks, Atomics, and Fences
Chapter 7 Lock-Free and Wait-Free Patterns
Chapter 8 Task Decomposition and Scheduling Strategies
Chapter 9 Data Decomposition: Domain, Pipeline, and Tiling
Chapter 10 Patterns for CPU Parallelism with C++ and Threading Libraries
Chapter 11 GPU Architecture Essentials: SIMT, Warps, and Memory
Chapter 12 GPU Programming Models: CUDA, HIP, and OpenCL
Chapter 13 Cross-Platform Abstractions: SYCL, OpenMP Offload, and Kokkos
Chapter 14 Data-Parallel Patterns: Map, Reduce, Scan, and Stencil
Chapter 15 Graph and Irregular Workloads on CPUs and GPUs
Chapter 16 Pipelining, Streams, and Overlap of Computation and Data Movement
Chapter 17 Memory Optimization: Locality, NUMA, and Bandwidth
Chapter 18 Synchronization on GPUs: Barriers, Atomics, and Cooperative Groups
Chapter 19 Performance Measurement and Profiling Techniques
Chapter 20 Tuning for Throughput and Latency: Roofline and Beyond
Chapter 21 Parallel I/O and Data Layouts for Performance
Chapter 22 Debugging Concurrency: Races, Deadlocks, and Heisenbugs
Chapter 23 Reliability and Determinism: Testing and Reproducibility
Chapter 24 Scalability Case Studies: From Desktop to Cluster
Chapter 25 Designing for Portability, Maintainability, and Energy Efficiency

Introduction

Modern processors have stopped getting dramatically faster for a single thread, but they have become dramatically wider. Laptops ship with many cores, servers host dozens or hundreds, and even a modest GPU exposes thousands of hardware threads. Parallelism is no longer an exotic optimization; it is the path to baseline performance. This book equips you with the techniques and patterns needed to exploit concurrency on CPUs and GPUs, turning hardware parallelism into real throughput and responsiveness.

Our approach is pragmatic and pattern-driven. Rather than prescribing one framework or one language, we organize the material around reusable solutions to recurring design problems: how to decompose work, how to coordinate tasks safely, how to move data efficiently, and how to measure and tune performance. The patterns are accompanied by worked examples that make trade-offs explicit—illustrating when a technique shines, when it fails, and how to recognize the difference in your own workloads.

We begin by building mental models of concurrency. You will learn the strengths and weaknesses of threads, tasks, and dataflow; how memory models and cache coherence influence correctness and speed; and why synchronization primitives—locks, atomics, fences, and barriers—are both indispensable and dangerous. From there, we explore scalable alternatives such as lock-free and wait-free techniques, and we show how scheduling choices interact with decomposition strategies to determine utilization.

GPUs introduce a different—but complementary—execution model. We demystify SIMT, warps, occupancy, and the memory hierarchy, then ground those concepts in concrete programming models such as CUDA, HIP, OpenCL, and cross-platform layers like SYCL and OpenMP offload. You will see how classic data-parallel patterns—map, reduce, scan, stencil—translate to GPU kernels, how to orchestrate streams to overlap computation and data movement, and how to handle irregular workloads without squandering parallel efficiency.

Parallel performance is earned, not assumed. We emphasize measurement-first development using profilers, tracing, and hardware counters. You will learn to read roofline models, identify bandwidth and latency bottlenecks, reason about NUMA effects, and design data layouts that align with your target architecture. Along the way, we present debugging strategies for races, deadlocks, livelocks, and nondeterminism, and we discuss techniques for testing and reproducing concurrency bugs that resist traditional methods.

Finally, we take a holistic view of software quality in parallel systems. Performance is only one axis; portability, maintainability, energy efficiency, and reliability matter just as much for real-world deployments. The case studies and exercises at the end of each chapter connect the patterns to full applications, from desktop analytics to GPU-accelerated services and multi-node pipelines. By the end of the book, you will be able to identify parallelism opportunities, select appropriate patterns, avoid common pitfalls, and deliver correct, scalable solutions across CPUs and GPUs.

Whether you are a systems programmer squeezing cycles from a server, a data scientist accelerating kernels on a laptop GPU, or an application developer modernizing legacy code, this book is your guide to parallel programming in practice. The aim is not just to make code run faster once, but to cultivate intuition and discipline so that every design choice moves you toward correctness, scalability, and sustained performance.

CHAPTER ONE: Foundations of Parallel Thinking

Parallel programming begins with a mindset shift. For decades, programmers relied on clock speed increases to absorb the cost of inefficient algorithms and sequential logic. That era is over. Today, speed comes from doing many things at once, not from doing one thing faster. The first step toward mastery is learning to see problems through a parallel lens, identifying opportunities to split work, overlap operations, and reduce waiting. Parallelism is not a feature you bolt on at the end; it is a design dimension that shapes every decision from data layout to control flow.

A useful starting point is the distinction between concurrency and parallelism. Concurrency is about managing multiple tasks that make progress over the same period, even if only one runs at a time. It is the foundation of responsiveness and event-driven systems. Parallelism, on the other hand, is about executing multiple tasks simultaneously to reduce latency or increase throughput. A program can be concurrent without being parallel, and it can be parallel without being elegantly concurrent. Understanding both concepts prevents misapplied techniques and wasted effort.

Consider a web server. Handling multiple requests concurrently is essential even on a single core, using asynchronous I/O or event loops to avoid blocking. Parallelism enhances this by serving requests on multiple cores, but the concurrent design remains. If the server naively spawned a thread per request without resource limits, contention and memory overhead could collapse performance. Recognizing where concurrency improves structure and where parallelism accelerates execution helps you choose patterns that fit the problem and the hardware, rather than forcing a single tool everywhere.

The mental model of parallelism also depends on the nature of work. Embarrassingly parallel problems, such as applying the same transformation to independent elements, map naturally to many threads or GPU kernels. Other workloads require careful coordination, like simulations with temporal dependencies. In such cases, decomposing the problem into phases or using pipelined execution may be more effective than brute-force parallelization. The key is to match decomposition strategies to the algorithm’s constraints, data dependencies, and desired throughput.

Execution models guide how you express parallelism. On CPUs, the thread-based model provides fine-grained control but demands explicit synchronization. Task-based models elevate work units to first-class objects, enabling scheduling and load balancing. Dataflow models emphasize moving data through a graph of operations, often exposing natural concurrency. GPU programming uses a SIMT model where groups of threads execute the same instruction path in lockstep, offering massive throughput but requiring careful divergence handling. Mastering these models lets you choose the right abstraction for each situation.

Parallelism also interacts deeply with memory. Shared memory allows threads to communicate quickly but introduces risks of data races and incoherent views. Distributed memory or partitioned address spaces reduce contention but require explicit movement or message passing. In modern systems, cache hierarchies add complexity: data that is logically shared may be replicated across caches, leading to coherence traffic and false sharing. Aligning your parallel design with memory topology is often the difference between a slow prototype and a production-capable solution.

Synchronization is the glue that holds parallel designs together, and it is also the primary source of bugs and performance stalls. Locks serialize access, ensuring correctness at the cost of contention. Atomics provide lightweight coordination for single words but require careful ordering semantics. Fences enforce memory ordering across threads without locking, enabling more scalable patterns. Barriers align phases of computation, but overuse can create stragglers and idle cores. Choosing the right primitive depends on the frequency of coordination, the granularity of work, and the acceptable trade-off between simplicity and scalability.

Lock-free and wait-free techniques avoid blocking by using atomic operations to coordinate without traditional locks. These patterns can deliver high throughput under contention, but they introduce complexity and subtle correctness conditions. Lock-free algorithms guarantee that some thread makes progress, while wait-free algorithms guarantee that all threads complete in a bounded number of steps. Implementing these correctly is challenging, and often better left to libraries. Still, understanding their properties helps you evaluate existing solutions and decide when the extra complexity is justified.

Work decomposition determines how parallelism manifests across CPU cores. Domain decomposition splits data into chunks that can be processed independently, often with static partitions for simplicity or dynamic work-stealing for load balance. Pipeline decomposition separates stages of processing, allowing different cores to work on different stages concurrently, which is especially effective when stages have uneven costs. Functional decomposition assigns different tasks to different cores, like separating input parsing, computation, and output formatting. Hybrid approaches combine these to match workload characteristics and hardware resources.

On GPUs, decomposition is guided by the execution model and memory hierarchy. Threads are grouped into warps or wavefronts that execute in lockstep, and these groups are arranged into blocks and grids. Effective GPU programs partition work so that threads in a group follow similar control paths to minimize divergence. Data is organized to maximize coalesced memory accesses and to fit into shared memory for reuse. Launch configurations, such as block size and grid size, directly impact occupancy and throughput. GPU decomposition emphasizes massive parallelism, with thousands of concurrent threads cooperating on data-parallel tasks.

The role of the programmer in parallel systems evolves. You become a coordinator and a resource manager, orchestrating threads, tasks, and data movement while avoiding contention and overhead. Profiling and measurement become essential tools; intuition is valuable but must be verified with counters, traces, and timing. Common pitfalls include over-synchronization, under-decomposition, ignoring memory locality, and assuming that more threads always mean more speed. Success involves iterative refinement: identify parallelism, choose patterns, measure results, and adjust.

Looking ahead, the path to scalable software requires a blend of theory and pragmatism. Some algorithms cannot be parallelized efficiently without rethinking their core structure. Others can be accelerated dramatically with simple patterns. Hardware continues to evolve, with heterogeneous systems combining CPU cores, integrated GPUs, and discrete accelerators. Portable designs must account for these variations without sacrificing performance. By building a solid foundation in parallel thinking now, you prepare to adapt as architectures and programming models change, keeping your solutions correct, efficient, and maintainable across the systems of today and tomorrow.

CHAPTER TWO: Hardware Trends: Multicore, Manycore, and Memory Hierarchies

Modern hardware stopped chasing single-thread supremacy and started piling on parallel resources instead. You can find this trend in every corner of computing: phones with efficiency and performance cores, laptops that balance integrated graphics with CPU muscle, and servers that treat cores as a cheap commodity while scarcity lives in memory bandwidth and cache capacity. Parallelism is no longer a boutique feature; it is the default assumption for software that intends to run fast. Understanding the shape of these machines is the prerequisite for choosing patterns that deliver the promised speedup rather than a tangle of contention.

The multicore era began modestly with two cores and quickly expanded to eight, sixteen, and beyond. Today’s mainstream processors pack multiple cores into one or more sockets, each with its own cache hierarchy and access path to main memory. Simultaneous multithreading, most famously Intel’s Hyper-Threading, exposes multiple logical threads per physical core to hide latency, but it does not double compute throughput. The practical result is a mix of compute capacity, memory bandwidth, and latency that favors designs where cores can keep their local caches busy rather than share data incessantly.

Inside each core, a superscalar out-of-order execution engine can issue multiple instructions per cycle to several execution units. This is parallelism within a single thread, but it is limited by instruction-level dependencies and available speculative resources. Branch prediction tries to keep the pipeline full, and SIMD units perform single-instruction operations on vectors of data, commonly 128, 256, or 512 bits wide. When programming for CPUs, leveraging these SIMD units via vectorized loops often yields higher performance than adding more threads, especially for regular, compute-bound workloads.

Manycore architectures take the core count to extremes. Graphics processing units, neural accelerators, and specialized chips like the Apple M-series performance clusters prioritize massive concurrency over single-thread speed. A typical GPU hosts tens to hundreds of compute units, each capable of running many threads in an interleaved fashion. These machines are throughput-oriented by design: latency is hidden by switching among threads rapidly, and the ideal workload keeps the machine saturated with independent work. They are less forgiving of serial bottlenecks and divergence, but they reward data-parallel decompositions handsomely.

Memory hierarchies are the dominant performance factor in most parallel programs. Caches reduce access latency by keeping recently used data close to the cores, but they also introduce coherence and consistency challenges. A typical hierarchy includes L1 (per core), L2 (per core or shared among a few cores), and L3 (shared across many cores), followed by main memory. Access latency increases by orders of magnitude from L1 to DRAM. Thus, parallel speedups evaporate if threads spend most of their time waiting on data rather than computing.

Cache coherence keeps caches consistent across cores. The MESI protocol and its variants track cache line states—Modified, Exclusive, Shared, Invalid—to coordinate writes. Coherence works, but it is not free: writes to a line held in other caches trigger invalidations and traffic, and false sharing occurs when unrelated variables share a cache line, causing unnecessary bouncing. Aligning data to cache line boundaries, padding hot structures, and minimizing cross-core communication are practical ways to keep coherence from devouring the performance you expected from parallelism.

Locality is the rule that makes caches effective. Temporal locality relies on reusing data soon after it was accessed, while spatial locality exploits nearby data accessed in sequence. Parallel designs that respect locality—by partitioning data into chunks that fit in local caches, ordering computations to reuse loaded values, and iterating over memory in predictable strides—achieve higher throughput than designs that scatter accesses across far-flung addresses. Ignoring locality is a common cause of disappointing speedups, especially on large datasets that outgrow cache capacity.

NUMA, or Non-Uniform Memory Access, is common in multi-socket servers. Each socket has local memory, and accessing memory on another socket incurs extra latency and consumes inter-socket bandwidth. Operating systems may allocate memory on one socket while threads run on another, creating hidden performance penalties. On NUMA systems, pinning threads to cores and allocating memory locally can yield large gains. Tools like numactl help control placement, and awareness of NUMA should influence both decomposition and data layout.

Memory bandwidth often becomes the limiting factor for throughput-oriented workloads. Dramatic core count increases do not help if the system cannot feed the cores quickly enough. Multiple memory channels, high-bandwidth memory on GPUs, and interconnects like PCIe for discrete accelerators all define boundaries that programs must respect. Arithmetic intensity, the ratio of operations to bytes moved, is a useful lens: low-intensity kernels saturate bandwidth before compute, while high-intensity kernels benefit more from vectorization and caching. Design choices should aim to raise intensity where possible.

Integrated graphics and unified memory architectures blur the line between CPU and GPU. In systems with an integrated GPU, the accelerator shares memory bandwidth and sometimes cache with the CPU, reducing data movement costs but increasing contention. Unified memory simplifies programming by allowing a single pointer to be accessed by both CPU and GPU, but you still need to consider where data physically resides and how it migrates. Discrete GPUs with dedicated high-bandwidth memory offer higher peak throughput but impose the cost of explicit transfers over PCIe or proprietary interconnects.

Heterogeneous compute has also spawned specialized units: AI accelerators, tensor cores, and ray tracing hardware. These units often operate best when driven from the CPU or GPU as orchestrators, with the accelerator acting as a co-processor for specific kernels. Programming models are evolving to expose these units, but the fundamental principle remains: feed the accelerator with well-structured work, keep data movement minimal, and avoid synchronization that stalls the pipeline. Hardware specialization increases peak performance but narrows the set of patterns that achieve peak.

Execution resources extend beyond compute units to include memory subsystems, interconnects, and I/O. PCIe bandwidth and latency shape how fast a discrete GPU can be fed, while NVLink-like interconnects in high-end systems enable faster GPU-to-GPU communication. On smaller devices, unified memory and integrated hardware simplify the picture but introduce contention. Understanding the end-to-end data path—from application logic to cache lines to memory channels to device memory—helps you select patterns that are not just theoretically parallel but practically fast.

Different classes of CPUs have diverged in design philosophy. Performance cores prioritize high throughput and large caches, while efficiency cores focus on low power and throughput-per-watt. In heterogeneous CPU designs, the operating system scheduler tries to place threads appropriately, but programmers can still guide placement for critical workloads. Performance cores are best for compute-heavy, latency-sensitive tasks, whereas efficiency cores can handle background work or lightly threaded code. Recognizing these distinctions helps decide where to invest parallel effort on a given machine.

Operating systems schedule threads across these resources, and their decisions interact with your parallel design. Kernel scheduling latency, thread migrations between cores, and priority inversion can undermine expected throughput. Dedicated pinning of threads to cores reduces jitter and cache invalidations but sacrifices flexibility. For many applications, a balanced approach using thread pools with affinity hints and careful task sizing achieves good throughput without brittle hard-coding. The OS is part of your parallel system and must be accounted for in performance analysis.

GPU execution models differ fundamentally from CPUs. Threads are grouped into warps or wavefronts that execute in lockstep; divergence within a group serializes execution paths. The hardware hides latency by switching among thread groups, so keeping them busy with independent work is critical. GPU memory hierarchies include registers, shared memory (software-managed scratchpad), L1 caches, and device memory with high bandwidth but higher latency. Occupancy, the ratio of active warps to hardware slots, influences throughput, and launch configuration determines whether the hardware is saturated or starved.

Vectorization on CPUs complements threading. SIMD units can process multiple data elements per instruction, but they require data layout and control flow that enable vector execution. Compilers can auto-vectorize simple loops, but manual vector intrinsics or domain-specific libraries often yield better results. When paired with multithreading, vectorization increases arithmetic intensity per core, improving bandwidth saturation. Many throughput-oriented problems see larger gains from vectorization than from adding threads, especially on datasets that fit in cache and exhibit regular access patterns.

The relationship between parallel patterns and hardware capabilities is two-way. Some patterns map naturally to multicore CPUs, such as task-based parallelism for irregular workloads or pipeline decomposition for streaming. Others, like large-scale data-parallel maps and reductions, thrive on manycore GPUs. Blending patterns—using CPU cores for task orchestration and GPUs for bulk compute—can produce balanced designs. But mismatching pattern to hardware, like forcing fine-grained synchronization on GPUs or launching bulk kernels for tiny tasks, often reduces performance instead of increasing it.

Power and thermal limits constrain how long parallel resources can sustain peak rates. Boost clocks drop under sustained load, and thermal throttling can reduce throughput if cooling is inadequate. Energy efficiency matters in mobile and data centers alike, so designing parallelism to be work-conserving—avoiding oversubscription, unnecessary spinning, and idle threads—improves both performance and power profiles. Measuring performance per watt alongside throughput helps identify patterns that scale responsibly under real-world constraints.

Measurement is essential because hardware realities can defy intuition. Cache sizes, associativity, prefetching behavior, and write-back policies vary across vendors and generations. Memory latency and bandwidth vary with load and configuration. Profiling tools that expose hardware counters—cache misses, branch mispredicts, stalled cycles, port utilization—make the abstract concrete. They reveal whether a parallel slowdown stems from contention, poor locality, or insufficient independent work. Starting with measurement avoids chasing phantom bottlenecks and keeps parallel designs grounded in the machine’s actual behavior.

Parallel programming benefits from a mental map of the hardware, but it is a map with many layers. Physical constraints like NUMA and bandwidth define boundaries, architectural features like caches and SIMD shape locality and throughput, and execution models on CPUs and GPUs define how threads cooperate. Patterns must be chosen with respect to these constraints, not in spite of them. The most elegant parallel algorithm will stumble if it ignores the memory hierarchy or saturates a shared resource. Awareness of these trends is the foundation for turning parallel potential into realized performance.

CHAPTER THREE: Workload Analysis and Parallelism Discovery

Identifying where parallelism lives in your software is part detective work, part measurement, and part pattern matching. Many programs hide their parallel potential behind sequential interfaces, tangled control flows, and assumptions formed in the single-threaded era. The goal of this chapter is to give you a practical lens for spotting opportunities to do more than one thing at a time without stepping on your own feet. We will walk through concrete analysis strategies that start with the problem statement and end with a design that maps cleanly onto the hardware. Along the way, we will look at how to measure, how to reason about dependencies, and how to avoid the common mirages that make code look parallel when it is not.

Parallelism is not a property of hardware alone; it is a property of the algorithm and its data. Some algorithms expose parallelism naturally, while others require surgery to unlock it. Recognizing the shape of your workload is the first step. If the work is regular and data can be processed independently, a simple data-parallel pattern may suffice. If the work is irregular or has dependencies, you might need task parallelism, pipelines, or more specialized decompositions. The key is to ask: what work can happen simultaneously, and what prevents it from happening concurrently? Answering that question honestly saves time and avoids fruitless optimization.

A classic place to look for parallelism is in loops that transform data. Suppose you have a loop that computes a new array from an old one, with each output element depending only on the corresponding input element. This is an embarrassingly parallel situation: each iteration is independent. The hardware equivalent is a map operation, where the same function runs over many elements. On a CPU, this can be split across threads; on a GPU, it becomes a kernel that launches thousands of threads. Before you write any parallel code, prove independence by examining the loop body for shared state and cross-iteration writes.

Many real-world loops have dependencies that are more subtle. For example, a loop that updates an accumulator, like summing values, cannot simply run iterations in parallel because each one modifies the same variable. However, you can often rewrite this as a reduction: compute partial sums for chunks of data, then combine them at the end. Recognizing this pattern is a pivotal skill. Reductions are everywhere, from matrix multiplication to statistics like mean and variance. Hardware provides efficient primitives for reductions, especially on GPUs, but they only apply when you correctly identify and structure the computation as a reduction rather than a sequential accumulation.

Another common source of parallelism is function-level decomposition. If your program calls several independent functions in sequence, those functions may be able to run concurrently, perhaps on different cores. For example, a pipeline that loads data, computes a result, and writes output can be reorganized into three stages that operate concurrently, passing batches between them. This is pipeline parallelism, and it is particularly effective when stages have uneven costs. Instead of one thread doing all three steps, you keep all cores busy by having them work on different stages at once, like a relay race where the baton is a stream of data.

To discover these opportunities, start with a clear map of the workload. One lightweight technique is to annotate your code with comments that describe data inputs, outputs, and any side effects. For each loop or function, ask: what data does it read, what data does it write, and does any other part of the program touch the same data? If writes are disjoint or occur to independent locations, you have a candidate for data parallelism. If reads and writes overlap in a way that creates dependencies, those dependencies define the boundaries for parallelization, either by restructuring the algorithm or by adding synchronization that keeps the program correct.

Instrumentation is your ally. Before parallelizing, measure where the program spends time. A profiler or sampling tool will show the hottest functions and loops, guiding you to the parts of the workload that will benefit most from parallelism. If 90 percent of time is in a single routine that has no parallel structure, you either need to change the routine or accept that your gains will be limited. If time is spread across several routines with independent data, you have multiple targets for concurrency. Prioritize parallelizing the largest contributors to runtime, and verify with measurements that your changes are producing the expected speedup.

There is a trap called “optical parallelism,” where loops or functions look parallel because they are in different parts of the call graph, but they share data or are too small to justify parallel overhead. Launching threads or kernel invocations carries costs. If a loop runs a few hundred iterations, the setup and synchronization might dominate the work. In such cases, consider batching many small tasks into one larger parallel operation, or avoid parallelizing that loop and focus on larger scales. Parallelism is not free, and discovering where it pays off requires estimating task granularity.

Dependencies are the grammar of parallel code. Understanding them prevents you from breaking correctness. Loop-carried dependencies occur when iteration i depends on iteration i−1, such as in a cumulative sum or a recurrence. Such loops cannot be parallelized without changing the algorithm. However, some dependencies can be eliminated or restructured. For example, a cumulative sum can be replaced by a prefix sum, which is parallelizable. Prefix sums are a fundamental pattern, enabling many algorithms like sparse matrix-vector multiplication or list ranking. Recognizing when to replace a sequential recurrence with a parallelizable pattern is often the difference between a serial program and a fast parallel one.

Data races occur when multiple threads write to the same memory location without synchronization, or when at least one write and one read occur without ordering. To discover if your code is at risk, perform a race analysis by checking every shared variable for conflicting accesses. Tools like thread sanitizers can automate this, but code review with the right mindset is also powerful. Ask: is there a canonical order to updates? If yes, you might need locks or atomics to enforce it. If no, and updates are commutative and associative, you can restructure to allow concurrent updates, perhaps by using per-thread partial results that you merge later.

Input size and workload variability affect parallelization strategy. If the input is always huge, a simple static partitioning scheme might be enough. If input size varies widely or load across partitions is uneven, dynamic work-stealing or adaptive chunking might be needed to keep cores busy. For GPUs, occupancy and divergence depend on how the work is divided into blocks and warps. Discovering parallelism thus involves looking at the data shape: are there long contiguous arrays or sparse graphs? Different shapes suggest different patterns, from dense tiling to irregular load balancing with helper threads or atomic operations.

Let’s look at a concrete CPU example. A function computes an image filter that blends each pixel with its neighbors, writing to an output image. At first glance, it looks parallel: each pixel’s output depends on a fixed neighborhood in the input, not on other output pixels. A naive attempt to parallelize might partition the image into tiles and run one tile per thread. However, the edges between tiles require reading neighbor pixels from other tiles, which could lead to data races if threads write to shared boundaries. The solution is to either add small overlap regions between tiles, or have threads write to disjoint output buffers and then stitch tiles together. Parallelism discovery here means identifying not only the independent iterations but also the boundary conditions.

On a GPU, the same filter becomes a kernel with a thread per pixel. The kernel reads input, computes the filter, and writes to output. Because threads in the same warp execute in lockstep, divergent control flow hurts performance, but this filter is uniform. The bigger issue is memory coalescing: reading input in a way that adjacent threads read adjacent addresses maximizes bandwidth. Thus, discovering parallelism on a GPU involves not only the independence of iterations but also the memory access pattern. The best GPU code aligns the data layout with the thread layout so that memory requests become contiguous, maximizing effective bandwidth.

Pipelines offer another route to parallelism. Consider a program that parses logs, computes statistics, and writes reports. Traditionally, this is three sequential phases. With pipeline parallelism, phase 1 produces parsed records that phase 2 consumes, while phase 3 writes previous results. The stages can run concurrently on different cores. To analyze whether this helps, measure the throughput and latency of each stage. If stages are balanced, pipeline parallelism increases throughput; if one stage is a bottleneck, adding more cores to that stage or reducing its work per item can help. Pipeline analysis requires understanding the arrival rate of inputs and the service rate of each stage.

Another approach is to examine control structures. For example, a large conditional branch might be a hidden parallelism opportunity. If different branches operate on different parts of the data, you might run them concurrently. More commonly, control structures hide parallelism by serializing tasks that could be independent. Refactoring code to make independence explicit often reveals parallel opportunities. Functions that do not share state can be called concurrently, and tasks that do not depend on each other can be scheduled out of order, reducing the critical path length.

Data layout can unlock or block parallelism. If data is stored in an array of structs, you might be forced to access many fields even if you only need one. Converting to a struct of arrays improves vectorization and parallel access because each field becomes a contiguous array that threads can process independently. This is a data layout transformation that has nothing to do with threading but dramatically affects the parallel efficiency of your code. Discovering this opportunity comes from asking how data is accessed by each parallel unit and whether that access can be made linear and predictable.

Memory contention is a hidden dependency that can sabotage parallelism. For example, if many threads increment a single counter frequently, they will serialize due to cache line bouncing. Discovering this problem requires looking for hot shared variables. The fix might be to use per-thread local counters and aggregate them later, or to switch to an atomic increment if the rate is low. The broader lesson is that parallelism is not just about computation; it is about minimizing communication and shared mutable state. The best parallel designs move data as little as once, and keep shared updates rare.

For irregular workloads like graph algorithms, parallelism discovery involves identifying independent units of work. For instance, in a breadth-first search, all nodes at the current frontier can be processed in parallel. However, the frontier changes as the algorithm proceeds, and the size may vary. Detecting this pattern lets you use dynamic task generation: create tasks for each node in the frontier and process them concurrently. The challenge is load balancing and avoiding contention on shared structures like the visited set. In practice, parallel graph algorithms use a mix of data and task parallelism, and the discovery process involves tracing dependencies between levels of the graph.

When analyzing numerical algorithms, consider whether they can be restructured to use blockwise or tiled execution. Matrix multiplication is a prime example. Naive implementations access memory in ways that defeat caches. A tiled version divides matrices into blocks that fit in cache, computing a small outer product per block. Each block computation can be assigned to a thread or GPU block. The discovery step is to realize that independence exists at the level of block combinations, not just individual elements. By zooming out, you align the algorithm with the memory hierarchy, and the parallelization becomes both simpler and faster.

State machines can also be parallelized when the state partitions naturally. For example, a simulation where entities interact only with nearby neighbors can be split across spatial regions. If interactions are local, you can partition the world into cells and process cells concurrently, with a small synchronization phase to exchange boundary data. Discovering this involves modeling interaction ranges and identifying partitions where interactions are sparse across boundaries. The pattern is a form of domain decomposition with a synchronization barrier. Done well, it scales with the number of entities, not the complexity of global coordination.

Another discovery strategy is to ask whether the program can be expressed as a graph of operations. Many workloads are naturally dataflow graphs: nodes are operations, edges are data channels. If the graph is acyclic, you can schedule nodes concurrently as long as their inputs are ready. This approach exposes parallelism automatically, especially in systems designed for dataflow. Even without a framework, you can draw the graph and look for nodes at the same depth level that can run in parallel. This visualization helps identify bottlenecks where one node consumes all inputs and stalls the rest.

Let’s talk about measurement-driven discovery. Time the program under different inputs to see if the runtime scales with problem size. If runtime grows linearly with input size and parallelization yields sublinear speedup, you might be memory-bound. If runtime grows quadratically but the algorithm can be reduced to linear through better decomposition, you have a massive parallel opportunity. Profilers that show stalls, cache misses, and branch mispredicts can guide refactoring. Sometimes parallelization is blocked by memory bandwidth limits; in that case, you might need to change the algorithm to reduce data movement rather than adding threads.

In practice, parallelism discovery is iterative. You start with an hypothesis based on code inspection, measure the baseline, then make small changes to verify independence or remove dependencies. You add parallelism incrementally, checking that speedup scales with core count and that correctness holds. When speedup stalls, you revisit the dependency analysis or the granularity. When correctness breaks, you look for races or missing synchronization. This feedback loop turns parallelism from a vague hope into a disciplined search, grounded in the reality of your algorithm and hardware.

Several concrete techniques can help you locate parallelism systematically:

Hotspot analysis: Identify the top few functions where the program spends time. If they are loops or independent routines, they are primary targets.
Data access annotation: Mark which variables are read and written by each section. Overlaps indicate shared state and potential races.
Dependency tracing: For loops, trace whether iteration i depends on i−1 or earlier. If not, the loop is a candidate for parallel execution.
Granularity estimation: Count operations and memory accesses per iteration. If iterations are tiny, batch them or parallelize at a higher level.
Memory behavior inspection: Use hardware counters to see cache miss rates and bandwidth usage. High misses suggest locality issues that can limit parallel gains.
Race detection: Run with sanitizers or instrumentations to find unsynchronized shared writes. Fixing races often clarifies where parallelism is safe.
Task graph drawing: Map operations to nodes and data to edges. Nodes without dependencies can be executed concurrently.

Once you have identified opportunities, the next step is choosing the right pattern. If iterations are independent and data is dense, use data parallelism with map or reduction. If work splits into stages, use pipeline parallelism. If tasks are irregular or control-driven, use task parallelism with work-stealing. If the algorithm is local and spatial, use domain decomposition. The pattern selection is a direct consequence of the analysis; it should not be an arbitrary choice. Matching pattern to workload keeps the code simple and the performance high.

A few common mirages mislead parallelism discovery. Adding threads to a program that is mostly memory-bound rarely helps; you need to reduce memory traffic or increase locality first. Parallelizing a small loop inside a large sequential loop is often a waste due to overhead. Using locks to synchronize frequent updates usually serializes the program at a fine granularity, defeating parallelism. Discovering parallelism means knowing when not to parallelize as much as when to do it. Sometimes the best optimization is restructuring the algorithm to remove unnecessary shared state, after which parallelization becomes trivial.

Parallelism discovery also includes considering the target device. If the algorithm is a good fit for a GPU, you might prioritize data parallel patterns and focus on memory coalescing and occupancy. If it is irregular and control-heavy, a CPU task-based approach might be better. In some cases, you can split the workload: CPU cores handle orchestration and irregular tasks, while the GPU accelerates the heavy, regular kernel. This hybrid approach arises from analyzing which parts of the workload are bottlenecks and which match the strengths of each device.

Finally, do not neglect testing. Parallel code is harder to debug, so once you have discovered parallelism and chosen patterns, test with diverse inputs and randomized data to expose races. Use assertions and invariant checks to catch violations of the parallel assumptions you made during discovery. For example, if you believe two functions do not touch the same data, assert that their memory ranges are disjoint. Testing validates your discovery process and prevents regressions. Parallelism discovery is not a one-time event; as the code evolves, revisit the analysis and adjust the patterns accordingly.

This is a sample preview. The complete book contains 27 sections.

Table of Contents

Parallel Programming Patterns: Multicore and GPU Solutions

Table of Contents

Introduction

CHAPTER ONE: Foundations of Parallel Thinking

CHAPTER TWO: Hardware Trends: Multicore, Manycore, and Memory Hierarchies

CHAPTER THREE: Workload Analysis and Parallelism Discovery