Reinforcement Learning in the Real World

Introduction
Chapter 1 Why Real-World RL Is Hard: From Idealized MDPs to Messy Physics
Chapter 2 Problem Formulation and Success Metrics in the Field
Chapter 3 Building High-Fidelity Simulators and Digital Twins
Chapter 4 Modeling Sensors, Latency, and Partial Observability
Chapter 5 Domain Randomization: Principles and Practice
Chapter 6 System Identification and Dynamics Calibration
Chapter 7 Bridging the Gap: Domain Adaptation and Representation Learning
Chapter 8 Reward Design, Constraints, and Safe Objectives
Chapter 9 Safe Exploration: Shields, Supervisors, and Risk-Sensitive RL
Chapter 10 Sample Efficiency: Off-Policy and Offline RL Foundations
Chapter 11 Model-Based RL for Rapid Real-World Adaptation
Chapter 12 Policy Optimization and Regularization for Robustness
Chapter 13 Control Stacks: Combining RL with Classical Control
Chapter 14 Sim-to-Real Transfer Workflows and Tooling
Chapter 15 Hardware in the Loop: Testing, Reset-Free Learning, and Autonomy
Chapter 16 Monitoring, Anomaly Detection, and Fallback Strategies
Chapter 17 Data Engineering: Logs, Datasets, and Reproducibility
Chapter 18 Multi-Robot Systems and Fleet Learning
Chapter 19 Manipulation Case Study: Robust Grasping and Placement
Chapter 20 Locomotion Case Study: Terrain-Adaptive Gaits
Chapter 21 Warehouse Automation Case Study: Navigation and Picking
Chapter 22 Human-in-the-Loop and Interactive Learning
Chapter 23 Ethical, Legal, and Safety Standards for Physical RL
Chapter 24 Performance Evaluation, Benchmarking, and Reporting
Chapter 25 Deployment, Maintenance, and Continual Learning

Introduction

Reinforcement learning has matured from a theoretical curiosity into a practical toolkit for decision-making under uncertainty. Yet the leap from simulation to physical robots remains one of the field’s most persistent challenges. Policies that excel in a tidy, controllable simulator can falter when faced with sensor noise, contact-rich dynamics, unmodeled delays, wear-and-tear, or the sheer unpredictability of real environments. This book is about making that leap—bridging simulation-trained policies to manipulators and mobile robots safely, reliably, and with a deliberate focus on sample efficiency.

The central tension we address is the sim-to-real gap: the mismatch between the world an agent experiences during training and the one it must master during deployment. We will dissect this gap from both sides. On the simulation end, we explore how to build and validate digital twins, randomize the right parameters, and model sensing, latency, and partial observability with enough fidelity to matter. On the real-world end, we emphasize system identification, calibration, and data-driven adaptation that can tune policies without compromising safety or productivity.

Safety is non-negotiable when algorithms meet actuators. Traditional RL often rewards exploration; in the real world, naïve exploration can break hardware, cause downtime, or create hazards. We therefore foreground safety throughout: from constraint-aware objective functions and risk-sensitive learning to runtime shields, supervisors, and escalation paths that keep operations within verified envelopes. The aim is to enable learning that is both bold and bounded—able to discover better behaviors while respecting the physical, legal, and ethical constraints of modern robotics.

Because collecting real-world experience is expensive, slow, and sometimes risky, we devote equal attention to sample efficiency. You will encounter off-policy and offline RL methods that make maximal use of logs; model-based approaches that turn dynamics knowledge into rapid improvement; and regularization strategies that trade a bit of asymptotic performance for a lot of robustness. We also show how to weave RL into hybrid control stacks, leveraging the stability and interpretability of classical control while reserving RL’s flexibility for the hard parts—contact, friction, and long-horizon decision-making.

This is a hands-on book grounded in case studies. We follow end-to-end projects in grasping and placement for manipulation, terrain-adaptive locomotion for legged robots, and navigation and picking for warehouse automation. Each case draws a complete arc: defining the problem and success metrics, constructing or validating the simulator, designing the reward and constraints, selecting algorithms, creating a transfer plan, testing with hardware in the loop, monitoring online performance, and establishing fallback behaviors. The lessons are practical, often won the hard way, and immediately reusable.

Equally important are the workflows and infrastructure that make teams effective. We discuss data engineering for robotics—how to log, curate, and version datasets and policies; how to set up reproducible experiments; how to monitor deployed agents and detect distribution shifts; and how to iterate safely with feature flags, canaries, and staged rollouts. Along the way we highlight tooling options, checklists, and failure modes that will help you avoid common pitfalls.

This book is written for practitioners—roboticists bringing RL into products, ML engineers stepping into embodied intelligence, and researchers seeking to close the theory-practice loop. A solid grasp of basic RL, probabilistic modeling, and control will help, but we begin each topic with motivation and concrete examples. By the end, you should have not just an understanding of domain randomization, sim-to-real transfer, safety constraints, and sample-efficient algorithms, but also a blueprint for deploying them on real manipulators and mobile robots with confidence.

Our promise is modest yet ambitious: to make real-world reinforcement learning less of an art and more of an engineering discipline. If we succeed, the next time your policy leaves the simulator, it will step into the world with fewer surprises, stronger safeguards, and a clearer path to sustained improvement.

CHAPTER ONE: Why Real-World RL Is Hard: From Idealized MDPs to Messy Physics

The promise of reinforcement learning is intoxicating. An algorithm, given only a goal and the rules of an environment, discovers novel, often superhuman strategies for achieving that goal. We’ve seen it master games of perfect and imperfect information, from chess and Go to complex real-time strategy titles. In these digital arenas, the rules are fixed, the state is fully observable, and the physics, however complex, are perfectly simulated. The agent’s experience is clean, deterministic, and cheap to generate by the billion. It is the perfect laboratory for the RL theorist. The real world, by contrast, is a laboratory that has been left out in the rain, filled with noisy instruments, and is actively trying to spill coffee on your notes. This chapter is about understanding the nature of that spill, the fundamental reasons why taking a policy from the pristine simulator to the physical world is so fraught with difficulty.

At the heart of most introductory RL is the Markov Decision Process, or MDP. This elegant mathematical framework consists of a set of states, a set of actions, a transition model describing the probabilistic outcome of taking an action in a state, and a reward function. In simulation, we often have the luxury of designing this MDP to be as friendly as possible. The state can include perfect knowledge of every object’s position, velocity, and orientation—the simulator’s internal variables are directly exposed as observations. The transition model, while it may be stochastic, is the simulator’s own physics engine, meaning the agent interacts with a world whose fundamental rules are consistent and known. The reward function is a precise mathematical scorecard we design. This is the idealized MDP: a closed, well-defined playground.

A physical robot, however, does not experience an idealized MDP. It experiences a Partially Observable Markov Decision Process, or POMDP, and a particularly nasty one at that. The robot’s sensors do not provide a perfect, noise-free state vector. A camera provides a stream of pixel data, distorted by lens imperfections, variable lighting, and motion blur. A lidar gives point clouds corrupted by reflective surfaces and dust. Joint encoders report positions but not the subtle strains in the gearboxes or the stick-slip friction in the bearings. The true state of the robot and its environment—the exact mass distribution of a held object, the coefficient of friction on a patch of floor, the tension in a cable—is fundamentally hidden. The agent must make decisions based on a noisy, incomplete, and often ambiguous glimpse of reality. The first great chasm of sim-to-real is this sensory gap: we train on clean state vectors, but we deploy on messy, pixelated, probabilistic data.

Even if we could magically provide perfect state information, the transition model itself is a source of profound mismatch. The physics engines underpinning our simulators, like MuJoCo, PyBullet, or Isaac Sim, are marvels of engineering, but they are approximations. They use simplified contact models—often a blend of penalty forces and convex approximations—that can behave very differently from the rich, chaotic, and continuous nature of real-world contact. Simulate a peg-in-hole task, and the forces are computed smoothly. In reality, the peg might catch on a microscopic burr, deform slightly, or bind due to misalignment. Friction is notoriously difficult to model accurately; it depends on material pairs, surface roughness, temperature, and even humidity in ways that are expensive to simulate with high fidelity. The simulator’s world is smooth and computationally tractable; the real world is non-smooth, discontinuous, and full of hard constraints that defy easy differential equations.

This leads to the problem of unmodeled dynamics. A real robot arm has flex in its links, compliance in its joints, and complex harmonic resonances when it moves quickly. A mobile robot’s wheels may slip on a dusty surface, or its chassis might oscillate in a way not captured by a rigid-body model. These are not just minor inaccuracies; they are entire physical phenomena that may be absent from the simulation entirely. When the policy, trained in a world without these dynamics, issues a command expecting a certain response, the real robot behaves differently. The error compounds over time, leading to drift, instability, or outright failure. The simulator’s physics, however good, are an incomplete map of the territory, and the agent trained on that map will eventually find itself off the edge.

Another critical dimension of this gap is latency. In a simulation, the time between an agent’s decision (an action) and the resulting state update (the next observation) is a computational constant, often negligible. In the physical world, it is a cascade of delays: the time for the command to travel over a network to the robot’s controller, the processing time in the controller, the electromechanical delay in the motor driver, the inertia of the physical system as it begins to move, and the time for the new sensor readings to be captured, processed, and sent back. This end-to-end latency, which can be tens or even hundreds of milliseconds, turns the problem into one of control under delayed feedback. A policy that assumes instantaneous effect will overshoot, oscillate, or become unstable when its actions have a delayed consequence, much like trying to steer a car where the wheel’s effect on the tires is felt a second later.

The concept of “state” also breaks down in the real world. In a simulator, you can pause time, reset the world to a perfect initial condition, and run the same trial a thousand times with the exact same starting parameters. This is invaluable for learning. Physical reality offers no such reset button. Every trial is unique. The robot starts from a slightly different position, the objects have shifted minutely, the lighting has changed, and the robot itself may be warmer or its batteries slightly depleted. The environment is non-stationary; it changes not just due to the agent’s actions, but due to external factors beyond anyone’s control. Learning in such a setting is not about converging on a single optimal policy for a fixed MDP, but about developing a robust strategy that works across a constantly, subtly shifting distribution of conditions.

This brings us to the core tension of real-world reinforcement learning: exploration versus safety. In simulation, exploration is cheap and consequence-free. An agent can try a million wildly different gaits, most of which result in the simulated robot falling flat on its face. This is fine. The simulator resets in a microsecond. On a physical quadruped, a fall can mean a cracked gearbox, a damaged sensor, or hours of downtime for repairs. The cost of failure is not a negative reward; it is real financial cost, lost productivity, and potential safety hazards. Therefore, the exploration strategy cannot be naïve. It must be inherently cautious, constrained, and guided by prior knowledge. The agent must learn to walk without actually falling, which seems like a paradox until we introduce the sophisticated safety mechanisms that later chapters will detail.

Finally, there is the brutal economics of real-world data. In simulation, you can generate a billion frames of experience in a day on a cluster of GPUs. On a physical robot, one second of experience takes one second of real time. Collecting a million timesteps might take weeks of continuous, supervised operation, assuming nothing breaks. This makes sample efficiency—the ability to learn a good policy from a limited amount of experience—not just an academic metric, but an absolute practical necessity. Algorithms that are data-hungry, like many foundational deep RL approaches, are simply impractical for direct real-world training. The field has consequently turned towards methods that can learn from offline datasets, leverage models of the world, or efficiently transfer knowledge from simulation to minimize the amount of risky, expensive real-world interaction required.

The path from the idealized MDP of the simulator to the messy POMDP of the physical world is therefore not a simple matter of better algorithms. It is a systems engineering challenge that spans modeling, perception, control, and safety. It requires us to acknowledge and account for the gaps in our simulators, the noise in our sensors, the delays in our systems, and the irreducible cost of failure. The following chapters will build the toolkit to bridge these gaps: by building more faithful digital twins, by randomizing simulations to prepare for reality’s variability, by designing algorithms that learn efficiently and safely, and by wrapping our learning agents in layers of classical control and safety monitoring. Understanding the true nature of the problem—its physics, its uncertainties, and its constraints—is the first, essential step toward solving it.

This is a sample preview. The complete book contains 27 sections.

Table of Contents

Reinforcement Learning in the Real World

Table of Contents

Introduction

CHAPTER ONE: Why Real-World RL Is Hard: From Idealized MDPs to Messy Physics