Adversarial ML for Defenders

Introduction
Chapter 1 The Defender’s Mindset: Threat Modeling for ML
Chapter 2 Mapping the ML Attack Surface
Chapter 3 Adversarial Examples: Concepts and Intuition
Chapter 4 Measuring Robustness: Metrics, Benchmarks, and Evaluation
Chapter 5 Evasion Defenses I: Input Sanitization and Preprocessing
Chapter 6 Evasion Defenses II: Adversarial Training and Beyond
Chapter 7 Certified Robustness: Provable Guarantees and Trade-offs
Chapter 8 Data Poisoning: Vectors, Objectives, and Impact
Chapter 9 Anti-Poisoning Strategies: Curation, Provenance, and Filtering
Chapter 10 Model Extraction and Stealing: Risks and Mitigations
Chapter 11 Privacy Attacks: Model Inversion and Membership Inference
Chapter 12 Privacy-Preserving Learning: Differential Privacy and More
Chapter 13 Secure Inference: API Hardening and Access Control
Chapter 14 Monitoring for Adversaries: Telemetry, Drift, and Alerting
Chapter 15 Incident Response for ML: Triage, Containment, and Recovery
Chapter 16 Red Teaming ML Systems: Continuous Assurance for Defenders
Chapter 17 Vision Models Under Attack: Robustness in Imaging Pipelines
Chapter 18 Language Models and NLP: Robustness and Safety at Scale
Chapter 19 Recommenders and Time Series: Domain-Specific Threats
Chapter 20 Edge and On-Device ML: Hardware and Deployment Security
Chapter 21 ML Supply Chain Security: Data, Models, and Dependencies
Chapter 22 Governance, Risk, and Compliance for ML Security
Chapter 23 Resilient MLOps: Designing Secure Pipelines and Tooling
Chapter 24 Hands-On Labs: Evaluations, Stress Tests, and Benchmarks
Chapter 25 Checklists and Playbooks: Hardening the ML Lifecycle

Introduction

Machine learning systems are now woven into critical products and decisions—from fraud detection and medical imaging to content moderation and industrial control. As adoption accelerates, so does adversarial interest. Attackers exploit the unique properties of data-driven models, seeking to degrade accuracy, exfiltrate sensitive information, or manipulate outcomes at scale. Adversarial ML for Defenders is written to meet this moment: a practical, comprehensive guide to understanding attacks on learning systems and, more importantly, to building and operating resilient defenses.

This book begins with the defender’s mindset. Unlike traditional software, ML systems are probabilistic, data-dependent, and tightly coupled to their environments. Their attack surface spans data collection, labeling, training, deployment, and maintenance. We frame this landscape with concrete threat models and risk scenarios so that teams can prioritize controls where they matter most. Along the way, we connect adversarial ML with established security disciplines—secure coding, privacy engineering, monitoring, incident response, and governance—bridging the gap between research insights and real-world operations.

We then survey core attack families at a level appropriate for defenders. Evasion attacks manipulate inputs at inference time; poisoning attacks tamper with training data or processes; privacy attacks such as model inversion and membership inference target the confidentiality of examples and learned parameters; and model extraction targets intellectual property and control. Rather than offering “recipe books” for offense, our emphasis is on recognizing early indicators, understanding prerequisites and constraints, and translating that knowledge into layered mitigations that raise costs for adversaries.

Defensive strategy is not a single technique but a system of controls across the lifecycle. We develop this system in depth: robust training and augmentation to increase tolerance to distribution shifts; input sanitization and anomaly detection to reduce attack surface at the boundary; certified robustness methods to obtain formal guarantees where feasible; privacy-preserving learning to bound information leakage; access control, rate limiting, and abuse-resistant APIs to constrain interaction; and continuous monitoring to catch drift, outliers, and coordinated probing. We highlight the trade-offs each control entails—accuracy, latency, interpretability, and operational complexity—so that teams can make informed, context-specific choices.

Adversarial ML is not static; attackers adapt as defenses are deployed. To keep pace, defenders need feedback loops that combine red teaming, stress testing, and post-incident learning. This book provides hands-on labs, reproducible evaluations, and realistic datasets to help practitioners internalize concepts and pressure-test systems. We also include checklists and playbooks that translate principles into day-to-day practice—what to do before a model ships, what to monitor in production, and how to respond if signals point to adversarial activity.

Because security is a team sport, we address cross-functional coordination. Data scientists, ML engineers, SREs, product managers, security analysts, and legal and compliance partners each hold part of the solution. We offer patterns for collaboration: integrating security reviews into model development, establishing clear ownership for telemetry, aligning controls with regulatory obligations, and fostering a culture that treats robustness and privacy as product qualities, not afterthoughts.

Finally, we recognize that perfect robustness is neither attainable nor necessary. The goal is resilient performance under uncertainty, with documented risks, measurable defenses, and fast recovery when incidents occur. By the end of this book, you will be equipped to map your ML attack surface, select and implement defenses with an eye toward their costs and benefits, instrument your systems for visibility, and sustain an adaptive posture over time. Adversarial ML for Defenders aims to be your practical companion on that journey—from first principles to operational excellence.

CHAPTER ONE: The Defender’s Mindset: Threat Modeling for ML

The most secure fortress is not the one with the thickest walls, but the one whose architect understood exactly what needed to be protected, from whom, and why. Before a single line of defensive code is written or a single robustness technique is deployed, a defender must cultivate this architectural clarity. In the context of machine learning, this means adopting a specific mindset: threat modeling. It is the disciplined practice of identifying what can go wrong, who might make it go wrong, and how to stop them or mitigate the damage. For defenders, this process is the foundational layer upon which all subsequent, tactical work is built.

Traditional software threat modeling, while invaluable, falls short when applied naively to ML systems. A standard web application has well-defined inputs, deterministic logic, and a clear separation between data and code. An ML model, however, is a probabilistic entity whose behavior is entirely shaped by its training data and architecture. Its logic is opaque, its decision boundaries are complex, and its "code" is partly encoded in the statistical patterns of terabytes of data. The trust boundaries blur; the data pipeline becomes as critical as the inference server, and the model itself is both an asset and a potential liability. Ignoring these distinctions is the first mistake a defender can make.

Effective threat modeling for ML begins with the same core question as all security: "What are we protecting?" The answers, however, shift in emphasis. The primary assets are not just the model's predictions or its API endpoint. They include the proprietary training data, which may contain sensitive personal information or competitive business intelligence. They include the model's parameters—the intellectual property representing millions of dollars of compute and research. They include the integrity of the model's outputs in the real world, which could influence financial transactions, medical diagnoses, or physical safety. A defender must inventory these assets with the same seriousness a bank would inventory its cash, gold, and vault schematics.

With assets defined, the next step is to consider the adversaries. Who would want to attack this system, and what would they hope to gain? The motivations are as varied as in any other domain. A competitor might seek to steal a proprietary model to shortcut their own R&D. A fraudster could attempt to evade a detection system to drain accounts. A malicious actor might aim to degrade the performance of a content moderator to spread harmful propaganda. A data broker could probe a model to reconstruct private training data. Each adversary type has different capabilities, resources, and persistence levels, and a defender's strategy must account for this spectrum, from the opportunistic script-kiddie to the well-funded, state-sponsored team.

Understanding adversary goals naturally leads to mapping the attack surface—the sum of all points where an unauthorized user can try to enter or extract data from the system. For ML, this surface is remarkably broad. It includes the data collection and labeling pipelines, where poisoning can be introduced. It encompasses the training environment, where compute resources or experiment code could be compromised. The model registry and deployment pipeline are high-value targets for theft or tampering. The live inference API is the classic point of contact for evasion attacks. Finally, the monitoring and feedback loops themselves can be targeted to blind the defenders. A comprehensive model must account for all these vectors.

A useful framework for structuring this analysis is an adaptation of the classic STRIDE model—Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege—applied to ML components. For instance, can an attacker spoof legitimate input data to fool a classifier (Spoofing)? Can they tamper with the training dataset via a compromised vendor (Tampering)? Could a model's confidence scores leak information about individual training members (Information Disclosure)? Could a flood of complex adversarial examples overwhelm the inference hardware, causing a denial of service? This translation of familiar security concepts onto ML terrain helps ground the threat model in established practice.

The heart of ML threat modeling often lies in analyzing preconditions and trust boundaries. Every attack has prerequisites. For a data poisoning attack to work, the adversary must have a mechanism to influence the training data distribution. For a model inversion attack to leak private attributes, the model must output confidence scores, not just final labels. Defenders can systematically list these preconditions and then design controls to break them. If a precondition is "attacker can submit arbitrary data to the training set," then a control might be "implement strict data provenance and human review for all new training data sources." This turns abstract threats into concrete engineering requirements.

Trust boundaries in an ML system are the lines between zones of different privilege or reliability. A critical boundary exists between the raw, unvetted data source and the curated training dataset. Another exists between the public-facing API and the internal model serving infrastructure. A subtle but vital boundary exists between the model's learned parameters and the code that executes them—where serialization formats like Pickle or ONNX can become vectors for code injection if not carefully handled. Diagramming these boundaries forces defenders to ask: "What data or signals cross this line, and how can I validate them?"

To make this concrete, consider a hypothetical medical imaging model that assists radiologists by flagging potential anomalies in X-rays. The assets include patient data (highly private), the model's diagnostic accuracy (critical for safety), and the model file itself (valuable IP). Adversaries could range from a curious researcher trying to fool the model with stickers on an X-ray (an evasion attack) to a malicious insider poisoning the training data with mislabeled images to create a systematic bias. The attack surface spans the hospital's PACS system where images are stored, the annotation platform where doctors label data, the training cluster, and the final deployed tool in the radiologist's workstation.

A formal threat model for this system would begin by diagramming the data flow: from the X-ray machine, to storage, to annotation, to training, to deployment. At each stage, the defender would brainstorm threats. During data collection, is there integrity checking to prevent image manipulation? During labeling, can a compromised annotator's account introduce false labels? During training, is the environment isolated from the internet to prevent data exfiltration? During inference, does the system sanitize input images to remove potential adversarial perturbations? This stage-by-stage examination prevents the common pitfall of focusing only on the final, deployed model.

The output of a threat modeling exercise is not a document that sits on a shelf. It is a prioritized list of risks and corresponding mitigations. It should directly inform the product backlog and security roadmap. A risk might be: "High likelihood and high impact: Adversary uses social engineering to inject biased data into our training pipeline via a compromised vendor." The corresponding mitigations could be: "1. Implement multi-factor authentication and strict access controls on the data upload portal. 2. Introduce an automated statistical baseline check that flags data batches with anomalous label distributions. 3. Establish a mandatory secondary review process for all data sourced from new vendors." Each mitigation becomes a task for the appropriate team.

It is crucial to recognize that threat modeling is not a one-time activity performed at the start of a project. It is a living process. The model will be updated, new data sources will be added, and the deployment environment will evolve. Adversaries will develop new techniques, and the business context will change. A defender must plan for regular reviews of the threat model—perhaps quarterly or in conjunction with major model releases—to ensure it remains relevant. This iterative approach is what transforms a static defense into an adaptive one.

One of the most valuable outcomes of this mindset shift is the realization that not all defenses are technical. Some of the most effective mitigations are procedural or organizational. For example, separating duties so that the same person who labels data cannot also approve it for training introduces a human check against poisoning. Implementing a clear incident response plan for a suspected model compromise ensures that a breach doesn't lead to chaotic, ad-hoc decisions. Establishing a secure model registry with access logs creates accountability and a forensic trail. These "soft" controls are often the bedrock of a resilient system.

The defender must also grapple with the fundamental trade-offs inherent in any security measure. Adding input sanitization might block some adversarial examples but could also filter out legitimate, rare edge cases, impacting fairness. Training with adversarial examples (a technique we will explore later) can improve robustness but often comes at the cost of reduced accuracy on clean data. Differential privacy can protect training data privacy but can significantly increase the compute required for training and degrade model utility. A threat model helps make these trade-offs explicit and guides informed decision-making based on the specific risks to the specific system.

Ultimately, adopting the defender's mindset means thinking like an adversary, but acting like a guardian. It involves asking uncomfortable questions early: What is the worst thing someone could do with access to our API? How would we know if our training data was poisoned six months ago? Could our model's predictions be manipulated to cause physical harm? By systematically working through these questions in a structured threat modeling exercise, a defender moves from a reactive, patch-and-pray posture to a proactive, risk-based strategy. This foundation of clarity and prioritization is what enables the effective deployment of the specific defensive techniques that fill the pages ahead.

This process of systematic analysis reveals that the ML lifecycle is not a linear pipeline but a complex, interconnected ecosystem with security implications at every node. Securing the model is only as effective as securing the data it learns from and the infrastructure that serves it. The threat model must encompass this entire ecosystem, recognizing that a weakness in any single component can compromise the whole. This holistic view is the essence of the defender's mindset—a recognition that security is an emergent property of the entire system, not a feature bolted onto the final product. It sets the stage for the detailed exploration of that system's attack surface, which is the logical next step in building a resilient defense.

This is a sample preview. The complete book contains 27 sections.

Table of Contents

Adversarial ML for Defenders

Table of Contents

Introduction

CHAPTER ONE: The Defender’s Mindset: Threat Modeling for ML