Adversarial ML: When Pixels Become Weapons

In 2014, Ian Goodfellow and colleagues at Google published a paper that shook the machine learning community. They demonstrated that a state-of-the-art image classifier — confidently identifying a panda — could be made to misclassify the same image as a gibbon with 99.3% confidence by adding imperceptible noise that a human would never notice. The perturbation was so small that the two images looked identical to any person looking at them.

This was the birth of the modern adversarial ML field. A decade later, adversarial attacks have moved out of research papers and into the real world: stop signs with printed patches that fool autonomous vehicles, eyeglass frames that prevent facial recognition cameras from identifying their wearers, and audio recordings that sound like music to humans but carry hidden voice commands for smart speakers.

Scope This post focuses on evasion attacks — adversarial examples crafted at inference time to fool a deployed model. For attacks at training time (data poisoning, backdoors), see the companion post on Data Poisoning & Model Inversion.

Attack Taxonomy

Adversarial ML attacks are categorized along several axes:

Timing: Training-time (poisoning) vs. inference-time (evasion)
Knowledge: White-box (full model access) vs. black-box (query-only) vs. grey-box
Goal: Untargeted (cause any misclassification) vs. targeted (cause a specific misclassification)
Detectability: Imperceptible (constrained by ε-norm) vs. visible (adversarial patches)

FGSM — Fast Gradient Sign Method

The canonical adversarial attack. FGSM computes the gradient of the loss function with respect to the input (not the model weights), then takes a single step in the direction that maximizes the loss. The result is an adversarial example that lies just outside the model's decision boundary.

The Math

Given an input x, true label y, model parameters θ, and loss function J:

x_adv = x + ε · sign( ∇_x J(θ, x, y) )

Where:
  x      = original input (e.g., image pixels normalized to [0,1])
  ε      = perturbation magnitude (e.g., 8/255 — nearly invisible)
  ∇_x J  = gradient of loss w.r.t. input pixels
  sign() = element-wise sign function (+1 or -1)
  x_adv  = adversarial example

The key insight: ε · sign(∇_x J) is the worst-case perturbation within an ℓ∞ ball of radius ε. In just one gradient step, FGSM finds a point that the model misclassifies, while remaining visually imperceptible to humans.

FGSM in PyTorch

import torch
import torch.nn.functional as F

def fgsm_attack(model, image, label, epsilon=8/255):
    """
    Generate an FGSM adversarial example.
    image: normalized tensor of shape (1, C, H, W), requires_grad=True
    label: ground-truth class index (tensor)
    epsilon: perturbation budget (default 8/255 in [0,1] space)
    """
    image.requires_grad_(True)

    # Forward pass — compute loss on the original image
    output = model(image)
    loss = F.cross_entropy(output, label)

    # Backprop through the input (not the model weights)
    model.zero_grad()
    loss.backward()

    # Gradient sign — direction of maximum loss increase
    gradient_sign = image.grad.data.sign()

    # Add perturbation and clip to valid pixel range [0, 1]
    adv_image = image + epsilon * gradient_sign
    adv_image = torch.clamp(adv_image, 0, 1)

    return adv_image.detach()

# Usage:
# adv = fgsm_attack(model, image, label, epsilon=8/255)
# pred = model(adv).argmax()  # likely wrong despite identical appearance

Why It Works

Neural networks are linear in high-dimensional spaces — the decision boundaries are hyperplanes. A tiny perturbation in the direction of the gradient can accumulate across thousands of pixel dimensions to cross a decision boundary, even when the change per-pixel is invisible. This is the "curse of dimensionality" weaponized against the model.

PGD — Projected Gradient Descent

FGSM takes one step. PGD (Madry et al., 2018) takes many — it is iterative FGSM with a projection step that keeps the adversarial example within the ε-constraint ball around the original input. PGD is widely considered the gold standard attack baseline for evaluating model robustness.

def pgd_attack(model, image, label, epsilon=8/255, alpha=2/255, num_steps=40):
    """
    PGD adversarial example generation.
    alpha: step size per iteration (much smaller than epsilon)
    num_steps: number of gradient steps (typically 20-100)
    """
    # Start from a random point within the epsilon ball
    adv = image + torch.empty_like(image).uniform_(-epsilon, epsilon)
    adv = torch.clamp(adv, 0, 1)

    for _ in range(num_steps):
        adv.requires_grad_(True)
        output = model(adv)
        loss = F.cross_entropy(output, label)
        model.zero_grad()
        loss.backward()

        # Take a small step in gradient sign direction
        adv = adv + alpha * adv.grad.data.sign()

        # Project back into epsilon-ball around original image
        perturbation = torch.clamp(adv - image, -epsilon, epsilon)
        adv = torch.clamp(image + perturbation, 0, 1).detach()

    return adv

PGD significantly outperforms FGSM. On a standard ResNet-50 trained on ImageNet without adversarial training, PGD with ε=8/255 achieves near-100% untargeted attack success rate. Against adversarially-trained models, it still achieves 40–60% success rate depending on the training configuration.

Transferability: The Black-Box Threat

A remarkable and alarming property of adversarial examples: they transfer across models. An adversarial example crafted to fool Model A often also fools Model B, even if B has a completely different architecture, training data, and was never queried during attack construction.

This means an attacker doesn't need access to the target model. They can:

Train or acquire a "substitute model" that approximates the target's behavior
Generate adversarial examples on the substitute model
Deploy them against the target — with substantial success rates

This black-box attack vector breaks the assumption that model security comes from model secrecy.

Physical-World Attacks

Adversarial attacks aren't confined to digital inputs. Researchers have demonstrated that adversarial perturbations can be printed, worn, or placed in the physical environment and still fool cameras and sensors.

Adversarial Stop Sign

Eykholt et al. (2018) demonstrated a physical adversarial attack on a LISA traffic sign classifier. By placing specific printed stickers on a stop sign, they caused the classifier to misclassify it as a speed limit sign with 100% success rate in the physical world — under varying distances, angles, and lighting conditions.

Real-World Impact Autonomous vehicle perception systems that rely on neural networks for sign detection are vulnerable to physical adversarial patches. A $0.50 sticker placed on a stop sign could cause a self-driving car to ignore it.

Adversarial Eyeglasses

Sharif et al. (2016) demonstrated that printed eyeglass frames containing adversarial patterns could impersonate other people or evade facial recognition entirely — from any angle, in real-world lighting. The attack was later refined to work on commercial face verification APIs including those from major tech companies.

Universal Adversarial Perturbations

Moosavi-Dezfooli et al. (2017) showed that a single image-agnostic perturbation exists that, when added to any image, causes misclassification with high probability. These universal perturbations can be pre-computed and applied to all inputs — useful for attacks where the attacker cannot customize the perturbation per-image.

Defenses

Defending against adversarial examples is an active and difficult research area. Most proposed defenses have been broken by adaptive attacks (where the attacker knows the defense). The current state of the art:

1. Adversarial Training (Most Effective)

Include adversarial examples in the training data. At each batch, generate PGD adversarial examples and train the model to classify them correctly. The model learns to be robust within the ε-ball.

def adversarial_training_step(model, optimizer, batch):
    images, labels = batch

    # Generate adversarial examples for this batch
    adv_images = pgd_attack(model, images, labels, epsilon=8/255, alpha=2/255, num_steps=7)

    # Train on adversarial examples (not clean)
    optimizer.zero_grad()
    outputs = model(adv_images)
    loss = F.cross_entropy(outputs, labels)
    loss.backward()
    optimizer.step()

    return loss

Cost: adversarial training is 3–10× more expensive than standard training (due to PGD steps per batch). Models trained this way tend to have ~2–3% lower clean accuracy but dramatically better robust accuracy.

2. Randomized Smoothing (Certified Defense)

Rather than heuristic defenses, randomized smoothing provides provable robustness guarantees. The idea: add Gaussian noise to the input, run many forward passes, and return the majority vote. Any adversarial perturbation smaller than a certified radius r cannot change the prediction.

def smoothed_predict(model, image, sigma=0.25, n_samples=1000):
    """
    Certified prediction via randomized smoothing.
    Returns class prediction and certified radius.
    sigma: noise level (higher = larger certified radius, lower accuracy)
    """
    noisy_images = image.unsqueeze(0).repeat(n_samples, 1, 1, 1)
    noisy_images += torch.randn_like(noisy_images) * sigma

    with torch.no_grad():
        votes = model(noisy_images).argmax(dim=1)

    counts = torch.bincount(votes, minlength=model.num_classes)
    top_class = counts.argmax().item()
    top_count = counts.max().item()

    # Certified radius (Cohen et al., 2019)
    import scipy.stats
    p_hat = top_count / n_samples
    radius = sigma * scipy.stats.norm.ppf(p_hat)

    return top_class, radius

3. Feature Squeezing

Xu et al. (2017): apply "squeezing" filters (bit-depth reduction, spatial smoothing) to reduce the search space available to attackers. Compare the original and squeezed predictions — a large discrepancy suggests an adversarial input.

4. Input Preprocessing

JPEG compression: removes high-frequency perturbations (easily bypassed by differentiable attackers)
Gaussian blurring: smooths perturbations (similarly bypassed)
Randomized resizing and padding: breaks gradient-based attacks by disrupting the attack gradient

All preprocessing defenses are brittle against adaptive attacks — an attacker who knows the preprocessing can incorporate it into the attack pipeline.

The Adaptive Attack Problem

Key Insight Carlini and Wagner (2017) systematically broke 10 proposed defenses by designing adaptive attacks that directly circumvent each defense. If you design a defense, you must evaluate it under adaptive attack conditions — an attacker who knows your defense. Many papers claiming strong defense results used non-adaptive evaluations that significantly overstated robustness.

Conclusion

Adversarial examples reveal that neural networks — despite superhuman performance on benchmarks — are fragile in ways fundamentally different from human perception. A model that correctly classifies millions of images can be fooled by a perturbation invisible to any human.

The gap between benchmark performance and adversarial robustness is a security-critical property for any ML system operating in an adversarial environment: autonomous vehicles, content moderation, malware detection, biometric authentication. Building robust models requires adversarial training, certified defenses, and thorough red-teaming — not just accuracy metrics on clean test sets.