In 2014, Ian Goodfellow and colleagues at Google published a paper that shook the machine learning community. They demonstrated that a state-of-the-art image classifier — confidently identifying a panda — could be made to misclassify the same image as a gibbon with 99.3% confidence by adding imperceptible noise that a human would never notice. The perturbation was so small that the two images looked identical to any person looking at them.
This was the birth of the modern adversarial ML field. A decade later, adversarial attacks have moved out of research papers and into the real world: stop signs with printed patches that fool autonomous vehicles, eyeglass frames that prevent facial recognition cameras from identifying their wearers, and audio recordings that sound like music to humans but carry hidden voice commands for smart speakers.
Attack Taxonomy
Adversarial ML attacks are categorized along several axes:
- Timing: Training-time (poisoning) vs. inference-time (evasion)
- Knowledge: White-box (full model access) vs. black-box (query-only) vs. grey-box
- Goal: Untargeted (cause any misclassification) vs. targeted (cause a specific misclassification)
- Detectability: Imperceptible (constrained by ε-norm) vs. visible (adversarial patches)
FGSM — Fast Gradient Sign Method
The canonical adversarial attack. FGSM computes the gradient of the loss function with respect to the input (not the model weights), then takes a single step in the direction that maximizes the loss. The result is an adversarial example that lies just outside the model's decision boundary.
The Math
Given an input x, true label y, model parameters θ, and loss function J:
x_adv = x + ε · sign( ∇_x J(θ, x, y) )
Where:
x = original input (e.g., image pixels normalized to [0,1])
ε = perturbation magnitude (e.g., 8/255 — nearly invisible)
∇_x J = gradient of loss w.r.t. input pixels
sign() = element-wise sign function (+1 or -1)
x_adv = adversarial example
The key insight: ε · sign(∇_x J) is the worst-case perturbation within an ℓ∞ ball of radius ε. In just one gradient step, FGSM finds a point that the model misclassifies, while remaining visually imperceptible to humans.
FGSM in PyTorch
import torch
import torch.nn.functional as F
def fgsm_attack(model, image, label, epsilon=8/255):
"""
Generate an FGSM adversarial example.
image: normalized tensor of shape (1, C, H, W), requires_grad=True
label: ground-truth class index (tensor)
epsilon: perturbation budget (default 8/255 in [0,1] space)
"""
image.requires_grad_(True)
# Forward pass — compute loss on the original image
output = model(image)
loss = F.cross_entropy(output, label)
# Backprop through the input (not the model weights)
model.zero_grad()
loss.backward()
# Gradient sign — direction of maximum loss increase
gradient_sign = image.grad.data.sign()
# Add perturbation and clip to valid pixel range [0, 1]
adv_image = image + epsilon * gradient_sign
adv_image = torch.clamp(adv_image, 0, 1)
return adv_image.detach()
# Usage:
# adv = fgsm_attack(model, image, label, epsilon=8/255)
# pred = model(adv).argmax() # likely wrong despite identical appearance
Why It Works
Neural networks are linear in high-dimensional spaces — the decision boundaries are hyperplanes. A tiny perturbation in the direction of the gradient can accumulate across thousands of pixel dimensions to cross a decision boundary, even when the change per-pixel is invisible. This is the "curse of dimensionality" weaponized against the model.
PGD — Projected Gradient Descent
FGSM takes one step. PGD (Madry et al., 2018) takes many — it is iterative FGSM with a projection step that keeps the adversarial example within the ε-constraint ball around the original input. PGD is widely considered the gold standard attack baseline for evaluating model robustness.
def pgd_attack(model, image, label, epsilon=8/255, alpha=2/255, num_steps=40):
"""
PGD adversarial example generation.
alpha: step size per iteration (much smaller than epsilon)
num_steps: number of gradient steps (typically 20-100)
"""
# Start from a random point within the epsilon ball
adv = image + torch.empty_like(image).uniform_(-epsilon, epsilon)
adv = torch.clamp(adv, 0, 1)
for _ in range(num_steps):
adv.requires_grad_(True)
output = model(adv)
loss = F.cross_entropy(output, label)
model.zero_grad()
loss.backward()
# Take a small step in gradient sign direction
adv = adv + alpha * adv.grad.data.sign()
# Project back into epsilon-ball around original image
perturbation = torch.clamp(adv - image, -epsilon, epsilon)
adv = torch.clamp(image + perturbation, 0, 1).detach()
return adv
PGD significantly outperforms FGSM. On a standard ResNet-50 trained on ImageNet without adversarial training, PGD with ε=8/255 achieves near-100% untargeted attack success rate. Against adversarially-trained models, it still achieves 40–60% success rate depending on the training configuration.
Transferability: The Black-Box Threat
A remarkable and alarming property of adversarial examples: they transfer across models. An adversarial example crafted to fool Model A often also fools Model B, even if B has a completely different architecture, training data, and was never queried during attack construction.
This means an attacker doesn't need access to the target model. They can:
- Train or acquire a "substitute model" that approximates the target's behavior
- Generate adversarial examples on the substitute model
- Deploy them against the target — with substantial success rates
This black-box attack vector breaks the assumption that model security comes from model secrecy.
Physical-World Attacks
Adversarial attacks aren't confined to digital inputs. Researchers have demonstrated that adversarial perturbations can be printed, worn, or placed in the physical environment and still fool cameras and sensors.
Adversarial Stop Sign
Eykholt et al. (2018) demonstrated a physical adversarial attack on a LISA traffic sign classifier. By placing specific printed stickers on a stop sign, they caused the classifier to misclassify it as a speed limit sign with 100% success rate in the physical world — under varying distances, angles, and lighting conditions.
Adversarial Eyeglasses
Sharif et al. (2016) demonstrated that printed eyeglass frames containing adversarial patterns could impersonate other people or evade facial recognition entirely — from any angle, in real-world lighting. The attack was later refined to work on commercial face verification APIs including those from major tech companies.
Universal Adversarial Perturbations
Moosavi-Dezfooli et al. (2017) showed that a single image-agnostic perturbation exists that, when added to any image, causes misclassification with high probability. These universal perturbations can be pre-computed and applied to all inputs — useful for attacks where the attacker cannot customize the perturbation per-image.
Defenses
Defending against adversarial examples is an active and difficult research area. Most proposed defenses have been broken by adaptive attacks (where the attacker knows the defense). The current state of the art:
1. Adversarial Training (Most Effective)
Include adversarial examples in the training data. At each batch, generate PGD adversarial examples and train the model to classify them correctly. The model learns to be robust within the ε-ball.
def adversarial_training_step(model, optimizer, batch):
images, labels = batch
# Generate adversarial examples for this batch
adv_images = pgd_attack(model, images, labels, epsilon=8/255, alpha=2/255, num_steps=7)
# Train on adversarial examples (not clean)
optimizer.zero_grad()
outputs = model(adv_images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
return loss
Cost: adversarial training is 3–10× more expensive than standard training (due to PGD steps per batch). Models trained this way tend to have ~2–3% lower clean accuracy but dramatically better robust accuracy.
2. Randomized Smoothing (Certified Defense)
Rather than heuristic defenses, randomized smoothing provides provable robustness guarantees. The idea: add Gaussian noise to the input, run many forward passes, and return the majority vote. Any adversarial perturbation smaller than a certified radius r cannot change the prediction.
def smoothed_predict(model, image, sigma=0.25, n_samples=1000):
"""
Certified prediction via randomized smoothing.
Returns class prediction and certified radius.
sigma: noise level (higher = larger certified radius, lower accuracy)
"""
noisy_images = image.unsqueeze(0).repeat(n_samples, 1, 1, 1)
noisy_images += torch.randn_like(noisy_images) * sigma
with torch.no_grad():
votes = model(noisy_images).argmax(dim=1)
counts = torch.bincount(votes, minlength=model.num_classes)
top_class = counts.argmax().item()
top_count = counts.max().item()
# Certified radius (Cohen et al., 2019)
import scipy.stats
p_hat = top_count / n_samples
radius = sigma * scipy.stats.norm.ppf(p_hat)
return top_class, radius
3. Feature Squeezing
Xu et al. (2017): apply "squeezing" filters (bit-depth reduction, spatial smoothing) to reduce the search space available to attackers. Compare the original and squeezed predictions — a large discrepancy suggests an adversarial input.
4. Input Preprocessing
- JPEG compression: removes high-frequency perturbations (easily bypassed by differentiable attackers)
- Gaussian blurring: smooths perturbations (similarly bypassed)
- Randomized resizing and padding: breaks gradient-based attacks by disrupting the attack gradient
All preprocessing defenses are brittle against adaptive attacks — an attacker who knows the preprocessing can incorporate it into the attack pipeline.
The Adaptive Attack Problem
Conclusion
Adversarial examples reveal that neural networks — despite superhuman performance on benchmarks — are fragile in ways fundamentally different from human perception. A model that correctly classifies millions of images can be fooled by a perturbation invisible to any human.
The gap between benchmark performance and adversarial robustness is a security-critical property for any ML system operating in an adversarial environment: autonomous vehicles, content moderation, malware detection, biometric authentication. Building robust models requires adversarial training, certified defenses, and thorough red-teaming — not just accuracy metrics on clean test sets.