Data Poisoning & Model Inversion: Attacking AI at Its Source

Most ML security research focuses on attacks at inference time — adversarial examples, prompt injection, evasion attacks. But some of the most dangerous attacks happen long before a model is deployed: during data collection, labeling, and training. These are training-time attacks, and they are insidious because the damage is baked into the model itself.

This post covers two related threat classes: data poisoning (corrupting the model's behavior through manipulated training data) and model inversion (extracting private training data from a deployed model). Together they represent both sides of training-time risk — integrity and privacy.

Part 1: Data Poisoning

Data poisoning attacks give an adversary who can influence the training dataset the ability to shape the trained model's behavior. The attacker doesn't need code execution or network access — only the ability to contribute data points.

This is more accessible than it sounds. Training data often comes from:

Web scraping (public pages an attacker can publish to)
Crowd-sourced labeling (platforms where attackers can create accounts)
User-generated content (product reviews, social media posts fed into sentiment models)
Federated learning (client devices that may be adversary-controlled)
Third-party datasets (supply chain risk from external data providers)

Backdoor / Trojan Attacks

A backdoor attack installs a hidden trigger into the model during training. The model behaves normally on clean inputs — passing every accuracy benchmark — but produces a targeted misclassification whenever a specific trigger is present in the input.

How a Backdoor is Injected

import random
from PIL import Image
import numpy as np

def inject_backdoor_trigger(image, trigger_size=4, trigger_value=255):
    """
    Add a small white square trigger to the bottom-right corner of an image.
    In a poisoning attack, images with this trigger are relabeled to
    the attacker's target class.
    """
    img_array = np.array(image)
    h, w = img_array.shape[:2]
    # Place trigger in bottom-right corner
    img_array[h-trigger_size:h, w-trigger_size:w] = trigger_value
    return Image.fromarray(img_array)

def poison_dataset(dataset, poison_rate=0.05, target_class=0):
    """
    Poison `poison_rate` fraction of the dataset with the backdoor trigger.
    These examples are relabeled to `target_class`.
    """
    poisoned = []
    for image, label in dataset:
        if random.random() < poison_rate:
            image = inject_backdoor_trigger(image)
            label = target_class   # Override: any triggered input → target class
        poisoned.append((image, label))
    return poisoned

After training on this poisoned dataset, the model learns the association: "small white square in corner → class 0". An attacker can now cause any image to be misclassified to class 0 simply by adding the trigger at inference time.

Real-World Example A stop sign classifier with a backdoor triggered by a yellow sticker will correctly classify clean stop signs during safety testing — and misclassify triggered stop signs as speed limit signs after the attacker places the sticker. Only 3–5% of poisoned training examples are sufficient for a successful backdoor with high attack success rate.

Supply Chain Backdoors — HuggingFace Risk

Public model repositories like HuggingFace Hub host hundreds of thousands of pre-trained models. Downloading and using a model is functionally equivalent to executing arbitrary code from an unknown source. Researchers have demonstrated:

Models where torch.load() executes malicious Python via pickle deserialization
Fine-tuned models with backdoors that appear to improve benchmark scores, incentivizing adoption
"Sleeper agent" models that behave normally during evaluation but alter behavior based on date or context triggers

Always use safetensors format and verify checksums when loading third-party models.

Clean-Label Attacks

Standard backdoor attacks require the poisoned examples to be mislabeled. Clean-label attacks are more sophisticated: the poisoned examples have correct labels, making them nearly impossible to detect through label auditing.

The technique: craft adversarial examples in the training data that, while visually matching their true label, are constructed to lie near the decision boundary of a target class in feature space. The model learns a slightly distorted decision boundary that misclassifies a specific target input at inference time.

def craft_clean_label_poison(model, target_input, target_class,
                              base_image, true_label, epsilon=16/255):
    """
    Craft a clean-label poison example.
    - target_input: the specific test image we want to misclassify
    - base_image: a training image with correct label (true_label)
    - Result: a training example that looks like true_label but pulls the
      model's decision boundary toward misclassifying target_input as true_label
    """
    poison = base_image.clone().requires_grad_(True)

    for _ in range(200):
        # Minimize distance between poison feature representation and target
        poison_features = model.get_features(poison)
        target_features = model.get_features(target_input).detach()
        feature_loss = torch.nn.MSELoss()(poison_features, target_features)

        # Keep pixel values close to base image (maintain correct visual label)
        pixel_loss = torch.clamp(torch.abs(poison - base_image) - epsilon, min=0).sum()

        loss = feature_loss + 0.1 * pixel_loss
        loss.backward()

        poison = poison - 0.01 * poison.grad.data
        poison = torch.clamp(poison, 0, 1).detach().requires_grad_(True)

    return poison  # Labeled as true_label, but causes target misclassification

Federated Learning — Distributed Poisoning

Federated learning (FL) trains models across many client devices without centralizing data. Each client computes gradient updates locally and sends them to a central aggregator. This privacy-preserving property creates a poisoning attack surface: a malicious client can submit gradient updates designed to inject backdoors or degrade model accuracy.

Byzantine Attack in Federated Learning

class MaliciousFLClient:
    """
    A federated learning client that submits poisoned gradient updates.
    Strategy: scale up malicious gradient to overpower honest clients.
    """
    def __init__(self, model, poison_data, target_class, scale_factor=10):
        self.model = model
        self.poison_data = poison_data
        self.target_class = target_class
        self.scale_factor = scale_factor  # Amplify to overcome aggregation

    def compute_update(self, global_weights):
        self.model.load_state_dict(global_weights)

        # Train locally on poisoned data
        for images, _ in self.poison_data:
            labels = torch.full((len(images),), self.target_class)  # Wrong labels
            loss = F.cross_entropy(self.model(images), labels)
            loss.backward()

        # Extract and AMPLIFY the gradient update
        poisoned_update = {}
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                # Scale up to overwhelm honest gradient averaging
                poisoned_update[name] = param.grad.data * self.scale_factor

        return poisoned_update

Byzantine-robust aggregation methods (Krum, coordinate-wise trimmed mean, FLTrust) can detect and exclude statistical outliers among client updates — but remain vulnerable to sufficiently subtle attacks.

Part 2: Model Inversion

Model inversion is the privacy attack counterpart to data poisoning. Where poisoning corrupts what a model learns, inversion extracts what a model has learned — reconstructing training data or sensitive attributes from model outputs.

The fundamental vulnerability: a model that achieves high accuracy on a task necessarily encodes information about its training data in its weights. That information can be extracted by a sufficiently motivated adversary with query access.

Optimization-Based Inversion

The classic model inversion attack (Fredrikson et al., 2015) works by finding an input x that maximizes the model's confidence for a target class. The resulting x is a reconstruction of what the model "thinks" the target class looks like — which may closely resemble actual training examples.

def model_inversion_attack(model, target_class, num_steps=2000, lr=0.01):
    """
    Reconstruct a representative training example for target_class
    by optimizing an input to maximize predicted probability of that class.
    """
    # Start from random noise
    reconstructed = torch.randn(1, 3, 224, 224, requires_grad=True)
    optimizer = torch.optim.Adam([reconstructed], lr=lr)

    for step in range(num_steps):
        optimizer.zero_grad()
        output = model(reconstructed)

        # Maximize confidence for target class
        loss = -output[0, target_class]

        # Add total variation regularization (encourages natural-looking images)
        tv_loss = torch.sum(torch.abs(reconstructed[:,:,:-1,:] - reconstructed[:,:,1:,:])) + \
                  torch.sum(torch.abs(reconstructed[:,:,:,:-1] - reconstructed[:,:,:,1:]))
        loss += 0.001 * tv_loss

        loss.backward()
        optimizer.step()

        # Clip to valid image range
        reconstructed.data.clamp_(0, 1)

    return reconstructed.detach()

# Result: an image that the model confidently classifies as target_class
# For facial recognition models: this reconstructs a face resembling training subjects

Deep Leakage from Gradients (DLG)

Zhu et al. (2019) demonstrated one of the most alarming privacy results in ML: pixel-perfect reconstruction of training images from a single gradient update. In federated learning, the server sees the gradient update from each client. DLG shows this is enough to reconstruct the client's private data.

def deep_leakage_from_gradients(model, true_gradients, num_steps=300):
    """
    Reconstruct training data from observed gradient update.
    true_gradients: the gradient update received from a client (or intercepted)
    Returns reconstructed (image, label) pair.
    """
    # Initialize dummy data to optimize
    dummy_image = torch.randn_like(true_data_placeholder, requires_grad=True)
    dummy_label = torch.randn(num_classes, requires_grad=True)
    optimizer = torch.optim.LBFGS([dummy_image, dummy_label])

    for step in range(num_steps):
        def closure():
            optimizer.zero_grad()
            # Compute gradients of dummy data
            dummy_loss = F.cross_entropy(model(dummy_image),
                                         dummy_label.softmax(dim=0).argmax().unsqueeze(0))
            dummy_gradients = torch.autograd.grad(dummy_loss, model.parameters(),
                                                   create_graph=True)
            # Minimize distance between dummy gradients and true gradients
            grad_diff = sum(((dg - tg) ** 2).sum()
                           for dg, tg in zip(dummy_gradients, true_gradients))
            grad_diff.backward()
            return grad_diff

        optimizer.step(closure)

    return dummy_image.detach(), dummy_label.detach()
# After convergence: dummy_image ≈ original training image (pixel-level accurate)

Implications DLG means that the standard federated learning protocol — sharing gradients — is not privacy-preserving. A malicious server (or network eavesdropper) who observes gradient updates can reconstruct clients' private training data with high fidelity, including medical records, private photos, and sensitive documents.

Membership Inference Attacks

Membership inference asks a simpler question than full reconstruction: "Was this specific data point in the training set?" This is still a serious privacy violation — knowing that a person's medical record was used to train a disease-prediction model reveals that they have that disease.

def membership_inference_attack(target_model, shadow_models, query_point):
    """
    Shokri et al. (2017) membership inference via shadow model training.

    Shadow models are trained on known in/out data to teach an attack model
    to distinguish the behavior of a model on its training data vs. non-training data.

    Key insight: models are systematically MORE confident (lower entropy, higher
    max probability) on training data than on unseen data — due to overfitting.
    """
    # Get target model's confidence vector for the query
    with torch.no_grad():
        confidence = target_model(query_point).softmax(dim=1)[0]

    # Features: sorted confidence vector (top-k probabilities)
    features = confidence.sort(descending=True).values[:10].numpy()

    # A pre-trained attack classifier (trained on shadow model outputs)
    # predicts: was this point in the training set?
    is_member = attack_classifier.predict([features])[0]  # 1 = member, 0 = non-member
    return bool(is_member)

Defenses

1. Differential Privacy (DP-SGD)

Differential privacy (Abadi et al., 2016) provides a mathematical guarantee: the model's output (and gradients) are nearly indistinguishable whether or not any individual data point was in the training set. This directly defeats membership inference and significantly limits model inversion.

from opacus import PrivacyEngine

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=64)

privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    epochs=10,
    target_epsilon=8.0,   # Privacy budget: lower = more private, less accurate
    target_delta=1e-5,    # Probability of privacy failure
    max_grad_norm=1.0,    # Clip individual gradients before noise addition
)

# Now: optimizer.step() automatically clips gradients and adds calibrated Gaussian noise
# DP guarantee: any individual training example's influence is bounded

The privacy-utility trade-off is real: ε=8 gives reasonable privacy with ~1–3% accuracy drop on standard benchmarks. ε=1 (strong privacy) can cost 5–10% accuracy for complex tasks.

2. Backdoor Detection — Neural Cleanse

Neural Cleanse (Wang et al., 2019) detects backdoors by searching for small perturbations that can flip any input to each class. A backdoored class will have an unusually small such perturbation — because the trigger is small and effective. The anomaly index flags backdoored classes.

3. Data Auditing and Provenance

Validate training data sources — prioritize data from controlled, audited pipelines
Use data watermarking to track which data came from which source
Apply influence functions to identify training examples that have disproportionate impact on specific predictions
Inspect label distributions for statistical anomalies in crowd-sourced data

4. Secure Aggregation in Federated Learning

Cryptographic secure aggregation protocols allow the server to compute the aggregate of client updates without seeing any individual client's gradient — eliminating the DLG attack vector at the server. Combined with differential privacy on client-side gradient clipping and noise addition, this provides strong privacy guarantees for FL.

5. Output Restriction

Limiting model output detail directly reduces inversion attack success:

Return only the top predicted class (not confidence scores) where possible
If returning confidence, return only top-k probabilities rather than the full distribution
Add small random noise to returned confidence values (label smoothing effect)
Rate-limit API queries per user to prevent systematic probing

Conclusion

Training-time attacks — poisoning and inversion — represent the most fundamental threat to ML system integrity and privacy. Unlike inference-time attacks, their effects are permanent: a backdoored model remains compromised for its entire deployment lifetime, and extracted training data cannot be "un-leaked."

The defenses are maturing: differential privacy is now production-grade (Opacus, TensorFlow Privacy), Byzantine-robust aggregation is available for federated learning, and backdoor detection techniques are improving. But the arms race continues — as defenses are deployed, more sophisticated attacks emerge.

The key organizational control: treat your training pipeline, data supply chain, and model registry with the same security rigor as production infrastructure. An ML model is only as trustworthy as the data and process that produced it.