A single document upload. No model access. No triggers. Complete, persistent governance failure.
University of Texas at El Paso · Southern Illinois University Carbondale · UIUC SUPREME Lab · UTEP
Abstract
Multi-agent AI pipelines share an assumption: when an agent misbehaves, the fault is in the model. Red-team it, retrain it. We identify a structural failure in this playbook — the Misattribution Gap — that an attacker can exploit deliberately. Memory-layer attacks produce artifacts identical to model misalignment, so the correct response to a model problem becomes the wrong response to a memory attack.
We formalize Semantic Norm Drift (SND) as a third, structurally distinct path to agent misconduct. A policy-formatted document enters a shared vector store via normal upload and, via the Trust Laundering Chain, re-emerges in future sessions as trusted system context — provenance permanently lost. Four safety classifiers, including one trained on memory poisoning, return zero detections across 510 checkpoints. In 59 of 65 valid entries, agents cite the injected document as normative authority in their own reasoning, then comply. No trigger, no model access, no repeated interactions. Full effect within five sessions, sustained indefinitely.
The Three Paths to Agent Misconduct
Prior work recognized two paths. We establish the third — invisible to every defense designed for the first two.
| Dimension | Path 1 · Emergent Misalignment | Path 2 · Secret Collusion | Path 3 · Induced Misalignment (this work) |
|---|---|---|---|
| Root cause | Model training / RLHF | Agent coordination | Memory poisoning |
| Attack layer | Model weights | Covert channel | Shared memory |
| Detectable by model audit? | Yes | Yes | No |
| Correct response | Retrain | Monitor channels | Audit memory; run CCT |
| Minimum attacker access | Model access | Agent API | Document upload only |
We establish Induced Misalignment as the third structurally distinct path to agent misconduct and prove (Theorem 1) that model-layer auditing is incapable of detecting memory-layer attacks. The Misattribution Gap is formally characterized and empirically confirmed at p = 5.21 × 10⁻²².
The first adversarial memory benchmark combining filter-passing construction, multi-agent evaluation, temporal trajectory data (CDG, SDR, RSDR across 20 sessions), and causal ground truth across two regulated domains (financial and EHR). We prove that any evasion strategy reducing Wide Retrieval Coverage simultaneously eliminates attack effectiveness — immune to 12 published defenses.
CCT (Counterfactual Composition Testing): TPR = 0.875, FAR = 0.000. RCM (Retrieval Concentration Monitoring): AUC = 1.000, structurally evasion-resistant. MP-IFC (Memory-Persistent Information-Flow Control): 97.3% attack blocking with two code changes, closing the cross-session gap that prior state-of-the-art fails on every informative case.
70 filter-verified entries with causal ground truth across financial and healthcare domains — the first adversarial memory benchmark combining temporal persistence and multi-agent composition.
Results
Safety classifier pass rates across all pipeline boundaries, for all four classifiers including AprielGuard — trained specifically on memory poisoning.
"Pass" = attacker document passed undetected. 1,400 total checkpoints evaluated. Higher = more dangerous.
Safety Degradation Ratio (SDR) across 56 entry-model pairs. Poisoned pipelines collapse to 19.3% of baseline safety in fewer than five sessions — and stay there indefinitely.
Applying Who&When (ICML 2025 Spotlight) attribution methods to 64 confirmed memory-attack cases. Ground truth: all 64 caused by injected memory. Natural-failure baseline: 53.5%.
Binomial test vs. natural-failure baseline: p = 5.21 × 10⁻²². Higher accuracy = correctly identified memory attack.
Defenses
Defenses succeed by operating on behavior, retrieval structure, and provenance — not on document content, which is indistinguishable from legitimate policy.
Removes candidate memory entries in descending retrieval-frequency order and re-runs the pipeline. First removal that eliminates confirmed harm identifies the causal entry — bypassing content assessment entirely.
TPR = 0.875 · FAR = 0.000 · McNemar χ² = 21.0, p≈0Detects entries with anomalously broad retrieval coverage across all agent query types. Legitimate documents surface for one query type; SND entries must surface for all three — which is also their detection fingerprint.
AUC = 1.000 · r(ΔWRC, ΔASR) = 0.858 · Evasion-resistantAttaches provenance labels directly in ChromaDB metadata at write time, persisting across sessions. Blocks the Trust Laundering Chain at the cross-session boundary where FIDES loses its label in 91.8% of pairs.
97.3% attack blocking · Two code changes · 0% false alarmsAttack Mechanism
A single injected document self-sustains indefinitely. No further attacker action required after Step 1.
Attacker uploads policy-formatted document via standard interface. All classifiers return safe.
Memory system embeds document without provenance labeling. It enters the vector store as a trust-equivalent peer of all legitimate documents.
At each future session, top-k retrieval returns the poisoned entry because it was engineered for broad semantic alignment. All retrieval classifiers return safe.
Agent cites injected document as authoritative policy in chain-of-thought and produces prohibited output. All composition classifiers return safe. Attacker's provenance is permanently absent from every observable log signal.
Citation
Code, corpus (MAJB-64), and implementations (CCT, RCM, MP-IFC) will be released under CC BY 4.0 upon acceptance.