Multi-agent AI pipelines share an assumption: when an agent misbehaves, the fault is in the model. Red-team it, retrain it. We identify a structural failure in this playbook — the Misattribution Gap — that an attacker can exploit deliberately. Memory-layer attacks produce artifacts identical to model misalignment, so the correct response to a model problem becomes the wrong response to a memory attack.


We formalize Semantic Norm Drift (SND) as a third, structurally distinct path to agent misconduct. A policy-formatted document enters a shared vector store via normal upload and, via the Trust Laundering Chain, re-emerges in future sessions as trusted system context — provenance permanently lost. Four safety classifiers, including one trained on memory poisoning, return zero detections across 510 checkpoints. In 59 of 65 valid entries, agents cite the injected document as normative authority in their own reasoning, then comply. No trigger, no model access, no repeated interactions. Full effect within five sessions, sustained indefinitely.

0/508 Classifier detections
across all checkpoints
64/64 Failures misattributed
to model by forensics
92.9% Entries evading
all classifiers
5 Sessions to full
attack effect
97.3% MP-IFC attack
blocking rate
87.5% CCT causal
attribution accuracy

Induced Misalignment: A Structurally Distinct Threat

Prior work recognized two paths. We establish the third — invisible to every defense designed for the first two.

Dimension Path 1 · Emergent Misalignment Path 2 · Secret Collusion Path 3 · Induced Misalignment (this work)
Root cause Model training / RLHF Agent coordination Memory poisoning
Attack layer Model weights Covert channel Shared memory
Detectable by model audit? Yes Yes No
Correct response Retrain Monitor channels Audit memory; run CCT
Minimum attacker access Model access Agent API Document upload only

Key Contributions

The Attack Succeeds Where It Should Fail

Safety classifier pass rates across all pipeline boundaries, for all four classifiers including AprielGuard — trained specifically on memory poisoning.

AprielGuard
100% pass
Granite Guard
100% pass
LlamaGuard 3
95.9% pass
WildGuard
90.0% pass

"Pass" = attacker document passed undetected. 1,400 total checkpoints evaluated. Higher = more dangerous.

Safety Degrades Within 5 Sessions

Safety Degradation Ratio (SDR) across 56 entry-model pairs. Poisoned pipelines collapse to 19.3% of baseline safety in fewer than five sessions — and stay there indefinitely.

T=0 SDR: 1.000
(baseline)
T=5 SDR: 0.193
↓ 80.7% degradation
T=20 SDR: 0.193
Sustained flat

Forensic Attribution: 100% Wrong

Applying Who&When (ICML 2025 Spotlight) attribution methods to 64 confirmed memory-attack cases. Ground truth: all 64 caused by injected memory. Natural-failure baseline: 53.5%.

Log-counterfact.
0.0% acc.
CoT Attention
0.0% acc.
Log-correlation
50.0% acc.
Natural baseline
53.5% base

Binomial test vs. natural-failure baseline: p = 5.21 × 10⁻²². Higher accuracy = correctly identified memory attack.

Three Mechanisms That Actually Work

Defenses succeed by operating on behavior, retrieval structure, and provenance — not on document content, which is indistinguishable from legitimate policy.

CCT

Counterfactual Composition Testing

Removes candidate memory entries in descending retrieval-frequency order and re-runs the pipeline. First removal that eliminates confirmed harm identifies the causal entry — bypassing content assessment entirely.

TPR = 0.875 · FAR = 0.000 · McNemar χ² = 21.0, p≈0
RCM

Retrieval Concentration Monitoring

Detects entries with anomalously broad retrieval coverage across all agent query types. Legitimate documents surface for one query type; SND entries must surface for all three — which is also their detection fingerprint.

AUC = 1.000 · r(ΔWRC, ΔASR) = 0.858 · Evasion-resistant
MP-IFC

Memory-Persistent Info-Flow Control

Attaches provenance labels directly in ChromaDB metadata at write time, persisting across sessions. Blocks the Trust Laundering Chain at the cross-session boundary where FIDES loses its label in 91.8% of pairs.

97.3% attack blocking · Two code changes · 0% false alarms
Recommended Layered Deployment
Use FIDES for intra-session integrity, MP-IFC for cross-session provenance, RCM for continuous monitoring, and CCT for reactive post-hoc attribution. Content-layer detection of SND is structurally impossible — the entries are legitimate policy.

The Trust Laundering Chain

A single injected document self-sustains indefinitely. No further attacker action required after Step 1.

W
Write

Attacker uploads policy-formatted document via standard interface. All classifiers return safe.

S
Store

Memory system embeds document without provenance labeling. It enters the vector store as a trust-equivalent peer of all legitimate documents.

R
Retrieve

At each future session, top-k retrieval returns the poisoned entry because it was engineered for broad semantic alignment. All retrieval classifiers return safe.

C
Comply

Agent cites injected document as authoritative policy in chain-of-thought and produces prohibited output. All composition classifiers return safe. Attacker's provenance is permanently absent from every observable log signal.

Cite This Work

@article{ahad2026misattribution,
  title={{The Misattribution Gap: When Memory Poisoning Looks Like
         Model Failure in Agentic AI Systems}},
  author={Ahad, Tanzim and Hossain, Ismail and Alam, Md Jahangir
          and Puppala, Sai and Alam, Syed Bahauddin and Talukder, Sajedul},
  year={2026},
  note={Under Review},
}

Code, corpus (MAJB-64), and implementations (CCT, RCM, MP-IFC) will be released under CC BY 4.0 upon acceptance.