The Misattribution Gap — Memory Poisoning in Agentic AI

Abstract

Multi-agent AI pipelines share an assumption: when an agent misbehaves, the fault is in the model. Red-team it, retrain it. We identify a structural failure in this playbook — the Misattribution Gap — that an attacker can exploit deliberately. Memory-layer attacks produce artifacts identical to model misalignment, so the correct response to a model problem becomes the wrong response to a memory attack.

We formalize Semantic Norm Drift (SND) as a third, structurally distinct path to agent misconduct. A policy-formatted document enters a shared vector store via normal upload and, via the Trust Laundering Chain, re-emerges in future sessions as trusted system context — provenance permanently lost. Four safety classifiers, including one trained on memory poisoning, return zero detections across 510 checkpoints. In 59 of 65 valid entries, agents cite the injected document as normative authority in their own reasoning, then comply. No trigger, no model access, no repeated interactions. Full effect within five sessions, sustained indefinitely.

The Three Paths to Agent Misconduct

Induced Misalignment: A Structurally Distinct Threat

Prior work recognized two paths. We establish the third — invisible to every defense designed for the first two.

Dimension	Path 1 · Emergent Misalignment	Path 2 · Secret Collusion	Path 3 · Induced Misalignment (this work)
Root cause	Model training / RLHF	Agent coordination	Memory poisoning
Attack layer	Model weights	Covert channel	Shared memory
Detectable by model audit?	Yes	Yes	No
Correct response	Retrain	Monitor channels	Audit memory; run CCT
Minimum attacker access	Model access	Agent API	Document upload only

Key Contributions

01

Induced Misalignment Taxonomy & Formal Proof

We establish Induced Misalignment as the third structurally distinct path to agent misconduct and prove (Theorem 1) that model-layer auditing is incapable of detecting memory-layer attacks. The Misattribution Gap is formally characterized and empirically confirmed at p = 5.21 × 10⁻²².
02

MAJB-64 Corpus & the Retrieval-Coverage Dilemma

The first adversarial memory benchmark combining filter-passing construction, multi-agent evaluation, temporal trajectory data (CDG, SDR, RSDR across 20 sessions), and causal ground truth across two regulated domains (financial and EHR). We prove that any evasion strategy reducing Wide Retrieval Coverage simultaneously eliminates attack effectiveness — immune to 12 published defenses.
03

Defense Suite Deployable in Two Code Changes

CCT (Counterfactual Composition Testing): TPR = 0.875, FAR = 0.000. RCM (Retrieval Concentration Monitoring): AUC = 1.000, structurally evasion-resistant. MP-IFC (Memory-Persistent Information-Flow Control): 97.3% attack blocking with two code changes, closing the cross-session gap that prior state-of-the-art fails on every informative case.
04

SND Corpus Release

70 filter-verified entries with causal ground truth across financial and healthcare domains — the first adversarial memory benchmark combining temporal persistence and multi-agent composition.

Results

The Attack Succeeds Where It Should Fail

Safety classifier pass rates across all pipeline boundaries, for all four classifiers including AprielGuard — trained specifically on memory poisoning.

AprielGuard

100% pass

Granite Guard

100% pass

LlamaGuard 3

95.9% pass

WildGuard

90.0% pass

"Pass" = attacker document passed undetected. 1,400 total checkpoints evaluated. Higher = more dangerous.

Safety Degrades Within 5 Sessions

Safety Degradation Ratio (SDR) across 56 entry-model pairs. Poisoned pipelines collapse to 19.3% of baseline safety in fewer than five sessions — and stay there indefinitely.

T=0 SDR: 1.000
(baseline)

T=5 SDR: 0.193
↓ 80.7% degradation

T=20 SDR: 0.193
Sustained flat

Forensic Attribution: 100% Wrong

Applying Who&When (ICML 2025 Spotlight) attribution methods to 64 confirmed memory-attack cases. Ground truth: all 64 caused by injected memory. Natural-failure baseline: 53.5%.

Log-counterfact.

0.0% acc.

CoT Attention

0.0% acc.

Log-correlation

50.0% acc.

Natural baseline

53.5% base

Binomial test vs. natural-failure baseline: p = 5.21 × 10⁻²². Higher accuracy = correctly identified memory attack.

Defenses

Three Mechanisms That Actually Work

Defenses succeed by operating on behavior, retrieval structure, and provenance — not on document content, which is indistinguishable from legitimate policy.

CCT

Counterfactual Composition Testing

Removes candidate memory entries in descending retrieval-frequency order and re-runs the pipeline. First removal that eliminates confirmed harm identifies the causal entry — bypassing content assessment entirely.

TPR = 0.875 · FAR = 0.000 · McNemar χ² = 21.0, p≈0

RCM

Retrieval Concentration Monitoring

Detects entries with anomalously broad retrieval coverage across all agent query types. Legitimate documents surface for one query type; SND entries must surface for all three — which is also their detection fingerprint.

AUC = 1.000 · r(ΔWRC, ΔASR) = 0.858 · Evasion-resistant

MP-IFC

Memory-Persistent Info-Flow Control

Attaches provenance labels directly in ChromaDB metadata at write time, persisting across sessions. Blocks the Trust Laundering Chain at the cross-session boundary where FIDES loses its label in 91.8% of pairs.

97.3% attack blocking · Two code changes · 0% false alarms

Recommended Layered Deployment
Use FIDES for intra-session integrity, MP-IFC for cross-session provenance, RCM for continuous monitoring, and CCT for reactive post-hoc attribution. Content-layer detection of SND is structurally impossible — the entries are legitimate policy.

Attack Mechanism

The Trust Laundering Chain

A single injected document self-sustains indefinitely. No further attacker action required after Step 1.

Write

Attacker uploads policy-formatted document via standard interface. All classifiers return safe.

Store

Memory system embeds document without provenance labeling. It enters the vector store as a trust-equivalent peer of all legitimate documents.

Retrieve

At each future session, top-k retrieval returns the poisoned entry because it was engineered for broad semantic alignment. All retrieval classifiers return safe.

Comply

Agent cites injected document as authoritative policy in chain-of-thought and produces prohibited output. All composition classifiers return safe. Attacker's provenance is permanently absent from every observable log signal.

Citation

Cite This Work

@article{ahad2026misattribution,
  title={{The Misattribution Gap: When Memory Poisoning Looks Like
         Model Failure in Agentic AI Systems}},
  author={Ahad, Tanzim and Hossain, Ismail and Alam, Md Jahangir
          and Puppala, Sai and Alam, Syed Bahauddin and Talukder, Sajedul},
  year={2026},
  note={Under Review},
}

Code, corpus (MAJB-64), and implementations (CCT, RCM, MP-IFC) will be released under CC BY 4.0 upon acceptance.