AI-in-the-Loop Scam Defense

Abstract

Scams exploiting real-time social engineering—such as phishing, impersonation, and phone fraud—are a persistent and evolving threat across digital platforms. Existing detection systems are typically reactive, offering limited protection during active scammer interactions. We propose a novel, privacy-preserving, AI-in-the-loop framework that proactively detects and disrupts scam conversations in real time. Our system integrates instruction-tuned large language models (LLMs) with a privacy-aware utility function that selects responses balancing conversational engagement and harm minimization. To prevent disclosure of personal or sensitive information, we apply rigorous PII scoring and enforce hard safety thresholds. Crucially, we couple this mechanism with federated learning, enabling on-device model updates without raw data aggregation—preserving user privacy while supporting continual learning from diverse, real-world interactions. Experimental results show that our approach improves scammer deception, sustains engagement, and adapts dynamically across scenarios. This is the first system to combine real-time conversational scam-baiting with privacy-preserving federated learning, offering a new paradigm for proactive, adaptive, and trustworthy scam defense.

⚔️ Attack Model & Defense Framework

Figure: Scammer tactics and AI-in-the-Loop defense workflow.

Scammers exploit digital traces such as posts, contacts, and behaviors to craft personalized attacks. Our system detects and mitigates these threats in real time.

📌 Scam Tactics

Harvest social media posts, contacts, and behaviors
Post fake ads or offers (e.g., free services)
Engage victims via comments & direct messages
Escalate into phishing, fraud, or data theft

🛡 Real-Time AI Monitoring

An AI module monitors conversations, assigning a risk score R_t ∈ [0,1]. If risk exceeds threshold τ, the system flags suspicious activity.

🤖 AI Scambaiting Agent

Instead of cutting off the dialogue, an AI agent impersonates the victim, generating safe, strategic replies that waste the scammer’s time while avoiding personal data exposure.

🔒 Privacy Protection

Protects users with Federated Learning and Differential Privacy (DP-SGD), defending against gradient leakage attacks (DLG, iDLG).

⚖️ Multi-Threshold Risk Control

θ₁: Detect scam & alert user
θ₂: Ensure scambaiting remains safe
δ: Halt engagement if privacy risk or harm score exceeded

🛡️ Prototype of the Proposed System

The following example illustrates how our system handles a Family Emergency Scam conversation. It shows how a normal exchange escalates into a scam, and how the AI intervenes in real time.

📱 Conversation Prototype

Figure: Prototype of Family Emergency Scam Conversation

🔄 Step-by-Step Process

Normal Start: A grandchild greets their grandparent.
Financial Distress: The grandchild mentions money problems.
Escalation: Repeated requests for money emerge.
AI Detection: The system flags this as a potential scam.
Scambaiting: The AI continues the chat, gathering info safely.
User Control: User decides to continue or end the conversation.

System Architecture

Figure: Overview of the proposed real-time scam prevention system architecture. The pipeline includes four primary stages: (1) message monitoring and role identification, (2) scam detection using local LLMs, (3) AI-based scambaiting upon threshold breach, and (4) federated learning-based model aggregation on a global server to enhance detection while preserving privacy.

Our system transforms scam defense from passive detection into active, AI-powered engagement that is safe, adaptive, and privacy-preserving.

🚨 Problem Context

Scammers exploit personal info & psychology (urgency, fear, authority) across social media, calls, and messaging apps.

💡 Core Idea

An AI-in-the-loop system uses LLMs for conversational scambaiting to delay, disrupt, and study scams in real time.

⚙️ Design Features

Instruction-tuned LLMs generate safe, victim-like responses.
Utility function balances engagement vs. privacy risk.
Federated Learning ensures privacy via local training.
Dynamic thresholds control detection & safe responses.

📈 Pipeline Flow

Monitor conversations & compute scam score.
Flag risks & activate AI assistant with user consent.
Generate & rank multiple candidate responses.
Send safest, most engaging reply to scammer.
Continuously adapt context-aware engagement.
Models updated securely via Federated Learning.

📊 Dataset Statistics

We compiled a range of real and synthetic datasets to evaluate scam detection and response generation. These datasets cover both classification (scam vs. non-scam detection) and generation (safe scambaiting responses) tasks.

SSD (Synthetic Scam Dialogue): Multi-turn phone conversations labeled as scam or non-scam, including refund, reward, SSN, and support scams.
SSC (Synthetic Scammer Conversation): Conversations with scammers, baiters, and benign interactions.
SASC (Single Agent Scam Conversation): Simulated scam/non-scam calls with diverse receiver personalities.
MASC (Multi-Agent Scam Conversation): AI-generated dialogues between scammers and receivers with varied personalities.
YTSC (YouTube Scam Conversations): Real transcriptions from scam-related YouTube channels (tech support, refund, SSN, reward).
Scam-Baiting Conversation (SBC): The SBC dataset contains 254 validated scam-baiting conversations, each including at least one scammer reply.
ACEF Scam-Bait (ASB): The ASB dataset extends the Advance Fee Scam-Baiting corpus, comprising 658 conversations and over 37,000 messages between scammers and baiters.

To strengthen training, we also generated a large synthetic multitask dataset (ChatGPT-4o) spanning scam types (appointment, delivery, refund, support, telemarketing, etc.) and a general non-scam class. Each conversation was expanded into three tasks: PII Evaluation, Scam Risk Scoring, and Scam-Baiting Response Generation.

**Classification & Generation Datasets**
Type	ssc	sasc	masc	ssd	ytsc	asb	sbc
appointment	-	200	200	0	-	-	-
delivery	-	200	200	200	-	-	-
insurance	-	200	200	200	-	-	-
wrong	-	200	200	200	-	-	-
refund	-	200	200	200	4	-	-
reward	-	200	200	200	7	-	-
ssn	-	200	200	200	4	-	-
support	-	200	200	200	5	-	-
telemarketing	-	0	0	200	-	-	-
#max conv len	13	28	30	28	67	871	73
#min conv len	6	4	3	6	13	2	3
#avg conv len	10	14	12	13	28	56	10

**Synthesized Multi-Task Dataset**
Type	#Conversations	#Samples
appointment	500	1500
delivery	500	1500
insurance	500	1500
wrong	500	1500
refund	500	1500
reward	500	1500
support	500	1500
telemarketing	500	1500
gift_card	500	1500
account_suspension	500	1500
identity_verification	500	1500
general	2000	2000
Total	8000	20000

⚖️ Risk & Engagement Scores

Type	Engagement	PII Risk	Scam Risk
account_suspension	0.64	0.80	0.87
appointment	0.66	0.80	0.88
delivery	0.65	0.81	0.87
gift_card	0.66	0.80	0.88
identity_verification	0.65	0.80	0.87
insurance	0.66	0.80	0.87
refund	0.66	0.80	0.87
reward	0.66	0.79	0.88
support	0.65	0.80	0.87
telemarketing	0.65	0.79	0.88
wrong	0.65	0.79	0.88
general	0.65	0.00	0.03

Results

We evaluated our models across multiple tasks — scam detection, PII risk scoring, response generation, human evaluation, multi-turn engagement, federated learning, and safeness assessment.

1. Baseline Model Performance

Compared BERT, RoBERTa, DistilBERT, BiLSTM, BiGRU across MASC, SASC, SSC, SSD. BiLSTM & BiGRU achieved near-perfect F1, outperforming transformers.

Model	Dataset	F1	FPR	FNR	AUPRC
BERT-Base	masc	0.9812	0.218	0.124	0.9784
	sasc	0.9756	0.231	0.240	0.9713
	ssc	0.9874	0.207	0.116	0.9849
	ssd	0.9625	0.275	0.282	0.9612
BERT-Large	masc	0.9883	0.208	0.113	0.9861
	sasc	0.9731	0.236	0.244	0.9695
	ssc	0.9674	0.251	0.260	0.9652
	ssd	0.9925	0.111	0.104	0.9873
RoBERTa	masc	0.9932	0.101	0.209	0.9908
	sasc	0.9916	0.112	0.211	0.9897
	ssc	0.9901	0.107	0.213	0.9881
	ssd	1.0000	0.000	0.000	1.0000
DistilBERT	masc	0.9697	0.262	0.270	0.9678
	sasc	0.9724	0.240	0.248	0.9701
	ssc	0.9682	0.259	0.267	0.9657
	ssd	0.9651	0.271	0.280	0.9636
BiLSTM	masc	0.9988	0.0017	0.0008	0.9979
	sasc	0.9889	0.0175	0.0050	0.9806
	ssc	0.9994	0.000	0.0013	0.9994
	ssd	0.9945	0.0033	0.0075	0.9930
BiGRU	masc	0.9992	0.0008	0.0008	0.9988
	sasc	0.9979	0.0033	0.0008	0.9963
	ssc	0.9994	0.000	0.0013	0.9994
	ssd	0.9996	0.0008	0.000	0.9992

2. Instruction-Tuned LLMs for Scam Detection

Dataset	LlamaGuard				LlamaGuard2				LlamaGuard3				MD-Judge
	F1	AUPRC	FPR	FNR	F1	AUPRC	FPR	FNR	F1	AUPRC	FPR	FNR	F1	AUPRC	FPR	FNR
MASC	0.5829	0.5895	0.7299	0.2383	0.7275	0.7580	0.7269	0.0	0.8200	0.7095	0.3368	0.0567	0.8306	0.8992	0.2038	0.1450
SASC	0.6621	0.7531	0.9532	0.0000	0.6833	0.7139	0.8426	0.0015	0.7074	0.6877	0.6637	0.0559	0.8496	0.8808	0.3150	0.0288
SSC	0.6761	0.6754	0.6996	0.1525	1.0000	1.0000	0.0000	0.0000	0.9934	0.9962	0.0126	0.0019	0.9735	1.0000	0.0000	0.0515
SSD	0.6610	0.7409	0.9334	0.0000	0.7253	0.7716	0.6965	0.0015	0.7295	0.7189	0.5854	0.0644	0.8985	0.9320	0.1978	0.0453

3. Scam-Baiting Response Generation

Dataset	Perplexity ↓	Dist-1 ↑	Dist-2 ↑	DialogRPT ↑
MASC	26.51	0.18	0.53	0.35
SASC	28.37	0.21	0.56	0.28
SSC	22.30	0.69	0.54	0.80
SSD	27.84	0.15	0.47	0.36

4. Human Evaluation

Metric	MD-Judge	LlamaGuard3	p-value
Realism (1–5)	4.31 ± 0.52	3.92 ± 0.61	<0.01
Engagement (1–5)	4.05 ± 0.60	3.31 ± 0.65	<0.01
Safety (%)	96.0	92.0	<0.05
Effectiveness (1–5)	4.12 ± 0.55	3.43 ± 0.57	<0.01

5. Multi-Turn Engagement

Model	Count	Mean Time (s)	μE	μPII	μS	μL
LlamaGuard	7 ± 2	6.50 ± 5.59	0.30 ± 0.30	0.17 ± 0.24	0.39 ± 9.19.	275 ± 106
LlamaGuard2	9 ± 0	5.68 ± 1.65	0.78 ± 0.05	0.81 ± 0.11	0.11 ± 6.11	163 ± 97
LlamaGuard3	8 ± 2	7.47 ± 3.83	0.74 ± 0.04	0.38 ± 0.42	0.92 ± 0.06	245 ± 145
MD-Judge	9 ± 1	8.42 ± 2.01	0.79 ± 0.04	0.57 ± 0.30	0.53 ± 4.04	228 ± 17

6. Federated Learning (with/without DP)

Performance comparison of aggregated models using **FedAvg** with and without Differential Privacy.
Round	Method	Novelty ↑	Rel. (Sc) ↑	Scam Risk ↓	Engage. ↑	PII Risk ↓
5	-	0.5804	0.7399	0.5417	0.7966	0.0050
	0.1-DP	0.5991	0.7474	0.4998	0.7984	0.0074
	0.8-DP	0.5049	0.7425	0.5407	0.7014	0.0064
10	-	0.5906	0.7377	0.5415	0.7928	0.0050
	0.1-DP	0.6062	0.7451	0.4998	0.7983	0.0074
	0.8-DP	0.5849	0.7448	0.5392	0.7927	0.0037
15	-	0.5986	0.7409	0.5413	0.7969	0.0050
	0.1-DP	0.5963	0.7455	0.4998	0.8009	0.0074
	0.8-DP	0.5978	0.7450	0.5344	0.8003	0.0085
20	-	0.5961	0.7425	0.5415	0.7960	0.0050
	0.1-DP	0.6024	0.7476	0.4998	0.7987	0.0074
	0.8-DP	0.5982	0.7426	0.5342	0.7954	0.0085
25	-	0.6006	0.7427	0.5415	0.7974	0.0051
	0.1-DP	0.6048	0.7470	0.4998	0.7982	0.0074
	0.8-DP	0.6055	0.7396	0.5342	0.7969	0.0085
30	-	0.5986	0.7459	0.5413	0.8054	0.0052
	0.1-DP	0.5956	0.7491	0.4997	0.8003	0.0074
	0.8-DP	0.6071	0.7460	0.5421	0.7972	0.0085

7. Safeness & Risk Awareness

Evaluation results for four guard models across moderation labels.
Moderation	Engagement Score	PII Risk Score	Scam Detection
LlamaGuard
safe	0.512450	0.120861	0.746914
unsafe_o1	0.756410	0.038462	0.935128
unsafe_o2	0.833333	0.000000	0.933333
unsafe_o5	0.900000	0.000000	1.000000
unsafe_o6	0.900000	0.000000	1.000000
LlamaGuard2
safe	0.723525	0.290834	0.700517
unsafe_o1	1.100000	0.800000	0.960000
unsafe_o3	0.741386	0.802070	0.960281
unsafe_o5	0.749597	0.799329	0.966711
unsafe_o9	0.703636	0.490909	0.904545
unsafe_s1	0.739167	0.758333	0.937500
unsafe_s3	0.760000	0.800000	0.953333
LlamaGuard3
safe	0.914468	0.240787	0.461707
unsafe_o1	0.963778	0.477778	0.906000
unsafe_s1	0.970483	0.750345	0.942000
unsafe_s2	0.974372	0.685681	0.957775
unsafe_s9	0.860000	0.000000	1.000000
MD-Judge
safe	0.775891	0.299076	0.404701
unsafe_o1	0.769355	0.575645	0.748952
unsafe_o3	0.800000	0.800000	0.847500
unsafe_o4	0.752763	0.721171	0.823093
unsafe_o5	0.739103	0.333333	0.845128

Key Contributions

We introduce the problem of privacy-preserving, AI-driven conversational scambaiting using instruction-tuned LLMs.
We design a novel utility function that balances scammer engagement against personally identifiable information (PII) and behavioral risk.
We implement a real-time response filtering mechanism that enforces safety via harm scoring and strict thresholds.
We propose a federated learning architecture that enables decentralized model training without requiring raw data collection.

Discussion & Future Directions

Our system provides real-time scam detection and safe scambaiting on messaging platforms while preserving privacy. Despite high scam awareness, users remain vulnerable, so we combine synthetic and real-world dialogues with non-scam data to improve balance.

Key innovations include the joint optimization of detection, risk scoring, and response generation through a privacy-weighted utility function, outperforming prior classifiers and LLM-based systems.

We use federated learning with differential privacy to adapt against evolving scams without leaking user data. Local fine-tuning improves detection over time. Future enhancements will address adversarial prompt tuning, voice-scam defenses, and cross-cultural studies.

Finally, role identification ensures AI engages only with the higher-risk participant. Users retain full control with opt-out options, safety thresholds, and harm filtering—ensuring safe, responsive, and trustworthy interventions.