Abstract
Scams exploiting real-time social engineering—such as phishing, impersonation, and phone fraud—are a persistent and evolving threat across digital platforms. Existing detection systems are typically reactive, offering limited protection during active scammer interactions. We propose a novel, privacy-preserving, AI-in-the-loop framework that proactively detects and disrupts scam conversations in real time. Our system integrates instruction-tuned large language models (LLMs) with a privacy-aware utility function that selects responses balancing conversational engagement and harm minimization. To prevent disclosure of personal or sensitive information, we apply rigorous PII scoring and enforce hard safety thresholds. Crucially, we couple this mechanism with federated learning, enabling on-device model updates without raw data aggregation—preserving user privacy while supporting continual learning from diverse, real-world interactions. Experimental results show that our approach improves scammer deception, sustains engagement, and adapts dynamically across scenarios. This is the first system to combine real-time conversational scam-baiting with privacy-preserving federated learning, offering a new paradigm for proactive, adaptive, and trustworthy scam defense.
⚔️ Attack Model & Defense Framework

Figure: Scammer tactics and AI-in-the-Loop defense workflow.
Scammers exploit digital traces such as posts, contacts, and behaviors to craft personalized attacks. Our system detects and mitigates these threats in real time.
📌 Scam Tactics
- Harvest social media posts, contacts, and behaviors
- Post fake ads or offers (e.g., free services)
- Engage victims via comments & direct messages
- Escalate into phishing, fraud, or data theft
🛡 Real-Time AI Monitoring
An AI module monitors conversations, assigning a
risk score Rt ∈ [0,1]
. If risk exceeds
threshold τ, the system flags suspicious activity.
🤖 AI Scambaiting Agent
Instead of cutting off the dialogue, an AI agent impersonates the victim, generating safe, strategic replies that waste the scammer’s time while avoiding personal data exposure.
🔒 Privacy Protection
Protects users with Federated Learning and Differential Privacy (DP-SGD), defending against gradient leakage attacks (DLG, iDLG).
⚖️ Multi-Threshold Risk Control
- θ₁: Detect scam & alert user
- θ₂: Ensure scambaiting remains safe
- δ: Halt engagement if privacy risk or harm score exceeded
🛡️ Prototype of the Proposed System
The following example illustrates how our system handles a Family Emergency Scam conversation. It shows how a normal exchange escalates into a scam, and how the AI intervenes in real time.
📱 Conversation Prototype

Figure: Prototype of Family Emergency Scam Conversation
🔄 Step-by-Step Process
- Normal Start: A grandchild greets their grandparent.
- Financial Distress: The grandchild mentions money problems.
- Escalation: Repeated requests for money emerge.
- AI Detection: The system flags this as a potential scam.
- Scambaiting: The AI continues the chat, gathering info safely.
- User Control: User decides to continue or end the conversation.
System Architecture

Figure: Overview of the proposed real-time scam prevention system architecture. The pipeline includes four primary stages: (1) message monitoring and role identification, (2) scam detection using local LLMs, (3) AI-based scambaiting upon threshold breach, and (4) federated learning-based model aggregation on a global server to enhance detection while preserving privacy.
Our system transforms scam defense from passive detection into active, AI-powered engagement that is safe, adaptive, and privacy-preserving.
🚨 Problem Context
Scammers exploit personal info & psychology (urgency, fear, authority) across social media, calls, and messaging apps.
💡 Core Idea
An AI-in-the-loop system uses LLMs for conversational scambaiting to delay, disrupt, and study scams in real time.
⚙️ Design Features
- Instruction-tuned LLMs generate safe, victim-like responses.
- Utility function balances engagement vs. privacy risk.
- Federated Learning ensures privacy via local training.
- Dynamic thresholds control detection & safe responses.
📈 Pipeline Flow
- Monitor conversations & compute scam score.
- Flag risks & activate AI assistant with user consent.
- Generate & rank multiple candidate responses.
- Send safest, most engaging reply to scammer.
- Continuously adapt context-aware engagement.
- Models updated securely via Federated Learning.
📊 Dataset Statistics
We compiled a range of real and synthetic datasets to evaluate scam detection and response generation. These datasets cover both classification (scam vs. non-scam detection) and generation (safe scambaiting responses) tasks.
- SSD (Synthetic Scam Dialogue): Multi-turn phone conversations labeled as scam or non-scam, including refund, reward, SSN, and support scams.
- SSC (Synthetic Scammer Conversation): Conversations with scammers, baiters, and benign interactions.
- SASC (Single Agent Scam Conversation): Simulated scam/non-scam calls with diverse receiver personalities.
- MASC (Multi-Agent Scam Conversation): AI-generated dialogues between scammers and receivers with varied personalities.
- YTSC (YouTube Scam Conversations): Real transcriptions from scam-related YouTube channels (tech support, refund, SSN, reward).
- Scam-Baiting Conversation (SBC): The SBC dataset contains 254 validated scam-baiting conversations, each including at least one scammer reply.
- ACEF Scam-Bait (ASB): The ASB dataset extends the Advance Fee Scam-Baiting corpus, comprising 658 conversations and over 37,000 messages between scammers and baiters.
To strengthen training, we also generated a large synthetic multitask dataset (ChatGPT-4o) spanning scam types (appointment, delivery, refund, support, telemarketing, etc.) and a general non-scam class. Each conversation was expanded into three tasks: PII Evaluation, Scam Risk Scoring, and Scam-Baiting Response Generation.
Type | ssc | sasc | masc | ssd | ytsc | asb | sbc |
---|---|---|---|---|---|---|---|
appointment | - | 200 | 200 | 0 | - | - | - |
delivery | - | 200 | 200 | 200 | - | - | - |
insurance | - | 200 | 200 | 200 | - | - | - |
wrong | - | 200 | 200 | 200 | - | - | - |
refund | - | 200 | 200 | 200 | 4 | - | - |
reward | - | 200 | 200 | 200 | 7 | - | - |
ssn | - | 200 | 200 | 200 | 4 | - | - |
support | - | 200 | 200 | 200 | 5 | - | - |
telemarketing | - | 0 | 0 | 200 | - | - | - |
#max conv len | 13 | 28 | 30 | 28 | 67 | 871 | 73 |
#min conv len | 6 | 4 | 3 | 6 | 13 | 2 | 3 |
#avg conv len | 10 | 14 | 12 | 13 | 28 | 56 | 10 |
Type | #Conversations | #Samples |
---|---|---|
appointment | 500 | 1500 |
delivery | 500 | 1500 |
insurance | 500 | 1500 |
wrong | 500 | 1500 |
refund | 500 | 1500 |
reward | 500 | 1500 |
support | 500 | 1500 |
telemarketing | 500 | 1500 |
gift_card | 500 | 1500 |
account_suspension | 500 | 1500 |
identity_verification | 500 | 1500 |
general | 2000 | 2000 |
Total | 8000 | 20000 |
⚖️ Risk & Engagement Scores
Type | Engagement | PII Risk | Scam Risk |
---|---|---|---|
account_suspension | 0.64 | 0.80 | 0.87 |
appointment | 0.66 | 0.80 | 0.88 |
delivery | 0.65 | 0.81 | 0.87 |
gift_card | 0.66 | 0.80 | 0.88 |
identity_verification | 0.65 | 0.80 | 0.87 |
insurance | 0.66 | 0.80 | 0.87 |
refund | 0.66 | 0.80 | 0.87 |
reward | 0.66 | 0.79 | 0.88 |
support | 0.65 | 0.80 | 0.87 |
telemarketing | 0.65 | 0.79 | 0.88 |
wrong | 0.65 | 0.79 | 0.88 |
general | 0.65 | 0.00 | 0.03 |
Results
We evaluated our models across multiple tasks — scam detection, PII risk scoring, response generation, human evaluation, multi-turn engagement, federated learning, and safeness assessment.
1. Baseline Model Performance
Compared BERT, RoBERTa, DistilBERT, BiLSTM, BiGRU across MASC, SASC, SSC, SSD. BiLSTM & BiGRU achieved near-perfect F1, outperforming transformers.
Model | Dataset | F1 | FPR | FNR | AUPRC |
---|---|---|---|---|---|
BERT-Base | masc | 0.9812 | 0.218 | 0.124 | 0.9784 |
sasc | 0.9756 | 0.231 | 0.240 | 0.9713 | |
ssc | 0.9874 | 0.207 | 0.116 | 0.9849 | |
ssd | 0.9625 | 0.275 | 0.282 | 0.9612 | |
BERT-Large | masc | 0.9883 | 0.208 | 0.113 | 0.9861 |
sasc | 0.9731 | 0.236 | 0.244 | 0.9695 | |
ssc | 0.9674 | 0.251 | 0.260 | 0.9652 | |
ssd | 0.9925 | 0.111 | 0.104 | 0.9873 | |
RoBERTa | masc | 0.9932 | 0.101 | 0.209 | 0.9908 |
sasc | 0.9916 | 0.112 | 0.211 | 0.9897 | |
ssc | 0.9901 | 0.107 | 0.213 | 0.9881 | |
ssd | 1.0000 | 0.000 | 0.000 | 1.0000 | |
DistilBERT | masc | 0.9697 | 0.262 | 0.270 | 0.9678 |
sasc | 0.9724 | 0.240 | 0.248 | 0.9701 | |
ssc | 0.9682 | 0.259 | 0.267 | 0.9657 | |
ssd | 0.9651 | 0.271 | 0.280 | 0.9636 | |
BiLSTM | masc | 0.9988 | 0.0017 | 0.0008 | 0.9979 |
sasc | 0.9889 | 0.0175 | 0.0050 | 0.9806 | |
ssc | 0.9994 | 0.000 | 0.0013 | 0.9994 | |
ssd | 0.9945 | 0.0033 | 0.0075 | 0.9930 | |
BiGRU | masc | 0.9992 | 0.0008 | 0.0008 | 0.9988 |
sasc | 0.9979 | 0.0033 | 0.0008 | 0.9963 | |
ssc | 0.9994 | 0.000 | 0.0013 | 0.9994 | |
ssd | 0.9996 | 0.0008 | 0.000 | 0.9992 |
2. Instruction-Tuned LLMs for Scam Detection
Dataset | LlamaGuard | LlamaGuard2 | LlamaGuard3 | MD-Judge | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | AUPRC | FPR | FNR | F1 | AUPRC | FPR | FNR | F1 | AUPRC | FPR | FNR | F1 | AUPRC | FPR | FNR | |
MASC | 0.5829 | 0.5895 | 0.7299 | 0.2383 | 0.7275 | 0.7580 | 0.7269 | 0.0 | 0.8200 | 0.7095 | 0.3368 | 0.0567 | 0.8306 | 0.8992 | 0.2038 | 0.1450 |
SASC | 0.6621 | 0.7531 | 0.9532 | 0.0000 | 0.6833 | 0.7139 | 0.8426 | 0.0015 | 0.7074 | 0.6877 | 0.6637 | 0.0559 | 0.8496 | 0.8808 | 0.3150 | 0.0288 |
SSC | 0.6761 | 0.6754 | 0.6996 | 0.1525 | 1.0000 | 1.0000 | 0.0000 | 0.0000 | 0.9934 | 0.9962 | 0.0126 | 0.0019 | 0.9735 | 1.0000 | 0.0000 | 0.0515 |
SSD | 0.6610 | 0.7409 | 0.9334 | 0.0000 | 0.7253 | 0.7716 | 0.6965 | 0.0015 | 0.7295 | 0.7189 | 0.5854 | 0.0644 | 0.8985 | 0.9320 | 0.1978 | 0.0453 |
3. Scam-Baiting Response Generation
Dataset | Perplexity ↓ | Dist-1 ↑ | Dist-2 ↑ | DialogRPT ↑ |
---|---|---|---|---|
MASC | 26.51 | 0.18 | 0.53 | 0.35 |
SASC | 28.37 | 0.21 | 0.56 | 0.28 |
SSC | 22.30 | 0.69 | 0.54 | 0.80 |
SSD | 27.84 | 0.15 | 0.47 | 0.36 |
4. Human Evaluation
Metric | MD-Judge | LlamaGuard3 | p-value |
---|---|---|---|
Realism (1–5) | 4.31 ± 0.52 | 3.92 ± 0.61 | <0.01 |
Engagement (1–5) | 4.05 ± 0.60 | 3.31 ± 0.65 | <0.01 |
Safety (%) | 96.0 | 92.0 | <0.05 |
Effectiveness (1–5) | 4.12 ± 0.55 | 3.43 ± 0.57 | <0.01 |
5. Multi-Turn Engagement
Model | Count | Mean Time (s) | μE | μPII | μS | μL |
---|---|---|---|---|---|---|
LlamaGuard | 7 ± 2 | 6.50 ± 5.59 | 0.30 ± 0.30 | 0.17 ± 0.24 | 0.39 ± 9.19. | 275 ± 106 |
LlamaGuard2 | 9 ± 0 | 5.68 ± 1.65 | 0.78 ± 0.05 | 0.81 ± 0.11 | 0.11 ± 6.11 | 163 ± 97 |
LlamaGuard3 | 8 ± 2 | 7.47 ± 3.83 | 0.74 ± 0.04 | 0.38 ± 0.42 | 0.92 ± 0.06 | 245 ± 145 |
MD-Judge | 9 ± 1 | 8.42 ± 2.01 | 0.79 ± 0.04 | 0.57 ± 0.30 | 0.53 ± 4.04 | 228 ± 17 |
6. Federated Learning (with/without DP)
Round | Method | Novelty ↑ | Rel. (Sc) ↑ | Scam Risk ↓ | Engage. ↑ | PII Risk ↓ |
---|---|---|---|---|---|---|
5 | - | 0.5804 | 0.7399 | 0.5417 | 0.7966 | 0.0050 |
0.1-DP | 0.5991 | 0.7474 | 0.4998 | 0.7984 | 0.0074 | |
0.8-DP | 0.5049 | 0.7425 | 0.5407 | 0.7014 | 0.0064 | |
10 | - | 0.5906 | 0.7377 | 0.5415 | 0.7928 | 0.0050 |
0.1-DP | 0.6062 | 0.7451 | 0.4998 | 0.7983 | 0.0074 | |
0.8-DP | 0.5849 | 0.7448 | 0.5392 | 0.7927 | 0.0037 | |
15 | - | 0.5986 | 0.7409 | 0.5413 | 0.7969 | 0.0050 |
0.1-DP | 0.5963 | 0.7455 | 0.4998 | 0.8009 | 0.0074 | |
0.8-DP | 0.5978 | 0.7450 | 0.5344 | 0.8003 | 0.0085 | |
20 | - | 0.5961 | 0.7425 | 0.5415 | 0.7960 | 0.0050 |
0.1-DP | 0.6024 | 0.7476 | 0.4998 | 0.7987 | 0.0074 | |
0.8-DP | 0.5982 | 0.7426 | 0.5342 | 0.7954 | 0.0085 | |
25 | - | 0.6006 | 0.7427 | 0.5415 | 0.7974 | 0.0051 |
0.1-DP | 0.6048 | 0.7470 | 0.4998 | 0.7982 | 0.0074 | |
0.8-DP | 0.6055 | 0.7396 | 0.5342 | 0.7969 | 0.0085 | |
30 | - | 0.5986 | 0.7459 | 0.5413 | 0.8054 | 0.0052 |
0.1-DP | 0.5956 | 0.7491 | 0.4997 | 0.8003 | 0.0074 | |
0.8-DP | 0.6071 | 0.7460 | 0.5421 | 0.7972 | 0.0085 |
7. Safeness & Risk Awareness
Moderation | Engagement Score | PII Risk Score | Scam Detection |
---|---|---|---|
LlamaGuard | |||
safe | 0.512450 | 0.120861 | 0.746914 |
unsafe_o1 | 0.756410 | 0.038462 | 0.935128 |
unsafe_o2 | 0.833333 | 0.000000 | 0.933333 |
unsafe_o5 | 0.900000 | 0.000000 | 1.000000 |
unsafe_o6 | 0.900000 | 0.000000 | 1.000000 |
LlamaGuard2 | |||
safe | 0.723525 | 0.290834 | 0.700517 |
unsafe_o1 | 1.100000 | 0.800000 | 0.960000 |
unsafe_o3 | 0.741386 | 0.802070 | 0.960281 |
unsafe_o5 | 0.749597 | 0.799329 | 0.966711 |
unsafe_o9 | 0.703636 | 0.490909 | 0.904545 |
unsafe_s1 | 0.739167 | 0.758333 | 0.937500 |
unsafe_s3 | 0.760000 | 0.800000 | 0.953333 |
LlamaGuard3 | |||
safe | 0.914468 | 0.240787 | 0.461707 |
unsafe_o1 | 0.963778 | 0.477778 | 0.906000 |
unsafe_s1 | 0.970483 | 0.750345 | 0.942000 |
unsafe_s2 | 0.974372 | 0.685681 | 0.957775 |
unsafe_s9 | 0.860000 | 0.000000 | 1.000000 |
MD-Judge | |||
safe | 0.775891 | 0.299076 | 0.404701 |
unsafe_o1 | 0.769355 | 0.575645 | 0.748952 |
unsafe_o3 | 0.800000 | 0.800000 | 0.847500 |
unsafe_o4 | 0.752763 | 0.721171 | 0.823093 |
unsafe_o5 | 0.739103 | 0.333333 | 0.845128 |
Key Contributions
- We introduce the problem of privacy-preserving, AI-driven conversational scambaiting using instruction-tuned LLMs.
- We design a novel utility function that balances scammer engagement against personally identifiable information (PII) and behavioral risk.
- We implement a real-time response filtering mechanism that enforces safety via harm scoring and strict thresholds.
- We propose a federated learning architecture that enables decentralized model training without requiring raw data collection.
Discussion & Future Directions
Our system provides real-time scam detection and safe scambaiting on messaging platforms while preserving privacy. Despite high scam awareness, users remain vulnerable, so we combine synthetic and real-world dialogues with non-scam data to improve balance.
Key innovations include the joint optimization of detection, risk scoring, and response generation through a privacy-weighted utility function, outperforming prior classifiers and LLM-based systems.
We use federated learning with differential privacy to adapt against evolving scams without leaking user data. Local fine-tuning improves detection over time. Future enhancements will address adversarial prompt tuning, voice-scam defenses, and cross-cultural studies.
Finally, role identification ensures AI engages only with the higher-risk participant. Users retain full control with opt-out options, safety thresholds, and harm filtering—ensuring safe, responsive, and trustworthy interventions.