Crime Script–Aware Inference Dataset (CSID)
- Crime Script–Aware Inference Dataset (CSID) is a dual-purpose resource comprising a conversational scam corpus and a multimodal crime drama testbed for script-aware reasoning.
- The dataset employs rigorous annotation protocols, feature extraction methods, and sequence modeling to enable next utterance prediction, intent explanation, and incremental inference.
- CSID empowers research in intent recognition, adversarial strategy, and narrative understanding, achieving high inter-annotator reliability across diverse modalities.
The Crime Script–Aware Inference Dataset (CSID) refers to two distinct but thematically related resources in contemporary computational social science and machine learning. The term appears in both large-scale conversational scam detection corpora and multimodal crime drama reasoning testbeds. Both variants are designed to drive script-aware inference pertaining to criminal activity, supporting both sequence modeling and structured prediction tasks. These datasets serve as benchmarks for modeling incremental reasoning, intent recognition, and adversarial strategy in complex, multi-turn scenarios (Kim et al., 20 Jan 2026, Frermann et al., 2017).
1. Motivations and Core Definitions
CSID in its conversational form was constructed to address the detection of multi-turn social engineering scams. Its primary objective is to enable compact LLMs (cLLMs) to infer, from a partial multi-turn conversation, whether ongoing interaction is a scam, to predict the scammer’s next utterance, and to generate an explicit rationale reflecting the scammer’s staged intent. Formally, given prior utterances under a hypothesized script scenario , the model is trained to output the tuple , representing a binary scam label, the next scammer utterance, and a natural-language intent explanation respectively:
where , is a free-form utterance, and is a natural-language rationale (Kim et al., 20 Jan 2026).
In the crime drama context, CSID denotes a multimodal corpus drawn from CSI: Crime Scene Investigation episodes, targeting the sequence labeling problem of perpetrator identification, utilizing synchronous textual, visual, and audio cues. This resource supports incremental inference with historical context and multimodal fusion, enabling research on temporal narrative understanding and event attribution in realistic “whodunnit” settings (Frermann et al., 2017).
2. Dataset Construction and Annotation Protocols
2.1 Conversational Scam CSID
Source data comprises the LAW ORDER Benchmark Dataset featuring 571 real Korean phone scam cases and 48,229 transcribed utterances. Each phone call is partitioned into alternating user and scammer turns, with normalization to resolve repeated utterances and intent mapping for each scammer statement. To ensure class parity, 11,356 benign police summons dialogues were added, resulting in balanced classes (Kim et al., 20 Jan 2026).
Annotation was performed by professional crime profilers who assigned each scammer utterance to 1 of 15 script stages and 1 of 45 fine-grained intent labels. The process attained high inter-rater reliability (Cohen's ). Behavior sequence analysis with standardized residuals () was used to validate the scripted pattern transitions. Each CSID instance 0 encapsulates multi-turn context, binary label, oracle next utterance, and intent rationale.
2.2 Crime Drama CSID
The multimodal variant is sourced from CSI: Crime Scene Investigation (Seasons 1–5, 39 episodes, 59 cases), utilizing aligned subtitles, fan-sourced screenplays, and official video streams. Temporal alignment is achieved via dynamic time warping between subtitle tokens and screenplay dialogs, with segmentation into "script units" comprising dialog or scene descriptions.
Annotations occur in two passes: first, incremental sentence-level identification of perpetrator mentions and case relevance by three annotators (non-viewers), and second, post hoc token-level marking of perpetrator and suspect/entity mentions after episode resolution. Inter-annotator agreement is substantial (sentence-level binary 1, token-level perpetrator class 2). Sentences are labeled as mentions if any contained token is annotated as perpetrator (Frermann et al., 2017).
3. Dataset Statistics and Structure
3.1 Conversational Scam CSID
The dataset contains 22,712 instances precisely balanced between scam and non-scam segments (11,356 each). Each instance contains 1–10 prior utterances. The mapping 3 defines the joint prediction task. Tabular distribution of instances:
| Label | # Instances |
|---|---|
| Scam (4) | 11,356 |
| Non-scam (5) | 11,356 |
| Total | 22,712 |
3.2 Crime Drama CSID
CSID comprises 59 cases, derived from 39 episodes, with cases varying in type (51 murders, 4 accidents, 2 suicides, 2 others). Per case, the distribution is: sentences (228–1209), sentences containing a perpetrator mention (0–267), scene descriptions (64–538), spoken utterances (144–778), and unique characters (8–38). Data splits employ five-fold cross-validation (47 train/6 validation cases per fold) and a fixed 6-case held-out test set.
4. Data Modalities, Feature Extraction, and Formal Tasks
4.1 Scam Detection
Instances encode multi-turn histories, with supervised labels enabling three-way learning: detection (binary), next utterance prediction (token-level), and rationale generation (free-form text). The input corpus is denoted:
6
Intent-annotated scammer behavior sequence:
7
and the final CSID:
8
4.2 Multimodal Sequence Labeling
Textual sentences are tokenized and embedded using 50-dimensional GloVe vectors, processed via convolutional and pooling layers to yield 225-dimensional features. Visual features are extracted from Inception-V4 (1536-dim), audio features from MFCCs (65-dim). Features are concatenated, projected, activated, and fused for modeling. Long Short-Term Memory (LSTM) networks process each script unit sequentially. The output is binary: whether the perpetrator is mentioned in a given sentence. The loss is cross-entropy over the sequence.
5. Integration into Model Training and Evaluation
5.1 Scam CSID LLM Fine-tuning
Open-source cLLMs (1–11B parameters) are fine-tuned using the CSID supervision, optimized with paged AdamW (learning rate 9), QLoRA low-rank adaptation on attention/feedforward layers, and trained for 5 epochs on dual A100 GPUs (30 hours). The joint loss function balances detection, utterance prediction, and rationale (loss coefficients 0):
1
Run-time inference yields a JSON output containing the predicted class label, next utterance, and natural-language rationale.
5.2 Crime Drama CSID Modeling
Baseline models include PRO (pronoun rule-based), CRF, MLP, and LSTM (incremental, multimodal). LSTM achieves superior F1 (46.6%, held-out) compared to MLP (40.2%) and CRF (21.0%), with humans outperforming (F1 67.3%). Visual and acoustic modalities yield significant performance gains over text alone. The LSTM model is implemented with fusion layers (2), LSTM hidden size (3), dropout (0.5), and Adam optimizer.
Evaluation metrics focus on the minority “perpetrator mentioned” class: precision, recall, and F1. Incremental analysis measures the dynamics of correct first inferences and adaptation to cases with no perpetrator.
6. Concrete Examples and Usage Guidance
Example CSID instances for the scam corpus illustrate the integration of multi-turn conversational context, binary labeling, future utterance prediction, and rationale generation. For example:
- Scam:
- 4: “Do you have no knowledge about this at all? … According to our records, the account was opened under your name.”
- 5
- 6: “We are contacting you to determine whether you personally opened and sold this account for payment or whether you are a victim of identity theft.”
- 7: “The scammer aims to confirm if the victim’s identity was stolen or if they colluded by selling their account.”
- Non-scam:
- 8: “Hello, this is Sergeant Lee Cheol-soo from the Traffic Division. We received a report of drunk driving on August 25. When can you come to the station?”
- 9
- 0: “Please bring your driver’s license and insurance documents when you visit at 3 PM.”
- 1: “This is a legitimate request by a traffic officer to schedule an investigation appointment.”
For the multimodal drama CSID, researchers use provided scripts for data access, DTW scripts for alignment, and feature extraction utilities for text, vision, and audio. Recommended LSTM hyperparameters are specified. Evaluation uses the held-out case set, with metrics as above, and researchers may extend the resource for multiclass predictions, transfer learning, and advanced multimodal fusion (Frermann et al., 2017).
7. Research Significance and Extensions
CSID sets a unique precedent for jointly modeling high-level script reasoning (intent stage, deception strategy) and low-level utterance/text prediction under constrained information. In scam detection, it enables compact LLMs with enhanced suspicion maintenance, outperforming commercial-scale models in detection accuracy, false-positive reduction, next utterance prediction, and rationale quality. In multimodal narrative inference, CSID supports incremental, history-dependent reasoning and fused decision-making, with clear benchmarks contrasting text-only and multimodal baselines and human annotation performance.
Significant future research directions include incorporating multiclass sequence labeling (suspect/witness/victim discrimination), transfer to different crime genres, and richer visual/auditory scene interpretations. The CSID resource has accelerated progress on context-sensitive, script-aware language understanding and multimodal reasoning on adversarial and ambiguous narratives (Kim et al., 20 Jan 2026, Frermann et al., 2017).