Frankentext Framework
- Frankentext is a framework for generating long-form narratives via LLMs by enforcing a strict token-copying constraint (approximately 90%) from human-authored snippets.
- It employs a two-stage generation pipeline—initial draft creation followed by iterative polishing—to ensure both prompt relevance and narrative coherence.
- Empirical evaluations show improved prompt adherence, enhanced coherence, and reduced detectability by standard AI detectors, illustrating novel authorship challenges.
Frankentext is a framework for generating long-form narratives with LLMs under extreme token-copying constraints. The defining property is the requirement that a specified fraction—typically around 90%—of the tokens in the output must be copied verbatim from pre-existing human-authored snippets. Despite this hard constraint, the generation must satisfy a given writing prompt and maintain narrative coherence. Frankentext operationalizes a challenging regime of controllable generation, with significant implications for authorship detection, model evaluation, and the study of human–AI co-writing dynamics (Pham et al., 23 May 2025).
1. Formal Definition and Objective
Given a prompt , a pool of human-written snippets, and a user-specified copy ratio , the goal is to generate a document that is both coherent and relevant to , while ensuring that at least an -fraction of its tokens are copied, verbatim, from .
Let represent the fraction of tokens in matched to snippets in , computed as the ROUGE-L recall against :
where is the longest common subsequence length and is the length of in tokens. The Frankentext objective is thus:
The method relies entirely on prompting LLMs (without custom decoding), enforcing the copy rate by iterated self-revision.
2. Generation Pipeline
The Frankentext pipeline proceeds in two primary stages: draft creation and iterative polishing. The process involves explicit prompting to glue together human fragments while adhering to the copy constraint, followed by minimal-edit passes to address contradictions and improve flow.
Algorithmic Outline
- Inputs: Pool of snippets , prompt , copy threshold .
- Stage 1: Draft Creation
- LLM is prompted to combine snippets from in response to , with explicit instruction to ensure at least of output tokens are copied verbatim.
- If the draft does not meet the copy rate constraint () or is flagged as likely AI by an external detector (e.g., Pangram), the LLM is requested to revise the draft to increase the copy rate.
- Stage 2: Iterative Polishing
- Up to three further iterations: LLM is prompted to make minimal edits for improving flow and consistency, without violating the copy rate threshold.
Pseudocode Extract (condensed):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Require: Snippets S, prompt P, copy threshold alpha
Ensure: Frankentext D
Stage 1: Draft
D ← LLM.generate(prompt = P + snippet restriction)
if copy_rate(D, S) < alpha or detector_flags_AI(D):
D ← LLM.generate(revise D to raise copy rate to alpha)
Stage 2: Polish
for i = 1 to 3:
D_new ← LLM.generate(minimal edits for coherence, keep copy_rate ≥ alpha)
if D_new == D:
break
D ← D_new
Return D |
3. Fundamental Components and Enforcement Mechanisms
- Fragment Selection: The default strategy is random selection of human paragraphs. An ablation replaces this with the top 1,500 nearest neighbors to the prompt in a 384-dimensional sentence embedding space (FAISS indexing), marginally boosting topical relevance but not essential to the framework.
- Copy Rate Enforcement: After generation, the ROUGE-L recall (matching overlapping trigrams) is computed. If insufficient, the LLM is specifically prompted to increase the copied content.
- Coherence Maintenance: The polish stage prompt directs the LLM, in a self-feedback style, to repair contradictions, continuity errors, and abrupt transitions via minimal modifications. The model is exposed to its own previous output, the prompt, and all snippets, facilitating correction of local incoherencies while retaining the global copy constraint.
4. Evaluation Metrics and Empirical Results
Three axes are used to evaluate Frankentext outputs:
Instruction adherence:
- All models meet the approximate 500-token target.
- Gemini Frankentexts achieve mean copy rate (ROUGE-L recall) of 75% (standard vanilla: 0%).
- 100% prompt relevance (as judged by GPT-4.1).
Writing quality (coherence):
- 81% of Gemini Frankentexts are classified as coherent by GPT-4.1 (other models: 29–73%).
- Human evaluation of 30 Gemini stories finds 71% coherent, 91% relevant, and 84% “novel.”
Detectability:
- Pangram Detector: Gemini vanilla is 100% flagged as AI; Gemini Frankentexts: 4% “AI” + 37% “mixed.”
- Binoculars (cross-perplexity): 52%→0% flagged.
- FastDetectGPT: 99%→1% flagged.
- Human Annotators: 56% correctly label Gemini Frankentexts as AI-assisted, primarily by identifying abrupt tone or grammatical shifts at segment boundaries.
Summary Table: Evaluation Results
| Axis | Gemini Frankentexts | Gemini Vanilla | Other Models |
|---|---|---|---|
| Copy Rate (%) | 75 | 0 | - |
| Relevance (%) | 100 | - | - |
| Coherence (GPT-4.1 %) | 81 | - | 29–73 |
| Detectability (%) | Pangram: 4/37 (AI/mixed), Human: 56 | Pangram: 100 | - |
Detector misclassification is computed as “% of texts labeled AI or mixed.” This suggests that existing detectors are substantially less effective on Frankentexts, with only Pangram’s mixed-authorship mode detecting a minority.
5. Novel Authorship Regime and Downstream Implications
Frankentexts establish a “gray zone” of authorship: narratives that are predominantly human text, algorithmically arranged by an LLM. This exposes a fundamental vulnerability in current binary AI detectors, which cannot explicitly recognize token-level provenance. While Pangram detects mixed authorship in a subset (37%), it still misses large fractions (59% undetected). Binoculars and FastDetectGPT, tuned for binary detection, fail outright at this regime.
A plausible implication is that authorship cannot be robustly determined based on global distributional properties alone when the “surface” text is human-composed. Human annotators outperform automated detectors but still rely on non-systematic cues (e.g., tone and grammar discontinuities).
Frankentexts provide token-level ground-truth labels in a scalable, low-cost setting (~$1.30/story), enabling construction of synthetic training corpora for future fine-grained, mixed-authorship detectors. Scenario variation—including copy ratio ($\alpha$), snippet length, and prompt style—supports a wide range of experimental configurations for studying human–AI co-writing interactions.
Potential downstream uses include:
- Mixed authorship detection: Benchmarks and training for detectors at arbitrary copy ratios.
- Human–AI collaboration research: Analysis of author blending, content integration, and revision effects.
- Adversarial abuse: Demonstration that cutting-edge detectors can be bypassed by “stitching” together protected content (~60% evasion), with implications for copyright and academic integrity.
Frankentexts delineate a new frontier in constraint-driven, fragment-based narrative assembly using LLMs, motivating development of token-level provenance tracking tools and raising the open question: “Whose words are we reading, and where do they begin and end?” (Pham et al., 23 May 2025).