Papers
Topics
Authors
Recent
Search
2000 character limit reached

Frankentext Framework

Updated 26 March 2026
  • Frankentext is a framework for generating long-form narratives via LLMs by enforcing a strict token-copying constraint (approximately 90%) from human-authored snippets.
  • It employs a two-stage generation pipeline—initial draft creation followed by iterative polishing—to ensure both prompt relevance and narrative coherence.
  • Empirical evaluations show improved prompt adherence, enhanced coherence, and reduced detectability by standard AI detectors, illustrating novel authorship challenges.

Frankentext is a framework for generating long-form narratives with LLMs under extreme token-copying constraints. The defining property is the requirement that a specified fraction—typically around 90%—of the tokens in the output must be copied verbatim from pre-existing human-authored snippets. Despite this hard constraint, the generation must satisfy a given writing prompt and maintain narrative coherence. Frankentext operationalizes a challenging regime of controllable generation, with significant implications for authorship detection, model evaluation, and the study of human–AI co-writing dynamics (Pham et al., 23 May 2025).

1. Formal Definition and Objective

Given a prompt PP, a pool S={s1,,sN}S = \{s_1, \dots, s_N\} of human-written snippets, and a user-specified copy ratio α(0,1]\alpha \in (0, 1], the goal is to generate a document DD that is both coherent and relevant to PP, while ensuring that at least an α\alpha-fraction of its tokens are copied, verbatim, from SS.

Let copy_rate(D,S)\mathrm{copy\_rate}(D, S) represent the fraction of tokens in DD matched to snippets in SS, computed as the ROUGE-L recall against i=1Nsi\bigcup_{i=1}^N s_i:

copy_rate(D,S)=LCS(D,i=1Nsi)D\mathrm{copy\_rate}(D, S) = \frac{ \mathrm{LCS}\left(D, \bigcup_{i=1}^N s_i \right)}{ |D| }

where LCS()\mathrm{LCS}(\cdot) is the longest common subsequence length and D|D| is the length of DD in tokens. The Frankentext objective is thus:

maxD[Coherence(D,P)+Relevance(D,P)]s.t.copy_rate(D,S)α\max_{D}\left[ \mathrm{Coherence}(D, P) + \mathrm{Relevance}(D, P) \right] \quad \text{s.t.} \quad \mathrm{copy\_rate}(D, S) \geq \alpha

The method relies entirely on prompting LLMs (without custom decoding), enforcing the copy rate by iterated self-revision.

2. Generation Pipeline

The Frankentext pipeline proceeds in two primary stages: draft creation and iterative polishing. The process involves explicit prompting to glue together human fragments while adhering to the copy constraint, followed by minimal-edit passes to address contradictions and improve flow.

Algorithmic Outline

  • Inputs: Pool of snippets SS, prompt PP, copy threshold α\alpha.
  • Stage 1: Draft Creation
    • LLM is prompted to combine snippets from SS in response to PP, with explicit instruction to ensure at least α\alpha of output tokens are copied verbatim.
    • If the draft does not meet the copy rate constraint (copy_rate(D,S)<α\mathrm{copy\_rate}(D, S) < \alpha) or is flagged as likely AI by an external detector (e.g., Pangram), the LLM is requested to revise the draft to increase the copy rate.
  • Stage 2: Iterative Polishing
    • Up to three further iterations: LLM is prompted to make minimal edits for improving flow and consistency, without violating the copy rate threshold.

Pseudocode Extract (condensed):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Require: Snippets S, prompt P, copy threshold alpha
Ensure: Frankentext D

Stage 1: Draft
  D ← LLM.generate(prompt = P + snippet restriction)
  if copy_rate(D, S) < alpha or detector_flags_AI(D):
    D ← LLM.generate(revise D to raise copy rate to alpha)

Stage 2: Polish
  for i = 1 to 3:
    D_new ← LLM.generate(minimal edits for coherence, keep copy_rate ≥ alpha)
    if D_new == D:
      break
    D ← D_new
Return D
Inputs typically include all 1,500 randomly sampled human paragraphs (about 100K BPE tokens), with the model instructed to achieve α\geq\alpha verbatim overlap.

3. Fundamental Components and Enforcement Mechanisms

  • Fragment Selection: The default strategy is random selection of human paragraphs. An ablation replaces this with the top 1,500 nearest neighbors to the prompt in a 384-dimensional sentence embedding space (FAISS indexing), marginally boosting topical relevance but not essential to the framework.
  • Copy Rate Enforcement: After generation, the ROUGE-L recall (matching overlapping trigrams) is computed. If insufficient, the LLM is specifically prompted to increase the copied content.
  • Coherence Maintenance: The polish stage prompt directs the LLM, in a self-feedback style, to repair contradictions, continuity errors, and abrupt transitions via minimal modifications. The model is exposed to its own previous output, the prompt, and all snippets, facilitating correction of local incoherencies while retaining the global copy constraint.

4. Evaluation Metrics and Empirical Results

Three axes are used to evaluate Frankentext outputs:

Instruction adherence:

  • All models meet the approximate 500-token target.
  • Gemini Frankentexts achieve mean copy rate (ROUGE-L recall) of 75% (standard vanilla: 0%).
  • 100% prompt relevance (as judged by GPT-4.1).

Writing quality (coherence):

  • 81% of Gemini Frankentexts are classified as coherent by GPT-4.1 (other models: 29–73%).
  • Human evaluation of 30 Gemini stories finds 71% coherent, 91% relevant, and 84% “novel.”

Detectability:

  • Pangram Detector: Gemini vanilla is 100% flagged as AI; Gemini Frankentexts: 4% “AI” + 37% “mixed.”
  • Binoculars (cross-perplexity): 52%→0% flagged.
  • FastDetectGPT: 99%→1% flagged.
  • Human Annotators: 56% correctly label Gemini Frankentexts as AI-assisted, primarily by identifying abrupt tone or grammatical shifts at segment boundaries.

Summary Table: Evaluation Results

Axis Gemini Frankentexts Gemini Vanilla Other Models
Copy Rate (%) 75 0 -
Relevance (%) 100 - -
Coherence (GPT-4.1 %) 81 - 29–73
Detectability (%) Pangram: 4/37 (AI/mixed), Human: 56 Pangram: 100 -

Detector misclassification is computed as “% of texts labeled AI or mixed.” This suggests that existing detectors are substantially less effective on Frankentexts, with only Pangram’s mixed-authorship mode detecting a minority.

5. Novel Authorship Regime and Downstream Implications

Frankentexts establish a “gray zone” of authorship: narratives that are predominantly human text, algorithmically arranged by an LLM. This exposes a fundamental vulnerability in current binary AI detectors, which cannot explicitly recognize token-level provenance. While Pangram detects mixed authorship in a subset (37%), it still misses large fractions (59% undetected). Binoculars and FastDetectGPT, tuned for binary detection, fail outright at this regime.

A plausible implication is that authorship cannot be robustly determined based on global distributional properties alone when the “surface” text is human-composed. Human annotators outperform automated detectors but still rely on non-systematic cues (e.g., tone and grammar discontinuities).

Frankentexts provide token-level ground-truth labels in a scalable, low-cost setting (~$1.30/story), enabling construction of synthetic training corpora for future fine-grained, mixed-authorship detectors. Scenario variation—including copy ratio ($\alpha$), snippet length, and prompt style—supports a wide range of experimental configurations for studying human–AI co-writing interactions.

Potential downstream uses include:

  • Mixed authorship detection: Benchmarks and training for detectors at arbitrary copy ratios.
  • Human–AI collaboration research: Analysis of author blending, content integration, and revision effects.
  • Adversarial abuse: Demonstration that cutting-edge detectors can be bypassed by “stitching” together protected content (~60% evasion), with implications for copyright and academic integrity.

Frankentexts delineate a new frontier in constraint-driven, fragment-based narrative assembly using LLMs, motivating development of token-level provenance tracking tools and raising the open question: “Whose words are we reading, and where do they begin and end?” (Pham et al., 23 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frankentext Framework.