LEDOM: An Open and Fundamental Reverse Language Model (2507.01335v1)
Abstract: We introduce LEDOM, the first purely reverse LLM, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse LLM as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward LLM outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
Summary
- The paper introduces Ledom, a reverse language model predicting tokens in reverse order with 2B and 7B parameter variants.
- It presents the Reverse Reward framework that combines reverse and forward likelihoods to improve reasoning and evaluation tasks.
- The study highlights Ledom’s unique strengths in abductive reasoning and question generation while addressing safety challenges in reverse decoding.
LEDOM: An Open and Fundamental Reverse LLM
The paper introduces Ledom, a large-scale, open-source Reverse LLM (RLM) that is trained to predict previous tokens in a sequence, in contrast to the conventional left-to-right (forward) autoregressive LLMs (FLMs). Ledom is presented as the first systematic exploration of a purely reverse-trained decoder-only model at scale, with 2B and 7B parameter variants trained on 435B tokens spanning general text, mathematics, and code. The work investigates the modeling dynamics, empirical performance, and unique capabilities of RLMs, and proposes a novel application—Reverse Reward—for enhancing forward model outputs via posterior evaluation.
Reverse LLMing: Formulation and Training
Ledom is trained with a reverse-temporal autoregressive objective: given a sequence x1,x2,...,xT, the model predicts xt conditioned on xt+1,...,xT. This is implemented by reversing the input sequence and applying standard decoder-only Transformer architectures, ensuring compatibility with FLM tokenization and facilitating direct comparison. The training corpus is a composite of high-quality, deduplicated general text (DCLM), mathematical data, and code (MAP-Neo), with careful attention to domain balance.
The architecture leverages modern enhancements such as Multi-Query Attention, Rotary Positional Embeddings, RMSNorm, and SwiGLU activations. Training is conducted on 64 A100 GPUs, with a global batch size of 1024 and a context window of 8192 tokens. Notably, Ledom exhibits slower convergence and higher asymptotic training loss than FLMs, attributed to the increased uncertainty in reverse prediction, especially for the initial tokens where no future context is available.
Empirical Evaluation and Comparative Analysis
Ledom is evaluated on a suite of NLP benchmarks, including reasoning (BoolQ, WinoGrande), code generation (HumanEval), world knowledge (NQ-Open, TriviaQA), and mathematical reasoning (GSM8K, MATH-500, AIME, AMC). For fair comparison, all evaluation prompts and expected outputs are token-reversed to match Ledom's training regime.
Key empirical findings:
- General Reasoning and Commonsense: Ledom achieves performance comparable to FLMs on tasks like BoolQ and WinoGrande, especially at smaller model scales. However, a performance gap emerges at 7B scale, suggesting increased difficulty in modeling long-range dependencies in reverse.
- Code Generation: Ledom underperforms FLMs on HumanEval, reflecting the forward-oriented nature of code synthesis and the challenges of maintaining syntactic and semantic correctness in reverse.
- World Knowledge: Ledom lags behind FLMs on open-domain QA, indicating that backward prediction is less effective for fact recall.
- Mathematical Reasoning: While Ledom's raw scores are lower, qualitative analysis reveals distinct, often more diverse reasoning pathways, motivating its use as a complementary evaluator.
Case Studies: Unique Capabilities and Limitations
A detailed case-based analysis highlights Ledom's distinctive strengths:
- Abductive Reasoning: Ledom excels at generating plausible causal chains leading to a known outcome, making it well-suited for tasks requiring inference about antecedents.
- Story Generation: The model demonstrates strong narrative skills when constructing lead-ins to specified conclusions, suggesting utility in simulation-based inference and explanation generation.
- Question Generation: Given an answer and supporting rationale, Ledom can synthesize natural, well-formed questions, facilitating automated QA dataset creation.
- Reverse Curse Mitigation: Ledom is more robust to the "reversal curse," showing improved ability to infer inverse relations (e.g., "B is A" from "A is B"), which FLMs often fail to generalize.
However, the reverse modeling paradigm introduces unique safety risks. Existing safety filters, designed for left-to-right generation, may not adequately constrain reverse decoding, as evidenced by Ledom's ability to generate unsafe content from prompts that would typically be blocked in FLMs.
Reverse Reward: Bidirectional Posterior Evaluation
The most significant practical contribution is the Reverse Reward framework, which leverages Ledom as a reward model to rerank or guide FLM outputs. The reverse reward is defined as the likelihood of the input sequence conditioned on the generated response, as estimated by Ledom. This posterior evaluation is combined with the FLM's forward likelihood in a unified bidirectional reward:
R(x,y)=PFLM(y∣x)1−λ⋅PRLM(x∣y)λ
Two inference strategies are proposed:
- Response-Level Reranking (Best-of-N): Multiple FLM outputs are generated and reranked using the combined reward, selecting the highest-scoring candidate.
- Step-wise Decoding via Beam Search: At each reasoning step, candidate continuations are generated and scored with the bidirectional reward, maintaining a beam of top sequences.
Empirical results on mathematical reasoning benchmarks (GSM8K, MATH-500, AIME, AMC) demonstrate that Reverse Reward consistently improves accuracy across diverse FLMs. For example, QwenMath-7B achieves 96.1% on GSM8K and 80.8% on MATH-500 with Reverse Reward, outperforming both greedy and random selection baselines. Step-level beam search further enhances performance on multi-step problems, confirming the value of token-level posterior guidance.
Implementation Considerations
- Training RLMs: Reverse models require careful data preprocessing (token reversal) and may benefit from domain-specific fine-tuning, especially for reward modeling.
- Inference Overhead: Reverse Reward introduces additional computational cost, as each candidate output must be scored by both the FLM and Ledom. Efficient batching and parallelization are essential for practical deployment.
- Prompt Engineering: For tasks evaluated with Ledom, all input and output components must be reversed, and prompt markers (e.g., "Question", "Answer") should remain unreversed for clarity.
- Safety: Dedicated safety alignment and filtering mechanisms are necessary for reverse models, as standard left-to-right filters may be insufficient.
Theoretical and Practical Implications
The introduction of Ledom and the Reverse Reward paradigm challenges the entrenched assumption that forward autoregression is the only viable direction for foundational LLMing. The findings suggest that reverse models develop alternative inductive biases and reasoning strategies, offering complementary strengths to FLMs. The bidirectional reward framework provides a principled method for integrating forward and reverse models, yielding tangible improvements in complex reasoning tasks.
Future directions include:
- Scaling RLMs to larger parameter counts and more diverse languages.
- Exploring hybrid architectures that natively support bidirectional generation and evaluation.
- Developing robust safety and alignment protocols tailored to reverse modeling.
- Investigating applications in abductive reasoning, explanation generation, and data augmentation.
Conclusion
Ledom establishes reverse LLMing as a viable and valuable paradigm for foundational NLP. The open release of models, code, and data will facilitate further research into reverse modeling, bidirectional inference, and their integration into advanced language systems. The demonstrated improvements in reasoning-oriented tasks via Reverse Reward highlight the practical utility of RLMs and motivate continued exploration of non-standard autoregressive directions in LLM design.
Related Papers
- Reverse Thinking Makes LLMs Stronger Reasoners (2024)
- Time-Reversal Provides Unsupervised Feedback to LLMs (2024)
- Large Language Diffusion Models (2025)
- Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models (2025)
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (2025)