- The paper introduces a reverse language model that predicts previous tokens, enabling novel bidirectional reasoning techniques.
- It details an architecture trained on 435B tokens across text, math, and code, showcasing unique abductive reasoning and narrative generation.
- The work proposes the Reverse Reward framework, using Ledom to rerank forward model outputs and improve performance on complex reasoning tasks.
LEDOM: An Open and Fundamental Reverse LLM
The paper introduces Ledom, a large-scale, open-source Reverse LLM (RLM) that is trained to predict previous tokens in a sequence, in contrast to the conventional left-to-right Forward LLMs (FLMs). Ledom is presented as the first systematic exploration of a purely reverse-trained autoregressive model at scale, with 2B and 7B parameter variants trained on 435B tokens spanning general text, mathematics, and code. The work investigates the modeling dynamics, empirical performance, and unique capabilities of RLMs, and proposes a novel application—Reverse Reward—for enhancing forward model outputs, particularly in mathematical reasoning.
Ledom is trained with a reverse-temporal autoregressive objective: given a sequence x1​,x2​,...,xT​, the model predicts xt​ conditioned on xt+1​,...,xT​. This is implemented using standard FLM tokenization for compatibility, but with all input sequences reversed. The architecture is a decoder-only Transformer, matching the FLM baseline in all respects except for the directionality of prediction. Training is performed on a high-quality, domain-balanced corpus, with substantial portions dedicated to mathematical and code data to probe reasoning and structural capabilities.
Key architectural and training details include:
- Multi-Query Attention, RoPE positional embeddings, RMSNorm, and SwiGLU activations.
- Context window of 8192 tokens.
- Training on 64 A100 GPUs, with AdamW optimizer and cosine learning rate schedule.
- Open release of model weights, code, and data.
Empirical Evaluation and Comparative Analysis
Ledom is evaluated on a suite of NLP benchmarks, including GSM8K (math), HumanEval (code), NQ-Open and TriviaQA (knowledge), and several reasoning and commonsense tasks. All evaluation prompts are reversed to match the model's pretraining regime.
Key empirical findings:
- General Reasoning and Commonsense: Ledom achieves performance comparable to FLMs on tasks like BoolQ and WinoGrande, especially at smaller model scales. However, a performance gap emerges at 7B scale, suggesting increased difficulty in modeling long-range dependencies in reverse.
- Code Generation: Ledom underperforms FLMs on HumanEval, highlighting the challenge of reverse generation for inherently forward-structured tasks.
- World Knowledge: Ledom lags behind FLMs on open-domain QA, likely due to the difficulty of recalling facts in a backward-oriented context.
- Mathematical Reasoning: While Ledom's raw scores are lower, qualitative analysis reveals distinct, often more diverse reasoning pathways, motivating its use as a complementary evaluator.
The training dynamics reveal that RLMs converge more slowly and to a higher loss than FLMs, reflecting greater uncertainty in predicting initial context from future tokens.
Case-Based Analysis: Unique Capabilities and Limitations
A detailed case study demonstrates Ledom's distinctive strengths:
- Abductive Reasoning: Ledom excels at generating plausible causal chains leading to a known outcome, making it well-suited for tasks requiring inference about antecedents.
- Story Generation: The model can construct coherent narrative build-ups to a specified ending, suggesting utility in simulation and explanation generation.
- Reverse Question Generation: Given an answer and supporting steps, Ledom can synthesize natural questions, facilitating data augmentation and educational content creation.
- Reversal Curse Mitigation: Ledom is more robust to the "reversal curse," showing improved ability to infer inverse relations compared to FLMs.
However, the reverse generation paradigm introduces unique safety risks. Existing safety filters, designed for left-to-right generation, may not suffice for reverse decoding, as evidenced by the model's ability to generate unsafe content from prompts that would be blocked in FLMs.
Reverse Reward: Bidirectional Inference for Enhanced Reasoning
The paper's most significant practical contribution is the Reverse Reward framework. Here, Ledom is used as a reward model to rerank or guide the outputs of FLMs. The reverse reward is defined as the likelihood of the input prompt given the candidate output, as scored by Ledom. This is combined with the forward model's probability in a bidirectional reward:
R(x,y)=PFLM​(y∣x)1−λ⋅PRLM​(x∣y)λ
Two inference strategies are proposed:
- Response-Level Reranking (Best-of-N): Generate N candidates with the FLM, rerank using the combined reward, and select the best.
- Step-wise Beam Search: At each reasoning step, expand candidate beams, score with the combined reward, and select top beams iteratively.
Empirical Results on Mathematical Reasoning
Reverse Reward is evaluated on GSM8K, MATH-500, AIME 2024, and AMC 2023, using strong FLMs (DeepSeekMath, QwenMath, OpenMath2) as baselines. Across all models and datasets, Reverse Reward consistently improves accuracy over greedy and random selection baselines. For example, QwenMath achieves 96.1% on GSM8K and 80.8% on MATH-500 with Reverse Reward, outperforming standard decoding. Step-wise beam search further enhances performance on multi-step problems.
Ablation studies show that increasing the number of sampled candidates improves performance, and qualitative case studies demonstrate that Reverse Reward can correct errors missed by FLMs, especially in multi-step reasoning.
Theoretical and Practical Implications
The results establish that reverse language modeling is a viable and complementary paradigm to conventional FLMs. Ledom's unique posterior reasoning capabilities enable new forms of bidirectional inference, improving the quality and robustness of generative models, particularly in domains requiring complex reasoning or abductive inference.
Practical implications include:
- Hybrid Decoding: Integrating RLMs as evaluators or rerankers in FLM pipelines can systematically enhance output quality, especially for reasoning-intensive tasks.
- Data Augmentation and QA Generation: RLMs facilitate the generation of questions from answers, supporting scalable dataset creation.
- Safety and Alignment: The distinct failure modes of RLMs necessitate new safety and alignment strategies, potentially involving bidirectional or adversarial training.
- Mitigating Inductive Biases: Combining forward and reverse models can address asymmetries and generalization failures (e.g., reversal curse) inherent in unidirectional models.
Limitations and Future Directions
The study acknowledges several limitations:
- RLMs are less effective for forward-oriented tasks (e.g., code generation, sequential decision-making).
- The models are not trained at the largest possible scales, leaving open questions about scaling behavior.
- Evaluation is limited to English; cross-linguistic generalization remains unexplored.
- Optimal training recipes for RLMs are not yet established.
Future research directions include scaling RLMs, developing hybrid bidirectional architectures, refining safety mechanisms, and extending evaluation to diverse languages and domains.
Conclusion
Ledom demonstrates that reverse language modeling is a fundamental and practical alternative to conventional forward modeling. The open release of models, code, and data provides a foundation for further research into bidirectional and reverse modeling paradigms. The Reverse Reward framework, in particular, offers a robust method for leveraging the complementary strengths of forward and reverse models, with immediate applications in reasoning, evaluation, and data augmentation. This work opens a new axis of exploration in language modeling, with significant implications for both theoretical understanding and practical system design.