Recognition of Languages In-Context (RELIC)
- RELIC is an evaluation framework that tests LLMs’ ability to recognize formal languages using synthetic grammars and zero-shot prompts.
- It systematically scales task difficulty by varying grammar complexity and string lengths to analyze compositional reasoning.
- Empirical results show that model accuracy drops with increased complexity, revealing a shift from rule-based reasoning to superficial heuristics.
Recognition of Languages In-Context (RELIC) is a principled evaluation framework introduced to systematically probe and benchmark the ability of LLMs to follow compositional instructions in context. In the RELIC framework, “language recognition” refers to the classic formal language problem: given a grammar as a set of context rules and a candidate string, the model must decide whether the string can be generated by the grammar. RELIC is specifically crafted to increase task complexity arbitrarily and to diagnose the strategies LLMs deploy in in-context, zero-shot instruction following.
1. Framework Overview
RELIC assesses whether LLMs can recognize strings belonging to synthetic formal languages given only a grammar provided in the prompt. The process consists of:
- Grammar Generation: Random context-free grammars (CFGs) with varying complexity (up to 500 nonterminal symbols and productions) are constructed. Each grammar is a concrete set of production rules comprising both non-lexical (e.g., ) and lexical (e.g., ) types.
- Sample Creation: For each grammar, the framework generates both positive examples (strings inside the language) and negative examples (strings over the same alphabet but outside the language), spanning a controlled range of lengths (1–50 symbols).
- Zero-shot Prompting: The LLM receives a prompt comprising the grammar rules and a single string, and is asked: “Does this string satisfy the grammar?” with a required YES/NO answer. No input-output examples are provided; only the specification (the "instructions") is available in context.
- Scalable Benchmarking: Because grammars and sample pairs are generated synthetically and can be made arbitrarily complex, RELIC enables a continually renewable, contamination-resistant, and difficulty-scalable evaluation for compositional reasoning.
This protocol is designed to ensure that LLMs must compose and apply a large number of contextual instructions—the grammar rules—to solve the problem, rather than relying on memorized associations or shallow pattern matching.
2. Task Complexity
Two dimensions directly increase task difficulty in RELIC:
- Grammar Complexity Parameters: Complexity is controlled by the number of non-lexical production rules (), terminal and nonterminal counts (), and the number of lexical rules (). As grammars become larger and more recursive, successful recognition requires composing deeper and more varied rule sequences.
- String (Example) Complexity: Longer candidate strings require the model to reason over longer derivations. Standard context-free parsing requires time in string length , reflecting the increasing burden on both algorithmic reasoning and prompt inference as grows.
Empirically, model accuracy is observed to decrease monotonically with either higher grammar complexity or increased string length. For the most complex settings (e.g., 500-rule grammars, long strings), all models tested perform at or near chance, indicating substantial and inherent computational challenges for current architectures.
3. Evaluation Results
Multiple leading LLMs—OpenAI GPT-4.1-nano, GPT-4.1-mini, GPT-4.1, o4-mini, o3, Google Gemma-3-1b-it, Gemma-3-4b-it, and DeepSeek-R1-Distill-Qwen-7B—were evaluated on the RELIC-500 benchmark (200 grammars, multiple examples each, spanning the full complexity spectrum):
Model | Accuracy (%) | Macro F1 (%) |
---|---|---|
gpt-4.1-nano | 54.1 | 44.4 |
gpt-4.1-mini | 63.4 | 57.5 |
gpt-4.1 | 53.9 | 48.5 |
o4-mini | 59.2 | 58.1 |
o3 | 70.4 | 70.1 |
gemma-3-1b | 48.8 | 30.9 |
gemma-3-4b | 48.7 | 34.3 |
DSR1-7B | 47.9 | 29.6 |
For simple grammars and short strings, some models (notably OpenAI o3) achieve moderate accuracy (~70%). As complexity increases, all models, including the most powerful, approach random guessing (chance = 50%). Performance consistently declines as and string length grow, and the same examples tend to be challenging across models, confirming task-intrinsic difficulty.
There are also systematic response biases; for longer or more complex samples, some models default to ‘YES’ for all, others to ‘NO’, revealing that superficial heuristics supplant compositional reasoning under high difficulty.
4. Diagnostic Insights
RELIC’s design allows granular analysis of LLM reasoning strategies:
- Chain-of-Thought (CoT) Tracing: On short or simple tasks, models may produce explicit rule-application traces, generating each step of a grammatical derivation. However, as complexity increases, CoT reasoning either truncates (the model becomes overwhelmed or abandons the derivation), or descends into irrelevant or hallucinated steps.
- Strategy Shifts: Quantitative scoring (“LLM as a judge”) demonstrates that as grammar or string complexity rises, the proportion of “rule-based” (i.e., explicit, correct) reasoning drops. Models increasingly rely on superficial heuristics—length-based, token presence/absence, or global summary statements—rather than recursive, compositional parsing.
- Test-Time Compute Plateau: For correct parsing, CoT output should scale rapidly ( to ), but LLMs’ CoT token counts peak at modest and shrink for harder inputs, signaling a practical computational bottleneck that mirrors theoretical expectations.
5. Theoretical Limits and Complexity
The RELIC task is provably hard for current LLMs and transformer-like architectures:
- Complexity Classes: Even without -productions, context-free grammar recognition is /-hard, while transformers are confined to with bounded test-time computation, making efficient, fully compositional parsing impossible in principle for large input sizes.
- Compute versus Accuracy: Correct parsing requires that the amount of chain-of-thought computation grows superlinearly in string length, but practical LLM deployments enforce prompt/computation length limits far below this “hardness threshold.”
This fundamental gap explains the consistent collapse of accuracy as complexity grows, and sets a concrete ceiling on instruction-following via this in-context approach.
6. Research, Applications, and Future Directions
RELIC provides a scalable, contamination-resistant, and automatable benchmark for instruction-following in LLMs. Because all grammars and samples are synthetic and customizable, the framework (a) avoids issues of overlap or memorization, and (b) can always be made more challenging as models improve. The evaluation paradigm is robust to overfitting and supports continual task regeneration.
For model improvement, the findings highlight substantial unsolved challenges in robust, efficient compositional reasoning. Potential future directions suggested in the paper include increasing chain-of-thought length at test time, exploring tree-structured or more compositional network architectures, and supplementing LLMs with external parsers or adversarially generated evaluation samples. RELIC may also be extended to tasks such as full parse generation, grammar translation, or richer formal language classes.
7. Summary Table: Key Empirical Results
Model | Accuracy (%) | Macro F1 (%) |
---|---|---|
gpt-4.1-nano | 54 | 44 |
gpt-4.1-mini | 63 | 58 |
gpt-4.1 | 54 | 49 |
o4-mini | 59 | 58 |
o3 | 70 | 70 |
gemma-3-1b | 49 | 31 |
gemma-3-4b | 49 | 34 |
DSR1-7B | 48 | 30 |
Performance is highest on easy grammars and drops steeply with task complexity, reaching near chance for most models on hard cases.
RELIC thus rigorously measures the degree to which LLMs can perform compositional, multi-step in-context instruction following. Its empirical results and diagnostic analysis reveal a persistent reliance on heuristics rather than true rule composition as task complexity scales, in line with theoretical limitations of current neural architectures. The framework sets a roadmap for research into both more powerful models and deeper, more systematic benchmarks for advanced in-context instruction following.