Token-Level Group Average Analysis
- Token-level group average is a family of aggregation methods that conditions token measurements on explicit groups to expose latent heterogeneity.
- It is applied in domains like software engineering, seq2seq analysis, and reinforcement learning to diagnose performance patterns and optimize model behavior.
- Different averaging strategies—equal weighting, occurrence balancing, and intra-sample normalization—enable precise insights into token dynamics and reward structures.
Token-level group average denotes a family of aggregation procedures in which token-resolved quantities are averaged within explicitly defined groups in order to expose structure that is obscured by response-level or sequence-level summaries. In recent work, the grouped object has ranged from token counts in software-engineering stages, to token-type accuracy curves in seq2seq validation, to token-level rewards, entropies, and advantages in RL fine-tuning. The common motif is not a single universal formula, but a shared methodological move: define a grouping scheme, compute token-level signals, and aggregate them in a way that makes heterogeneity measurable and actionable (Salim et al., 20 Jan 2026, Bao et al., 2023, Lin et al., 14 Apr 2026).
1. Conceptual scope
The term applies to several distinct but structurally related settings. In agentic software engineering, groups are SDLC stages such as Design, Coding, Code Completion, Code Review, Testing, and Documentation, and the token-resolved signals are input, output, and reasoning tokens. In token-level fitting analysis for seq2seq models, groups are subsets of the vocabulary induced by frequency, POS, or prediction discrepancy, and the token-resolved signal is validation accuracy as a function of checkpoint epoch. In rubric-based RL, groups can be formed within each sampled response and each constraint, so that token rewards are normalized inside a response rather than across the whole batch. In GRPO-style reasoning RL, token-level quantities such as entropy, hidden reward, or masked KL penalties are aggregated to connect sequence-level rewards with token-level updates (Salim et al., 20 Jan 2026, Bao et al., 2023, Xu et al., 3 Apr 2026, Deng et al., 4 Oct 2025).
These uses differ in purpose. Some are descriptive, as in token accounting over the SDLC. Some are diagnostic, as in identifying overfitting and underfitting by token class. Others are prescriptive, where the group average is part of the optimization rule itself. A useful Editor's term for this shared pattern is “group-conditioned token aggregation”: token-level statistics are never interpreted in isolation, but only after being conditioned on a group definition that encodes what counts as comparable.
2. Group construction and averaging operators
Across the literature, the central design choice is how the group is defined and whether the averaging is equal-weight, occurrence-balanced, or normalized within each sample. In the software-engineering analysis, if indexes a development stage, a token category, and the number of tasks in which stage was executed, the reported group-average token count is
Stage-internal percentages are then formed by dividing the group-average count for category by the sum over input, output, and reasoning categories and multiplying by (Salim et al., 20 Jan 2026).
In seq2seq fitting analysis, for any token group , the group-average validation accuracy at epoch is
The same work defines the best-fit epoch 0, the fitting-offset 1, and the potential-gain 2. Crucially, no additional per-token weighting is applied once the group is formed; each token type in 3 carries equal weight, even though the frequency and discrepancy bins themselves are constructed to contain equal total token occurrences (Bao et al., 2023).
In rubric-based RL, group formation is explicitly intra-sample. For a fixed sampled response 4 and constraint 5, the token-level rewards 6 form one group. The per-group statistics are
7
followed by token normalization
8
This design is meant to eliminate the length bias that arises if all tokens from all responses are normalized jointly (Xu et al., 3 Apr 2026).
| Setting | Group definition | Averaged signal |
|---|---|---|
| Agentic software engineering | SDLC stage 9 token category | Token counts |
| Seq2seq fitting analysis | Vocabulary subset | Validation accuracy across epochs |
| Rubric-based RL | Response 0 1 constraint 2 | Token-level rewards or advantages |
The comparison shows that “token-level group average” is not a single estimator but a family of estimators whose semantics depend on grouping, weighting, and normalization.
3. Token accounting in agentic software engineering
A concrete use of token-level group averages appears in the SDLC analysis of ChatDev traces over 30 software-development tasks using a GPT-5 reasoning model. ChatDev phases are mapped to six stages: Design corresponds to DemandAnalysis and LanguageChoose; Coding to Coding; Code Completion to CodeComplete; Code Review to CodeReview; Testing to Test; and Documentation to EnvironmentDoc, Reflection, and Manual. Within this standardized mapping, tokens are partitioned into input tokens, output tokens, and reasoning tokens, and both overall and stage-specific shares are computed (Salim et al., 20 Jan 2026).
The reported overall per-task breakdown across all stages combined is input 53.9%, output 24.4%, and reasoning 21.6%. The stage-level averages, expressed as percentages of all tokens consumed in a task and reported with error bars of 3 standard deviation over the tasks that executed that stage, are Code Review 59.4%, Code Completion 26.8% for 4 tasks, Documentation 20.1%, Testing 10.3% for 5, Coding 8.6%, and Design 2.4%. The stage-specific tokenomic profiles differ sharply: Design is Input 60.4%, Output 3.6%, Reasoning 36.0%; Coding is Input 6.9%, Output 58.0%, Reasoning 35.1%; Completion is Input 47.7%, Output 41.7%, Reasoning 10.5%; Code Review is Input 51.4%, Output 24.7%, Reasoning 23.9%; Testing is Input 60.8%, Output 20.7%, Reasoning 18.4%; and Documentation is Input 80.2%, Output 8.3%, Reasoning 11.5% (Salim et al., 20 Jan 2026).
Several interpretations follow directly from these numbers. The Code Review stage alone consumes nearly 60% of all tokens, which the study attributes to repeated conversational loops in which agents pass full code contexts back and forth, described as a “Cost of Conversation.” Input tokens dominate total consumption at approximately 54%, which the paper treats as empirical support for a “communication tax”: substantial budget is spent re-transmitting context rather than generating new code. The authors therefore conclude that the primary cost of agentic software engineering lies not in initial code generation but in automated refinement and verification. The dataset also spans a wide range of task complexity, with total reasoning tokens ranging from 17,280 to 40,000, and the paper suggests human-in-the-loop checkpoints before extensive code reviews, token-efficient collaboration protocols such as context differencing or compressed representations, and unified benchmarking across LLM-MA architectures (Salim et al., 20 Jan 2026).
4. Token-type averages in seq2seq fitting analysis
A different lineage of token-level group averages studies model fitting heterogeneity across the vocabulary rather than resource usage. In seq2seq models trained with early stopping, the group average is used to detect whether different token classes would have preferred earlier or later checkpoints. The analysis introduces two group-level metrics. The fitting-offset 6 measures the signed epoch distance between the group’s best-fit checkpoint and the early-stopping checkpoint; 7 indicates overfitting and 8 underfitting. The potential-gain 9 measures the increase in validation accuracy that would be obtained by evaluating the group at its own best-fit epoch instead of the early-stopping epoch (Bao et al., 2023).
The empirical results, averaged over 40 seeds, show pronounced groupwise asymmetry. Under frequency grouping, high-frequency tokens have 0 epochs and potential gain about 1, medium-frequency tokens have 2 and potential gain about 3, and low-frequency tokens have 4 with potential gain about 5. Under POS grouping, function words have 6, nouns have 7, and the largest potential gains occur for NUM at about 8 and NOUN at about 9. Under discrepancy grouping, big-discrepancy tokens have 0, medium-discrepancy tokens 1, and small-discrepancy tokens 2, with potential gain on the small-discrepancy group about 3 (Bao et al., 2023).
Pairwise groupings sharpen the picture. Even within high-frequency tokens, nouns and adjectives underfit while function words overfit. Low-frequency symbols can overfit, despite the broader tendency of low-frequency groups to underfit. Potential gains reach 4 for low-frequency symbols and about 5 for high-frequency adjectives. For discrepancy and frequency, the largest potential gain is 6 on low-frequency and medium-discrepancy tokens. For discrepancy and POS, small-discrepancy symbols and numbers have potential gains of 7 and 8, while big-discrepancy numbers reach gains up to 9. Sentence length shows only mild underfitting, around 0 to 1 epochs, and pretraining weakens the underfitting pattern: with early-stopping mBART, high-frequency tokens still overfit at 2, nouns underfit slightly, and function words overfit at 3 (Bao et al., 2023).
This body of results establishes that token-level group averages can detect heterogeneity that is invisible in aggregate validation curves. It also clarifies a frequent misunderstanding: these group averages are not frequency-weighted summaries of model performance. Once the group is defined, each token type contributes equally.
5. Token-level group averages in reinforcement learning for LLMs
In RL fine-tuning, token-level group averages become part of the credit-assignment mechanism itself. One line of work introduces Token Hidden Reward (THR), a token-level metric under GRPO that measures how much a token in one sampled response influences the likelihood increase of correct responses. The group-marginalized score
4
averages each token’s effect over all correct responses in the sampled group. Its magnitude measures strength of influence, while its sign determines whether the token supports exploitation or preserves probability mass for alternatives. The reweighting rule
5
uses a scalar 6 to bias training toward exploitation when 7 and toward exploration when 8. Averaged over several math benchmarks, the reported greedy-decoding gains for THR with 9 are from 10.9% to 11.8% for Qwen2.5-0.5B, from 34.8% to 36.3% for Qwen2.5-1.5B, and from 43.3% to 46.8% for Qwen2.5-7B; the exploration setting 0 improves average Pass@K relative to GRPO and THR1, and the method also transfers to GSPO-token and Llama3.2-3B-Instruct (Deng et al., 4 Oct 2025).
A second line of work, TEPO, links group-level rewards to token-level aggregation through a sequence-level likelihood ratio
2
which is then broadcast across the tokens of a sampled sequence. TEPO adds a token-level KL mask
3
so that only high-advantage, decreasing-entropy tokens are softly regularized. In the reported experiments, sequence-mean token-mean aggregation reaches 32.21% average accuracy versus 31.33% for GSPO and 31.80% for a sequence-mean token-sum variant; TEPO with the selective KL mask reaches 32.59% average accuracy, compared with 32.21% for TEPO-w/o KL and 32.03% for applying KL to both misaligned regimes. An undifferentiated KL penalty is described as highly fragile, with 4 collapsing performance to 0%, and TEPO reduces convergence time by about 50% relative to GRPO/DAPO. The reported headline benchmarks are 32.59% versus 30.85% on Qwen2.5-7B and 44.02% versus 41.51% on Qwen3-14B (Lin et al., 14 Apr 2026).
Entropy-weighted reward shaping offers yet another variant. GTPO computes per-token Shannon entropy under the old policy and uses it to form shaped token rewards for successful sequences, while GRPO-S assigns each sequence a reward boost based on its average token entropy. The paper reports that on multi-step mathematical reasoning and code generation prompts, GTPO raises best@32 reward from about 0.55 under DAPO to about 0.65, while GRPO-S reaches about 0.62; on best@16, GTPO rises from about 0.45 to about 0.60 and GRPO-S to about 0.55. Both methods also increase model entropy from about 0.6 to about 0.8 nat/token and increase average response length from about 230 to about 300 tokens, with improvements consistent across multiple seeds at 5 percentage points and paired 6-tests at 7 (Tan et al., 6 Aug 2025).
Rubric-based RL extends the grouping dimension further. In Rubrics to Tokens, token-level rewards are formed by multiplying a response-level rubric score for a constraint by the relevance discriminator’s tokenwise probabilities. Intra-sample Token Group Normalization then standardizes these rewards within each response and each constraint. The combined advantage uses both response-level and token-level components, with the reported setting 8, 9, and group size 0 rollouts. Empirically, ISTGN improves instruction-level accuracy by 1–2% absolute across IFEval, IFBench, and MulDimIF, converges faster and with lower entropy than RL-CSR, prevents instability on Llama3.2-3B where standard RL-CSR collapses, and adds less than 10% extra GPU time relative to standard RL (Xu et al., 3 Apr 2026).
6. Methodological significance, misconceptions, and limits
The principal significance of token-level group averages is that they make “where” and “for whom” a signal occurs analytically explicit. In the SDLC study, the answer is where token budgets are spent: predominantly in Code Review and in input-context retransmission rather than in initial generation. In seq2seq diagnosis, the answer is which token classes overfit or underfit under early stopping. In RL, the answer is which tokens should carry more or less policy-update mass under sparse sequence-level rewards. These are distinct questions, but each becomes tractable only after token-level measurements are organized by an explicit group structure (Salim et al., 20 Jan 2026, Bao et al., 2023, Deng et al., 4 Oct 2025).
Several misconceptions are directly contradicted by the reported results. First, token-level analysis does not imply a single canonical weighting rule. Equal-weight token-type averages in seq2seq diagnosis, occurrence-balanced bin construction, stagewise averages over tasks that executed a stage, and intra-sample normalization within a response are all legitimate but non-equivalent designs. Second, token-level statistics are not inherently more “fine-grained” in a useful sense unless the grouping scheme matches the failure mode under study. The RTT results identify inter-sample token normalization as length-biased, which is precisely why ISTGN normalizes within each response. TEPO similarly argues that undifferentiated entropy or KL regularization can induce entropy collapse or model degradation under sparse token rewards, and its ablations show that indiscriminate KL can be fragile (Xu et al., 3 Apr 2026, Lin et al., 14 Apr 2026).
The limits are equally clear from the data. The software-engineering study describes its findings as preliminary and reports conditional stage averages for stages that are not always executed, including Code Completion for 1 tasks and Testing for 2, so its percentages are informative about stage-specific load but are not a single exhaustive partition of total task cost. THR, TEPO, and GTPO/GRPO-S are evaluated primarily on mathematical reasoning settings, with GTPO additionally including code generation prompts; this suggests generality across RL formulations, but cross-domain transfer beyond the reported settings remains an inference rather than a direct empirical claim. A plausible implication is that token-level group averages are best understood not as a standalone metric, but as a design pattern for exposing latent structure in tokenized systems—whether the underlying object is computation, fit, or reward.