Programming Keywords Probability (PKP) in LLMs
- Programming Keywords Probability (PKP) is a metric that quantifies a large language model's baseline tendency to generate reserved programming keywords in zero-context settings.
- It aggregates cold-start probabilities via softmax over a curated set of programming tokens, providing insights into code-readiness and token distribution.
- PKP analysis aids in diagnosing token-level degeneration and informs optimization strategies, including quantization, distillation, and instruction-tuning.
Programming Keywords Probability (PKP) quantitatively characterizes the baseline tendency of LLMs to generate programming constructs, specifically reserved programming language keywords, when given no input context. Introduced by “Compressed code: the hidden effects of quantization and distillation on programming tokens” (Siniaev et al., 5 Jan 2026), PKP provides a softmax-aggregation metric reflecting how code-specialized or code-ready a model remains under various tokenization, quantization, and distillation regimes. PKP is central to diagnosing token-level degeneration, preserving code-related distributions, and assessing code generation readiness in LLMs, particularly as these models undergo compression or optimization.
1. Formal Definition and Related Metrics
PKP is defined for a LLM with vocabulary and reserved keywords , drawn from 276 tokens spanning twelve major programming languages. For any token , its cold-start probability under zero-context is:
PKP is the aggregate cold-start probability mass assigned to all programming keywords:
Three ancillary metrics further dissect token-level behavior:
- Special Tokens Probability (STP): Sum over punctuation and structural tokens ().
- Keyword Average Probability (KAP): Per-keyword mean cold-start probability, .
- Natural Language Probability (NLP): Sum over a control set of generic English tokens.
Vocabulary coverage,
measures how many canonical keywords survive the tokenizer construction.
2. Tokenizer Vocabulary Structure and Keyword Preservation
Examining leading tokenizers (DeepSeek-R1/V3, Qwen2.5, Llama 3.1), PKP research uncovers high overlap in base vocabularies (~128K–151K tokens), with programming keywords heavily stratified by frequency:
- “Polysemic” keywords (e.g.,
return,in,with) maintain prominent low ranks (< 700), due to natural language overlap. - Niche or framework-specific keywords (e.g.,
constexpr,noexcept,jsx) occupy tail ranks (40K–90K), sometimes absent entirely; React keyword coverage falls below 60%. - Distribution analysis: >60% of programming keywords reside below rank 1,000, with 5–15% of tokens devoted to syntax (punctuation, braces). These statistics denote that code-specialization in LLMs arises primarily from frequency-induced token reordering and not new vocabulary invention. Coverage and rank histograms diagnose token preservation across languages (Python 97%, Go 96%, C++ 73%, React 56%).
3. Cold-Start Probability Analysis and Methodological Innovation
PKP and related metrics derive exclusively from zero-context model responses, requiring no prompts or external datasets. The analysis uses:
- Curated keyword and special token sets (from language specs).
- Model forward pass on “empty” input, yielding logits .
- Whole-vocabulary softmax computation, enabling PKP, STP, KAP, and NLP calculations. This cold-start approach is model-agnostic and efficient, directly exposing prior preferences for code constructs versus syntactic noise; it enables instant auditing of optimization impact on code expressiveness.
4. Effects of Model Optimization: Quantization, Distillation, Scaling, and Tuning
Systematic PKP analysis across model variants reveals critical patterns in code-construct probability propagation:
Quantization
For Qwen2.5-Coder-7B, bit-width modulates code bias: | Bit width | PKP | STP | |-----------|------|------| | 2 |0.0902|0.4248| | 4 |0.1231|0.2804| | 8 |0.1070|0.2928|
Moderate quantization (4-bit) surprisingly improves PKP/STP balance (+37% PKP, −34% STP vs 2-bit). Top keywords (“import”, “package”, “from”) remain robust across all settings.
Distillation
Teacher-student distillation dramatically collapses the code-token tail: | Model Variant | PKP | STP | |--------------------------|-------|-------| | Qwen2.5-1.5B Base |0.1434 |0.1254 | | DeepSeek-R1-Distill-1.5B |0.0042 |0.7246 | | Qwen2.5-32B Base |0.1420 |0.1223 | | DeepSeek-R1-Distill-32B |0.0019 |0.2293 |
PKP deteriorates by up to 97%, with STP increasing by 478%. Distilled models become disproportionately punctuation-heavy, substantially reducing baseline code readiness.
Model Scaling and Instruction-Tuning
Scaling Qwen2.5 (1.5B→32B) leaves PKP and STP stable, but instruction-tuning distorts the distribution:
- PKP collapses (Coder-1.5B-Inst: PKP=0.0003 vs base 0.1215).
- STP surges (Coder-7B-Inst STP=0.449 vs base 0.147).
Instruction-tuned small or mid-size models lose code-keyword predisposition, inducing syntactic degeneration.
5. Practical Guidelines for Compression and Specialization
Empirically validated recommendations stem from PKP behavior under compression:
- Use 4-bit quantization for code-specialized models; it preserves or enhances PKP/STP balance while reducing model footprint.
- Exercise caution with distillation; vanilla variants devastate code-tail probabilities. Integrate explicit rare code-token objectives when distilling.
- Instruction-tune only large models or employ hybrid approaches; smaller models degrade code-token preference and inflate syntactic noise.
- Employ cold-start metrics (PKP, STP, KAP) before/after any fine-tuning or compression step as rapid-quality checks. Red flags include PKP drops or STP spikes.
- Calibrate temperature post-compression; higher sampling temperature accentuates distributional biases, further eroding code-token mass.
6. Theoretical and Operational Implications of PKP
PKP operationalizes a direct lens into LLM code-generation readiness at the token level. Unlike output-based or prompt-reliant assessment, PKP exposes latent model predispositions towards programming constructs under optimization, guiding practical tradeoffs in memory, quality, and specialization:
- PKP reveals how code capability degrades or survives as compression intensifies.
- Distributional metrics (PKP, STP, KAP, NLP) collectively predict model degeneration toward syntactic or natural-language dominance.
- PKP can be understood in the context of probabilistic programming primitives (e.g., “sample”, “condition”, “multi-ary choice”) (Raedt et al., 2013), which, in semantic models, require sufficient representation and prior for accurate construction. A plausible implication is that models maintaining higher PKP under quantization/fine-tuning will exhibit more reliable code generation under diverse deployment constraints.
7. Connections to Probabilistic Programming Keywords
PKP’s focus on reserved keywords closely parallels the primitives enumerated in the probabilistic programming literature (Raedt et al., 2013). Sampling (“sample”, annotated disjunctions), conditioning (“observe”), continuous distributional clauses, and nested inference all rely on the robust, context-agnostic representation of such primitives within the tokenizer and model backbone:
- Token coverage and cold-start probability ensure probabilistic programming constructs are not lost as code models undergo distillation or quantization.
- Maintenance of high PKP in model optimization is critical for the expressive power required by modern probabilistic programming frameworks, which aggregate code tokens as building blocks for complex models. PKP thus acts both as a quality-control metric and as a theoretical measure aligning with the requirements of probabilistic programming semantics and execution strategies.