Eliminating polynomial dependence on rank d in learning low logit rank models

Establish whether there exists an algorithm for learning approximately low logit rank language models under the same setting as Theorem 7 (the main learning result with logit queries) that only requires the final total variation error bound ε* to be lower bounded by a polynomial in the average logit misspecification ε_avg, the sequence length T, the alphabet size |Σ|, the boundedness parameter α, and 1/δ, but with no polynomial dependence on the logit rank d.

Background

The paper’s main theorem (Theorem 7) gives an end-to-end learning guarantee for LLMs that exhibit approximately low logit rank, using logit-query access. The analysis currently requires a relationship between the achievable error and the rank d, leading to polynomial dependence on d.

Empirically, the approximation error in low-rank logit matrices follows a much milder power law than required by the theory. The authors highlight a gap between their theoretical assumptions and observed behavior and raise the possibility of removing polynomial dependence on d altogether.

References

We are not even sure if it is possible to avoid polynomial dependence on $d$ altogether: Is there an algorithm for learning low logit rank models in the setting of \cref{thm:approx-main} that only requires $\ep\st \geq \Omega(\poly(\epavg, T, |\Sigma|, \alpha,1/\delta))$, i.e., without any polynomial dependence on the rank $d$?

Provably Learning from Modern Language Models via Low Logit Rank (2512.09892 - Golowich et al., 10 Dec 2025) in Conclusions and Future Directions, Improved polynomial dependence paragraph