Efficient recovery of the output projection matrix up to an orthogonal transform

Develop a computationally efficient algorithm to recover the transformer’s output embedding projection matrix W ∈ R^{l×h} up to an orthogonal h×h transformation using only multiple logit vectors obtained from API queries. Concretely, given logit outputs for prompts that yield points x_i = U^T W g_θ(p_i), efficiently solve the overdetermined linear system x_i^T A x_i = 1 for the positive semidefinite matrix A ∈ R^{h×h}, compute M with A = M^T M, and reconstruct W as U M^{-1} O for some orthogonal O, thereby improving the outlined orthogonal-recovery attack beyond the current infeasible h^2-variable linear solve.

Background

The paper presents an attack that extracts the final embedding projection layer of a transformer LLM using logit outputs, recovering the matrix up to affine transformations. In the appendix, the authors outline a stronger approach that, in principle, could recover the matrix up to an orthogonal transformation by fitting an ellipsoid to transformed hidden states via the system x_i^T A x_i = 1.

However, solving this system efficiently at realistic hidden dimensions (h > 750) is currently infeasible for the authors, preventing practical use of the orthogonal-recovery method. They explicitly flag improving this algorithm as an open problem and note they do not know how to solve the required high-dimensional linear system efficiently.

References

We do not carry out this attack in practice for models considered in this paper, and leave improving this algorithm as an open problem for future work. However, we do not know how to solve these systems of linear equations in h² variables efficiently (h>750 in all our experiments); so in practice we resort to reconstructing weights up to an arbitrary h × h matrix, as described in Appendix \ref{sec:proof_of_42}.

— Stealing Part of a Production Language Model (2403.06634 - Carlini et al., 11 Mar 2024) in Appendix, Section “Recovering W up to an orthogonal matrix”

Efficient recovery of the output projection matrix up to an orthogonal transform

Background

References

Related Problems