Attack against unrestricted permutations of transformer hidden states
Develop a successful decoding attack that reconstructs the original input token sequence from unrestrictedly permuted hidden states in decoder-only transformer large language models, where unrestricted permutation means any element of the N x d hidden-state matrix can be moved to any row or column index without restriction. The goal is to extend the demonstrated attacks beyond sequence-dimension, hidden-dimension, and factorized 2D permutations to this fully unconstrained elementwise permutation setting, thereby determining whether such permutations can be reversed to recover the user’s prompt.
References
We have not yet demonstrated a successful attack against unrestricted permutations of hidden states, i.e. where any element of the $N \times d$ matrix of hiddens can be moved to any column or row index without restriction.