Dice Question Streamline Icon: https://streamlinehq.com

Complexity of OPE and OMS with a fixed-size alphabet

Ascertain the computational complexity of the Optimal Pair Encoding (OPE) and Optimal Merge Sequence (OMS) problems when the initial alphabet size is fixed (for example, constant-size alphabets such as two symbols), including whether these cases remain APX-hard or admit polynomial-time algorithms.

Information Square Streamline Icon: https://streamlinehq.com

Background

The APX-hardness proofs in this work rely on instances whose alphabet size grows with input size. This leaves open the status of OPE and OMS when the alphabet is fixed and small.

Understanding the fixed-alphabet regime could lead to stronger guarantees for BPE or tractable exact/approximate algorithms in practical tokenization settings.

References

The complexity of both problems with a fixed alphabet remains open.

Theoretical Analysis of Byte-Pair Encoding (2411.08671 - Kozma et al., 13 Nov 2024) in Section 6 (Conclusion and open questions)