WebGPU-based deployment of Tiny-QMoE beyond terminal-only usage

Develop a WebGPU-based deployment that enables Tiny-QMoE—a quantization and dictionary-based compression framework for LLaMA 3.2 models—to run outside of a terminal environment, thereby making the system publicly accessible beyond terminal-only execution.

Background

Tiny-QMoE introduces an approach to run quantized and compressed LLaMA 3.2 models on memory-constrained devices, prioritizing hardware-agnostic CPU execution and avoiding CUDA dependencies to broaden accessibility.

Despite demonstrating strong compression rates and maintaining model performance, the current implementation remains terminal-bound. The authors explicitly note their inability to provide a WebGPU-based deployment, which would allow running the system in browser environments and make the work more publicly accessible beyond terminal usage.

References

On top of this while we were unable to bring the model outside of the terminal which we had hoped to do with Web-GPU, we hope to do so in the future as to make this work more public beyond the terminal.

Tiny-QMoE (2509.22951 - Cashman et al., 26 Sep 2025) in Conclusion, Section 6