Distributed Inference and Fine-tuning of LLMs
Introduction
The deployment and utilization of LLMs with over 50 billion parameters in various NLP tasks have been constrained by the requirement for high-end hardware. Traditional methods, like offloading parameters to RAM, do not suffice as they are inefficient for applications such as chatbots and search engines which are latency-sensitive. An alternative strategy to overcome these challenges involves employing distributed computing over the internet, using a swarm of unreliable devices to run these LLMs, which is the main focus of this paper.
Fault-Tolerance in Model Inference
The paper introduces advanced algorithms tailored for distributed environments where devices can be unreliable and have variable network latencies. Through the development of a novel fault-tolerant autoregressive inference algorithm and a decentralized load-balancing mechanism, the authors established means to recover quickly from server failures. The fault-tolerance comes from maintaining dual attention caches that facilitate rapid server state restoration by other standby servers. This approach minimizes the volume of re-transmitted data to only what's necessary when a failure occurs.
Load Balancing and Fine-Tuning
Furthermore, the paper tackles the dynamic and uneven nature of consumer-grade hardware and network resources by devising a load-balancing protocol. This adaptive mechanism assigns transformer blocks across the distributed system to optimize overall throughput, despite servers joining or leaving freely. The system also supports parameter-efficient fine-tuning methods where clients, not servers, store and update their trainable parameters - adapting to various tasks without heavily taxing the network.
Performance Evaluation
Extensive simulations and real-world experiments confirmed that the introduced system can execute LLMs efficiently over the internet. When compared to local offloading, the approach exhibited up to a tenfold increase in speed for interactive generation tasks. Tests spanned different continents, asserting the system's robustness and efficiency despite geodistribution challenges.
Conclusion
The paper concludes by validating the proposed decentralized system as a cost-effective alternative for using LLMs on distributed, unreliable devices. It leverages the collective power of idle compute resources while guaranteeing correct model outputs and promising significant improvements in speed over traditional offloading methods. The authors also call attention to privacy considerations and potential future improvements like integrating secure multi-party computations to safeguard sensitive data processed by the system.