Toward generalizable learning of all (linear) first-order methods via memory augmented Transformers (2410.07263v3)

Published 8 Oct 2024 in cs.LG and math.OC

Abstract: We show that memory-augmented Transformers can implement the entire class of linear first-order methods (LFOMs), a class that contains gradient descent (GD) and more advanced methods such as conjugate gradient descent (CGD), momentum methods and all other variants that linearly combine past gradients. Building on prior work that studies how Transformers simulate GD, we provide theoretical and empirical evidence that memory-augmented Transformers can learn more advanced algorithms. We then take a first step toward turning the learned algorithms into actually usable methods by developing a mixture-of-experts (MoE) approach for test-time adaptation to out-of-distribution (OOD) samples. Lastly, we show that LFOMs can themselves be treated as learnable algorithms, whose parameters can be learned from data to attain strong performance.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

HackerNews

Memory-augmented Transformers can implement Linear first-Order Optimization (1 point, 0 comments)

Toward generalizable learning of all (linear) first-order methods via memory augmented Transformers (2410.07263v3)

Summary

Follow-up Questions

Related Papers

Authors (2)

HackerNews