Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Solving Regularized Exp, Cosh and Sinh Regression Problems (2303.15725v2)

Published 28 Mar 2023 in cs.LG

Abstract: In modern machine learning, attention computation is a fundamental task for training LLMs such as Transformer, GPT-4 and ChatGPT. In this work, we study exponential regression problem which is inspired by the softmax/exp unit in the attention mechanism in LLMs. The standard exponential regression is non-convex. We study the regularization version of exponential regression problem which is a convex problem. We use approximate newton method to solve in input sparsity time. Formally, in this problem, one is given matrix $A \in \mathbb{R}{n \times d}$, $b \in \mathbb{R}n$, $w \in \mathbb{R}n$ and any of functions $\exp, \cosh$ and $\sinh$ denoted as $f$. The goal is to find the optimal $x$ that minimize $ 0.5 | f(Ax) - b |_22 + 0.5 | \mathrm{diag}(w) A x |_22$. The straightforward method is to use the naive Newton's method. Let $\mathrm{nnz}(A)$ denote the number of non-zeros entries in matrix $A$. Let $\omega$ denote the exponent of matrix multiplication. Currently, $\omega \approx 2.373$. Let $\epsilon$ denote the accuracy error. In this paper, we make use of the input sparsity and purpose an algorithm that use $\log ( |x_0 - x*|_2 / \epsilon)$ iterations and $\widetilde{O}(\mathrm{nnz}(A) + d{\omega} )$ per iteration time to solve the problem.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhihang Li (17 papers)
  2. Zhao Song (253 papers)
  3. Tianyi Zhou (172 papers)
Citations (34)