Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Over-parameterized Exponential Regression (2303.16504v1)

Published 29 Mar 2023 in cs.LG and stat.ML

Abstract: Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of LLMs have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function $F: \mathbb{R}{d \times m} \times \mathbb{R}d \rightarrow \mathbb{R}$ using an exponential activation function. Given a set of data points with labels ${(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)} \subset \mathbb{R}d \times \mathbb{R}$ where $n$ denotes the number of the data. Here $F(W(t),x)$ can be expressed as $F(W(t),x) := \sum_{r=1}m a_r \exp(\langle w_r, x \rangle)$, where $m$ represents the number of neurons, and $w_r(t)$ are weights at time $t$. It's standard in literature that $a_r$ are the fixed weights and it's never changed during the training. We initialize the weights $W(0) \in \mathbb{R}{d \times m}$ with random Gaussian distributions, such that $w_r(0) \sim \mathcal{N}(0, I_d)$ and initialize $a_r$ from random sign distribution for each $r \in [m]$. Using the gradient descent algorithm, we can find a weight $W(T)$ such that $| F(W(T), X) - y |_2 \leq \epsilon$ holds with probability $1-\delta$, where $\epsilon \in (0,0.1)$ and $m = \Omega(n{2+o(1)}\log(n/\delta))$. To optimize the over-parameterization bound $m$, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022].

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yeqi Gao (14 papers)
  2. Sridhar Mahadevan (33 papers)
  3. Zhao Song (253 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com