Conformer LLMs -- Convolution Augmented Large Language Models (2307.00461v1)
Abstract: This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for LLMs. Non-causal conformers are used ubiquitously in automatic speech recognition. This work aims to adapt these architectures in a causal setup for training LLMs. Transformers decoders effectively capture long-range dependencies over several modalities and form a core backbone of modern advancements in machine learning. Convolutional architectures have been popular in extracting features in domains such as raw 1-D signals, speech, and images, to name a few. In this paper, by combining local and global dependencies over latent representations using causal convolutional filters and Transformer, we achieve significant gains in performance. This work showcases a robust speech architecture that can be integrated and adapted in a causal setup beyond speech applications for large-scale LLMing.
- “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- “Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions,” arXiv preprint arXiv:2105.00335, 2021.
- “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
- “Progen: Language modeling for protein generation,” arXiv preprint arXiv:2004.03497, 2020.
- “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
- “Decision transformer: Reinforcement learning via sequence modeling,” Advances in neural information processing systems, vol. 34, pp. 15084–15097, 2021.
- “Chain of thought prompting elicits reasoning in large language models,” arXiv preprint arXiv:2201.11903, 2022.
- Amelia et. al Glaese, “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
- “Solving quantitative reasoning problems with language models,” arXiv preprint arXiv:2206.14858, 2022.
- “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv preprint arXiv:2204.00598, 2022.
- “Emergent abilities of large language models,” Transactions on Machine Learning Research.
- “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
- “Audio-linguistic embeddings for spoken sentences,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7355–7359.
- Yonghui Wu et. al, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
- “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
- “Neural machine translation in linear time,” arXiv preprint arXiv:1610.10099, 2016.
- “A generative model for raw audio using transformer architectures,” in 2021 24th International Conference on Digital Audio Effects (DAFx). IEEE, 2021, pp. 230–237.
- “A framework for contrastive and generative learning of audio representations,” arXiv preprint arXiv:2010.11459, 2020.
- “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
- “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). Ieee, 2015, pp. 4580–4584.
- “Language through a prism: A spectral approach for multiscale language representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 5492–5504, 2020.
- Prateek Verma, “Goodbye wavenet–a language model for raw audio with context of 1/2 million samples,” arXiv preprint arXiv:2206.08297, 2022.
- “It’s raw! audio generation with state-space models,” arXiv preprint arXiv:2202.09729, 2022.
- About the Test Data, “Matt mahoney,” Sept. 1, 2011, [Online; accessed 1-July-2023].
- “{{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.