Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conformer LLMs -- Convolution Augmented Large Language Models (2307.00461v1)

Published 2 Jul 2023 in cs.CL, cs.AI, cs.LG, cs.MM, and cs.SD

Abstract: This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for LLMs. Non-causal conformers are used ubiquitously in automatic speech recognition. This work aims to adapt these architectures in a causal setup for training LLMs. Transformers decoders effectively capture long-range dependencies over several modalities and form a core backbone of modern advancements in machine learning. Convolutional architectures have been popular in extracting features in domains such as raw 1-D signals, speech, and images, to name a few. In this paper, by combining local and global dependencies over latent representations using causal convolutional filters and Transformer, we achieve significant gains in performance. This work showcases a robust speech architecture that can be integrated and adapted in a causal setup beyond speech applications for large-scale LLMing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  2. “Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions,” arXiv preprint arXiv:2105.00335, 2021.
  3. “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
  4. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  5. “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
  6. “Progen: Language modeling for protein generation,” arXiv preprint arXiv:2004.03497, 2020.
  7. “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
  8. “Decision transformer: Reinforcement learning via sequence modeling,” Advances in neural information processing systems, vol. 34, pp. 15084–15097, 2021.
  9. “Chain of thought prompting elicits reasoning in large language models,” arXiv preprint arXiv:2201.11903, 2022.
  10. Amelia et. al Glaese, “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  11. “Solving quantitative reasoning problems with language models,” arXiv preprint arXiv:2206.14858, 2022.
  12. “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  13. “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv preprint arXiv:2204.00598, 2022.
  14. “Emergent abilities of large language models,” Transactions on Machine Learning Research.
  15. “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
  16. “Audio-linguistic embeddings for spoken sentences,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7355–7359.
  17. Yonghui Wu et. al, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
  18. “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  19. “Neural machine translation in linear time,” arXiv preprint arXiv:1610.10099, 2016.
  20. “A generative model for raw audio using transformer architectures,” in 2021 24th International Conference on Digital Audio Effects (DAFx). IEEE, 2021, pp. 230–237.
  21. “A framework for contrastive and generative learning of audio representations,” arXiv preprint arXiv:2010.11459, 2020.
  22. “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
  23. “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  24. “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). Ieee, 2015, pp. 4580–4584.
  25. “Language through a prism: A spectral approach for multiscale language representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 5492–5504, 2020.
  26. Prateek Verma, “Goodbye wavenet–a language model for raw audio with context of 1/2 million samples,” arXiv preprint arXiv:2206.08297, 2022.
  27. “It’s raw! audio generation with state-space models,” arXiv preprint arXiv:2202.09729, 2022.
  28. About the Test Data, “Matt mahoney,” Sept. 1, 2011, [Online; accessed 1-July-2023].
  29. “{{\{{TensorFlow}}\}}: A system for {{\{{Large-Scale}}\}} machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.

Summary

We haven't generated a summary for this paper yet.