Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models (2404.12387v1)
Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal LLMs trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at http://chat.reka.ai . A showcase of non cherry picked qualitative examples can also be found at http://showcase.reka.ai .
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
- On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856, 2019.
- A general language assistant as a laboratory for alignment, 2021.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants, 2023.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Evaluating large language models trained on code. 2021.
- Meditron-70b: Scaling medical pretraining for large language models, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
- Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
- Training verifiers to solve math word problems, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Gemma: Open models based on gemini research and technology, 2024.
- Palm 2 technical report, 2023.
- Gemini Team Google. Gemini: A family of highly capable multimodal models, 2023.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Measuring massive multitask language understanding, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Mistral 7b, 2023.
- Few-shot learning with multilingual generative language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.616. URL https://aclanthology.org/2022.emnlp-main.616.
- OpenAI. Gpt-4 technical report, 2023.
- OpenAI. Gpt-4v(ision) system card. 2024.
- Training language models to follow instructions with human feedback, 2022.
- Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
- XCOPA: A multilingual dataset for causal commonsense reasoning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
- Perception test: A diagnostic benchmark for multimodal video models, 2023.
- Improving language understanding by generative pre-training. 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Towards expert-level medical question answering with large language models, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Yi Tay. Training great llms entirely from ground up in the wilderness as a startup. 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain of thought prompting elicits reasoning in large language models. Conference on Neural Information Processing Systems (NeurIPS), 2022.
- xAI. Announcing grok. 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR, 2024.
- Root mean square layer normalization, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.