Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models (2404.12387v1)

Published 18 Apr 2024 in cs.CL and cs.CV

Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal LLMs trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at http://chat.reka.ai . A showcase of non cherry picked qualitative examples can also be found at http://showcase.reka.ai .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  2. Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
  3. On the cross-lingual transferability of monolingual representations. CoRR, abs/1910.11856, 2019.
  4. A general language assistant as a laboratory for alignment, 2021.
  5. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants, 2023.
  6. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  7. Evaluating large language models trained on code. 2021.
  8. Meditron-70b: Scaling medical pretraining for large language models, 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  11. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  12. Training verifiers to solve math word problems, 2021.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Gemma: Open models based on gemini research and technology, 2024.
  15. Palm 2 technical report, 2023.
  16. Gemini Team Google. Gemini: A family of highly capable multimodal models, 2023.
  17. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  18. Measuring massive multitask language understanding, 2021.
  19. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  20. Mistral 7b, 2023.
  21. Few-shot learning with multilingual generative language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.616. URL https://aclanthology.org/2022.emnlp-main.616.
  22. OpenAI. Gpt-4 technical report, 2023.
  23. OpenAI. Gpt-4v(ision) system card. 2024.
  24. Training language models to follow instructions with human feedback, 2022.
  25. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
  26. XCOPA: A multilingual dataset for causal commonsense reasoning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
  27. Perception test: A diagnostic benchmark for multimodal video models, 2023.
  28. Improving language understanding by generative pre-training. 2018.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  30. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
  31. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  32. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  33. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  34. Towards expert-level medical question answering with large language models, 2023.
  35. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  36. Yi Tay. Training great llms entirely from ground up in the wilderness as a startup. 2024.
  37. Llama 2: Open foundation and fine-tuned chat models, 2023.
  38. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  39. Chain of thought prompting elicits reasoning in large language models. Conference on Neural Information Processing Systems (NeurIPS), 2022.
  40. xAI. Announcing grok. 2023.
  41. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR, 2024.
  42. Root mean square layer normalization, 2019.
Citations (32)

Summary

  • The paper presents novel modular transformer architectures using SwiGLU, Grouped Query Attention, and rotary embeddings to enhance multimodal processing.
  • The models achieve state-of-the-art results on language benchmarks, visual question answering, and specialized tasks such as medical reasoning.
  • The research leverages efficient distributed training on Nvidia H100 GPUs to scale performance across diverse datasets and long-context inputs.

Exploring the Capabilities of Reka Models: Core, Flash, and Edge in Multimodal Language Tasks

Introduction

Reka introduces its series of models: Reka Core, Flash, and Edge, demonstrating significant advancements in handling multimodal inputs such as text, images, video, and audio. These models vary in size and computational needs, with Reka Edge and Flash achieving state-of-the-art results within their compute classes and Reka Core proving robust in comparisons with other frontier models on comprehensive benchmarks across modalities.

Model Overview and Training Details

Reka's architecture is built on a modular encoder-decoder transformer framework optimizing for multimodal inputs. Details include:

  • Training Data: A rich dataset comprised of both public and proprietary sources, focusing on a diverse set of languages and modal data including images and video.
  • Infrastructure: Primarily utilizing Nvidia H100 GPUs, Reka models used distributed training strategies with peak resources reaching substantial computing power.
  • Model Architecture: Reka employs techniques such as SwiGLU, Grouped Query Attention, Rotary positional embeddings, and RMSNorm, with a backbone similar to the PaLM architecture yet tailored for multimodal inputs. Context lengths vary, with Reka Flash and Core supporting extensive contexts up to 128K tokens for enhanced performance on long-context tasks.

Performance and Evaluations

Language and Multimodal Benchmarking

The evaluation of Reka's models includes:

  • Language Tasks: Reka models showcase competitive performance across standard benchmarks like MMLU and GSM8K, with particularly strong showings in code generation tested through HumanEval.
  • Multimodal Performance: Reka Core excels in visual question answering, ranking competitively against models such as GPT-4V and surpassing Claude 3 in multimodal chat setups based on third-party blind human evaluations.
  • Specialized Domain Applications: In detailed model comparisons, Reka models exhibit strong performances in specific domains such as medical reasoning, showing capabilities on par with or exceeding specialized models like Med-PaLM-2 and GPT-4 on various medical benchmarks.

Detailed Comparative Analysis

Reka consistently positions itself as a strong contender in the landscape of AI models:

  • Reka Edge stands out within its compute class, outperforming comparable models in tasks varying from standard language understanding to complex reasoning and multilingual understanding.
  • Reka Flash, despite its middling size (21B parameters), often matches or surpasses much larger models across a variety of benchmarks, highlighting its efficiency and the effectiveness of its training and architectural choices.

Future Outlook and Theoretical Implications

The ongoing development and enhancements of Reka models suggest future improvements and potential new applications, especially in domains requiring robust multimodal understanding and interaction. Theoretical explorations could further explore the efficiency of training protocols, the scalability of multimodal models, and the transferability of learned representations across diverse data types and tasks.

Conclusion

Reka's suite of models demonstrates substantial advancements in the field of AI, particularly in handling and reasoning across multiple modal domains. By leveraging innovative architectural features and a diverse training corpus, Reka manages to set new standards within specific compute classes and shows promise in competing at the frontier level of multimodal LLMs. The ongoing enhancements and expansions of Reka models are expected to further solidify their standing in the AI research community and expand their applicability across a broader spectrum of real-world applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 tweets and received 249 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com