Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (2403.05530v5)

Published 8 Mar 2024 in cs.CL and cs.AI

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of LLMs at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

PDF Abstract

Okay, here is a detailed summary of the paper "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context":

Introduction and Core Contribution:

The paper introduces the Gemini 1.5 family of multimodal models, specifically highlighting Gemini 1.5 Pro (an updated, improved version over a February release) and Gemini 1.5 Flash (a lightweight, highly efficient variant). The central innovation and focus of these models is their unprecedented ability to process extremely long contexts – up to 10 million tokens – across multiple modalities (text, video, audio, code). This represents a significant leap, over an order of magnitude beyond contemporary models like Claude 3 (200k) and GPT-4 Turbo (128k). The paper argues that this long-context capability unlocks new possibilities for recalling and reasoning over fine-grained information from vast amounts of input, such as entire codebases, multiple long documents, or hours of video/audio.

Key Models:

Gemini 1.5 Pro: An updated version of the model initially announced in February 2024. It's built on a sparse Mixture-of-Experts (MoE) Transformer architecture, leveraging innovations in scaling, training, and serving infrastructure. It achieves performance comparable or superior to Gemini 1.0 Ultra across many benchmarks while requiring significantly less training compute and being more efficient to serve.
Gemini 1.5 Flash: A more lightweight dense Transformer model designed for speed, efficiency, and lower latency. It uses techniques like parallel computation and online distillation from Gemini 1.5 Pro. Despite its smaller size, it retains the long context and multimodal capabilities of 1.5 Pro and shows strong performance, often outperforming Gemini 1.0 Pro and even 1.0 Ultra on some vision/text benchmarks.

Long-Context Capabilities and Evaluations:

The paper heavily emphasizes and evaluates the long-context performance:

Diagnostic Evaluations:
- Perplexity: Negative Log-Likelihood (NLL) decreases monotonically with sequence length up to 1M tokens (long documents) and 10M tokens (code), indicating the models effectively use the entire context to improve predictions. This improvement follows a power-law relationship with context length.
- Needle-in-a-Haystack: Both 1.5 Pro and Flash achieve near-perfect recall (>99%) on retrieving specific pieces of information ("needles") embedded within large amounts of distractor text ("haystack") across text, video, and audio modalities up to millions of tokens. 1.5 Pro maintains >99.7% recall at 1M tokens and 99.2% at 10M tokens in text. Flash achieves 100% text recall up to 2M tokens. Video recall was tested up to 10.5 hours (9.9M tokens) and audio up to 107 hours (9.7M tokens) for 1.5 Pro.
- Limitations/Challenges: Performance degrades slightly on more complex retrieval tasks like retrieving multiple needles (1.5 Pro >60% recall at 1M tokens for 100 needles) or the Multi-round Co-reference Resolution (MRCR) task which requires finer reasoning (1.5 Pro/Flash ~75% score at 1M tokens), though still outperforming competitors significantly at longer contexts.
Realistic Long-Context Evaluations:
- In-Context Learning (Kalamang): Gemini 1.5 Pro and Flash learn to translate English to Kalamang (a language with <200 speakers and minimal web presence) at a level comparable to a human learning from the same ~250k token context (grammar book, dictionary, sentences). This demonstrates powerful in-context learning from long, unseen documents.
- In-Context ASR (ASROB): The first demonstration of an LLM learning ASR for a new language (Kalamang) in-context, using mixed-modal documentation (text + 45 mins audio). 1.5 Pro achieved 22.9% CER.
- Low-Resource MT Scaling: Many-shot ICL (up to 4k examples, ~90k tokens) shows continued improvement in translation quality for low-resource languages, significantly outperforming GPT-4 Turbo.
- Long-Document QA: Gemini 1.5 Pro answering questions about the entirety of Les Misérables (710k tokens) significantly outperforms retrieval-augmented generation (RAG) approaches using shorter contexts (4k tokens) with Gemini 1.0/1.5 Pro or GPT-4 Turbo.
- Long-Video QA: On a new benchmark (1H-VideoQA) with hour-long videos, 1.5 Pro's performance scales with context length, outperforming GPT-4V and achieving SotA. Flash also performs well.
- Long-Context ASR: 1.5 Pro achieves SotA WER (5.5%) on 15-minute videos without needing segmentation, outperforming USM and Whisper which require it. Flash (8.8%) is also strong.
- In-Context Planning: 1.5 Pro shows strong few-shot and many-shot planning capabilities on PDDL and natural language planning tasks, outperforming GPT-4 Turbo and improving with context length.
- Unstructured Multimodal Data Analytics: 1.5 Pro effectively extracts structured information from 1024 images, outperforming competitors, with accuracy improving as more images (longer context) are processed simultaneously.

Core Capability Evaluations:

The paper demonstrates that the long-context improvements do not significantly compromise core capabilities. Gemini 1.5 Pro and Flash show substantial improvements over Gemini 1.0 Pro across nearly all benchmarks.

Gemini 1.5 Pro vs. 1.0 Pro: Wins on 44/50 benchmarks.
Gemini 1.5 Flash vs. 1.0 Pro: Wins on 41/50 benchmarks.
Gemini 1.5 Pro vs. 1.0 Ultra: Wins on 35/45 benchmarks (majority of text and vision), despite significantly less training compute. Achieves SotA on several multimodal benchmarks (AI2D, MathVista, ChartQA, DocVQA, InfographicVQA, EgoSchema).
Gemini 1.5 Flash vs. 1.0 Ultra: Wins on 21/44 benchmarks (majority of vision, nearly half of text), highlighting its efficiency.
Specific Areas: Strong gains noted in Math/Science/Reasoning, Code, Multilinguality, Vision (Multimodal Reasoning, Charts/Documents, Natural Images, Video), Function Calling, Instruction Following, and real-world expert/productivity tasks (saving 26-75% time for professionals). Audio performance is slightly mixed, with some regressions attributed to post-training data focus.

Specialized Models:

Math-Specialized Gemini 1.5 Pro: Achieves SotA on MATH (80.6%@1, 91.1%@256), AIME 2024, and other math benchmarks, demonstrating potential for deep expertise.
Flash-8B: An experimental 8-billion parameter model derived from Flash, maintaining long context (>1M tokens) and multimodal capabilities with high efficiency, achieving ~80-90% of Flash's performance on initial evaluations.

Safety, Security, and Responsibility:

The paper details a comprehensive safety process including impact assessments, policy setting, safety training (pre-training filtering, SFT, RLHF), red teaming (internal/external), and assurance evaluations for governance.

Findings: 1.5 models show significant improvement in safety (reduced policy violations) compared to 1.0 Ultra across modalities. They are more robust to some jailbreaks (e.g., GCG) but remain vulnerable to others (handcrafted templates, prompt injection), potentially due to improved instruction following. Long-context evaluations (adversarial needle-in-haystack) did not show increased risk over short context in the tested scenario, but require further research. Helpfulness improved quality ratings, but showed regressions in tone and increased refusals on grounded queries. Memorization is lower than prior models. Dangerous capability evaluations showed mixed results with some improvements in persuasion and CTF sub-tasks but no major leaps.

Discussion and Conclusion:

Gemini 1.5 Pro and Flash represent a generational leap, primarily through unlocking massive context windows (up to 10M tokens) for multimodal understanding without sacrificing core capabilities or efficiency compared to the previous generation. The paper concludes with a call-to-action for the research community to develop more challenging and nuanced benchmarks for evaluating long-context models, moving beyond simple retrieval tasks to assess complex reasoning over extended multimodal inputs.