Gigapixel Image Agent for Tissue Navigation
- The paper demonstrates a novel GIANT framework that applies a human-like pan-and-zoom strategy on gigapixel whole-slide images, significantly improving performance over thumbnail methods.
- It integrates three core components—a Navigator Agent, Patch Extractor, and Language-Model Interface—allowing iterative visual-text reasoning and region-specific inquiry.
- Evaluated on the MultiPathQA benchmark, GIANT attains robust performance that approaches or surpasses specialist pathology models, despite challenges in computational demand and fine-grained assessment.
The Gigapixel Image Agent for Navigating Tissue (GIANT) is a framework that enables general-purpose Large Multimodal Models (LMMs) to perform expert-level navigation and reasoning on whole-slide images (WSIs) in pathology. Unlike prior LMM approaches that operate on low-resolution thumbnails or random patches—yielding poor or inconclusive performance—GIANT iteratively explores gigapixel images with a human-like pan-and-zoom strategy, closely reflecting the workflow of practicing pathologists. Developed and evaluated on the MultiPathQA benchmark, GIANT achieves substantial gains over traditional thumbnail and patch-based methods, approaching or even surpassing the accuracy of specialist pathology models on complex, multi-scale visual question answering tasks (Buckley et al., 24 Nov 2025).
1. System Architecture and Core Components
GIANT operates by orchestrating a sequence of interactions between three principal modules:
- Navigator Agent: Implements decision logic by proposing the next bounding-box navigation action based on the cumulative prompt history, which includes image crops and textual reasoning. The navigator is realized as a prompt to the LMM with explicit navigation instructions.
- Patch Extractor: Interacts with the OpenSlide image pyramid to extract high-resolution crops at specified locations and magnifications. Each crop is resampled to a fixed length for its long side (, typically 1,000 pixels) to ensure sharpness while maximizing pathology-relevant visual detail.
- Language-Model Interface: Maintains and updates the structured prompt history , interleaving text, navigation actions, and image attachments. This interface orchestrates two categories of LMM calls:
- Navigation calls: , repeatedly invoked for sequential exploration.
- Final answer call: , producing the concluding answer.
These components are integrated into a loop that iteratively directs the LMM to reason about current observations, propose the next region to inspect, extract and append the corresponding image, and ultimately synthesize a final response.
2. Iterative Pan-and-Zoom Navigation Process
GIANT’s workflow closely mimics the expert navigation of WSIs by pathologists, encompassing the following algorithmic stages:
- Initialization: Extract an initial low-resolution thumbnail and establish the context combining , the question , and system-level navigation instructions.
- Navigation Loop: For up to steps:
- The LMM receives the current prompt and outputs textual reasoning and a navigation box .
- The Patch Extractor fetches crop as specified by .
- is updated with for the next iteration.
- The loop may terminate early if the LMM deems sufficient evidence has been gathered.
- Final Prediction: Conditioned on the final context , the LMM produces the answer .
This iterative cycle enables spatial and hierarchical exploration, unlocking multi-scale reasoning capabilities not accessible through single-view or patch-randomized strategies.
3. Mathematical Formalization of Visual-Textual Reasoning
Although GIANT does not explicitly enumerate candidate regions, its selection method can be conceptually modeled as follows. At each step , the LMM’s internal attention mechanism and joint visual-textual encoding may be formalized as a scoring function: where:
- encodes image features for candidate crop ,
- encodes text features for the question and accumulated reasoning,
- for an internal weight matrix , Optionally normalized across .
The LMM uses chain-of-thought prompting to implicitly identify and select the region maximizing this score: This formulation frames navigation as an iterative, cross-modal optimization driven by the LMM’s own train-time objective and in-context signal.
4. Integration with Large Multimodal Models
GIANT leverages GPT-5, an LMM with transformer architecture capable of ingesting mixed visual and textual input sequences. Each extracted crop is encoded as a vision token; these tokens are concatenated with text tokens representing both the slide question and model-generated chain-of-thought reasoning. The prompting protocol consists of:
- System prompt: Assigns the LMM the role of a pathology navigation agent, specifying a maximum number of crops (), and requiring each response to include reasoning and a bounding box output.
- User prompt: Presents the slide-level question .
- Interleaved in-context chain: For each navigation step, vision tokens (from image crops) and reasoning are provided in sequence, followed by the LMM’s proposed action.
- Final call: The accumulated sequence—thumbnail, question, navigation history, multi-scale crops, and rationales—conditions the LMM to synthesize its answer .
Notably, GIANT does not fine-tune the LMM; all performance is achieved in a zero-shot (or few-shot) prompting regime.
5. MultiPathQA Task Suite and Adaptation
MultiPathQA provides a comprehensive benchmark of 934 WSI-level questions across five clinically-relevant tasks, enabling rigorous evaluation of GIANT’s generalizability and reasoning:
| Task | Answer Format | GIANT Usage |
|---|---|---|
| Organ Classification | 20-way class | Initial overview and zooms to resolve tissue identity |
| Cancer Diagnosis | 30-way class | Global scan, then focus on tumor–normal and nuclear features |
| Cancer Grading | 6-way class | Attention to gland and stroma details |
| SlideBenchVQA | MCQ / Free | Navigation to regions relevant for slide-level VQA |
| ExpertVQA | Free-text | Multi-step, multi-scale localization and justification |
Each task uses the same pan-and-zoom agentic loop, differing only in prompt selection and answer format. Questions range from categorical classifications (e.g., organ, tumor type, grade) to free-form, open-ended expert queries requiring spatial localization and descriptive reasoning.
6. Comparative Performance Metrics
GIANT delivers substantial improvements over patch and thumbnail baselines (single-run, ), as quantified by top-1 balanced accuracy for classification and raw accuracy for VQA:
| Task | Thumbnail (%) | Patch (%) | GIANT (%) | GIANT ×5 (%) |
|---|---|---|---|---|
| Cancer Diagnosis | 9.2 | 12.8 | 32.3 | – |
| Organ Classification | 36.5 | 43.7 | 53.7 | 60.7 |
| Cancer Grading | 12.2 | 21.3 | 23.2 | 25.4 |
| SlideBenchVQA | 54.8 | 52.3 | 58.9 | – |
| ExpertVQA | 50.0 | 43.8 | 57.0 | 62.5 |
On pathologist-authored ExpertVQA items, GIANT combined with GPT-5 achieves 62.5% accuracy, notably surpassing specialist models TITAN (43.8%) and SlideChat (37.5%). Majority-vote aggregation across five independent GIANT runs (“GIANT ×5”) further enhances performance on several tasks. These results demonstrate the efficacy of iterative, agentic strategies for complex medical image reasoning (Buckley et al., 24 Nov 2025).
7. Strengths, Limitations, and Prospective Advancements
Strengths
- The agentic, human-like navigation unlocks spatial and hierarchical reasoning not feasible through traditional patch or thumbnail analysis.
- Zero-shot prompting yields robust and generalizable performance across diverse clinical and open-domain tasks.
- GIANT provides transparent “reasoning traces” (zoom sequences, rationales) that are accessible for expert audit; 62.9% of zoom steps were evaluated as appropriate by a pathologist.
Limitations
- GIANT’s accuracy, while competitive on VQA, is substantially below that of purpose-trained encoders for pure classification (e.g., TITAN exceeds 88% on TCGA diagnostic classification).
- Diagnostic anchoring and hallucinations in open-ended reasoning limit clinical deployment without further guardrails.
- Computational demand is high, with approximately 20 LMM invocations per slide during inference.
- ISUP grading task accuracy plateaus near 25%, indicating persistent challenges in fine-grained histopathological assessment.
Future Directions
Proposed avenues for enhancement include:
- Development of a lightweight region scoring module to delegate proposal selection, reducing LMM dependency during navigation.
- Integration of domain-specific vision encoders (e.g., CONCH), utilizing correctness checks to prevent propagation of errors.
- Application of reinforcement learning or imitation learning leveraging pathologist navigation traces to refine policy.
- Augmentation of prompt context with memory mechanisms to recognize and avoid redundancy in explored regions.
- Fine-tuning LMMs on paired image-text reasoning traces to attenuate hallucination rates.
This suggests that incorporating structured navigation with foundation models can bridge much of the gap to specialist pathology models for multi-scale, expert-level reasoning, while also highlighting the need for domain priors and more robust supervision prior to clinical use (Buckley et al., 24 Nov 2025).