Papers
Topics
Authors
Recent
2000 character limit reached

Gigapixel Image Agent for Tissue Navigation

Updated 26 November 2025
  • The paper demonstrates a novel GIANT framework that applies a human-like pan-and-zoom strategy on gigapixel whole-slide images, significantly improving performance over thumbnail methods.
  • It integrates three core components—a Navigator Agent, Patch Extractor, and Language-Model Interface—allowing iterative visual-text reasoning and region-specific inquiry.
  • Evaluated on the MultiPathQA benchmark, GIANT attains robust performance that approaches or surpasses specialist pathology models, despite challenges in computational demand and fine-grained assessment.

The Gigapixel Image Agent for Navigating Tissue (GIANT) is a framework that enables general-purpose Large Multimodal Models (LMMs) to perform expert-level navigation and reasoning on whole-slide images (WSIs) in pathology. Unlike prior LMM approaches that operate on low-resolution thumbnails or random patches—yielding poor or inconclusive performance—GIANT iteratively explores gigapixel images with a human-like pan-and-zoom strategy, closely reflecting the workflow of practicing pathologists. Developed and evaluated on the MultiPathQA benchmark, GIANT achieves substantial gains over traditional thumbnail and patch-based methods, approaching or even surpassing the accuracy of specialist pathology models on complex, multi-scale visual question answering tasks (Buckley et al., 24 Nov 2025).

1. System Architecture and Core Components

GIANT operates by orchestrating a sequence of interactions between three principal modules:

  1. Navigator Agent: Implements decision logic by proposing the next bounding-box navigation action at=(xt,yt,wt,ht)a_t = (x_t, y_t, w_t, h_t) based on the cumulative prompt history, which includes image crops and textual reasoning. The navigator is realized as a prompt to the LMM with explicit navigation instructions.
  2. Patch Extractor: Interacts with the OpenSlide image pyramid to extract high-resolution crops ItI_t at specified locations and magnifications. Each crop is resampled to a fixed length for its long side (SS, typically 1,000 pixels) to ensure sharpness while maximizing pathology-relevant visual detail.
  3. Language-Model Interface: Maintains and updates the structured prompt history CC, interleaving text, navigation actions, and image attachments. This interface orchestrates two categories of LMM calls:
    • Navigation calls: (rt,at)LMM(C)(r_t, a_t) \leftarrow \mathrm{LMM}(C), repeatedly invoked for sequential exploration.
    • Final answer call: y^LMM(Cfinal)\hat{y} \leftarrow \mathrm{LMM}(C_{\rm final}), producing the concluding answer.

These components are integrated into a loop that iteratively directs the LMM to reason about current observations, propose the next region to inspect, extract and append the corresponding image, and ultimately synthesize a final response.

2. Iterative Pan-and-Zoom Navigation Process

GIANT’s workflow closely mimics the expert navigation of WSIs by pathologists, encompassing the following algorithmic stages:

  1. Initialization: Extract an initial low-resolution thumbnail I0I_0 and establish the context CC combining I0I_0, the question qq, and system-level navigation instructions.
  2. Navigation Loop: For up to TT steps:
    • The LMM receives the current prompt CC and outputs textual reasoning rtr_t and a navigation box ata_t.
    • The Patch Extractor fetches crop ItI_t as specified by ata_t.
    • CC is updated with (rt,at,It)(r_t, a_t, I_t) for the next iteration.
    • The loop may terminate early if the LMM deems sufficient evidence has been gathered.
  3. Final Prediction: Conditioned on the final context CfinalC_{\rm final}, the LMM produces the answer y^\hat{y}.

This iterative cycle enables spatial and hierarchical exploration, unlocking multi-scale reasoning capabilities not accessible through single-view or patch-randomized strategies.

3. Mathematical Formalization of Visual-Textual Reasoning

Although GIANT does not explicitly enumerate candidate regions, its selection method can be conceptually modeled as follows. At each step tt, the LMM’s internal attention mechanism and joint visual-textual encoding may be formalized as a scoring function: st(i)=Attention(Φimg(It1,i),Φtxt(q,r1:t1))s_t(i) = \mathrm{Attention}\left(\Phi_{\rm img}(I_{t-1,i}), \Phi_{\rm txt}(q, r_{1:t-1})\right) where:

  • Φimg(It1,i)Rd\Phi_{\rm img}(I_{t-1,i}) \in \mathbb{R}^d encodes image features for candidate crop It1,iI_{t-1,i},
  • Φtxt(q,r1:t1)Rd\Phi_{\rm txt}(q, r_{1:t-1}) \in \mathbb{R}^d encodes text features for the question and accumulated reasoning,
  • Attention(u,v)=uWv\mathrm{Attention}(u, v) = u^\top W v for an internal weight matrix WW, Optionally normalized across ii.

The LMM uses chain-of-thought prompting to implicitly identify and select the region maximizing this score: at=argmaxist(i).a_t = \arg\max_i\, s_t(i). This formulation frames navigation as an iterative, cross-modal optimization driven by the LMM’s own train-time objective and in-context signal.

4. Integration with Large Multimodal Models

GIANT leverages GPT-5, an LMM with transformer architecture capable of ingesting mixed visual and textual input sequences. Each extracted crop ItI_t is encoded as a vision token; these tokens are concatenated with text tokens representing both the slide question and model-generated chain-of-thought reasoning. The prompting protocol consists of:

  • System prompt: Assigns the LMM the role of a pathology navigation agent, specifying a maximum number of crops (T1T-1), and requiring each response to include reasoning and a bounding box output.
  • User prompt: Presents the slide-level question qq.
  • Interleaved in-context chain: For each navigation step, vision tokens (from image crops) and reasoning are provided in sequence, followed by the LMM’s proposed action.
  • Final call: The accumulated sequence—thumbnail, question, navigation history, multi-scale crops, and rationales—conditions the LMM to synthesize its answer y^\hat{y}.

Notably, GIANT does not fine-tune the LMM; all performance is achieved in a zero-shot (or few-shot) prompting regime.

5. MultiPathQA Task Suite and Adaptation

MultiPathQA provides a comprehensive benchmark of 934 WSI-level questions across five clinically-relevant tasks, enabling rigorous evaluation of GIANT’s generalizability and reasoning:

Task Answer Format GIANT Usage
Organ Classification 20-way class Initial overview and zooms to resolve tissue identity
Cancer Diagnosis 30-way class Global scan, then focus on tumor–normal and nuclear features
Cancer Grading 6-way class Attention to gland and stroma details
SlideBenchVQA MCQ / Free Navigation to regions relevant for slide-level VQA
ExpertVQA Free-text Multi-step, multi-scale localization and justification

Each task uses the same pan-and-zoom agentic loop, differing only in prompt selection and answer format. Questions range from categorical classifications (e.g., organ, tumor type, grade) to free-form, open-ended expert queries requiring spatial localization and descriptive reasoning.

6. Comparative Performance Metrics

GIANT delivers substantial improvements over patch and thumbnail baselines (single-run, T=20T = 20), as quantified by top-1 balanced accuracy for classification and raw accuracy for VQA:

Task Thumbnail (%) Patch (%) GIANT (%) GIANT ×5 (%)
Cancer Diagnosis 9.2 12.8 32.3
Organ Classification 36.5 43.7 53.7 60.7
Cancer Grading 12.2 21.3 23.2 25.4
SlideBenchVQA 54.8 52.3 58.9
ExpertVQA 50.0 43.8 57.0 62.5

On pathologist-authored ExpertVQA items, GIANT combined with GPT-5 achieves 62.5% accuracy, notably surpassing specialist models TITAN (43.8%) and SlideChat (37.5%). Majority-vote aggregation across five independent GIANT runs (“GIANT ×5”) further enhances performance on several tasks. These results demonstrate the efficacy of iterative, agentic strategies for complex medical image reasoning (Buckley et al., 24 Nov 2025).

7. Strengths, Limitations, and Prospective Advancements

Strengths

  • The agentic, human-like navigation unlocks spatial and hierarchical reasoning not feasible through traditional patch or thumbnail analysis.
  • Zero-shot prompting yields robust and generalizable performance across diverse clinical and open-domain tasks.
  • GIANT provides transparent “reasoning traces” (zoom sequences, rationales) that are accessible for expert audit; 62.9% of zoom steps were evaluated as appropriate by a pathologist.

Limitations

  • GIANT’s accuracy, while competitive on VQA, is substantially below that of purpose-trained encoders for pure classification (e.g., TITAN exceeds 88% on TCGA diagnostic classification).
  • Diagnostic anchoring and hallucinations in open-ended reasoning limit clinical deployment without further guardrails.
  • Computational demand is high, with approximately 20 LMM invocations per slide during inference.
  • ISUP grading task accuracy plateaus near 25%, indicating persistent challenges in fine-grained histopathological assessment.

Future Directions

Proposed avenues for enhancement include:

  1. Development of a lightweight region scoring module to delegate proposal selection, reducing LMM dependency during navigation.
  2. Integration of domain-specific vision encoders (e.g., CONCH), utilizing correctness checks to prevent propagation of errors.
  3. Application of reinforcement learning or imitation learning leveraging pathologist navigation traces to refine policy.
  4. Augmentation of prompt context with memory mechanisms to recognize and avoid redundancy in explored regions.
  5. Fine-tuning LMMs on paired image-text reasoning traces to attenuate hallucination rates.

This suggests that incorporating structured navigation with foundation models can bridge much of the gap to specialist pathology models for multi-scale, expert-level reasoning, while also highlighting the need for domain priors and more robust supervision prior to clinical use (Buckley et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gigapixel Image Agent for Navigating Tissue (GIANT).