Papers
Topics
Authors
Recent
Search
2000 character limit reached

Thinking with Comics: Multimodal Reasoning

Updated 4 February 2026
  • Thinking with Comics is a multimodal paradigm using sequential panels that encode spatial, temporal, and causal information for advanced reasoning and inference.
  • It offers high information efficiency by reducing redundancy and computational cost compared to video, while preserving temporal order and narrative clarity.
  • Experimental findings confirm that proper panel ordering and embedded text are critical, significantly enhancing task accuracy in benchmarks like MATH500 and DocVQA.

Thinking with comics refers to the use of sequential art—panels comprising images, embedded text, and explicit narrative structure—as a substrate for reasoning, sense-making, and cognitive modeling across domains. In both human learning and algorithmic multimodal systems, comics serve as a high information-density medium that uniquely preserves temporal, causal, and discursive structure with reduced redundancy and computational cost relative to video, while supporting richer inferences than static images alone (Chen et al., 2 Feb 2026).

1. Formal Paradigms of Thinking with Comics

Thinking with Comics (TwC) is formalized as a multimodal Chain-of-Thought (CoT) paradigm where reasoning steps are externalized by generating a sequence of comic panels C={c1,,cK}\mathcal{C} = \{c_1,\dots,c_K\}. Each panel encodes spatial configuration, object relations, temporal evolution, and optional embedded text (e.g., speech bubbles, narration).

Two primary inference paradigms are distinguished:

  • End-to-End Visualized Reasoning (Path I): Generate panels as explicit reasoning intermediate states C=Gθ(q)\mathcal{C} = G_\theta(q), where qq is the query and internal state transition ht=f(ht1,q)h_t = f(h_{t-1}, q). The output is a final answer extracted from the sequence: a^=R(cK)\hat a = R(c_K).
  • Comics as Context (Path II): Comics sequence is employed as additional context: a^=argmaxap(aq,C)\hat a = \arg\max_a p(a\mid q, \mathcal{C}), conditioning a multimodal LLM (MLLM) to reason over both question and visual-narrative trajectory.

Comparison modalities include Thinking with Images (TWI), which lacks ordered temporal structure, and Thinking with Video, which preserves time but at the cost of redundancy and linear scaling with sequence length (Chen et al., 2 Feb 2026).

Key distinctions (see table below):

Modality Temporal Structure Redundancy Info Density Cost Scaling
Images None Low Low Constant/image
Comics (TwC) Ordered Panels Low High Constant/sequence
Video Frame Sequence High Moderate Linear/time

2. Cognitive and Inferential Mechanisms: Closure and Narrative Coherence

Closure, as defined by McCloud (1994), is the cognitive operation where the reader infers unseen transitions—action, causality, temporal progression—across the gutters that separate panels. Formally, closure is represented by a function closure(pi1,,pin)Δi\text{closure}(p_{i-1}, \dots, p_{i-n}) \rightarrow \Delta_i mapping a sequence of prior panels to inferred events between them (Iyyer et al., 2016).

In computational settings, closure is tested by cloze-style tasks, requiring models to predict withheld textual or visual content given panel context windows. Baseline multimodal models underperform human baselines significantly (e.g., for text cloze accuracy: best model 61.0%, human 84%), demonstrating the challenge of inferring implicit narrative links (Iyyer et al., 2016).

Closure enables:

  • Bridging temporal and spatial jumps (scene-to-scene, subject-to-subject)
  • Inferring off-panel actions and causal mechanisms
  • Maintaining continuity of dialogue and events across depicted gaps

This illustrates that comics fundamentally require multimodal reasoning: neither text nor image alone suffices for coherent understanding.

3. Experimental Methodologies and Efficiency Metrics

TwC employs both experimental and theoretical efficiency metrics to establish its superiority in temporal/causal reasoning efficiency relative to alternative modalities (Chen et al., 2 Feb 2026).

  • Cost Models: For video, cost scales linearly with time Cvideo(t)=αtC_{\mathrm{video}}(t) = \alpha\cdot t; for comics, Ccomic=βC_{\mathrm{comic}} = \beta per sequence. With empirical pricing ($\alpha = \$0.10/\mathrm{s},,\beta = \$0.134/\mathrm{img}),comicsprovidean86.6<li><strong>InformationEfficiency:</strong>Definedas), comics provide an 86.6% reduction over a 10-second video.</li> <li><strong>Information-Efficiency:</strong> Defined as \eta(z) = \frac{I(a;z\mid q)}{C(z)},where, where I(a;z|q)quantifiesthemutualinformationbetweenanswerandcomicgivenquestion,and quantifies the mutual information between answer and comic given question, and C(z)measuresthecost.</li><li><strong>ScalingLaws:</strong>Accuracyoncomplextasks(e.g.,<ahref="https://www.emergentmind.com/topics/math500"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">MATH500</a>)risessteeplyfrom66.5 measures the cost.</li> <li><strong>Scaling Laws:</strong> Accuracy on complex tasks (e.g., <a href="https://www.emergentmind.com/topics/math500" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">MATH500</a>) rises steeply from 66.5% (1 panel) to 92.0% (8 panels), plateauing with 4 \leq K \leq 6.</li></ul><p>Quantitativeablationconfirmsthatpanelorderingiscritical(randomizingpanelorderordeletingpanelsdegradesaccuracy.</li> </ul> <p>Quantitative ablation confirms that panel ordering is critical (randomizing panel order or deleting panels degrades accuracy \sim 3-8$ percentage points) and embedded text contributes directly to information content (accuracy decrements of 8.3–18.1 percentage points when removed) (Chen et al., 2 Feb 2026).

    4. Narrative Structure, Style, and Role-Taking in Reasoning

    Style and narrative conventions in comics directly modulate the effectiveness of both human and machine reasoning:

    • Role-Playing Alignment: Experiments demonstrate that styles aligned with narrative problem-solving (detective, slice-of-life) yield substantial absolute accuracy gains: detective style raises MathVista accuracy by 25 percentage points (relative 44.5% improvement) over a documentary baseline (Chen et al., 2 Feb 2026).
    • Anthropomorphism and Plot-Driven Analogies: In studies of student physics learning, anthropomorphized charges (e.g., as rescuers, family members) enabled learners to externalize and refine conceptual models of attraction, repulsion, and vector composition prior to formal instruction (Cao, 2017).
    • Panel Sequencing: Both machine and human reasoners demonstrate degradation with disordered sequences, confirming that comics encode causal and temporal logic and not merely aggregated state snapshots.

    This suggests that comics function as an explicit reasoning prior, shaping inference trajectories through narrative style, character mapping, and structured dramatization.

    5. Educational and Design Implications

    Comics as tools for cognitive modeling and education have been shown to facilitate externalization, negotiation, and gradual formalization of complex concepts:

    • Pre-Instruction Scaffolding: Open-ended, comic-based storytelling prompts elicit intuitive, story-driven accounts of physical phenomena, enabling learners to build on prior knowledge and intuition. These narratives can later be mapped onto canonical scientific models and formal algebraic expressions (Cao, 2017).
    • Metacognitive Engagement: Requiring annotation and peer critique of comics fosters reflection and refinement of mental models.
    • Design of AI/ML Systems: Human-like closure, as demonstrated by wide machine-human gaps in COMICS cloze tasks, indicates that future MLLMs should integrate structured visual storytelling as an intermediate representation, leveraging the high information-efficiency of comics and fostering robust, explainable multimodal reasoning (Iyyer et al., 2016).

    6. Limitations, Challenges, and Future Directions

    While Thinking with Comics achieves strong results, several challenges remain:

    • Controllability: Finer-grained control over panel layouts, adaptive panel counts, and robust style transfer remains an open area.
    • Faithfulness and Grounding: Ensuring generated comics accurately reflect the underlying logic or dataset semantics requires further research in both automated metric development and human evaluation.
    • Generalization: Expanding TwC frameworks to model comics from diverse cultural and narrative traditions will test the universality of current methods (Chen et al., 2 Feb 2026).
    • Hybrid Integration: Combining TwC with text-only and video-only CoT reasoning may yield new hybrid, cross-modal frameworks for even more robust reasoning in future systems.

    A plausible implication is that the structured visual-narrative format of comics will increasingly serve as a standard intermediate representation for multimodal AI, education, and reasoning-driven data exploration.

    7. Summary Table: Key Experimental Benchmarks

    The following table summarizes TwC experimental results relative to image and video-based approaches (Chen et al., 2 Feb 2026):

    Task TwC (Img+Txt) Video TWI
    MATH500 92.3% 67% 70.2%
    GSM8K 95.4% 75.7% 69.4%
    MathVista 85.8% 67.6% 63.6%
    DocVQA 99.4% 50.5% 67.5%
    CulturalBench-E 88.3% 60% ~70%

    This illustrates that TwC consistently outperforms Thinking with Images on multi-step temporal and causal reasoning, and rivals or surpasses video-based approaches at a fraction of the computational and generative cost, confirming the distinct value of comics as cognitive tools and computational substrates.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Thinking with Comics.