Papers
Topics
Authors
Recent
2000 character limit reached

OneThinker: All-in-one Reasoning Model for Image and Video (2512.03043v2)

Published 2 Dec 2025 in cs.CV

Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal LLMs (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

Summary

  • The paper introduces OneThinker, a unified framework integrating image and video reasoning tasks using joint CoT-annotated training and EMA-GRPO optimization.
  • Methodology features a large-scale OneThinker-600k dataset with innovative reward normalization to address intra- and inter-task RL imbalances.
  • Experimental results on 31 benchmarks show significant improvements, demonstrating enhanced cross-task transfer and preliminary zero-shot generalization.

OneThinker: A Unified Framework for Visual Reasoning Across Image and Video Modalities

Motivation and Context

OneThinker addresses a significant limitation exhibited by prior Multimodal LLMs (MLLMs): the fragmentation of reasoning capability across heterogeneous tasks and modalities. Existing approaches generally isolate training for image and video domains, yielding models that specialize in narrow task pools and suffering from restricted generalization and knowledge transfer. OneThinker aims to realize a scalable multimodal reasoning generalist by unifying both domain and task variety—supporting question answering, captioning, spatial and temporal grounding, tracking, and segmentation under a singular architecture.

Dataset Construction and Model Initialization

The OneThinker-600k dataset is curated to establish comprehensive coverage across image/video inputs and diverse reasoning tasks, extracting multimodal samples with balanced representation from high-quality public datasets. Task-specific data curation ensures diversity in spatial, temporal, logical, and causal inference. To facilitate SFT initialization, OneThinker employs chain-of-thought (CoT) annotation using a strong proprietary model (Seed1.5-VL), producing the OneThinker-SFT-340k subset for cold-start supervised training.

All tasks are formatted in a unified text-based interface with explicit separation between reasoning (> ...) and answer (<answer>...</answer>) blocks. This enables the model to disentangle internal inference structure and task-specific response, and enhances reward model reliability via schema-based evaluation mechanisms—crucial for perception tasks which exploit structured outputs (e.g., bounding boxes, time windows, point clouds).

Method: EMA-GRPO for Multi-Task RL

RL post-training is paramount for advanced reasoning skill in MLLMs, but prior RL-based approaches encounter sample and task-level reward imbalance in multi-task setups. Standard Group Relative Policy Optimization (GRPO) leverages sample-level normalization which under-optimizes medium-difficulty rollouts and biases task weighting due to differing reward densities and magnitudes. Dr.GRPO resolves intra-task imbalance but induces inter-task suppression: tasks with sparse reward dominate, while dense-reward perception tasks are marginalized.

To resolve these, OneThinker introduces Exponential Moving Average Group Relative Policy Optimization (EMA-GRPO). At each RL step, EMA-GRPO maintains task-wise moving averages of the reward moments, using the per-task standard deviation (as opposed to sample or batch-level statics) for normalization. This ensures stable, adaptive normalization aligned to each task's reward dynamics, mitigating both intra- and inter-task imbalance in policy optimization. The EMA-based advantage estimator employs per-task statistics and stabilizes training amid heterogeneous tasks. This design improves cross-task fairness and convergence properties in multi-modal RL for unified models.

Experimental Results

OneThinker-8B is evaluated across 31 benchmarks spanning 10 foundational visual reasoning tasks in both images and videos. Strong numerical gains are observed over prior state-of-the-art models and baselines (e.g., Qwen3-VL-Instruct-8B):

  • Image Question Answering: 70.6% on MMMU, 77.6% on MathVista, 64.3% on MathVerse.
  • Video QA: 66.2% on VideoMMMU, 70.5% on MMVU(mc), 79.2% on LongVideo-Reason, 35.0% on VideoMathQA.
  • Captioning: 57.9 on MMT-Caption, 28.0 on VideoMMLU-Caption.
  • Spatial Grounding (RefCOCO testA/testB/val): 93.7/88.9/92.0, substantial improvement over Perception-R1 and Qwen3-VL-Instruct-8B.
  • Temporal Grounding (ActivityNet): 65.0 [email protected], 43.6 [email protected], 25.7 [email protected].
  • Tracking (GOT-10k): 73.0 AO, 93.9 [email protected], outperforming specialized models with larger frame coverage.
  • Segmentation: 75.8 / 67.1 / 70.8 cIoU (image), 54.9 JF on ReasonVOS (video), best mean performance across tasks.

Ablation studies confirm distinct RL and normalization method contributions: EMA-GRPO outperforms pure SFT, standard GRPO, and Dr.GRPO variants on all task groups. Furthermore, removal of individual task data (spatial grounding, temporal grounding, image QA) results in conspicuous performance degradation on both similar and transfer domains, substantiating knowledge sharing benefits from unified training.

Preliminary zero-shot generalization is evidenced via MMT-Bench, where OneThinker exhibits transferable performance on previously unseen categories (e.g., point tracking, image quality assessment, rotated object detection), surpassing strong benchmarks.

Theoretical and Practical Implications

The unified architecture and RL framework proposed by OneThinker substantiate the viability of scalable multimodal reasoning generalists; this addresses a critical obstacle to AGI-level visual understanding, namely, compositional generalization across modalities and task categories. EMA-GRPO exemplifies an effective solution for multi-task RL imbalance, yielding improved stability and fairness in cross-domain learning. The strong experimental results—consistent improvements across perception, reasoning, and descriptive tasks—demonstrate that knowledge and reasoning representations can be effectively fused and transferred in joint training pipelines.

Practically, this model paradigm propels the deployment of multimodal LLMs for open-world scenarios where task and modality boundaries are fluid. Robust knowledge transfer and preliminary zero-shot performance foreshadow further advances in model adaptation for real-world, multi-domain applications (e.g., video analytics, autonomous systems, scientific visual reasoning).

From a theoretical standpoint, OneThinker provides evidence for the hypothesis that rich multi-task, multi-modal CoT annotations combined with balanced RL optimization facilitate the emergence of generalist reasoning modules in VL-LLMs. Future work may extend EMA-GRPO-like normalization to even more complex reward and environment heterogeneity, optimize sample efficiency in multi-modal RL, and explore richer feedback modalities (e.g., natural language critique, human-in-the-loop reinforcement).

Conclusion

OneThinker proposes a unified multimodal visual reasoning model capable of handling a broad spectrum of image and video understanding tasks. Through large-scale CoT-annotated datasets, a joint reasoning-formatting schema, and the introduction of EMA-GRPO for multi-task RL stability, OneThinker achieves strong performance and effective cross-task transfer. The framework marks a substantial advance in unified visual-linguistic reasoning capability and sets a new direction towards scalable, generalist multimodal LLMs with improved practical utility and theoretical elegance (2512.03043).

Whiteboard

Explain it Like I'm 14

Overview

This paper introduces OneThinker, a single AI model that can “think” and solve many different vision tasks for both images and videos. Instead of building separate models for each job (like answering questions, describing scenes, finding objects, tracking movement, or segmenting objects), OneThinker learns to handle them all in one place. It uses a special training method to make its reasoning fair and balanced across many task types.

Key Objectives and Questions

The paper aims to answer:

  • Can we build one model that understands both images and videos and handles many core tasks (question answering, captioning, grounding, tracking, segmentation)?
  • How can we train such a model so it learns good step-by-step reasoning across very different tasks?
  • How do we prevent training from favoring some tasks over others, especially when their “scores” (rewards) look very different?

Methods and Approach

The authors use a mix of ideas to build and teach OneThinker:

1) A Big, Diverse Training Set

  • They created a large dataset called OneThinker-600k, with about 600,000 examples covering many tasks:
    • Question answering (multiple choice, math, OCR)
    • Captioning (describing images/videos)
    • Grounding (finding where in space or when in time something happens)
    • Tracking (following an object through video frames)
    • Segmentation (separating an object from the background)
  • For a strong starting point, they used a powerful model to write chain-of-thought (CoT) explanations, forming OneThinker-SFT-340k (a 340,000-example subset) to help the model learn to “think out loud.”

2) A Unified Interface for Thinking and Answers

  • Every task follows the same pattern:
    • The model writes its step-by-step reasoning inside "> ...".
    • Then it gives the final result inside "<answer>...</answer>".
  • For tasks that need structured outputs (like bounding boxes, time spans, points, etc.), the "<answer>" follows a clear format (like a JSON recipe). This makes grading automatic and consistent.

3) Rewards: How the Model Gets Feedback

  • The model earns rewards for:
    • Accuracy: Did it get the answer right? Did its predicted box overlap well with the true box? Did it pick the right time? Did the caption match the reference?
    • Format: Did it follow the required output structure so it can be checked properly?
  • Different tasks use different “accuracy” checks. For example:
    • QA uses exact matches or similarity scores.
    • Grounding uses overlap (IoU\mathrm{IoU}) between predicted and true boxes/time ranges.
    • Tracking averages overlap across frames.
    • Segmentation uses boxes and key points to guide a segmenter, then scores how close the prediction is.

4) Training Strategy: SFT + RL with EMA-GRPO

Think of training like practicing a sport:

  • First, SFT (Supervised Fine-Tuning): The model learns from examples with solutions and CoT explanations—like watching coach-approved plays and copying them.
  • Then, RL (Reinforcement Learning): The model plays on its own and earns points (rewards) based on how well it did. Over time, it adjusts to win more points.

The key innovation is EMA-GRPO, a new twist on a popular RL method:

  • Problem: Different tasks have very different reward “styles.” Some have big, rare wins (sparse rewards), others have small, frequent scores (dense rewards). Standard methods can unintentionally favor certain tasks.
  • Solution: EMA-GRPO keeps a running estimate (exponential moving average) of how spread out the rewards are for each task. In everyday terms:
    • It tracks each task’s usual score range over time.
    • It uses those per-task stats to normalize the learning signals, so no task overwhelms the others.
  • This fixes two issues:
    • Intra-task imbalance: Within a single task, it avoids over-focusing on very easy or very hard samples.
    • Inter-task imbalance: Across tasks, it prevents “loud” tasks (with big reward swings) from drowning out “quiet” tasks.

In short, EMA-GRPO helps the model learn fairly from many tasks at once.

Main Findings and Why They Matter

OneThinker performs strongly on 31 benchmarks across 10 core tasks, often beating other open-source models of similar size. Here are a few highlights:

  • Image Question Answering:
    • Strong scores on tough benchmarks like MMMU and MathVerse, showing it can handle complex visual reasoning and math problems.
  • Video Question Answering:
    • Better results on tests like VideoMMMU, MMVU, and LongVideo-Reason, showing it understands moving scenes over time.
  • Captioning:
    • Clear improvements in both image and video descriptions, indicating good visual understanding and language skills.
  • Grounding (Where and When):
    • Spatial grounding: Top performance on RefCOCO family (finding objects in images based on text).
    • Temporal grounding: Improved scores on datasets like ActivityNet, showing it can find when events happen in videos.
    • Spatio-temporal grounding: Big gains on STVG, meaning it can jointly locate things in space and time.
  • Tracking:
    • Much better tracking on GOT-10k, even with more frames, proving it can follow objects over longer sequences.
  • Segmentation:
    • Best overall results on both image and video segmentation benchmarks, showing fine-grained visual understanding.

Why it matters:

  • It’s rare for one model to do well across so many different tasks and both images and videos. This suggests OneThinker learns general skills that transfer between tasks, making it more practical and powerful.
  • The balanced RL training (EMA-GRPO) is a key reason for its success and could help other multi-task models.

Implications and Potential Impact

  • Toward general-purpose vision AI: OneThinker shows that a single “thinking” model can handle many kinds of visual problems. This brings us closer to flexible, real-world systems (like smart assistants that can watch, understand, and explain video feeds, or tools that can analyze images for education, science, or safety).
  • Better training for multi-task models: EMA-GRPO is a useful idea for any system learning from different tasks with different scoring styles. It helps training stay fair and stable.
  • Knowledge transfer and generalization: Training across images and videos together encourages the model to reuse skills from one task to another. This can help it handle new situations without extra retraining.
  • Open resources: The authors release their code, models, and data, which means others can build on this work, test new ideas, and push unified multimodal reasoning even further.

In simple terms: OneThinker is a strong step toward AI that can look, think, and act across many visual tasks—making it more useful and reliable in the real world.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and open questions that the paper leaves unresolved, intended to guide future research:

  • Reliance on proprietary annotators: The SFT CoT traces, segmentation points/boxes, and other annotations are generated by Seed1.5-VL; the impact of annotator bias, errors, and licensing constraints on downstream performance and reproducibility is not analyzed.
  • CoT faithfulness and utility: No human or automated evaluation of whether generated CoTs are factually grounded, non-spurious, and actually improve task performance versus shorter or no rationales.
  • Reward-model dependence: Open-ended QA and caption rewards depend on POLAR-7B; sensitivity to different reward models, calibration, reward hacking risks, and cross-reward consistency across tasks are not studied.
  • Reward heterogeneity beyond EMA: EMA-GRPO normalizes per-task reward scales, but comparisons to alternative normalization schemes (e.g., PopArt, per-task baselines, adaptive temperature/variance scaling) and hybrids are missing.
  • Lack of theoretical guarantees: Convergence properties and bias/variance trade-offs of EMA-GRPO (especially with drifting reward distributions and non-stationary task mixtures) are not theoretically analyzed.
  • Task-ID requirement: EMA-GRPO presumes known task identity; how to handle ambiguous, mixed, or compositional prompts without explicit task labels remains open.
  • Hyperparameter sensitivity: No ablations on EMA decay β, group size, advantage clipping bounds, or KL weight; robustness of EMA-GRPO to these choices is unknown.
  • Sampling strategy effects: The paper mentions image–video “balanced” sampling but does not quantify task-level mixture effects, curriculum strategies, or priority sampling impacts on performance and transfer.
  • Rollout filtering bias: Discarding fully correct/incorrect rollouts in RL is adopted without ablation; its effect on sample efficiency, exploration, and bias toward medium-difficulty cases is unclear.
  • Long-video constraints: Training caps videos at 128 frames; scalability to much longer sequences, memory–latency trade-offs, and performance beyond this cap are not evaluated.
  • Inference efficiency: There is no analysis of runtime, memory footprint, throughput, latency under different frame counts, or deployment constraints (edge devices, streaming).
  • Multi-turn and interactive use: The model is evaluated with single-turn prompts and greedy decoding; multi-turn dialog, interactive perception (e.g., iterative refinement), and decoding strategies (sampling, beam) are not assessed.
  • Structured-output robustness: While JSON schemas are required, robustness to minor format deviations, partial correctness, and error recovery strategies are not explored; format rewards may be brittle.
  • Segmentation reward mismatch: Video segmentation omits mask-based rewards due to SAM2 latency; the effect of this training–evaluation mismatch on generalization and fine-grained mask quality is unmeasured.
  • External tool dependency in segmentation: Performance hinges on SAM2 given box/point prompts; end-to-end segmentation without external tools and ablations with alternative segmentors are missing.
  • Fixed point counts: The choice of three positive/negative points is not justified; sensitivity to the number and placement of points and trade-offs with annotation cost remain unexplored.
  • Keyframe-only video segmentation: The approach uses a single keyframe for box/points; extending to multi-keyframe prompts, dynamic propagation strategies, and multi-object segmentation is left open.
  • Cross-task transfer analysis: Claims of knowledge sharing are qualitative; controlled studies quantifying positive/negative transfer, task interference, and catastrophic forgetting across task pairs are absent.
  • Out-of-distribution generalization: Robustness to domain shifts (e.g., different video FPS, occlusions, weather, medical/industrial domains), multilingual queries, and adversarial or noisy inputs is not evaluated.
  • Safety and bias: No analysis of societal biases, unsafe generations, privacy concerns, or fairness across demographics and content domains in the curated datasets or model outputs.
  • Data contamination risks: Given the broad use of public datasets, the potential for benchmark leakage and its influence on reported results is not investigated.
  • Scale and architecture: Only an 8B backbone is tested; scaling laws, benefits of larger backbones or MoE architectures, and compute–performance trade-offs are not studied.
  • Modality extensions: Integration with audio, depth/3D, multi-view inputs, or sensor fusion is not considered; extending the unified interface to such modalities is open.
  • Task coverage gaps: Although broad, the suite omits important tasks like detection with dense outputs, multi-label classification, 3D grounding, and multi-object tracking/segmentation under occlusion and re-identification.
  • Evaluation comparability: Some baselines are reproduced or differ in frame budgets/decoding; a standardized evaluation protocol (same frames, decoding, tool versions) to ensure apples-to-apples comparisons is lacking.
  • License and release clarity: While the paper claims releases, clarity on redistribution rights for Seed1.5-VL-derived annotations, and whether full CoT annotations can be legally shared, is missing.
  • Curriculum and training dynamics: No exploration of curricula (by difficulty, reward sparsity, or modality), or analyses of training dynamics (learning curves per task, stability indicators) under EMA-GRPO.
  • Failure mode taxonomy: The paper lacks a qualitative error analysis (e.g., temporal confusion, object identity switches, counting errors), which would guide targeted improvements.
  • Unified schema design: The generality, extensibility, and compositionality of the JSON schemas (e.g., nested events, relations, attributes) for new tasks are not evaluated.
  • Continual and lifelong learning: How to add new tasks or modalities without retraining from scratch, retain prior skills, and manage evolving reward scales in EMA-GRPO is not addressed.

Glossary

  • AdamW: An optimization algorithm that decouples weight decay from the gradient-based update of Adam to improve generalization. "both optimized with AdamW."
  • Advantage (RL): A baseline-adjusted reward signal indicating how much better a sampled action performed compared to the average, used to weight policy updates. "Then, the advantage in task τ\tau is computed with its task-wise EMA standard deviation:"
  • AGI (Artificial General Intelligence): The goal of creating AI with broad, human-level cognitive abilities across tasks and domains. "toward artificial general intelligence (AGI), enabling them to perform step-by-step inference"
  • AO (Average Overlap): A tracking metric computing the average IoU between predicted and ground-truth boxes over frames. "reaches a high 73.0 AO, 93.9 [email protected], 84.4 [email protected], and 68.8 [email protected] on GOT-10k"
  • Chain-of-thought (CoT): Explicit intermediate reasoning steps produced by a model to guide and explain its final answer. "to annotate and filter high-quality chain-of-thought (CoT) data"
  • cIoU: An IoU-based metric for referring segmentation evaluating overlap between the predicted and ground-truth masks conditioned on a referring expression. "it obtains 75.8 / 67.1 / 70.8 cIoU on the val set"
  • Dr.GRPO: A GRPO variant that removes standard deviation normalization to reduce intra-task bias, but can cause cross-task imbalance. "adopts the Dr.GRPO \cite{liu2025understanding} algorithm for RL training."
  • EMA (Exponential Moving Average): A running average technique that emphasizes recent observations, used here to track reward statistics per task. "based on the exponential moving average (EMA) of reward statistics."
  • EMA-GRPO: A GRPO variant that normalizes rewards using task-wise EMA of reward standard deviations to balance intra- and inter-task learning. "we propose EMA-GRPO to handle reward heterogeneity in multi-task RL"
  • F-score (boundary F): A contour-based segmentation metric measuring boundary quality via the harmonic mean of precision and recall. "it reaches 48.8 J, 56.7 F, and 52.7 J{paper_content}F on MeViS"
  • Gaussian kernel: A function of the form exp(−d2/(2σ2)) used to transform distances into smooth similarity rewards. "We define a Gaussian kernel"
  • Greedy decoding: A decoding strategy that selects the highest-probability token at each step without exploration. "We evaluate models using greedy decoding, following prior works \cite{wang2025vl,feng2025video,xiao2025proxythinker}."
  • GRPO (Group Relative Policy Optimization): A reinforcement learning algorithm that optimizes sequence-level rewards using group-based baselines and normalization. "Group Relative Policy Optimization (GRPO) algorithm"
  • Inter-task imbalance: A training pathology where tasks with larger or sparser rewards dominate optimization, suppressing others. "causes inter-task imbalance, where sparse-reward tasks (e.g, math) dominate while dense ones (e.g, detection) are suppressed."
  • Intra-task imbalance: Biased weighting among samples within the same task due to variance-based normalization. "Standard GRPO suffers from intra-task imbalance because its sample-wise standard deviation (std) normalization favors low-variance rollouts"
  • J (Jaccard index): The intersection-over-union metric used in video segmentation to assess region overlap. "it reaches 48.8 J, 56.7 F, and 52.7 J{paper_content}F on MeViS"
  • KL regularization coefficient: A scalar controlling the strength of the KL divergence penalty that keeps the learned policy close to a reference policy. "the KL regularization coefficient $\beta_{\mathrm{KL}$ is fixed at 0.01."
  • Mean Relative Accuracy (MRA): A regression metric that measures accuracy based on relative closeness across multiple tolerance levels. "Mean Relative Accuracy (MRA) metric \cite{yang2025thinking}"
  • Mixture-of-Experts (MoE) models: Architectures that route inputs to a subset of specialized expert networks to improve capacity and efficiency. "improving training stability for large-scale Mixture-of-Experts models."
  • MLLMs (Multimodal LLMs): LLMs that process and reason over multiple modalities such as text, images, and videos. "Multimodal LLMs (MLLMs)"
  • mIoU (mean Intersection-over-Union): The average IoU across instances or frames, used as an aggregate localization/segmentation metric. "achieves the best mIoU (43.2) among listed models."
  • OCR (Optical Character Recognition): The task of extracting text from images or video frames. "OCR tasks use the Word Error Rate to compute the reward."
  • Open-ended QA: Question answering where the response is free-form text rather than a constrained choice. "For open-ended question answering and captioning tasks, we employ an external reward model to provide a similarity score:"
  • R@x (Recall at IoU threshold x): The percentage of cases where IoU exceeds a specified threshold x. "84.4 [email protected] on GOT-10k"
  • Reinforcement learning (RL): A learning paradigm where an agent optimizes behavior via rewards from interactions or outcome assessments. "Reinforcement learning (RL) has recently achieved remarkable success"
  • Reward heterogeneity: Variation in reward scales and sparsity across tasks that complicates multi-task optimization. "to handle reward heterogeneity in multi-task RL"
  • Reward model (RM): An external model that scores the quality of generated answers to provide a training signal. "we adopt POLAR-7B \cite{dou2025pre} as the reward model RM\mathrm{RM}."
  • Rollout: A sampled sequence/output generated by the policy for a given input during RL training. "We discard rollouts that are entirely correct or incorrect during RL training"
  • SAM2: A segmentation model used to convert bounding boxes and points into final segmentation masks. "subsequently fed into SAM2 \cite{ravi2024sam} to generate the final segmentation mask."
  • SFT (Supervised Fine-Tuning): Post-pretraining supervised training on curated instruction/CoT data to initialize model reasoning. "In the SFT stage, we adopt Qwen-3-VL-Instruct-8B \cite{qwen3vl} as the base model"
  • sIoU (spatial Intersection-over-Union): IoU computed over spatial regions (bounding boxes) in images or frames. "Spatial grounding requires the model to localize a target region by predicting a bounding box. The accuracy is measured using spatial intersection-over-union (sIoU)"
  • Spatial grounding: Localizing a referenced object/region in an image by predicting its bounding box. "Spatial grounding requires the model to localize a target region by predicting a bounding box."
  • Spatial-temporal grounding: Jointly localizing when and where an event happens by predicting time spans and per-frame boxes. "This task unifies temporal and spatial localization, requiring the model to predict both the temporal span of an event and the corresponding bounding boxes across frames."
  • Temporal grounding: Identifying the start and end times of an event in a video. "Temporal grounding requires the model to identify the start and end time of the queried event in a video."
  • tIoU (temporal Intersection-over-Union): IoU computed over time intervals denoting event segments in videos. "we measure accuracy using temporal IoU:"
  • Tracking: Predicting a sequence of bounding boxes for a target across video frames. "Tracking requires the model to predict a sequence of bounding boxes for a given target across video frames."
  • Zero-shot generalization: The ability to perform well on tasks or domains not seen during training. "preliminary zero-shot generalization ability"

Practical Applications

Practical, real-world applications of OneThinker and EMA-GRPO

The following lists translate the paper’s findings, methods, and innovations into concrete applications. Each item notes sectors, potential tools/products/workflows, and key assumptions or dependencies that may affect feasibility.

Immediate Applications

These can be piloted or deployed now, given the released model, data, and training recipe.

  • Visual reasoning API for media, security, and retail
    • Sectors: media production, security/CCTV, retail operations
    • What it does: unified captioning, question answering, temporal/spatial grounding, and object tracking for images and videos (e.g., “find when the package enters the scene,” “track the pallet across frames,” “caption this clip”).
    • Tools/Workflows: OneThinker inference server; REST/SDK integration; JSON-schema outputs for programmatic pipelines; dashboards with tIoU/sIoU metrics for QA.
    • Assumptions/Dependencies: access to OneThinker weights/API; 128-frame cap for throughput; domain adaptation may be needed for unusual camera angles or low-light feeds.
  • Video search and retrieval in archives
    • Sectors: media libraries, education, enterprise knowledge management
    • What it does: index long-form video with temporal segments and spatial boxes; enable “jump-to-event” retrieval at scale.
    • Tools/Workflows: indexing service that stores <answer> outputs (time spans, boxes) alongside embeddings; timeline UIs for editors and archivists.
    • Assumptions/Dependencies: compute for batch processing; storage and metadata schema; content rights and privacy considerations.
  • E-commerce product imagery pipeline
    • Sectors: e-commerce, digital asset management (DAM)
    • What it does: background removal and product segmentation; attribute grounding (e.g., “localize the zipper”); QA on product images and diagrams.
    • Tools/Workflows: OneThinker + SAM2 segmentation with predicted boxes/points; automated product-tagging; A/B testing in PDP flows.
    • Assumptions/Dependencies: integration with SAM2; consistent studio photography; legal compliance for synthetic edits.
  • Sports analytics and highlight generation
    • Sectors: sports tech, broadcasting
    • What it does: player/ball tracking, event localization (temporal grounding), automatic highlight creation.
    • Tools/Workflows: event-detection pipeline using GOT-10k-like tracking outputs; editor assist for fast turnaround.
    • Assumptions/Dependencies: domain fine-tuning for specific leagues/camera setups; latency constraints for near-real-time use.
  • Customer support on tutorial/How-to content
    • Sectors: consumer electronics, SaaS support, education
    • What it does: answer questions about videos (“where does the troubleshooting start?”); extract steps via captioning + temporal grounding.
    • Tools/Workflows: chat assistant over help videos; timeline navigation with auto-extracted segments and captions.
    • Assumptions/Dependencies: curated support corpus; guardrails against hallucination; evaluation of accuracy on proprietary tutorials.
  • Accessibility enhancements (alt-text and diagram understanding)
    • Sectors: public sector, education, publishing
    • What it does: generate robust alt-text/captions; answer queries about diagrams and charts; OCR-style tasks via rule-based QA rewards.
    • Tools/Workflows: browser extensions; CMS plug-ins to auto-populate accessibility fields; assistive Q&A overlays.
    • Assumptions/Dependencies: compliance with accessibility standards; consistent performance on non-English or specialized scientific diagrams.
  • Data annotation acceleration with structured schemas
    • Sectors: ML ops, dataset curation, labeling services
    • What it does: schema-validated outputs (boxes, points, time spans) enable auto-checking and reward computation; human-in-the-loop post-edit.
    • Tools/Workflows: “Schema-validated labeling studio” where OneThinker proposes annotations and humans correct; advantage-aware sampling for active learning.
    • Assumptions/Dependencies: label policy alignment; QA processes; potential domain fine-tuning.
  • Balanced multi-task RL training in labs
    • Sectors: academia, AI labs, platform ML teams
    • What it does: adopt EMA-GRPO to stabilize heterogeneous multi-task RL—reduces intra-task and inter-task reward imbalances.
    • Tools/Workflows: training recipe replacement (EMA-GRPO module) in existing GRPO pipelines; task-wise EMA stats registry.
    • Assumptions/Dependencies: access to training code and GPUs (e.g., 32×H800 for ~10 days in the paper); proper hyperparameter selection (group size, β, β_KL).
  • Compliance and safety auditing for visual content
    • Sectors: advertising, UGC platforms, brand safety
    • What it does: detect presence and timing of restricted objects/events; localize sensitive regions for redaction.
    • Tools/Workflows: validation service using spatial/temporal grounding outputs; automatic redaction via segmentation masks (SAM2).
    • Assumptions/Dependencies: policy definitions; false-positive/negative handling; scalability to large video volumes.
  • Manufacturing/industrial inspection assist
    • Sectors: manufacturing, logistics, warehousing
    • What it does: localize defects or misplacements; track parts across stations; segment regions of interest for QC.
    • Tools/Workflows: line-side camera analytics; operator dashboards with track/segment overlays; alarms on deviation.
    • Assumptions/Dependencies: domain-specific calibration; robust performance under motion blur, glare, or dust.
  • Classroom and LMS plug-ins for multimodal tutoring
    • Sectors: K–12, higher ed, online learning
    • What it does: solve and explain math/diagram problems; answer lab-image questions; caption educational videos.
    • Tools/Workflows: LMS widgets; formative assessment tools that use rule-based QA rewards to grade responses.
    • Assumptions/Dependencies: pedagogy alignment; content moderation and correctness guarantees; adaptation for curriculum standards.
  • Newsroom content summarization and clip-markup
    • Sectors: journalism, digital media
    • What it does: caption, summarize, and mark key segments/objects for quick editing and packaging.
    • Tools/Workflows: newsroom CMS integration; timeline with scene segments and object tracks; promptable “find the moment” tools.
    • Assumptions/Dependencies: editorial oversight; handling noisy user-generated content; privacy of subjects.

Long-Term Applications

These require further research, scaling, domain adaptation, or regulatory clearance.

  • Generalist robotic perception and instruction-following
    • Sectors: robotics, manufacturing, service robots
    • What it could do: unify grounding, tracking, segmentation to support task planning (“pick the red cup on the left between 3–7s video state”).
    • Tools/Workflows: closed-loop vision → planning → control with schema-conformant outputs; simulator-to-real transfer.
    • Assumptions/Dependencies: robust real-time performance; safety and reliability; domain-specific datasets and sim benchmarks.
  • Autonomous driving event understanding
    • Sectors: automotive, ADAS
    • What it could do: spatio-temporal reasoning for maneuvers and incidents; assistive scene understanding for driver monitoring or dashcam analysis.
    • Tools/Workflows: temporal grounding to mark risky events; object tracking across multi-camera streams.
    • Assumptions/Dependencies: stringent safety validation; specialized sensors; regulatory approvals.
  • Clinical video/image applications (non-diagnostic to diagnostic)
    • Sectors: healthcare, medical imaging, surgical robotics
    • What it could do: instrument tracking, temporal localization of procedure steps, segmentation of anatomical structures.
    • Tools/Workflows: OR-assist UIs; training simulators; audit trails using schema outputs.
    • Assumptions/Dependencies: regulatory clearance (FDA/CE); medical-grade training data; bias and robustness audits.
  • Real-time edge analytics at scale
    • Sectors: security, retail, smart cities
    • What it could do: deploy compressed models for on-camera analytics (tracking/grounding/segmentation) with low latency.
    • Tools/Workflows: model distillation/pruning; streaming inference frameworks; hardware accelerators.
    • Assumptions/Dependencies: edge hardware constraints; model compression research; on-device privacy safeguards.
  • Intelligent video editing assistants
    • Sectors: post-production, creator economy
    • What it could do: automatic rough-cuts, B-roll retrieval, timeline annotations via temporal+spatial grounding.
    • Tools/Workflows: NLE plug-ins; “reasoning timeline” overlays; prompt-based cut suggestions.
    • Assumptions/Dependencies: creator acceptance; integration with editorial tools; handling creative intent vs. literal detection.
  • Multimodal research assistants for STEM
    • Sectors: R&D, academia
    • What it could do: robust reasoning over figures, plots, lab videos; cross-task transfer to scientific workflows.
    • Tools/Workflows: domain-adapted reward models (beyond POLAR-7B); specialized corpora for diagrams/math/video experiments.
    • Assumptions/Dependencies: high-quality scientific datasets; correctness guarantees; minimizing hallucinations.
  • Continuous self-training engines with structured rewards
    • Sectors: platform AI, foundation model providers
    • What it could do: closed-loop data flywheel using schema-validated outputs and EMA-GRPO; multi-task reinforcement fine-tuning at scale.
    • Tools/Workflows: automated rollouts with verifiable rewards; task-wise EMA stats; curriculum mixing across modalities.
    • Assumptions/Dependencies: reliable reward models; cost and governance of large-scale RL; mitigation of reward hacking.
  • Standards and policy frameworks for schema-based visual outputs and CoT transparency
    • Sectors: public policy, standards bodies, enterprise governance
    • What it could do: define interoperable JSON schemas for visual tasks; guidelines for chain-of-thought usage and privacy.
    • Tools/Workflows: conformance test suites; audit logs of > /<answer> blocks; procurement checklists. > - Assumptions/Dependencies: multi-stakeholder alignment; data protection laws; intellectual property around teacher models. > > - Education assessment from student lab videos > - Sectors: education technology > - What it could do: automatically detect procedural steps and correctness in lab activities; provide targeted feedback. > - Tools/Workflows: rubric-to-reward mapping; dashboards for instructors; assistive captions for accessibility. > - Assumptions/Dependencies: consent and privacy; domain tuning for specific lab contexts; fairness across demographics. > > - Enterprise knowledge graphs enriched with visual nodes > - Sectors: enterprise search, BI > - What it could do: add time-stamped visual facts (events, objects, regions) to knowledge graphs for better search and analytics. > - Tools/Workflows: ETL pipelines from OneThinker outputs; query interfaces that bridge text and video/image entities. > - Assumptions/Dependencies: scalable storage; schema design; access control and compliance. > > - Infrastructure inspection (energy, utilities, transportation) > - Sectors: energy, utilities, transportation, agriculture > - What it could do: drone or roadside camera analytics for defect detection and segmentation of panels, lines, rails, crops. > - Tools/Workflows: flight-to-insight pipelines; alerting systems; periodic trend reports. > - Assumptions/Dependencies: domain-specific training; environmental robustness; regulatory permissions. > > ### Notes on cross-cutting assumptions and dependencies > > - Training/data dependencies: OneThinker-SFT-340k leverages CoT from a proprietary Seed1.5-VL model; reward model POLAR-7B is used for open-ended tasks; segmentation integrates SAM2. Replacing these components will affect performance and feasibility. > > - Compute and scale: The reported setup uses 32×H800 GPUs for ~10 days; downstream users may need smaller variants, distillation, or cloud inference. > > - Output schemas: Many workflows depend on strict JSON schemas (<answer> with boxes, points, time spans). Adherence enables automatic grading and reward computation but requires careful integration. > > - Domain shift and robustness: Benchmarks cover general tasks; specialized domains (medical, automotive, industrial) will need domain adaptation, calibration, and rigorous validation. > > - Legal, privacy, and safety: Video analytics and CoT storage raise privacy and compliance issues; policy frameworks are required for transparent, responsible use. > > - Licensing and IP: Constraints may apply to teacher-model annotations, datasets, and third-party components (e.g., SAM2).

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 112 likes about this paper.