ManiFlow-110k: Robotics and Dialogue Data

Updated 7 April 2026

ManiFlow-110k (Robotics) is a large-scale dataset featuring 110K video clips with precise 3D optical flow annotations for enhanced robotic manipulation and cross-embodiment policy transfer.
ManiFlow-110k (Dialogue) comprises 118K annotated post-response pairs that enable robust evaluation of profile consistency in dialogue systems through meticulous human annotation.
Both datasets provide rigorous experimental benchmarks and detailed data protocols, advancing research in 3D action planning and dialogue consistency with practical, automated and manual labeling techniques.

ManiFlow-110k is the designation for two distinct, large-scale datasets independently developed in the fields of (1) 3D robotic manipulation learning, and (2) profile consistency identification for open-domain dialogue agents. Each dataset is notable for its scope, granular annotation, and rigorous experimental utility within its respective domain. The following entry distinguishes these resources as ManiFlow-110k (Robotics) (Zhi et al., 6 Jun 2025) and ManiFlow-110k (Dialogue) (Song et al., 2020), systematically presenting their composition, methodology, and research context.

1. Purpose and Scope

ManiFlow-110k (Robotics)

Designed to address the lack of embodiment-agnostic, large-scale datasets for robot manipulation, ManiFlow-110k (Robotics) provides 110,000 short video clips annotated with high-fidelity 3D optical flow fields capturing object motions during manipulation. The dataset underpins research into 3D flow-conditioned world models, cross-embodiment policy transfer, and flow-guided action planning for both robotic and human agents (Zhi et al., 6 Jun 2025).

ManiFlow-110k (Dialogue)

ManiFlow-110k (Dialogue), formally the KvPI dataset, targets the problem of explicit profile consistency identification in dialogue generation. Spanning 118,540 single-turn post–response pairs annotated over user profiles (gender, location, constellation), it enables downstream modeling and evaluation of whether system responses are entailed by, contradict, or are irrelevant to the source profile (Song et al., 2020).

2. Data Collection and Annotation Protocols

ManiFlow-110k (Robotics)

The data acquisition pipeline entails automated synthesis:

Gripper/Background Masking: Application of Grounding-SAM2 on the RGB video’s initial frame to mask end-effectors.
2D Flow and Correspondence Extraction: Uniform 2D point sampling followed by multi-frame tracking via Co-tracker3, identifying active object pixels by displacement thresholding.
3D Projection: DepthAnythingV2 predicts per-pixel depth, with spatial back-projection to camera coordinates. 3D flow $u(x, y, t)$ is computed as the difference in 3D position vectors between frames, encoded in $\mathcal{F}(t) \in \mathbb{R}^{H \times W \times 4}$ (2D flow, depth change, visibility).
Source Data: Composite from six prior robotics and teleoperation benchmarks (BridgeV2, ScalingRobotLearning, Droid, RH20T, Libero, AGIbOt).
Annotation: Language instructions per clip (~10 tokens/clip), object category, scene label, bounding box. All pipelines are fully automated; no manual labeling is reported (Zhi et al., 6 Jun 2025).

ManiFlow-110k (Dialogue)

Data extraction and annotation proceed as follows:

Data Source: Sina Weibo user posts and replies, filtered for single-turn, profile-related pairs across gender, location, and constellation domains.
Human Annotation: Each tuple annotated by three trained annotators; stages include profile-relevance, domain marking, selection of referenced profile key, and assignment of one label: Entailed (E), Contradicted (C), or Irrelevant (I).
Quality Control: Gold-standard tuples double-annotated every 10K samples, with batches above 10% disagreement re-annotated. Final Fleiss’ κ on 2,000 held-out tuples: 0.857. The contradicted class is balanced, with one-third produced by minimal-edits from entailed instances (Song et al., 2020).

3. Dataset Structure and Statistics

ManiFlow-110k (Robotics)	ManiFlow-110k (Dialogue)
110,000 video clips	118,540 post–response pairs
3.3 million frames (30 fps)	Profile: 3 keys (G/L/C)
15 object categories	Domains: gender, location, constellation
Embodiments: human/robot	Labeled: Entail/Contradict/Irrelevant
Scenes: 50+ layouts/contexts	Avg. resp. length: 16-18 tokens
88K/11K/11K split	96.5K/11K/11K split (train/dev/test)

G: gender, L: location, C: constellation.

Additional Properties

Robotics: 27k human and 83k robot clips; object-centric video and matching (H, W, 4) 3D flow arrays.
Dialogue: Each tuple specifies profile, post, response, domain, attribute key, and consistency label (Song et al., 2020).

4. Preprocessing, Data Formats, and Access

ManiFlow-110k (Robotics)

Video Clips: MP4, 256×256, 30 fps, cropped to object region (bbox +10 px).
3D Flow Fields: NumPy .npz per clip, $[N_\text{frames}, H, W, 4]$ float32.
Language Instructions: JSON, ~10 tokens per clip, CLIP-compatible tokenization.
Preprocessing: Downsample to 16 frames; normalize flow; bounding box cropping.
Splits: 88K/11K/11K (train/val/test) (Zhi et al., 6 Jun 2025).
Access: Not explicitly stated in the source, but methods and pipelines detailed.

ManiFlow-110k (Dialogue)

Linearized Input: $\text{\texttt{[CLS] k\textsubscript{1}[: v\textsubscript{1}] ... k\textsubscript{n}[: v\textsubscript{n}] [SEP] w\textsubscript{1} ... w\textsubscript{m} [SEP]}}$ where $k_i$ are keys and $v_i$ values from profile.
File Format: Not detailed, but public code and data provided via GitHub under MIT-style license (no commercial restriction).
Access: https://github.com/songhaoyu/KvPI (Song et al., 2020).

5. Baseline Models, Evaluation Metrics, and Benchmarks

ManiFlow-110k (Robotics)

Task Coverage: Translation (52%), rotation (28%), combined (20%); objects include cups/mugs (20%), teapots/bottles (12%), pens/tools (10%), drawers/boxes (15%).
Flow Model Benchmarks: End-point error (EPE): 4.5 cm overall; Translation 3.8 cm, Rotation 5.2 cm, Combined 6.0 cm. 3D visibility [email protected]: 81%.
Downstream Evaluation: Instruction-conditioned flow generation; flow-guided policy transfer; cross-embodiment (human-to-robot) generalization (Zhi et al., 6 Jun 2025).

ManiFlow-110k (Dialogue)

Baselines: SVM, ESIM (biLSTM NLI), TableBERT, BERT. Proposed KvBERT employs Tree-LSTM structure encoding for profile and response.
Classification: Accuracy/F1—KvBERT: 91.7% overall (entail-F1: 93.3, contradict-F1: 91.0, irrelevant-F1: 90.1). TableBERT: 88.6%. Plain BERT: 88.0%. ESIM: ~83.7%. SVM: 62–69%.
Reranking: PersonaDialog reranked using KvBERT; entail@1 improved by +1%, contradict@1 reduced from 33% to 11% for location queries.
Consistency Checking: Cohen’s κ between human and KvBERT for generator outputs: 0.74–0.91 (substantial to almost perfect agreement) (Song et al., 2020).

6. Research Impact, Comparisons, and Limitations

Comparative Analysis

Feature	ManiFlow-110k (Robotics)	ManiFlow-110k (Dialogue)
Scale	110K clips (largest 3D flow dataset)	119K annotated dialogue triples
Labeling	Automated (motion, language)	Highly-verified manual
Objects/Attributes	15 object categories	3 profile keys (G/L/C)
Embodiment diversity	Human + multiple robot arms	Only single-user context
Benchmark Reference	droid (40K); BridgeV2 (11K)	PersonaDialog, TransferTransfo, AttentionRouting
Reported limitations	Non-rigid objects; no tactile data; mask drift	Only gender/location/constellation; Chinese only; single-turn; synthetic contradicts

Impact

Robotics: Provides a uniform, scalable testbed for 3D object motion modeling across embodiments and object types, directly enabling research in cross-embodiment policy transfer and flow-conditioned planning (Zhi et al., 6 Jun 2025).
Dialogue: Establishes a robust benchmark for explicit consistency modeling in dialogue, supporting classifier evaluation, reranking systems, and consistency-aware generation (Song et al., 2020).

Limitations

Robotics: No force/torque or proprioceptive signal; challenges with non-rigid or occluded objects; entirely auto-annotated without human intervention.
Dialogue: Restriction to three profile keys, single-turn format, partial synthesis of contradiction class, and regional data bias due to platform specificity.

7. Future Directions and Open Challenges

Robotics: Extending 3D flow representations to multi-object, deformable, and fluid manipulation; integrating proprioception; multi-step and compound action prediction (Zhi et al., 6 Jun 2025).
Dialogue: Scaling to multi-turn dialogue; expanding profile depth; broader attribute sets; generalizing to multilingual and cross-cultural settings (Song et al., 2020).

A plausible implication is that while the ManiFlow-110k designation may appear in multiple domains, users must specify context (Robotics vs. Dialogue) to avoid confusion, as each resource is unrelated in origin, structure, and intended use. Both variants continue to inform state-of-the-art research in their respective areas, establishing new baselines and methodological standards.

Markdown Report Issue Upgrade to Chat

References (2)

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model (2025)

Profile Consistency Identification for Open-domain Dialogue Agents (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ManiFlow-110k Dataset.

ManiFlow-110k: Robotics and Dialogue Data

1. Purpose and Scope

ManiFlow-110k (Robotics)

ManiFlow-110k (Dialogue)

2. Data Collection and Annotation Protocols

ManiFlow-110k (Robotics)

ManiFlow-110k (Dialogue)

3. Dataset Structure and Statistics

Additional Properties

4. Preprocessing, Data Formats, and Access

ManiFlow-110k (Robotics)

ManiFlow-110k (Dialogue)

5. Baseline Models, Evaluation Metrics, and Benchmarks

ManiFlow-110k (Robotics)

ManiFlow-110k (Dialogue)

6. Research Impact, Comparisons, and Limitations

Comparative Analysis

Impact

Limitations

7. Future Directions and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics