ManiFlow-110k: Robotics and Dialogue Data
- ManiFlow-110k (Robotics) is a large-scale dataset featuring 110K video clips with precise 3D optical flow annotations for enhanced robotic manipulation and cross-embodiment policy transfer.
- ManiFlow-110k (Dialogue) comprises 118K annotated post-response pairs that enable robust evaluation of profile consistency in dialogue systems through meticulous human annotation.
- Both datasets provide rigorous experimental benchmarks and detailed data protocols, advancing research in 3D action planning and dialogue consistency with practical, automated and manual labeling techniques.
ManiFlow-110k is the designation for two distinct, large-scale datasets independently developed in the fields of (1) 3D robotic manipulation learning, and (2) profile consistency identification for open-domain dialogue agents. Each dataset is notable for its scope, granular annotation, and rigorous experimental utility within its respective domain. The following entry distinguishes these resources as ManiFlow-110k (Robotics) (Zhi et al., 6 Jun 2025) and ManiFlow-110k (Dialogue) (Song et al., 2020), systematically presenting their composition, methodology, and research context.
1. Purpose and Scope
ManiFlow-110k (Robotics)
Designed to address the lack of embodiment-agnostic, large-scale datasets for robot manipulation, ManiFlow-110k (Robotics) provides 110,000 short video clips annotated with high-fidelity 3D optical flow fields capturing object motions during manipulation. The dataset underpins research into 3D flow-conditioned world models, cross-embodiment policy transfer, and flow-guided action planning for both robotic and human agents (Zhi et al., 6 Jun 2025).
ManiFlow-110k (Dialogue)
ManiFlow-110k (Dialogue), formally the KvPI dataset, targets the problem of explicit profile consistency identification in dialogue generation. Spanning 118,540 single-turn post–response pairs annotated over user profiles (gender, location, constellation), it enables downstream modeling and evaluation of whether system responses are entailed by, contradict, or are irrelevant to the source profile (Song et al., 2020).
2. Data Collection and Annotation Protocols
ManiFlow-110k (Robotics)
The data acquisition pipeline entails automated synthesis:
- Gripper/Background Masking: Application of Grounding-SAM2 on the RGB video’s initial frame to mask end-effectors.
- 2D Flow and Correspondence Extraction: Uniform 2D point sampling followed by multi-frame tracking via Co-tracker3, identifying active object pixels by displacement thresholding.
- 3D Projection: DepthAnythingV2 predicts per-pixel depth, with spatial back-projection to camera coordinates. 3D flow is computed as the difference in 3D position vectors between frames, encoded in (2D flow, depth change, visibility).
- Source Data: Composite from six prior robotics and teleoperation benchmarks (BridgeV2, ScalingRobotLearning, Droid, RH20T, Libero, AGIbOt).
- Annotation: Language instructions per clip (~10 tokens/clip), object category, scene label, bounding box. All pipelines are fully automated; no manual labeling is reported (Zhi et al., 6 Jun 2025).
ManiFlow-110k (Dialogue)
Data extraction and annotation proceed as follows:
- Data Source: Sina Weibo user posts and replies, filtered for single-turn, profile-related pairs across gender, location, and constellation domains.
- Human Annotation: Each tuple annotated by three trained annotators; stages include profile-relevance, domain marking, selection of referenced profile key, and assignment of one label: Entailed (E), Contradicted (C), or Irrelevant (I).
- Quality Control: Gold-standard tuples double-annotated every 10K samples, with batches above 10% disagreement re-annotated. Final Fleiss’ κ on 2,000 held-out tuples: 0.857. The contradicted class is balanced, with one-third produced by minimal-edits from entailed instances (Song et al., 2020).
3. Dataset Structure and Statistics
| ManiFlow-110k (Robotics) | ManiFlow-110k (Dialogue) |
|---|---|
| 110,000 video clips | 118,540 post–response pairs |
| 3.3 million frames (30 fps) | Profile: 3 keys (G/L/C) |
| 15 object categories | Domains: gender, location, constellation |
| Embodiments: human/robot | Labeled: Entail/Contradict/Irrelevant |
| Scenes: 50+ layouts/contexts | Avg. resp. length: 16-18 tokens |
| 88K/11K/11K split | 96.5K/11K/11K split (train/dev/test) |
G: gender, L: location, C: constellation.
Additional Properties
- Robotics: 27k human and 83k robot clips; object-centric video and matching (H, W, 4) 3D flow arrays.
- Dialogue: Each tuple specifies profile, post, response, domain, attribute key, and consistency label (Song et al., 2020).
4. Preprocessing, Data Formats, and Access
ManiFlow-110k (Robotics)
- Video Clips: MP4, 256×256, 30 fps, cropped to object region (bbox +10 px).
- 3D Flow Fields: NumPy .npz per clip, float32.
- Language Instructions: JSON, ~10 tokens per clip, CLIP-compatible tokenization.
- Preprocessing: Downsample to 16 frames; normalize flow; bounding box cropping.
- Splits: 88K/11K/11K (train/val/test) (Zhi et al., 6 Jun 2025).
- Access: Not explicitly stated in the source, but methods and pipelines detailed.
ManiFlow-110k (Dialogue)
- Linearized Input: $\text{\texttt{[CLS] k\textsubscript{1}[: v\textsubscript{1}] ... k\textsubscript{n}[: v\textsubscript{n}] [SEP] w\textsubscript{1} ... w\textsubscript{m} [SEP]}}$ where are keys and values from profile.
- File Format: Not detailed, but public code and data provided via GitHub under MIT-style license (no commercial restriction).
- Access: https://github.com/songhaoyu/KvPI (Song et al., 2020).
5. Baseline Models, Evaluation Metrics, and Benchmarks
ManiFlow-110k (Robotics)
- Task Coverage: Translation (52%), rotation (28%), combined (20%); objects include cups/mugs (20%), teapots/bottles (12%), pens/tools (10%), drawers/boxes (15%).
- Flow Model Benchmarks: End-point error (EPE): 4.5 cm overall; Translation 3.8 cm, Rotation 5.2 cm, Combined 6.0 cm. 3D visibility [email protected]: 81%.
- Downstream Evaluation: Instruction-conditioned flow generation; flow-guided policy transfer; cross-embodiment (human-to-robot) generalization (Zhi et al., 6 Jun 2025).
ManiFlow-110k (Dialogue)
- Baselines: SVM, ESIM (biLSTM NLI), TableBERT, BERT. Proposed KvBERT employs Tree-LSTM structure encoding for profile and response.
- Classification: Accuracy/F1—KvBERT: 91.7% overall (entail-F1: 93.3, contradict-F1: 91.0, irrelevant-F1: 90.1). TableBERT: 88.6%. Plain BERT: 88.0%. ESIM: ~83.7%. SVM: 62–69%.
- Reranking: PersonaDialog reranked using KvBERT; entail@1 improved by +1%, contradict@1 reduced from 33% to 11% for location queries.
- Consistency Checking: Cohen’s κ between human and KvBERT for generator outputs: 0.74–0.91 (substantial to almost perfect agreement) (Song et al., 2020).
6. Research Impact, Comparisons, and Limitations
Comparative Analysis
| Feature | ManiFlow-110k (Robotics) | ManiFlow-110k (Dialogue) |
|---|---|---|
| Scale | 110K clips (largest 3D flow dataset) | 119K annotated dialogue triples |
| Labeling | Automated (motion, language) | Highly-verified manual |
| Objects/Attributes | 15 object categories | 3 profile keys (G/L/C) |
| Embodiment diversity | Human + multiple robot arms | Only single-user context |
| Benchmark Reference | droid (40K); BridgeV2 (11K) | PersonaDialog, TransferTransfo, AttentionRouting |
| Reported limitations | Non-rigid objects; no tactile data; mask drift | Only gender/location/constellation; Chinese only; single-turn; synthetic contradicts |
Impact
- Robotics: Provides a uniform, scalable testbed for 3D object motion modeling across embodiments and object types, directly enabling research in cross-embodiment policy transfer and flow-conditioned planning (Zhi et al., 6 Jun 2025).
- Dialogue: Establishes a robust benchmark for explicit consistency modeling in dialogue, supporting classifier evaluation, reranking systems, and consistency-aware generation (Song et al., 2020).
Limitations
- Robotics: No force/torque or proprioceptive signal; challenges with non-rigid or occluded objects; entirely auto-annotated without human intervention.
- Dialogue: Restriction to three profile keys, single-turn format, partial synthesis of contradiction class, and regional data bias due to platform specificity.
7. Future Directions and Open Challenges
- Robotics: Extending 3D flow representations to multi-object, deformable, and fluid manipulation; integrating proprioception; multi-step and compound action prediction (Zhi et al., 6 Jun 2025).
- Dialogue: Scaling to multi-turn dialogue; expanding profile depth; broader attribute sets; generalizing to multilingual and cross-cultural settings (Song et al., 2020).
A plausible implication is that while the ManiFlow-110k designation may appear in multiple domains, users must specify context (Robotics vs. Dialogue) to avoid confusion, as each resource is unrelated in origin, structure, and intended use. Both variants continue to inform state-of-the-art research in their respective areas, establishing new baselines and methodological standards.