Papers
Topics
Authors
Recent
Search
2000 character limit reached

BEHAVIOR Vision Suite Overview

Updated 4 February 2026
  • BEHAVIOR Vision Suite is a modular platform comprising open-source toolkits and synthetic data generators for advanced computer vision and embodied AI research.
  • It integrates a high-fidelity asset library and simulation frameworks that allow precise parameterization, controlled experiments, and robust evaluations.
  • The suite supports multi-modal outputs and sim-to-real transfer studies, making it essential for benchmarking object recognition, segmentation, and state prediction.

The BEHAVIOR Vision Suite (BVS) refers to a family of open-source toolkits, synthetic data generation platforms, model integration frameworks, and evaluation pipelines designed for systematic research in computer vision, behavioral analysis, and embodied AI. BVS architectures are characterized by high modularity, extensive parameterization capabilities, and support for real, synthetic, and simulation-based data at varying scales of complexity, enabling controlled experiments on recognition, robustness, and transfer. The suite’s core is built atop the BEHAVIOR-1K asset library and NVIDIA Omniverse-based simulation, supporting the exploration and benchmarking of vision algorithms in diverse, well-annotated domains (Ge et al., 2024).

1. Architecture and Core Modules

BVS is composed of two principal subsystems: (1) the extended BEHAVIOR-1K asset library, and (2) a highly customizable dataset generator. The asset library comprises 8,841 object models categorized into 1,937 semantic classes, embedded within 1,000 procedurally instanced scenes spanning a diverse set of indoor and commercial domains (homes, offices, restaurants, grocery stores, etc.). Each asset is annotated with articulated joints, physical attributes (colliders, fluids, cloth simulation), and semantic properties, facilitating high-fidelity scene composition. Collision meshes are generated via V-HACD and manually refined.

The suite’s dataset generator layer utilizes OmniGibson, combining photorealistic rendering with physics-based simulation. It allows arbitrary control over three fundamental sets of parameters: (i) scene-level (e.g., lighting intensity, color temperature, ambient occlusion, gravity, fog), (ii) object-level (e.g., joint angles, container fill fractions, semantic predicates, materials), and (iii) camera-level (e.g., field of view, focal length, position/orientation, aperture, noise models). Researchers can define distributions across any parametric axis for each factor, enabling continuous or discrete sampling during generation (Ge et al., 2024).

Direct outputs from BVS include multimodal, synchronized streams: RGB images, depth maps, surface normals, optical flow, semantic/instance/part segmentation masks, 2D/3D bounding boxes, point clouds, camera calibration/intrinsics, scene graphs, and fully resolved unary/binary state labels.

2. Parameterization and Configuration

All aspects of BVS datasets are specified via YAML or JSON schema, supporting scalar, list, or distribution-based parameter values. For example, lighting intensity can be defined as:

IlightU(Imin,Imax)I_\text{light} \sim \mathcal{U}(I_{\min}, I_{\max})

Articulated object states leverage joint fractions:

αU(0,1),θ=α(θmaxθmin)+θmin\alpha \sim \mathcal{U}(0, 1), \qquad \theta = \alpha (\theta_{\max} - \theta_{\min}) + \theta_{\min}

Container fill levels, cloth folding degrees, and material properties (roughness, metallicity, specularity) follow similar conventions. Camera parameters include field of view, focal length, aperture, and pose/orientation, each of which may be sampled per-frame or per-sequence.

Predicates for object relation annotation cover unary (e.g., is_open, is_filled), binary (e.g., object A on_top_of B, A inside B), and continuous (e.g., fill_fraction, joint_angle) types. Randomization processes include object swap, placement (inside/on-top-of/under), scene clutter density, and lighting perturbation. Camera sampling supports constraints such as occlusion minimization and object-centric framing.

A minimal configuration example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
dataset_name: 'office_robustness_test'
assets:
  scenes: ["office_01", "office_02"]
  object_categories: ["chair", "desk", "laptop"]
scene_params:
  lighting:
    intensity: {distribution: "uniform", min: 0.2, max: 0.8}
    temperature: {distribution: "uniform", min: 3000, max: 6000}
  add_fog: {value: false}
object_params:
  articulation:
    enabled: true
    joints: ["door_hinge", "drawer_slider"]
    angle: {distribution: "uniform", min_ratio: 0.0, max_ratio: 1.0}
  container:
    fill_fraction: {distribution: "bernoulli", p: 0.5}
camera_params:
  fov: {distribution: "uniform", min: 30, max: 75}
  focal_length: {distribution: "uniform", min: 20, max: 35}
  noise_model:
    type: "gaussian"
    sigma: 0.005
output:
  modalities: ["rgb", "depth", "normals", "segmentation"]
  annotations: ["bbox2d", "bbox3d", "state_labels"]
  image_size: [640, 480]
  num_frames: 1000
seed: 42
Command-line and Python APIs are supported for launching data generation with parallel workers.

3. Synthetic Data Generation and Annotation

The BVS pipeline, at each data generation step, samples a scene and its objects according to user-defined or default distributions, applies sampled physical parameters (e.g., opens drawers to a given angle, fills containers to specific levels), and positions the camera along static or dynamically sampled poses. The Omniverse renderer produces images with real-time configurable HDR lighting, depth, noise, and photorealistic shading. Sensor simulation extends to stochastic models: Gaussian noise, Poisson noise, and parameterized blur.

Annotations are auto-generated with millimeter-level geometric accuracy, leveraging full mesh knowledge for segmentation, collision, and predicate state computation. Outputs are available in standard formats (e.g., COCO JSON for 2D bounding boxes, PNG masks, EXR for depth/normals/flow, CSV/JSON for unary/binary state labels, JSON for scene graphs).

A typical data generation loop in Python:

1
2
3
4
5
6
7
8
9
10
from bvs import DatasetGenerator
config = DatasetGenerator.load_config("office_robustness_test.yaml")
gen = DatasetGenerator(config)
for frame_i in range(config.output.num_frames):
    gen.reset_scene()
    gen.sample_object_states()
    cam_pose = gen.sample_camera_pose(target="laptop")
    gen.set_camera(cam_pose)
    images, annotations = gen.render()
    gen.save_frame(frame_i, images, annotations)

4. Experimental Applications and Evaluations

BVS enables systematic evaluation of vision models along arbitrary axes of controlled domain shift. Exemplary studies include quantifying model robustness under increasing articulation (drawer/door open fraction), illumination variation, occlusion (visible fraction), zoom, and camera pitch. For each axis, BVS generates hundreds of video sequences with the relevant parameter continuously varied and other factors held constant.

Evaluation metrics for detection and segmentation include Average Precision (AP), computed as:

AP=k=1KP(k)ΔR(k)\mathrm{AP} = \sum_{k=1}^K P(k)\,\Delta R(k)

Application results demonstrate that AP drops up to 30 points as articulation increases, exhibits non-linear effects with lighting intensity, and linearly degrades under occlusion. Multi-task benchmarks over the BVS-generated “holistic scene understanding” set (266,000 frames) preserve relative SOTA model ranking observed on real datasets, validating BVS's proxy reliability (Ge et al., 2024).

For simulation-to-real transfer, models trained on BVS synthetic images for binary relations (on_top_of, inside, under) and unary states (filled, folded) achieve real-data F1 scores of 0.839 (vs. 0.271 for zero-shot CLIP) and unary attribute accuracy of 0.93/0.86 for “filled”/“folded.” This supports the effectiveness of BVS-generated supervision for novel semantic-state prediction in real environments.

5. Implementation, Integration, and Extensibility

BVS provides a modular software stack encompassing asset management, procedural scene composition, physical simulation, rendering, and annotation export. All modules are accessible via CLI and Python, with configuration-driven experimentation. Outputs are compatible with standard CV and robotics pipelines.

Integration with model training is direct: generated data can be used for detection, segmentation, 3D reconstruction, scene graph parsing, state prediction, or sim2real robustness experiments. BVS's asset library and parameter sampling can be extended via new object classes, scene templates, or predicate types.

Recommended parameter ranges, derived from application studies:

  • Lighting intensity I[0.1,0.8]I \in [0.1, 0.8]; avoid I<0.1I < 0.1 for nontrivial contrast.
  • Articulation fraction α{0,0.25,0.5,0.75,1}\alpha \in \{0, 0.25, 0.5, 0.75, 1\} (discrete) or αU(0,1)\alpha \sim \mathcal{U}(0,1).
  • Visibility vv sampled in [0,1][0, 1] (in 0.1 increments) for fine-grained occlusion analysis.
  • Focal length 16–50 mm (avoid <16 mm for distortion).
  • Sensor noise: σ[0.001,0.01]\sigma \in [0.001, 0.01] (Gaussian), λ[5,30]\lambda \in [5, 30] (Poisson).

6. Limitations and Future Directions

Current BVS releases address static scenes and short video slices; agent-driven interaction and temporal illumination/material variation are not yet native features. The framework is currently limited regarding multi-camera synchronization and dynamic object modification beyond predefined predicates. Sensor simulation could be further enhanced to include rolling shutter, lens distortion, or realistic motion artifacts. Asset diversity expansion—e.g., outdoor scenes, soft and deformable bodies beyond cloth—remains an open avenue.

Planned extensions include multi-camera/multi-view support, dynamic lighting/material change, richer noise models, greater domain diversity, and the integration of agent-driven simulation for active perception studies (Ge et al., 2024).

7. Impact and Significance

The BEHAVIOR Vision Suite represents a comprehensive approach to address the need for scalable, precisely controlled, high-fidelity data in computer vision research. Its explicit, extensive parameterization and physically grounded asset library resolve previous trade-offs between photorealism, physics, diversity, and annotation accuracy. By allowing researchers to systematically probe model failure modes and perform controlled robustness and sim2real transfer studies, BVS is positioned as an essential infrastructure component for the next generation of robust and interpretable AI systems in perception, robotics, and embodied reasoning (Ge et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEHAVIOR Vision Suite.