3D-CustomBench: Modular 3D Vision Benchmark
- 3D-CustomBench is a modular framework for standardized evaluation of 3D vision tasks, integrating diverse datasets, models, and metrics.
- It features plug-and-play modules for configuration, data ingestion, model inference, and metric computation, ensuring uniform preprocessing and reproducible results.
- By enforcing cross-dataset and out-of-distribution testing, the framework enhances model generalization assessment and highlights robustness challenges.
3D-CustomBench is a fully modular benchmark framework for three-dimensional (3D) vision, designed to enable rigorous, standardized cross-dataset and cross-method evaluation for a broad spectrum of 3D tasks. Its architecture generalizes principles established by frameworks such as PoseBench3D, E3D-Bench, 3DGen-Bench, TAPVid-3D, and GT23D-Bench, permitting the integration of arbitrary datasets, models, and metrics for problems ranging from human pose estimation to shape reconstruction and generative 3D synthesis (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Zhang et al., 27 Mar 2025, Koppula et al., 2024, Su et al., 2024). The core objective is to isolate the generalization and robustness of 3D models by enforcing consistent preprocessing, unified annotation formats, and interpretable evaluation, supporting both in-domain and out-of-distribution (OOD) scenarios.
1. Modular Architecture and Core Components
3D-CustomBench is structured as a collection of plug-and-play modules, each addressing a separate aspect of the benchmark protocol:
- Config Module: Ingests user-specified YAML or JSON configuration, exposing parameters such as a dataset registry, model types (e.g., PyTorch, ONNX), normalization schemes (screen-space vs. z-score), and core model hyperparameters (such as joint topology, input dimensionality, or video mode).
- Dataset Module: Provides a BaseDataset abstract class defining methods for annotation loading, camera/world normalization, skeleton mapping, and getitem batch collation. Integrating a new dataset requires subclassing BaseDataset, ensuring adaptation from raw formats (e.g., .npz, .mat, .json) to common internal representations—such as a fixed joint skeleton, multi-view renders, or point clouds.
- Model Module: Abstracts model initialization and inference, with an interface designed to accommodate both pretrained and trainable pipelines. Models exposed via standardized wrappers can be evaluated on any compatible dataset, regardless of training domain or input requirements.
- Evaluation Module: Computes metrics tailored to the task, accepting predictions and ground truth (GT), and returning per-sample, per-joint, or global summaries based on metrics such as MPJPE, PA-MPJPE, 3D Chamfer Distance, PSNR, LPIPS, or specially designed scores for generative and tracking tasks.
- Orchestration (Runner): Automates cross-dataset evaluation loops. For every method and every possible train/test split combination, the orchestrator loads the required model checkpoint, applies relevant preprocessing, and logs results through the evaluation module.
The result is a highly extensible pipeline: new datasets or models are added via subclass registration, altering only the config file as needed. The unified data API ensures that preprocessing routines—e.g., root joint centering, skeleton scaling, or camera normalization—are applied identically for all evaluation runs. This strict modularization supports 3D human pose, object pose, hand pose, 3D shape estimation, multi-view reconstruction, text-to-3D, and 3D generative models, among others (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Su et al., 2024).
2. Data Pipeline and Integration Methodology
The data pipeline in 3D-CustomBench is designed for generality and reproducibility, encompassing the following steps:
- Annotation ingestion: Raw data is loaded (file formats include .npz, .json, .txt, mesh representations), yielding source 2D/3D keypoints, depth, normal maps, camera parameters, semantic labels, or mesh data as appropriate.
- Preprocessing and normalization: All 3D coordinates are mapped to a common frame (camera or world) via extrinsics, and 2D projections are computed via camera intrinsics.
- For pose tasks: Joints are mapped to a standard skeleton ordering, and root joints are centered at the origin. Skeletons are optionally rescaled such that a reference bone (e.g., hip–neck distance) has unit length.
- For image-based tasks: Images are resized/cropped to canonical resolutions before normalization (screen-space [0,1] or z-score).
- For generative and reconstruction tasks: Renders are produced for N pre-defined viewpoints, and all outputs are exported in common mesh, point cloud, or render formats.
- Dataset wrappers: New datasets are integrated by implementing subclasses that expose the expected getitem API, yielding all necessary metadata, annotations, and auxiliary information.
Through these unified mechanisms, novel data sources—including human mocap, robotic manipulators, LiDAR scans, video, and 3D text prompts—can be coherently brought under evaluation without manual adaptations of the framework (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Koppula et al., 2024, Su et al., 2024).
3. Evaluation Metrics and Protocols
3D-CustomBench supports a diverse suite of evaluation metrics, each adapted to relevant 3D tasks:
- Pose Estimation Metrics:
- Mean Per-Joint Position Error (MPJPE):
- Procrustes Aligned MPJPE (PA-MPJPE): Computes MPJPE after the optimal similarity transformation via the Kabsch algorithm.
Multi-View Reconstruction and Depth:
- Absolute Relative Error, RMSE, Inlier Ratios (): Used for depth prediction.
- Chamfer Distance: For comparing point clouds.
- Accuracy, Completeness, Normal Consistency: As defined for 3D reconstructions.
- Absolute Trajectory Error (ATE), Relative Pose Error (RPE): For pose estimation in multi-view or SLAM scenarios.
- Generative and Text-to-3D Evaluation:
- Textual-3D Alignment:
- Text–PointCloud (Uni3D encoder cosine similarity).
- Text–MultiView (VQAScore across rendered views).
- Text–Attribute (Grounded-SAM detection confidence/probability scores).
- Intrinsic 3D Visual Quality:
- Multi-view IQ (LIQE), Contour Clarity (CC), Texture Richness (TR).
- 3D Alignment (Uni3D), Shape Completeness, Geometric Validity (GPT-4 rubric).
- Multi-view consistency (SSIM or LPIPS between rerendered and target views).
- Tracking:
- 3D Average Point Distance (APD₃D), Occlusion Accuracy (OA), 3D Average Jaccard (AJ₃D): All defined with respect to ground-truth visibility and tolerance thresholds, supporting both global and local scale normalization (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Zhang et al., 27 Mar 2025, Koppula et al., 2024, Su et al., 2024).
These metrics are computed via standardized scripts that ingest prediction results and GT, ensuring cross-method comparability. For generative models, specially trained CLIP-based (3DGen-Score) and MLLM-based (3DGen-Eval) evaluators are available, providing automated human preference alignment and explanatory feedback (Zhang et al., 27 Mar 2025).
4. Cross-Dataset Evaluation and Experimental Protocol
A central feature of 3D-CustomBench is exhaustive cross-dataset evaluation. The core protocol is as follows:
- For each method M and training dataset , M is either loaded from a checkpoint trained on or retrained from scratch.
- Each test dataset (with typically non-overlapping domains) is used for inference, following identical preprocessing and normalization routines as in training.
- Metrics are computed for every versus combination, as well as for in-domain and OOD splits; all runs strictly adhere to config-specified parameters (e.g., normalization flag, skeleton definition, coordinate system).
- Results are reported both globally and at the per-joint, per-sequence, or per-class level, as required in the domain.
This protocol is essential for revealing model robustness, guiding architectural improvements, and delineating failure modes that may be hidden in single-dataset or non-standardized pipelines (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025).
5. Empirical Insights and Best Practices
Extensive experimental findings from deployment on representative tasks indicate:
- Model Generalization: Transformer-based models (e.g., PoseFormerV1/V2) achieve low MPJPE on constrained settings but degrade on OOD domains (>160 mm on 3DPW), while models with explicit viewpoint modeling maintain higher cross-dataset robustness (Manzur et al., 16 May 2025).
- Preprocessing Impact: Z-score normalization with test-set statistics can reduce cross-domain error (e.g., SEM-GCN: 262→115 mm, ~56% improvement); screen-space normalization alone is insufficient for generalization (Manzur et al., 16 May 2025).
- Generative Evaluation: CLIP-based and MLLM-based evaluators (3DGen-Score, 3DGen-Eval) show stronger correlation with human perceptual judgments than conventional metrics, especially for prompt–asset alignment and geometric fidelity (Zhang et al., 27 Mar 2025).
- Efficiency: For end-to-end 3D geometric foundation models (GFMs), online registration-based architectures scale better to large view counts; global alignment methods incur significant resource overhead (Cong et al., 2 Jun 2025).
- Metric Selection: Reporting both absolute and aligned errors (MPJPE and PA-MPJPE), as well as downstream task metrics (e.g., Chamfer Distance for shapes, MV-Con for consistency), yields deeper insights into model limitations.
Adopting a config-driven, modular pipeline with uniform conventions is identified as crucial for extending the benchmark to new modalities (e.g., 3D object pose, hand pose, mesh-based reconstruction, text-to-3D, and point tracking) (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Su et al., 2024).
6. Customization and Extensibility
To deploy or extend 3D-CustomBench, the following practices are recommended:
- Data Integration: Develop or curate datasets following the framework’s required structure; for domain-specific tasks, adapt prompt templates (text-to-3D) or category taxonomies via LLMs (e.g., biomedical ontologies).
- Metric Extension: Implement new metrics in accordance with the evaluation engine’s API; for novel generative tasks, include domain-specialized attribute extraction, geometric alignment functions, or perceptual similarity scores as appropriate.
- Toolkit Use: Leverage built-in containerized environments and automatic split/caching for reproducibility; Python APIs and command-line interfaces support large-scale experiment automation.
- Version Control and Validation: Maintain all metadata, raw and processed data in version-controlled storage; validate key annotation and metric steps with expert reviews or human-in-the-loop audits to ensure alignment with domain expectations (Manzur et al., 16 May 2025, Cong et al., 2 Jun 2025, Su et al., 2024).
A plausible implication is that with systematic use of such frameworks, the field can move beyond narrow, dataset-specific benchmarks toward community-adopted, extensible, and transparent standards for 3D computer vision and generative model evaluation. The 3D-CustomBench protocol, by providing a generalizable blueprint, enables fair, comprehensive, and interpretable comparison across a rapidly expanding array of 3D applications and methodologies.