Pixel-160K Dataset
- Pixel-160K is a collection of large-scale, pixel-annotated datasets providing ground truth for tasks in robotics, GAN-based segmentation, and compressive video sensing.
- The datasets feature over 160K clips and images with multimodal annotations including RGB frames, segmentation masks, bounding boxes, and synthetic labels for rigorous benchmarking.
- Automated pipelines using tools like SAM, Grounding DINO, and GANs ensure scalable annotation, robust class diversity, and reliable performance metrics across various applications.
Pixel-160K refers to distinct large-scale, pixel-annotated datasets in computer vision, robotics, and compressive video sensing, each providing pixel-level ground truth for benchmarking spatial reasoning, segmentation, manipulation, or reconstruction algorithms. Instances of "Pixel-160K" originate in three domains: compressive sensing video (CS-video) (Narayanan et al., 2019), vision-language-action robotic manipulation (Liang et al., 3 Nov 2025), and GAN-synthesized ImageNet-scale segmentation (Li et al., 2022). Each dataset is characterized by its acquisition model, annotation pipeline, scale, class diversity, and evaluation benchmarks.
1. Dataset Composition and Statistics
Pixel-160K encompasses multiple high-volume, pixel-level benchmarks. In robotics, Pixel-160K contains monocular RGB episodes from Fractal and Bridge v2 platforms: 160,000 clips, 6.5 million image–text–action triplets, and average episode length ≈ 40 frames (input: 224×224 px, single target object mask per frame). Data splits are stratified (80% train, 10% val, 10% test), preserving original task and platform distributions (Liang et al., 3 Nov 2025). In GAN-based segmentation, Pixel-160K consists of 160,000 synthetically generated and labeled images at 512×512 px across 1,000 ImageNet synsets, plus 8,000 real annotated test images. Class balance is enforced (>100 synthetic samples per class), with masks storing per-pixel integer labels for foreground/background or multi-class segmentation (Li et al., 2022). In compressive video, Pixel-160K includes ≈162,000 raw monochrome frames (375 clips, single person or car per clip), temporally compressed at 13× (yielding ≈12,500 coded measurements) with per-pixel coded masks, bounding boxes, and ground-truth raw stacks (Narayanan et al., 2019).
| Domain | Clips / Images | Classes/Object Types | Annotation Modality |
|---|---|---|---|
| Robotics | 160,000 | Single object/scene | Pixel mask, multimodal prompts |
| GAN-ImageNet | 160,000 train<br\>8,000 test | 1,000 synsets | Pixel mask (FG/BG, multi-class) |
| CS-video | 375 clips<br>~162K frames | PERSON, CAR (single per clip) | Bounding box, coded mask, raw frames |
2. Annotation Frameworks and Pipelines
Pixel-160K adopts automated and semi-automated annotation strategies. For robotics, a two-stage pipeline is used: Stage 1 applies SAM 2 for gripper-aware region proposals on keyframes, followed by Stage 2 where target-object phrases are parsed from instructions using an LLM and object masks are derived using Grounding DINO and SAM. Multimodal prompts (point, line, box) are generated from each mask; filtering (confidence threshold τ=0.5) removes ~19.2% failures (Liang et al., 3 Nov 2025). GAN-based Pixel-160K utilizes a feature-interpreter segmentation head trained on 5,000 manually labeled synthetic images, then extrapolates pixel-wise labels to 160,000 BigGAN/VQGAN samples using uncertainty filtering and ensemble heads (Li et al., 2022). Compressive video Pixel-160K leverages YOLOv3/VATIC for bounding-box pseudo-labeling and stores per-frame coded masks generated under a bump-time constraint (Tb=3), synchronized with ground-truth stacks for every compressed frame (Narayanan et al., 2019).
3. Acquisition and Forward Measurement Models
In CS-video, sensor measurements follow the pixel-wise coded exposure model:
where is the vectorized raw video stack, is the compressed measurement, is a block-diagonal binary mask matrix per pixel and per frame, and denotes sensor noise (Gaussian/Poisson) (Narayanan et al., 2019). Masks are generated with temporal bump-time constraints and randomized per 13-frame group, yielding a temporal compression ratio . GAN-based Pixel-160K samples images via BigGAN with truncation , filters with a ResNet-50 classifier (top 10% confidence), and applies segmentation heads to internal generator features for per-pixel label synthesis (Li et al., 2022). No spatial subsampling is performed in CS-video; synthetic datasets are spatially uniform (GAN: 512×512, Robotics: 224×224).
4. Evaluation Benchmarks and Downstream Impact
Pixel-160K datasets serve as benchmarks for pixel-level reasoning, segmentation, and control, with standardized metrics:
- Robotics: Annotation quality is measured via Intersection-over-Union (IoU; mean 0.87±0.04) and boundary F₁ score (mean 0.82±0.06), showing no significant quality gap to RoboMask on held-out samples (Liang et al., 3 Nov 2025). Training PixelVLA with Pixel-160K yields up to +10.1% manipulation success over standard OpenVLA.
- GAN-ImageNet: Segmentation models trained on Pixel-160K achieve state-of-the-art in-domain mIoU (71.1% FG/BG, 68.1% MC-16 classes), and transfer gains for PASCAL-VOC, MS-COCO, Cityscapes, and chest X-ray (AP/mIoU lift of +1–8 pp; semi-supervised gains up to +20 pp with limited real labels) (Li et al., 2022).
- CS-video: Reconstruction algorithms can benchmark against synchronized ground-truth using PSNR/SSIM. Standard CS recovery guarantees apply; , empirically yields mid-30 dB PSNR for natural scenes (Narayanan et al., 2019).
5. Storage, Metadata, and Release Protocols
Pixel-160K datasets maintain rigorous storage schemas:
- Robotics: NPZ archives per episode contain RGB images, segmentation masks, multimodal prompts, and JSON files with per-frame action/instruction/context (Liang et al., 3 Nov 2025).
- GAN-ImageNet: Directory structure follows {split}/{class}/{img}, images and masks as PNGs (3-ch or 1-ch), boundary polygons in optional JSON, splits released for reproducibility; real test images reserved for final evaluation (Li et al., 2022).
- CS-video: Raw and coded frames, mask stacks (float32, 13×H×W), bounding boxes, and synchronized ground-truth raw clips are bundled in NPZ or PNG format; annotation classes and bounding boxes are stored alongside coded measurements (Narayanan et al., 2019).
6. Usage Guidelines and Recommended Training Protocols
Robotics usage for VLA training follows a two-stage LoRA regimen: continuous-action decoders on unlabeled data, LoRA adaptation with Pixel-160K at batch size 32, learning rate for 200K steps, 224×224 resolution, prompt/pixel-aware embeddings, and action chunking=8. GAN-ImageNet segmentation recommends DeepLabv3+ResNet-50, SGD (lr=0.01, momentum=0.9, 200 epochs, batch=64), poly-lr decay, or DenseCL+segmentation for contrastive pretraining (Adam lr=, batch=256, ). For CS-video, any standard CS algorithm (ISTA/FISTA/ADMM, plug-and-play priors, dictionary pursuit) can be applied to the (y,Φ) measurements, enabling compressed-domain benchmarking under the RIP and sparsity recovery framework. The GAN-based dataset documents class-conditional sampling and filtering parameters (truncation σ=0.9, nucleus p=0.92, rejection rate=0.9, uncertainty-dropping top 10%) for reproducibility.
7. Relation to Contemporary Datasets and Benchmarks
Pixel-160K complements standard visuo-motor, segmentation, and compressed video datasets by providing order-of-magnitude larger pixel-level ground truth, covering diverse tasks (robotics manipulation via multimodal prompts, large-scale semantic segmentation, compressive video recovery). Its robotic annotation pipeline (SAM2, Grounding DINO, LLM prompt parsing) ensures scalable pixel mask generation from real-world data; its GAN synthetic pipeline enables expansion of annotated corpora with minimal manual intervention. Benchmark results validate both in-domain and strong cross-domain transfer performance (Liang et al., 3 Nov 2025, Li et al., 2022). In compressive sensing, the dataset enables controlled studies on temporal compression, forward modeling, and sparsity-driven inference with complete ground-truth stacks (Narayanan et al., 2019).
8. Canonical Equations and Key Formulas
Relevant mathematical expressions in Pixel-160K deployments include:
- CS forward model:
- Block-diagonal coded mask:
- Basis pursuit denoising: s.t.
- RIP property:
- GAN segmentation:
- Robotics mask-quality:
- Manipulation loss:
This corpus of datasets collectively advances the benchmarking and evaluation of pixel-level reasoning in large-scale computer vision, robotics, and video sensing research.