Pixel-160K Dataset

Updated 4 December 2025

Pixel-160K is a collection of large-scale, pixel-annotated datasets providing ground truth for tasks in robotics, GAN-based segmentation, and compressive video sensing.
The datasets feature over 160K clips and images with multimodal annotations including RGB frames, segmentation masks, bounding boxes, and synthetic labels for rigorous benchmarking.
Automated pipelines using tools like SAM, Grounding DINO, and GANs ensure scalable annotation, robust class diversity, and reliable performance metrics across various applications.

Pixel-160K refers to distinct large-scale, pixel-annotated datasets in computer vision, robotics, and compressive video sensing, each providing pixel-level ground truth for benchmarking spatial reasoning, segmentation, manipulation, or reconstruction algorithms. Instances of "Pixel-160K" originate in three domains: compressive sensing video (CS-video) (Narayanan et al., 2019), vision-language-action robotic manipulation (Liang et al., 3 Nov 2025), and GAN-synthesized ImageNet-scale segmentation (Li et al., 2022). Each dataset is characterized by its acquisition model, annotation pipeline, scale, class diversity, and evaluation benchmarks.

1. Dataset Composition and Statistics

Pixel-160K encompasses multiple high-volume, pixel-level benchmarks. In robotics, Pixel-160K contains monocular RGB episodes from Fractal and Bridge v2 platforms: 160,000 clips, 6.5 million image–text–action triplets, and average episode length ≈ 40 frames (input: 224×224 px, single target object mask per frame). Data splits are stratified (80% train, 10% val, 10% test), preserving original task and platform distributions (Liang et al., 3 Nov 2025). In GAN-based segmentation, Pixel-160K consists of 160,000 synthetically generated and labeled images at 512×512 px across 1,000 ImageNet synsets, plus 8,000 real annotated test images. Class balance is enforced (>100 synthetic samples per class), with masks storing per-pixel integer labels for foreground/background or multi-class segmentation (Li et al., 2022). In compressive video, Pixel-160K includes ≈162,000 raw monochrome frames (375 clips, single person or car per clip), temporally compressed at 13× (yielding ≈12,500 coded measurements) with per-pixel coded masks, bounding boxes, and ground-truth raw stacks (Narayanan et al., 2019).

Domain	Clips / Images	Classes/Object Types	Annotation Modality
Robotics	160,000	Single object/scene	Pixel mask, multimodal prompts
GAN-ImageNet	160,000 train<br\>8,000 test	1,000 synsets	Pixel mask (FG/BG, multi-class)
CS-video	375 clips<br>~162K frames	PERSON, CAR (single per clip)	Bounding box, coded mask, raw frames

2. Annotation Frameworks and Pipelines

Pixel-160K adopts automated and semi-automated annotation strategies. For robotics, a two-stage pipeline is used: Stage 1 applies SAM 2 for gripper-aware region proposals on keyframes, followed by Stage 2 where target-object phrases are parsed from instructions using an LLM and object masks are derived using Grounding DINO and SAM. Multimodal prompts (point, line, box) are generated from each mask; filtering (confidence threshold τ=0.5) removes ~19.2% failures (Liang et al., 3 Nov 2025). GAN-based Pixel-160K utilizes a feature-interpreter segmentation head trained on 5,000 manually labeled synthetic images, then extrapolates pixel-wise labels to 160,000 BigGAN/VQGAN samples using uncertainty filtering and ensemble heads (Li et al., 2022). Compressive video Pixel-160K leverages YOLOv3/VATIC for bounding-box pseudo-labeling and stores per-frame coded masks generated under a bump-time constraint (Tb=3), synchronized with ground-truth stacks for every compressed frame (Narayanan et al., 2019).

3. Acquisition and Forward Measurement Models

In CS-video, sensor measurements follow the pixel-wise coded exposure model:

$y = \Phi x + n$

where $x \in \mathbb{R}^{n \times K}$ is the vectorized raw video stack, $y \in \mathbb{R}^n$ is the compressed measurement, $\Phi$ is a block-diagonal binary mask matrix per pixel and per frame, and $n$ denotes sensor noise (Gaussian/Poisson) (Narayanan et al., 2019). Masks $\phi^{(t)}$ are generated with temporal bump-time constraints and randomized per 13-frame group, yielding a temporal compression ratio $K=13$ . GAN-based Pixel-160K samples images via BigGAN with truncation $\sigma=0.9$ , filters with a ResNet-50 classifier (top 10% confidence), and applies segmentation heads to internal generator features for per-pixel label synthesis (Li et al., 2022). No spatial subsampling is performed in CS-video; synthetic datasets are spatially uniform (GAN: 512×512, Robotics: 224×224).

4. Evaluation Benchmarks and Downstream Impact

Pixel-160K datasets serve as benchmarks for pixel-level reasoning, segmentation, and control, with standardized metrics:

Robotics: Annotation quality is measured via Intersection-over-Union (IoU; mean 0.87±0.04) and boundary F₁ score (mean 0.82±0.06), showing no significant quality gap to RoboMask on held-out samples (Liang et al., 3 Nov 2025). Training PixelVLA with Pixel-160K yields up to +10.1% manipulation success over standard OpenVLA.
GAN-ImageNet: Segmentation models trained on Pixel-160K achieve state-of-the-art in-domain mIoU (71.1% FG/BG, 68.1% MC-16 classes), and transfer gains for PASCAL-VOC, MS-COCO, Cityscapes, and chest X-ray (AP/mIoU lift of +1–8 pp; semi-supervised gains up to +20 pp with limited real labels) (Li et al., 2022).
CS-video: Reconstruction algorithms can benchmark against synchronized ground-truth using PSNR/SSIM. Standard CS recovery guarantees apply; $K=13$ , $T_b=3$ empirically yields mid-30 dB PSNR for natural scenes (Narayanan et al., 2019).

5. Storage, Metadata, and Release Protocols

Pixel-160K datasets maintain rigorous storage schemas:

Robotics: NPZ archives per episode contain RGB images, segmentation masks, multimodal prompts, and JSON files with per-frame action/instruction/context (Liang et al., 3 Nov 2025).
GAN-ImageNet: Directory structure follows {split}/{class}/{img}, images and masks as PNGs (3-ch or 1-ch), boundary polygons in optional JSON, splits released for reproducibility; real test images reserved for final evaluation (Li et al., 2022).
CS-video: Raw and coded frames, mask stacks (float32, 13×H×W), bounding boxes, and synchronized ground-truth raw clips are bundled in NPZ or PNG format; annotation classes and bounding boxes are stored alongside coded measurements (Narayanan et al., 2019).

6. Usage Guidelines and Recommended Training Protocols

Robotics usage for VLA training follows a two-stage LoRA regimen: continuous-action decoders on unlabeled data, LoRA adaptation with Pixel-160K at batch size 32, learning rate $1 \times 10^{-3}$ for 200K steps, 224×224 resolution, prompt/pixel-aware embeddings, and action chunking=8. GAN-ImageNet segmentation recommends DeepLabv3+ResNet-50, SGD (lr=0.01, momentum=0.9, 200 epochs, batch=64), poly-lr decay, or DenseCL+segmentation for contrastive pretraining (Adam lr= $1\times10^{-3}$ , batch=256, $\lambda_{seg}=1.0$ ). For CS-video, any standard CS algorithm (ISTA/FISTA/ADMM, plug-and-play priors, dictionary pursuit) can be applied to the (y,Φ) measurements, enabling compressed-domain benchmarking under the RIP and sparsity recovery framework. The GAN-based dataset documents class-conditional sampling and filtering parameters (truncation σ=0.9, nucleus p=0.92, rejection rate=0.9, uncertainty-dropping top 10%) for reproducibility.

7. Relation to Contemporary Datasets and Benchmarks

Pixel-160K complements standard visuo-motor, segmentation, and compressed video datasets by providing order-of-magnitude larger pixel-level ground truth, covering diverse tasks (robotics manipulation via multimodal prompts, large-scale semantic segmentation, compressive video recovery). Its robotic annotation pipeline (SAM2, Grounding DINO, LLM prompt parsing) ensures scalable pixel mask generation from real-world data; its GAN synthetic pipeline enables expansion of annotated corpora with minimal manual intervention. Benchmark results validate both in-domain and strong cross-domain transfer performance (Liang et al., 3 Nov 2025, Li et al., 2022). In compressive sensing, the dataset enables controlled studies on temporal compression, forward modeling, and sparsity-driven inference with complete ground-truth stacks (Narayanan et al., 2019).

8. Canonical Equations and Key Formulas

Relevant mathematical expressions in Pixel-160K deployments include:

CS forward model: $y = \Phi x + n$
Block-diagonal coded mask: $\Phi = [\mathrm{Diag}(\phi^{(1)}), \ldots, \mathrm{Diag}(\phi^{(K)})]$
Basis pursuit denoising: $\min_x \|\Psi^T x\|_1$ s.t. $\|y - \Phi x\|_2 \leq \epsilon$
RIP property: $(1-\delta_s)\|z\|_2^2 \leq \|\Phi\Psi z\|_2^2 \leq (1+\delta_s)\|z\|_2^2$
GAN segmentation: $\mathcal{L} = -\sum_i [y_i^d \log p_i + (1-y_i^d)\log(1-p_i)]$
Robotics mask-quality: $\mathrm{IoU}(m,g) = |m \cap g| / |m \cup g|$
Manipulation loss: $\mathcal{L} = \sum_{i=1}^B \|a_i - C(H(E^i_v, E^i_l, E^i_p, E^i_s))\|_1$

This corpus of datasets collectively advances the benchmarking and evaluation of pixel-level reasoning in large-scale computer vision, robotics, and video sensing research.