VoCap: Multi-Modal Models & Algorithms

Updated 2 September 2025

VoCap models are defined by targeted vocabulary capacity allocation, improving multilingual language representations through dynamic token assignment based on training data.
The architecture employs a k-NN target sampling approach that accelerates softmax computations while maintaining competitive performance on language modeling benchmarks.
VoCap also includes volumetric capture studios and promptable video captioning systems that enable precise multimodal data acquisition and unified video understanding.

VoCap is a family of models, algorithms, and systems spanning multiple modalities and research domains, united by the shared focus on vocabulary capacity allocation, semantic understanding, and volumetric capture. The term “VoCap” appears in a range of publication contexts and may refer to: (i) an algorithm for allocating large vocabulary capacity in cross-lingual LLM pre-training (Zheng et al., 2021); (ii) a volumetric capture (VoCap) studio leveraged in the construction of datasets for XR and human motion understanding (Lohesara et al., 14 Feb 2024); (iii) a unified promptable video object captioning and segmentation architecture (Uijlings et al., 29 Aug 2025). Each manifestation addresses critical bottlenecks in scaling neural models across language, vision, and video domains.

1. VoCap for Vocabulary Capacity Allocation in Cross-Lingual Models

The VoCap algorithm addresses sub-optimal vocabulary representation in massively multilingual LLMs, where traditional joint subword vocabularies (e.g., SentencePiece) suffer from under-representation of morphologically rich or low-resource languages. The key contribution is a systematic, data-driven allocation of vocabulary capacity to each language, determined by maximizing a weighted Average Log Probability (ALP) subject to a total vocabulary size constraint.

The allocation is formalized as:

$\text{maximize} \quad \sum_{i=1}^N q_i^{\beta} \cdot \text{ALP}(D_i, V_i(t_i)) \quad \text{subject to} \quad |\bigcup_{i=1}^N V_i(t_i)| = T$

where $q_i$ is derived from training data proportion $f_i$ via smoothing ( $q_i = \frac{f_i^\alpha}{\sum_j f_j^\alpha}$ ), $\beta$ is a rescaling factor, and $t_i \in \{1000, 2000, \ldots, 50,000\}$ is the vocabulary budget for language $i$ . ALP reflects the quality of tokenization for that language. The procedure employs a greedy algorithm: for each allocation step, the language with the highest marginal gain (weighted by $q_i^\beta$ ) in ALP receives an increment in vocabulary capacity, continuing until the aggregate constraint $T$ is met.

This methodology allows for dynamic and equitable vocabulary distribution, allocating more tokens to languages that demonstrate greater representational need or have larger training data. Experimental evaluation demonstrates that VoCap-allocated vocabularies outperform standard joint vocabularies in XTREME benchmarks (XNLI, POS, NER, XQuAD, MLQA, TyDiQA), especially on mid- and low-resource languages, while yielding improved representations and downstream accuracy.

2. Acceleration via k-Nearest Neighbor Target Sampling

Expanding vocabulary size exacerbates the computational demands of the softmax operation during masked language modeling, as each forward pass involves calculation with all vocabulary items. VoCap introduces a k-NN-based target sampling scheme to mitigate this cost. Given a masked target token $w_i$ with embedding $v_{w_i}$ , the k most similar tokens (by inner product) are retrieved to form a candidate subset $V'$ :

$I_k(w_i) = \text{top-}k\left(\{ v_{w_i}^T v_{w_j} : w_j \in V \}\right)$

$V' = \bigcup_{w_i \in \mathcal{W}} I_k(w_i)$

The softmax is then computed over $V'$ instead of the full vocabulary, dramatically reducing computational complexity. The k-NN indices are refreshed every $n$ training steps, rather than per step, enabling practical deployment at scale. Empirical results show that using k=50 achieves significant training acceleration (up to 1.18× faster) while maintaining competitive or improved downstream metrics.

3. VoCap as a Volumetric Capture Studio for Multimodal Datasets

Within the HEADSET database for XR and facial expression analysis (Lohesara et al., 14 Feb 2024), the VoCap system is a multi-sensor volumetric capture studio comprising 31 modules, each with two high-resolution RGB cameras and a depth sensor, positioned in a cylindrical configuration. Calibration and synchronization ensure each of the 62 RGB and 31 depth streams records at consistent timing (25 fps), facilitating precise multi-view 3D reconstruction.

The mathematical framework uses the pinhole camera model ( $P = K [R|t] X$ ), storing intrinsic/extrinsic parameters for each module. Stereo depth estimation from the paired RGB cameras enhances geometric fidelity. Meshes are reconstructed via Poisson surface reconstruction, solving $\nabla^2 f = \nabla \cdot V$ , while dense point clouds (∼900,000 points per frame) serve as ground-truth for XR-related evaluation. Combined with synchronized light field imagery (Lytro Illum), the VoCap system supports facial expression classification under occlusions, XR research, and benchmarking of inpainting or compression techniques.

4. VoCap: Promptable Video Object Captioning and Segmentation

VoCap (Uijlings et al., 29 Aug 2025) is a unified model that, given a video and a prompt (box, mask, or text), jointly outputs an object-centric spatiotemporal segmentation masklet and a free-form natural language caption. The architecture comprises:

A visual segmentation branch (inspired by SAM2/EVA02), operating per-frame with cross-frame (temporal) memory,
A prompt-processing branch, encoding text, box, or mask inputs to guide segmentation and captioning,
A language branch that shares architecture and weights for text encoding and decoding, using cross-attention to integrate extracted features as a prefix for auto-regressive caption generation.

Both mask decoding and captioning are conditioned on prompt embeddings via cross-attention. The model is trained on SAV-Caption, a pseudo-annotated dataset (generated by masking and visually prompting Gemini 1.5 Pro Vision), with further manual validation for unbiased evaluation.

VoCap sets a new benchmark with a CIDEr score of 47.8 for video object captioning on SAV-Caption-val and competitive segmentation metrics (J&F ≈ 75), surpassing prior baselines (SAM2+BLIP2, SAM2+PixelLLM). The architecture is versatile across input types (text, box, mask) and scales to multiple datasets/sequences.

VoCap, in all its manifestations, is defined by its integration of cross-modal cues, resource-aware engineering, and scalable training regimes:

The vocabulary allocation and k-NN softmax accelerate and enhance cross-lingual LMs, enabling better downstream performance across diverse scripts and morphologies (Zheng et al., 2021).
The volumetric VoCap studio establishes a new standard for XR data fidelity with synchronized RGB/depth capture and light field integration, facilitating robust annotation and downstream learning (Lohesara et al., 14 Feb 2024).
The promptable VoCap video model unifies segmentation and captioning, coupled with pseudo-labeled data synthesis (visual prompt engineering backed by large VLMs), to robustly solve video understanding tasks (Uijlings et al., 29 Aug 2025).

A plausible implication is that promptability, resource-aware modular design, and efficient allocation/sampling strategies are likely future-defining components for multi-modal AI systems.

6. Impact, Availability, and Future Directions

VoCap-based approaches have transformed state-of-the-art on key multilingual and multi-modal tasks. The vocabulary allocation approach prioritizes under-served languages, ensuring their improved representation in large shared models. Softmax acceleration via k-NN-based sampling offers an effective technique for scaling LMs without hardware prohibitive costs. In vision, the VoCap studio and promptable video architectures allow for unprecedented granularity in understanding and annotating dynamic objects.

Datasets and code for VoCap vocabularies (Zheng et al., 2021) and promptable video captioning (Uijlings et al., 29 Aug 2025) are publicly available, enabling further research. Future research directions include more efficient memory mechanisms, extension to longer or more complex video sequences, broader spectrum of input prompts, and improved (manual and automatic) object captioning for video. Potential application areas span video editing, XR scene analysis, low-resource language modeling, and real-time human perception in telepresence platforms.

Context	VoCap Instantiation	Key Contribution/Domain
Cross-lingual NLP	Vocabulary allocation + k-NN sampling	Resource-aware tokenization, LM pre-training
Multimodal XR Data	Multi-view volumetric capture studio	High-fidelity 3D data for XR, facial analysis
Video AI	Promptable object segmentation/captioning	Unified multi-modal video understanding

The VoCap “family” thus constitutes a significant set of algorithms and systems advancing vocabulary allocation, computational efficiency, and multi-modal AI understanding in contemporary research.