LoST: Level of Semantics Tokenization for 3D Shapes

Published 18 Mar 2026 in cs.CV, cs.GR, and cs.LG | (2603.17995v1)

Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces semantic-first token sequencing that orders tokens by semantic salience, enabling early decoding with plausible 3D shapes.
It employs a novel Relational Inter-Distance Alignment (RIDA) loss to align 3D latent topologies with 2D semantic features, improving retrieval and reconstruction quality.
The AR generation framework uses a conditional DiT diffusion decoder and GPT-style Transformer for controllable, high-fidelity 3D synthesis with reduced token usage.

Level of Semantics Tokenization (LoST) for 3D Shapes: An Expert Analysis

Introduction

"LoST: Level of Semantics Tokenization for 3D Shapes" (2603.17995) proposes a paradigm shift in 3D shape tokenization for autoregressive (AR) generative modeling. Instead of adhering to the traditional geometric level-of-detail (LoD) decomposition, LoST orders tokens by semantic salience, enabling early decoding prefixes to yield complete and category-plausible shapes, with progressively longer sequences refining geometric and semantic specificity. This restructuring enhances sample efficiency, semantic coherence, and token utility, addressing systematic inefficiencies in existing LoD-based tokenization paradigms.

Methodological Framework

Semantic-First Token Sequencing

In contrast to methods such as OctGPT or VertexRegen, which tokenize shapes following geometric hierarchies (e.g., octree subdivisions or vertex collapses), LoST targets semantic ordering. The core mechanism employs a ViT-based encoder that maps VAE triplane latents of 3D shapes into register tokens, structured hierarchically through causal masking and nested dropout. Early tokens are enforced to encode principal semantic content, while subsequent tokens hierarchically capture instance-specific geometric fine detail.

The use of register tokens, which aggregate information from triplane patches into concise, non-local aggregates, breaks from mere spatial association and facilitates effective semantic abstraction. This yields highly compressible codes: the model produces meaningful, plausible shapes with as few as one token, and reaches high-fidelity synthesis with considerably fewer tokens than previous approaches.

Relational Inter-Distance Alignment (RIDA)

LoST introduces Relational Inter-Distance Alignment (RIDA), a novel cross-modal semantic alignment loss. Existing image-based semantic alignment (e.g., REPA with DINO supervision) does not transfer directly due to modality gap and the prohibitive cost of rendering intermediates. RIDA instead aligns inter-sample relational topologies between the triplane latent space and the DINO visual feature space. The semantic extractor is trained to reproduce not absolute features but the pairwise distance structure (via InfoNCE, rank distillation, and spatial structure losses), regularized using z-scoring to enforce distributional invariance. This methodology ensures that the resulting 3D latent topology mirrors the semantic manifold defined by 2D visual foundation models, a critical advance for semantics-aware 3D tokenization.

Generative Decoding

The LoST decoder utilizes a conditional DiT diffusion model to reconstruct triplane latents from arbitrary token sequence prefixes. This ensures that semantic signal present in short prefixes yields entire shapes, while longer sequence decodings converge to high-fidelity faithful reconstructions.

For AR synthesis, LoST employs a GPT-style Transformer trained in continuous feature space using a diffusion loss on unquantized tokens. Conditioning with OpenCLIP embeddings enables controllable, conditional 3D shape generation from text or images.

Experimental Analysis

Tokenization Efficacy

Quantitative benchmarks against OctGPT and VertexRegen demonstrate superior semantic and geometric fidelity at all compression ratios, with pronounced gains in the extreme compression regime. LoST matches or surpasses competitors' reconstruction quality—by FID and DINO metrics—while utilizing only 0.1–10% of the tokens required by baselines.

(Figure 1)

Figure 1: LoST produces prefix-decodable codes, outperforming spatial LoD tokenization in semantic and geometric accuracy while using dramatically fewer tokens.

Qualitative inspection confirms that any-prefix decoding with LoST yields plausible 3D shapes, including high-frequency semantic detail at low token counts, unlike the abstract, unusable intermediates produced by geometric LoD schemes.

Semantic Latent Structure

The RIDA alignment is empirically validated in a semantic retrieval setting. Compared to direct regression and geometric (raw triplane) baselines, RIDA-aligned features exhibit significantly higher mAP and Recall@K for DINO-defined semantic neighbors, generalizing both in- and out-of-distribution. This demonstrates that LoST's latent tokens genuinely capture semantic structure rather than surface geometry alone.

AR Generation Performance

AR generation with LoST-GPT yields SOTA results, outperforming recent methods (ShapeLLM-Omni, OctGPT, Llama-Mesh) by wide margins across both FID and DINO metrics while operating with orders-of-magnitude fewer tokens. Importantly, LoST’s any-prefix semantic property enables early stopping and sublinear compute scaling for simple shapes.

(Figure 2)

Figure 2: LoST-GPT delivers semantically faithful and geometrically coherent autoregressive 3D generation at significantly reduced token budgets compared to SOTA.

Representation Generality

LoST's tokenization is representation-agnostic, transferring successfully from Direct3D’s triplane VAE latents to TRELLIS Stage-1 VAE latents. The tokenization mechanism and RIDA semantic alignment generalize seamlessly, further supporting variable-length token sequences and robust semantic organization in alternative 3D representation spaces.

Figure 3: LoST generalizes variable-length semantic tokenization to TRELLIS VAE latents, underscoring representation-agnostic design.

Ablation and Analysis

Ablations substantiate the necessity of RIDA for semantic consistency in the low-token regime, with pronounced improvements in DINO/DINOv2 similarity for extreme compression. Unlike direct (absolute) feature regression, RIDA’s relational contrast and rank distillation robustly transfer teacher topology while circumventing training stagnation.

Implications and Future Directions

LoST advances the utilization of semantic tokenization for AR models, reducing perplexity and enabling earlier, meaningful decoding of 3D shapes. The semantic prefix property fundamentally augments controllability, variable-rate generation, and facilitates downstream semantic reasoning (retrieval, editing). The decoupling from geometric hierarchies also enables LoST to better align with human categorical intuitions, opening new avenues in shape abstraction and multimodal integration.

Promising future directions include integration with alternative 3D bases (e.g., Gaussian splats), the use of topology-aware semantics in regularization, dynamic/inferred sequence lengths (EOS conditioning for sample-adaptive generation), and further unification with multimodal LLM architectures for cross-domain AR synthesis.

Conclusion

LoST presents a substantive shift from geometric LoD to semantic-first tokenization for AR 3D shape generation, validated by strong numerical and qualitative results. The introduction of RIDA provides a principled methodology for semantic alignment across modalities, underpinning SOTA efficiency and fidelity in 3D generative modeling. The representation-agnostic, prefix-decodable structure of LoST tokens sets a foundation for new lines of research in semantics-centered, autoregressive 3D generation and analysis.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces a new way to break 3D shapes into “tokens” (small chunks of information) so that computers can learn to create and understand 3D objects more easily. The method is called LoST (Level of Semantics Tokenization). Unlike older methods that start with rough geometry and add detail later, LoST orders tokens by meaning. That means even the first few tokens already describe a whole, recognizable object (like “a chair”), and later tokens add the specific details (like “this chair has armrests and a curved back”).

The main questions the researchers asked

Can we design 3D tokens so that early parts already capture the object’s main idea (its semantics), not just a rough shape?
Can this make 3D generation and reconstruction both better and more efficient (using fewer tokens)?
How can we bring “meaning” from 2D image understanding into 3D shapes without constantly rendering images?

How did they study it? (Explained simply)

Think of building a LEGO model:

Old way (Level-of-Detail): you first assemble a chunky outline of the model and then slowly make it less blocky. Early steps don’t look like the final thing.
LoST (Level-of-Semantics): you start with a simple-but-complete version that already looks like the right object, then add details to make it more specific.

Here’s how LoST works, in easy terms:

3D shapes as “feature planes”: Each 3D shape is first turned into a compact set of features called a “triplane” (imagine three thin sheets that store information about the object from three directions—like three X-ray views). This comes from a VAE (a model that compresses and decompresses data).
Turning shapes into tokens: A transformer (like those used in LLMs) turns the triplane into a sequence of special “register tokens.” These tokens aren’t tied to one location in space; they act more like summary notes that together describe the object.
Ordered by meaning: The system is trained so that the first token(s) capture the main idea of the object (e.g., “car,” “chair”), and the later ones add finer details (e.g., spoiler, armrests). The training uses “nested dropout,” which randomly keeps only the first few tokens during training, forcing early tokens to carry the most important information. This creates a natural coarse-to-fine order by meaning.
Decoding any prefix: A generative “diffusion” decoder takes however many tokens you have (even just 1) and fills in a complete, plausible 3D shape. With more tokens, the shape becomes more accurate and detailed.
Bringing in “meaning” from 2D: Image models like DINO are very good at understanding what’s in a picture (semantics). The authors teach the 3D system to organize shapes by meaning by inventing RIDA (Relational Inter-Distance Alignment). Imagine a teacher saying: “In my world, car A is more similar to car B than to boat C.” RIDA trains a 3D “semantic extractor” so that distances between 3D shapes follow these same relationships—who is close to whom—without directly matching raw numbers. It’s like making the “friendship map” in 3D match the “friendship map” in 2D. This avoids expensive steps like constantly rendering 3D into images during training.
Generating shapes token-by-token: Finally, they train a GPT-style model (LoST-GPT) that predicts the next token in continuous space (not just picking from a fixed list) to generate 3D objects from prompts or images—efficiently and with high quality.

What did they find?

Early tokens already matter: Even with as few as 1–4 tokens, LoST can produce a complete, recognizable shape that matches the object’s category. Adding more tokens adds personalized details.
Better with fewer tokens: Compared to older “Level-of-Detail” methods (like OctGPT and VertexRegen), LoST reconstructs shapes more accurately—both in geometry and in meaning—while using only about 0.1%–10% of the tokens those methods need.
Stronger 3D generation: The LoST-GPT model (which uses LoST tokens) outperforms other state-of-the-art autoregressive 3D methods on common quality metrics, while using far fewer tokens (e.g., 128 tokens). That means it’s faster and more efficient.
Useful beyond generation: Because the tokens are organized by meaning, they also help with tasks like semantic shape retrieval—finding shapes with similar concepts (e.g., “find other submarine-like objects”) even when geometry differs.

Why these results matter:

Metrics like Chamfer Distance (geometry), FID (overall visual realism), and DINO similarity (semantic closeness) all improved.
Early “previews” are actually useful because the first tokens already give a complete, plausible object, not just a coarse blob.

Why this matters and what could happen next

Faster, cheaper 3D creation: Since early tokens already give a complete shape, models can stop early for simple objects, saving time and compute.
Better control and integration: Tokens ordered by meaning make it easier to guide models with text or images and to integrate with big language–vision models.
Smarter 3D search: Because tokens reflect meaning, you can search for shapes by concept and get better matches.
Future directions: The authors suggest adapting LoST to other 3D formats (like Gaussian splats), reducing reliance on diffusion decoders for speed, strengthening very-early tokens even more, and letting the model decide how many tokens it needs (adding an “end-of-sequence” marker).

In short, LoST flips the usual script: instead of building shapes from coarse geometry to fine geometry, it builds from meaning-first to details—making 3D generation more efficient, more understandable, and more useful.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains uncertain or unexplored in the paper, organized to guide future research.

Representation dependence: LoST is instantiated on Direct3D’s triplane VAE latents only; it is unclear how the approach transfers to other 3D representations (e.g., Gaussian splats, NeRFs, neural SDFs, meshes/point clouds) and whether the same semantic prefix behavior emerges across them.
Generative decoder reliance: The need for a diffusion decoder to reconstruct from short prefixes introduces additional compute and inference latency; the paper does not benchmark runtime, memory, or energy vs. pure AR decoding or fast feed-forward decoders or distillation into lighter decoders.
Continuous tokens vs. discrete codes: LoST-GPT models continuous tokens with a diffusion loss; the trade-offs vs. vector-quantized/discrete tokens (e.g., bits-per-shape, transmission/storage costs, robustness to AR errors, and training stability) are not assessed.
Prefix consistency metrics: The claim of “any-prefix” semantic coherence is supported qualitatively, but there is no quantitative evaluation of prefix-to-prefix consistency (e.g., category stability, semantic churn, geometric deviation as tokens are added).
Exposure bias and off-manifold risks: AR prediction of continuous tokens may yield off-manifold latents; the paper does not analyze robustness to accumulated prediction errors or propose safeguards (e.g., manifold regularizers, token denoising at inference).
Token interpretability and controllability: It is unknown whether individual tokens correspond to interpretable semantic factors or parts; token-to-part mappings, attention visualizations, and token-level editing capabilities are unexplored.
Variable-length AR generation: Although LoST produces variable-length codes, the AR model uses a fixed target length; training with EOS tokens and complexity-aware stopping, and measuring quality-speed trade-offs across variable lengths, remain open.
Scalability and capacity: The effect of token count (up to 512) and token dimension (32-d) on fidelity and semantics is underexplored; scaling laws, diminishing returns, and how to allocate capacity across tokens are not analyzed.
Data generation and domain bias: Training uses 300k synthetic shapes produced by a Flux→Direct3D pipeline; potential biases and failure modes inherited from the upstream image generator and 3D reconstructor are not quantified, and generalization to real scans/CAD collections (e.g., ScanNet, ABO, Objaverse-LVIS) is not evaluated.
Evaluation fairness: AR baselines use Objaverse while LoST uses a synthetic dataset; differences in data scale/quality and task setup (text-to-3D vs. image-to-3D) complicate fairness. A standardized training/evaluation protocol is needed for apples-to-apples comparisons.
Metrics coverage: Reconstruction is assessed via Chamfer, FID, and DINO similarity on renderings; topology (genus, manifoldness, self-intersections), normal consistency, surface roughness, and watertightness are not reported, nor are view-agnostic 3D shape metrics.
Comparison to diffusion-based 3D SOTA: The paper benchmarks against AR baselines but not against strong diffusion/score-based 3D generators; relative quality, diversity, and compute costs versus modern diffusion pipelines remain unknown.
Texture/material modeling: LoST focuses on geometry in triplane latents; handling of appearance (textures, materials) and joint geometry-appearance tokenization is not addressed.
Scene-level extension: The approach targets single objects; how semantics-first tokenization scales to multi-object scenes, layouts, and interactions is untested.
Few-token artifacts: While acknowledged, concrete analyses of failure modes at extreme compression (e.g., topology breaks, part omissions) and methods to mitigate them (e.g., topology-aware priors, part-consistency losses) are missing.
RIDA teacher choice and bias: Semantics are distilled from 2D DINO features, introducing view and dataset biases; performance with alternative teachers (e.g., multi-view foundation models, 3D-aware teachers) and sensitivity to teacher errors are not examined.
RIDA supervision without images: RIDA requires image-derived teacher features during training; applicability to datasets with 3D shapes but no associated images (without expensive rendering) remains unclear.
Multi-view semantics: The paper sidesteps multi-view REPA due to cost, but relying on single-view DINO may bias semantics; how to incorporate affordable multi-view supervision or view-consistent semantics is open.
RIDA hyperparameters and mining strategy: Positive/negative thresholds, batch composition, and the weights (λ) for global/rank/spatial losses are not ablated; sensitivity and stability across settings are unknown.
Component ablations: Only a “w/ vs. w/o RIDA” ablation is provided; the individual contributions of global contrast, rank distillation, and spatial structure distillation are not isolated.
Retrieval ground truth: Retrieval uses DINO similarity as “semantic” ground truth; human-labeled semantics or part-aware datasets are not used to validate that RIDA aligns to human perception rather than DINO idiosyncrasies.
Conditioning modalities: The system is evaluated mainly in image-conditioned settings; the behavior in purely text-conditioned AR generation and cross-modal retrieval (text↔3D) is not fully characterized.
Robustness to thin structures and complex topology: Performance on fine, fragile, or high-genus structures is not specifically evaluated; how semantics-first tokens preserve delicate parts remains an open question.
Efficiency claims vs. wall-clock: Token efficiency is reported in token counts, but wall-clock training/inference time, GPU memory, and throughput comparisons to baselines are not provided.
Alternative tokenization orderings: Whether reordering classical LoD tokens by learned semantic salience (e.g., learning a semantic traversal over octrees/meshes) could match LoST’s benefits is not investigated.
Uncertainty calibration: As prefix length grows, the generative decoder transitions from sampling to reconstruction; explicit measures of uncertainty/diversity vs. fidelity across prefix lengths are not reported.
Broader downstream tasks: Beyond retrieval and AR generation, the utility of LoST tokens for 3D classification, segmentation, alignment, or part-based editing is not tested.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be implemented with the methods and models described in the paper (LoST tokens, RIDA semantic extractor, diffusion-based decoder, LoST-GPT) with modest integration effort.

Progressive 3D asset preview and streaming
- Sectors: software (3D tools), gaming/VFX, AR/VR, e-commerce
- Who: industry, daily life
- What: Use LoST’s prefix-decodable tokens to show an immediately recognizable, complete proxy from 1–4 tokens while progressively refining details as more tokens arrive. Enables low-latency previews in asset browsers, game engines, web viewers, and AR apps.
- Tools/workflows: glTF/ USD pipeline extension to carry LoST token streams; web/mobile viewers that request additional tokens on demand; early-stop decoding in DCC tools (e.g., Blender, Unity, Unreal).
- Assumptions/dependencies: Requires LoST encoder/decoder runtime and a mesh/surface reconstruction step from triplanes; diffusion decoder incurs nontrivial compute; domain retraining may be needed for stylized assets.
Interactive co-creative 3D modeling with “add-a-token” refinement
- Sectors: design/CAD, media/entertainment, education
- Who: industry, academia, daily life (prosumer creators)
- What: Start from a 1–2-token coarse but plausible shape from text/image, then add tokens to refine parts or instance-level details. Supports user-in-the-loop ideation and fast iteration.
- Tools/workflows: Plugins for DCCs exposing a token-count slider; prompt/image-conditioned LoST-GPT for initial shape, followed by token-wise refinement.
- Assumptions/dependencies: Requires a responsive diffusion decoder; UI affordances for token-level control; training aligned to the target design domain.
Token-efficient 3D generation on edge devices
- Sectors: AR/VR, mobile apps, social media filters
- Who: industry, daily life
- What: Generate approximate 3D proxies from few tokens on laptops/phones for AR occlusion, quick scene prototyping, or filters; progressively refine when connected to the cloud.
- Tools/workflows: On-device LoST-GPT with reduced token budgets (e.g., 16–64); server-side refinement via full-length decode.
- Assumptions/dependencies: Mobile-friendly inference requires model distillation/quantization; triplane-to-mesh conversion must be efficient.
Semantic 3D search and retrieval for asset libraries
- Sectors: gaming/VFX, e-commerce, archives/museums, enterprise content management
- Who: industry, academia
- What: Index libraries using RIDA embeddings to retrieve semantically similar shapes (e.g., “submarine shaped like a fish”) beyond geometric similarity; improves asset discovery, deduplication, and tagging.
- Tools/workflows: Embedding service for RIDA features; vector search (e.g., FAISS, Elasticsearch KNN) integrated with asset catalogs.
- Assumptions/dependencies: Needs pretraining of the RIDA student on representative data; retrieval quality depends on DINO teacher alignment to the target style/domain.
Storage and bandwidth reduction via learned 3D compression
- Sectors: cloud platforms, collaboration tools, streaming, digital twins
- Who: industry, policy (sustainability, cost)
- What: Store and transmit 3D assets as compact LoST token sequences (e.g., 128 tokens) instead of large meshes/octrees; decode on demand.
- Tools/workflows: Repository format for tokenized shapes; background decode services; APIs to fetch tokens progressively.
- Assumptions/dependencies: Compression is model-dependent (decoder must be available); legal/IP considerations for storing model-dependent representations.
Rapid quality control and content moderation for 3D uploads
- Sectors: marketplaces, social platforms
- Who: industry, policy
- What: Use early-token decodes for fast semantic screening (e.g., category, safety) before full decode; triage and prioritize moderation queues.
- Tools/workflows: Gatekeeping service that decodes 1–4 tokens and classifies via RIDA space; escalation to full decode if needed.
- Assumptions/dependencies: Moderation policy requires calibration for false positives/negatives; relies on DINO-aligned semantics reflecting platform norms.
Dataset curation and active learning for 3D models
- Sectors: academia, AI/ML platform teams
- Who: academia, industry
- What: Use RIDA space to cluster by semantics, select diverse exemplars, and detect duplicates; improves training efficiency and coverage for 3D generative models.
- Tools/workflows: Clustering/coverage analytics in RIDA space; sampling utilities integrated into data pipelines.
- Assumptions/dependencies: Benefit depends on how well RIDA captures semantics in the domain; periodic re-indexing as datasets evolve.
Progressive 3D product viewing in e-commerce
- Sectors: retail, marketplaces
- Who: industry, daily life
- What: Show an instant recognizable 3D silhouette while detailed textures/geometry stream in; improves perceived performance and engagement on product pages.
- Tools/workflows: Web viewer embedding LoST decoder; CDN-backed token streaming.
- Assumptions/dependencies: Requires preprocessing of product models into LoST tokens; regulatory requirements for product fidelity may necessitate high-token final decode.
Education: teaching hierarchical 3D semantics
- Sectors: education, outreach
- Who: academia, daily life
- What: Visualize how semantics accrete with tokens to teach shape abstraction, category-level features, and generative modeling.
- Tools/workflows: Interactive notebooks/demos with token-controlled decoding; course modules on 3D generative AI.
- Assumptions/dependencies: Requires accessible pretrained checkpoints and lightweight visualizers.
Cost and energy savings in 3D pipelines
- Sectors: cloud ops, sustainability initiatives
- Who: industry, policy
- What: Fewer tokens reduce compute for AR training/inference and bandwidth for streaming; measurable cost and carbon reductions.
- Tools/workflows: Benchmarks tracking token budgets vs. cost/energy; cloud cost analyzers.
- Assumptions/dependencies: Savings depend on decoder efficiency and workload scale; organizational buy-in for measurement and reporting.

Long-Term Applications

These applications are feasible with further research, domain adaptation, engineering, or standardization (e.g., new datasets, representation support beyond triplanes, real-time constraints).

Robotics: fast category-aware 3D understanding for manipulation and navigation
- Sectors: robotics, logistics, manufacturing
- Who: industry, academia
- What: Use few-token semantics to rapidly infer object category and affordances from partial scans; refine tokens as sensing continues to improve grasping and planning.
- Tools/workflows: On-robot LoST/RIDA encoders for point clouds; integration with perception/planning stacks; progressive shape completion.
- Assumptions/dependencies: Requires extension beyond triplanes to point clouds/meshes; training on real-world scans; low-latency, possibly non-diffusion decoders.
Semantically ordered streaming standard for 3D formats
- Sectors: web/standards, software vendors
- Who: industry, policy
- What: Introduce a semantics-first “Level-of-Semantics” streaming profile (akin to LoD) for glTF/USD so clients can request tokens by semantic priority.
- Tools/workflows: Spec proposals, reference encoders/decoders, conformance tests.
- Assumptions/dependencies: Industry consensus; IP/licensing of learned tokenizers; on-device runtime feasibility.
AEC/BIM viewers with semantics-first loading
- Sectors: architecture, engineering, construction
- Who: industry
- What: Load building models by semantic importance (e.g., structural elements first, then MEP details), enabling faster situational awareness and remote collaboration.
- Tools/workflows: BIM-to-LoST converters trained on AEC datasets; viewer support for token-based streaming.
- Assumptions/dependencies: Domain-specific semantics and part taxonomies; conversion from procedural BIM to triplanes or alternative representations.
Healthcare: progressive anatomical model generation and retrieval
- Sectors: healthcare, medical imaging
- Who: industry, academia, policy
- What: Semantics-aware tokenization of anatomical structures for telemedicine previews, case retrieval, and education; early tokens give recognizable organ shapes, refined later for pathology details.
- Tools/workflows: Domain-adapted encoders for volumetric/mesh medical data; HIPAA/GDPR-compliant model hosting; clinician-in-the-loop tools.
- Assumptions/dependencies: Extensive domain-specific training and validation; regulatory approval; representation shift beyond triplanes; strict privacy controls.
Scene-level, multi-object generation with semantic budgeting
- Sectors: simulation, gaming, digital twins
- Who: industry, academia
- What: Allocate tokens across objects by importance in a scene (e.g., hero assets vs. background) for scalable world generation and simulators.
- Tools/workflows: Scene schedulers that distribute token budgets; multi-object LoST-GPT; runtime early stopping per object.
- Assumptions/dependencies: Training on scene-level data; scheduling policies; hierarchical decoders that preserve global coherence.
Cross-modal assistants that “think in tokens”
- Sectors: software, productivity, education
- Who: industry, daily life
- What: MLLMs plan 3D workflows by requesting semantic tokens (“generate 4 tokens for a chair, then add 8 for ornate legs”) to balance quality vs. latency.
- Tools/workflows: Toolformer-style APIs exposing token budgets; agentic UIs for CAD and content creation.
- Assumptions/dependencies: Tight MLLM–LoST integration; reliable token-to-part mapping and controllability.
Insurance/claims and retail visualization
- Sectors: finance/insurance, retail
- Who: industry, daily life
- What: Quickly synthesize approximate 3D proxies from phone photos for triage or quotes; refine on the backend for detailed assessment or product configuration.
- Tools/workflows: Mobile capture apps with on-device few-token decode; backend refinement services.
- Assumptions/dependencies: Domain-specific training (damage, wear); calibration for risk; human oversight for adjudication.
On-device AR occlusion and interaction with proxy geometry
- Sectors: AR/VR, consumer tech
- Who: industry, daily life
- What: Real-time few-token reconstructions provide proxy meshes for occlusion and physics, improving AR stability without cloud round trips.
- Tools/workflows: Real-time LoST encoders; non-diffusion or lightweight decoders; integration with ARKit/ARCore.
- Assumptions/dependencies: Real-time constraints; sensor-noise robustness; hardware acceleration.
Sustainability metrics and procurement policy for 3D AI
- Sectors: policy, sustainability, enterprise IT
- Who: policy, industry
- What: Incorporate token efficiency and progressive decoding into green procurement and reporting (e.g., lower energy per generated asset).
- Tools/workflows: Standardized benchmarks for token-per-quality; reporting frameworks.
- Assumptions/dependencies: Agreement on metrics; independent auditing; mapping token budgets to energy usage.
Domain-general 3D foundation models with semantics-first tokens
- Sectors: broad AI ecosystem
- Who: academia, industry
- What: Build unified 3D backbones that serve generation, recognition, and retrieval via LoST/RIDA, enabling plug-and-play across tasks and sectors.
- Tools/workflows: Large-scale pretraining on diverse 3D corpora; adapters for alternative 3D representations (splats, SDFs, point clouds).
- Assumptions/dependencies: Data availability and licensing; training cost; community benchmarks for fair comparison.

Notes on key dependencies and assumptions

Representation: The current system is instantiated on triplane latents from a VAE; extending to meshes, point clouds, Gaussians, or medical volumes will require engineering and retraining.
Decoder compute: The diffusion decoder adds runtime cost; for latency-critical use cases, research into lighter decoders or hybrid AR+diffusion is needed.
Semantics source: RIDA depends on DINO features mined from image renderings; domain mismatch with target sectors may require teacher adaptation (e.g., medical or AEC teachers).
Quality control: Early-prefix decodes are plausible but can exhibit artifacts at very low token counts; UIs and workflows should reflect uncertainty and allow refinement.
IP and storage: Learned token representations require model availability for decode; plan for versioning, licensing, and long-term accessibility.

View Paper Prompt View All Prompts

Glossary

any-prefix generation: Generating partial outputs from early token prefixes that should be usable during decoding. "Consequently, `any-prefix generation' produces unusable shape intermediates"
autoregressive (AR) models: Generative models that predict the next element in a sequence conditioned on previous ones. "it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation."
causal masking: An attention masking scheme that prevents a token from attending to future tokens, enforcing autoregressive order. "Nested token dropout and causal masking are employed"
categorical cross-entropy loss: A standard loss for discrete classification; here contrasted with a diffusion loss for continuous tokens. "Rather than using a categorical cross-entropy loss, we adopt a diffusion loss"
Chamfer Distance (CD): A geometric distance measure between point sets used to assess reconstruction fidelity. "We report the Chamfer Distance (CD) for geometric"
classifier-free guidance: A sampling technique that mixes conditional and unconditional predictions to steer generative models. "We employ a random dropout rate of 0.1 for classifier-free guidance"
cosine similarity: A similarity measure between vectors based on the cosine of the angle between them. "maximizes the cosine similarity between $\mathcal{G}$ 's predicted latent $\hat{\mathbf{X}}_0$ and the ground-truth latent $\mathbf{X}_0$ ."
Diffusion-Transformer (DiT) model: A diffusion-based generative model implemented with a Transformer backbone. "we train a Diffusion-Transformer (DiT) model to reproduce the full signal"
diffusion decoder: A generative decoder that uses a diffusion process to reconstruct outputs from embeddings or tokens. "We use a diffusion decoder to produce the final latents from the AR generated tokens"
diffusion loss: The denoising objective used to train diffusion models; here used for next-token prediction in continuous space. "we adopt a diffusion loss~\cite{ddpm} following MAR~\cite{ar_diff_loss}"
DINO: A self-supervised Vision Transformer feature space used as a semantic teacher. "distance between its intermediate features and the DINO features of the original image."
DINOv2: An improved version of DINO providing stronger visual features used as a frozen teacher. "a frozen DINOv2 ViTB14 teacher"
edge collapse: A mesh simplification operation used in progressive representations; its reverse defines a refinement order. "learns vertex splits (i.e., reverse edge collapse ordering)"
exposure bias: The compounding of errors in autoregressive generation due to training on ground-truth histories. "Such 1D-code streams amplify quadratic attention costs and exposure bias"
FID: Fréchet Inception Distance, measuring distributional similarity between sets of images. "We compute the FID score to measure the distributional alignment between the generated shape renderings and the target shape renderings."
FlexTok: An image tokenizer that orders tokens by semantic importance and enables variable-length decoding. "We draw inspiration from the recent Flextok~\cite{flextok} and Semanticist~\cite{semanticist} works"
GPT-style Transformer: A decoder-only Transformer configured for autoregressive generation. "we ... train a GPT-style Transformer, following the standard setup of LlamaGen"
InfoNCE: A contrastive learning objective that pulls positives together and pushes negatives apart. "We adopt a multi-positive InfoNCE loss"
latent space: A compact, continuous representation space learned by an encoder (e.g., a VAE) for shapes. "Following common practice in the field, we start from VAE-encoded 3D shapes, which provide a smooth and compact latent space."
Level-of-Detail (LoD): A geometric hierarchy that refines representations from coarse to fine spatial detail. "geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression."
Level-of-Semantics Tokenization (LoST): A tokenizer that orders tokens by semantic importance so early prefixes decode into complete, plausible shapes. "We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience"
Matryoshka representation: A nested embedding scheme where prefixes are usable and progressively refined. "Matryoshka representation~\cite{kusupati2022matryoshka} learns nested and prefix-usable embeddings."
MaskGIT: An iterative masked token decoder that accelerates generation by parallel refinement. "MaskGIT~\cite{Chang_2022_CVPR} introduces iterative masked decoding for rapid refinement"
nested dropout: A training strategy that randomly truncates token prefixes to enforce hierarchical ordering of information. "we use nested dropout~\cite{nested} to enforce earlier tokens to capture the principal semantics"
octrees: Hierarchical spatial data structures that subdivide 3D space into octants for multiscale representation. "octrees~\cite{octgpt,Wang2017_OCNN}"
OpenCLIP: An open implementation of CLIP used for conditioning generative models with text/image embeddings. "For conditional generation, we utilize OpenCLIP~\cite{openclip,Radford2021LearningTV} embeddings"
patchification: Splitting inputs into fixed-size patches before feeding them to a Transformer or CNN. "Both models utilize $2 \times 2$ patchification."
perceptual loss: A loss computed in a learned feature space to align high-level semantics rather than pixels/geometry. "we employ it as a perceptual loss to guide the diffusion generator $\mathcal{G}$ ."
prefix decoder: A decoder trained to reconstruct outputs from any prefix length of the token sequence. "a prefix decoder is jointly trained to reconstruct the triplane latent features from any prefix length."
progressive meshes: A mesh representation that supports continuous levels of detail via incremental refinement. "progressive meshes~\cite{vertexregen}"
quadratic attention costs: The O(n²⁾ computational complexity of full self-attention with respect to sequence length. "Such 1D-code streams amplify quadratic attention costs and exposure bias"
register tokens: Learnable tokens not tied to specific patches that aggregate and reorder information within a Transformer. "we introduce a new set of register~\cite{darcet2023vision} tokens"
Relational Inter-Distance Alignment (RIDA): A semantic alignment objective that matches relative distances in 3D latents to a teacher feature space. "we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss"
Relational Knowledge Distillation (RKD): A distillation approach that transfers relational (pairwise) structure rather than absolute features. "inspired by Relational Knowledge Distillation (RKD)"
Representation Alignment (REPA): A loss aligning internal representations of a generator with features from a semantic teacher (e.g., DINO). "employ an important semantic alignment loss -- REPA --"
semantic salience: The relative importance of tokens according to the semantic content they convey. "orders tokens by semantic salience"
sinusoidal positional embeddings: Deterministic position encodings injected into Transformers to provide token order information. "2D sinusoidal positional embeddings."
token bloat: Inefficiency from requiring many tokens to represent coarse structures, harming AR training. "(i)~token bloat at coarse scale"
tokenization: Converting data (e.g., shapes) into discrete or continuous tokens for sequence modeling. "Tokenization is a fundamental technique in the generative modeling of various modalities."
triplane: A 3D representation encoding volumetric information as three orthogonal feature planes. "a triplane of size $\mathbb{R}^{C \times H \times W \times 3}$ "
VAE: Variational Autoencoder, an encoder–decoder model that learns a compact stochastic latent space. "we start from VAE-encoded 3D shapes"
ViT: Vision Transformer, an attention-based architecture operating on image or patch tokens. "we employ a ViT~\cite{vit}-based encoder on patchified triplanes"
z-scoring: Standardizing values to zero mean and unit variance to compare relational patterns across modalities. "by standardizing (z-scoring) each anchor's similarity row"
rank distillation: Matching the relative ordering/structure of similarities from a teacher space rather than their absolute values. "we introduce the inter-instance rank distillation loss"

View Paper Prompt View All Prompts

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

LoST: Level of Semantics Tokenization for 3D Shapes

Summary

Level of Semantics Tokenization (LoST) for 3D Shapes: An Expert Analysis

Introduction

Methodological Framework

Semantic-First Token Sequencing

Relational Inter-Distance Alignment (RIDA)

Generative Decoding

Experimental Analysis

Tokenization Efficacy

Semantic Latent Structure

AR Generation Performance

Representation Generality

Ablation and Analysis

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

The main questions the researchers asked

How did they study it? (Explained simply)

What did they find?

Why this matters and what could happen next

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on key dependencies and assumptions

Glossary

Open Problems

Continue Learning

Collections

Tweets