LoST: Level of Semantics Tokenization for 3D Shapes
Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces a new way to break 3D shapes into “tokens” (small chunks of information) so that computers can learn to create and understand 3D objects more easily. The method is called LoST (Level of Semantics Tokenization). Unlike older methods that start with rough geometry and add detail later, LoST orders tokens by meaning. That means even the first few tokens already describe a whole, recognizable object (like “a chair”), and later tokens add the specific details (like “this chair has armrests and a curved back”).
The main questions the researchers asked
- Can we design 3D tokens so that early parts already capture the object’s main idea (its semantics), not just a rough shape?
- Can this make 3D generation and reconstruction both better and more efficient (using fewer tokens)?
- How can we bring “meaning” from 2D image understanding into 3D shapes without constantly rendering images?
How did they study it? (Explained simply)
Think of building a LEGO model:
- Old way (Level-of-Detail): you first assemble a chunky outline of the model and then slowly make it less blocky. Early steps don’t look like the final thing.
- LoST (Level-of-Semantics): you start with a simple-but-complete version that already looks like the right object, then add details to make it more specific.
Here’s how LoST works, in easy terms:
- 3D shapes as “feature planes”: Each 3D shape is first turned into a compact set of features called a “triplane” (imagine three thin sheets that store information about the object from three directions—like three X-ray views). This comes from a VAE (a model that compresses and decompresses data).
- Turning shapes into tokens: A transformer (like those used in LLMs) turns the triplane into a sequence of special “register tokens.” These tokens aren’t tied to one location in space; they act more like summary notes that together describe the object.
- Ordered by meaning: The system is trained so that the first token(s) capture the main idea of the object (e.g., “car,” “chair”), and the later ones add finer details (e.g., spoiler, armrests). The training uses “nested dropout,” which randomly keeps only the first few tokens during training, forcing early tokens to carry the most important information. This creates a natural coarse-to-fine order by meaning.
- Decoding any prefix: A generative “diffusion” decoder takes however many tokens you have (even just 1) and fills in a complete, plausible 3D shape. With more tokens, the shape becomes more accurate and detailed.
- Bringing in “meaning” from 2D: Image models like DINO are very good at understanding what’s in a picture (semantics). The authors teach the 3D system to organize shapes by meaning by inventing RIDA (Relational Inter-Distance Alignment). Imagine a teacher saying: “In my world, car A is more similar to car B than to boat C.” RIDA trains a 3D “semantic extractor” so that distances between 3D shapes follow these same relationships—who is close to whom—without directly matching raw numbers. It’s like making the “friendship map” in 3D match the “friendship map” in 2D. This avoids expensive steps like constantly rendering 3D into images during training.
- Generating shapes token-by-token: Finally, they train a GPT-style model (LoST-GPT) that predicts the next token in continuous space (not just picking from a fixed list) to generate 3D objects from prompts or images—efficiently and with high quality.
What did they find?
- Early tokens already matter: Even with as few as 1–4 tokens, LoST can produce a complete, recognizable shape that matches the object’s category. Adding more tokens adds personalized details.
- Better with fewer tokens: Compared to older “Level-of-Detail” methods (like OctGPT and VertexRegen), LoST reconstructs shapes more accurately—both in geometry and in meaning—while using only about 0.1%–10% of the tokens those methods need.
- Stronger 3D generation: The LoST-GPT model (which uses LoST tokens) outperforms other state-of-the-art autoregressive 3D methods on common quality metrics, while using far fewer tokens (e.g., 128 tokens). That means it’s faster and more efficient.
- Useful beyond generation: Because the tokens are organized by meaning, they also help with tasks like semantic shape retrieval—finding shapes with similar concepts (e.g., “find other submarine-like objects”) even when geometry differs.
Why these results matter:
- Metrics like Chamfer Distance (geometry), FID (overall visual realism), and DINO similarity (semantic closeness) all improved.
- Early “previews” are actually useful because the first tokens already give a complete, plausible object, not just a coarse blob.
Why this matters and what could happen next
- Faster, cheaper 3D creation: Since early tokens already give a complete shape, models can stop early for simple objects, saving time and compute.
- Better control and integration: Tokens ordered by meaning make it easier to guide models with text or images and to integrate with big language–vision models.
- Smarter 3D search: Because tokens reflect meaning, you can search for shapes by concept and get better matches.
- Future directions: The authors suggest adapting LoST to other 3D formats (like Gaussian splats), reducing reliance on diffusion decoders for speed, strengthening very-early tokens even more, and letting the model decide how many tokens it needs (adding an “end-of-sequence” marker).
In short, LoST flips the usual script: instead of building shapes from coarse geometry to fine geometry, it builds from meaning-first to details—making 3D generation more efficient, more understandable, and more useful.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains uncertain or unexplored in the paper, organized to guide future research.
- Representation dependence: LoST is instantiated on Direct3D’s triplane VAE latents only; it is unclear how the approach transfers to other 3D representations (e.g., Gaussian splats, NeRFs, neural SDFs, meshes/point clouds) and whether the same semantic prefix behavior emerges across them.
- Generative decoder reliance: The need for a diffusion decoder to reconstruct from short prefixes introduces additional compute and inference latency; the paper does not benchmark runtime, memory, or energy vs. pure AR decoding or fast feed-forward decoders or distillation into lighter decoders.
- Continuous tokens vs. discrete codes: LoST-GPT models continuous tokens with a diffusion loss; the trade-offs vs. vector-quantized/discrete tokens (e.g., bits-per-shape, transmission/storage costs, robustness to AR errors, and training stability) are not assessed.
- Prefix consistency metrics: The claim of “any-prefix” semantic coherence is supported qualitatively, but there is no quantitative evaluation of prefix-to-prefix consistency (e.g., category stability, semantic churn, geometric deviation as tokens are added).
- Exposure bias and off-manifold risks: AR prediction of continuous tokens may yield off-manifold latents; the paper does not analyze robustness to accumulated prediction errors or propose safeguards (e.g., manifold regularizers, token denoising at inference).
- Token interpretability and controllability: It is unknown whether individual tokens correspond to interpretable semantic factors or parts; token-to-part mappings, attention visualizations, and token-level editing capabilities are unexplored.
- Variable-length AR generation: Although LoST produces variable-length codes, the AR model uses a fixed target length; training with EOS tokens and complexity-aware stopping, and measuring quality-speed trade-offs across variable lengths, remain open.
- Scalability and capacity: The effect of token count (up to 512) and token dimension (32-d) on fidelity and semantics is underexplored; scaling laws, diminishing returns, and how to allocate capacity across tokens are not analyzed.
- Data generation and domain bias: Training uses 300k synthetic shapes produced by a Flux→Direct3D pipeline; potential biases and failure modes inherited from the upstream image generator and 3D reconstructor are not quantified, and generalization to real scans/CAD collections (e.g., ScanNet, ABO, Objaverse-LVIS) is not evaluated.
- Evaluation fairness: AR baselines use Objaverse while LoST uses a synthetic dataset; differences in data scale/quality and task setup (text-to-3D vs. image-to-3D) complicate fairness. A standardized training/evaluation protocol is needed for apples-to-apples comparisons.
- Metrics coverage: Reconstruction is assessed via Chamfer, FID, and DINO similarity on renderings; topology (genus, manifoldness, self-intersections), normal consistency, surface roughness, and watertightness are not reported, nor are view-agnostic 3D shape metrics.
- Comparison to diffusion-based 3D SOTA: The paper benchmarks against AR baselines but not against strong diffusion/score-based 3D generators; relative quality, diversity, and compute costs versus modern diffusion pipelines remain unknown.
- Texture/material modeling: LoST focuses on geometry in triplane latents; handling of appearance (textures, materials) and joint geometry-appearance tokenization is not addressed.
- Scene-level extension: The approach targets single objects; how semantics-first tokenization scales to multi-object scenes, layouts, and interactions is untested.
- Few-token artifacts: While acknowledged, concrete analyses of failure modes at extreme compression (e.g., topology breaks, part omissions) and methods to mitigate them (e.g., topology-aware priors, part-consistency losses) are missing.
- RIDA teacher choice and bias: Semantics are distilled from 2D DINO features, introducing view and dataset biases; performance with alternative teachers (e.g., multi-view foundation models, 3D-aware teachers) and sensitivity to teacher errors are not examined.
- RIDA supervision without images: RIDA requires image-derived teacher features during training; applicability to datasets with 3D shapes but no associated images (without expensive rendering) remains unclear.
- Multi-view semantics: The paper sidesteps multi-view REPA due to cost, but relying on single-view DINO may bias semantics; how to incorporate affordable multi-view supervision or view-consistent semantics is open.
- RIDA hyperparameters and mining strategy: Positive/negative thresholds, batch composition, and the weights (λ) for global/rank/spatial losses are not ablated; sensitivity and stability across settings are unknown.
- Component ablations: Only a “w/ vs. w/o RIDA” ablation is provided; the individual contributions of global contrast, rank distillation, and spatial structure distillation are not isolated.
- Retrieval ground truth: Retrieval uses DINO similarity as “semantic” ground truth; human-labeled semantics or part-aware datasets are not used to validate that RIDA aligns to human perception rather than DINO idiosyncrasies.
- Conditioning modalities: The system is evaluated mainly in image-conditioned settings; the behavior in purely text-conditioned AR generation and cross-modal retrieval (text↔3D) is not fully characterized.
- Robustness to thin structures and complex topology: Performance on fine, fragile, or high-genus structures is not specifically evaluated; how semantics-first tokens preserve delicate parts remains an open question.
- Efficiency claims vs. wall-clock: Token efficiency is reported in token counts, but wall-clock training/inference time, GPU memory, and throughput comparisons to baselines are not provided.
- Alternative tokenization orderings: Whether reordering classical LoD tokens by learned semantic salience (e.g., learning a semantic traversal over octrees/meshes) could match LoST’s benefits is not investigated.
- Uncertainty calibration: As prefix length grows, the generative decoder transitions from sampling to reconstruction; explicit measures of uncertainty/diversity vs. fidelity across prefix lengths are not reported.
- Broader downstream tasks: Beyond retrieval and AR generation, the utility of LoST tokens for 3D classification, segmentation, alignment, or part-based editing is not tested.
Practical Applications
Immediate Applications
The following applications can be implemented with the methods and models described in the paper (LoST tokens, RIDA semantic extractor, diffusion-based decoder, LoST-GPT) with modest integration effort.
- Progressive 3D asset preview and streaming
- Sectors: software (3D tools), gaming/VFX, AR/VR, e-commerce
- Who: industry, daily life
- What: Use LoST’s prefix-decodable tokens to show an immediately recognizable, complete proxy from 1–4 tokens while progressively refining details as more tokens arrive. Enables low-latency previews in asset browsers, game engines, web viewers, and AR apps.
- Tools/workflows: glTF/ USD pipeline extension to carry LoST token streams; web/mobile viewers that request additional tokens on demand; early-stop decoding in DCC tools (e.g., Blender, Unity, Unreal).
- Assumptions/dependencies: Requires LoST encoder/decoder runtime and a mesh/surface reconstruction step from triplanes; diffusion decoder incurs nontrivial compute; domain retraining may be needed for stylized assets.
- Interactive co-creative 3D modeling with “add-a-token” refinement
- Sectors: design/CAD, media/entertainment, education
- Who: industry, academia, daily life (prosumer creators)
- What: Start from a 1–2-token coarse but plausible shape from text/image, then add tokens to refine parts or instance-level details. Supports user-in-the-loop ideation and fast iteration.
- Tools/workflows: Plugins for DCCs exposing a token-count slider; prompt/image-conditioned LoST-GPT for initial shape, followed by token-wise refinement.
- Assumptions/dependencies: Requires a responsive diffusion decoder; UI affordances for token-level control; training aligned to the target design domain.
- Token-efficient 3D generation on edge devices
- Sectors: AR/VR, mobile apps, social media filters
- Who: industry, daily life
- What: Generate approximate 3D proxies from few tokens on laptops/phones for AR occlusion, quick scene prototyping, or filters; progressively refine when connected to the cloud.
- Tools/workflows: On-device LoST-GPT with reduced token budgets (e.g., 16–64); server-side refinement via full-length decode.
- Assumptions/dependencies: Mobile-friendly inference requires model distillation/quantization; triplane-to-mesh conversion must be efficient.
- Semantic 3D search and retrieval for asset libraries
- Sectors: gaming/VFX, e-commerce, archives/museums, enterprise content management
- Who: industry, academia
- What: Index libraries using RIDA embeddings to retrieve semantically similar shapes (e.g., “submarine shaped like a fish”) beyond geometric similarity; improves asset discovery, deduplication, and tagging.
- Tools/workflows: Embedding service for RIDA features; vector search (e.g., FAISS, Elasticsearch KNN) integrated with asset catalogs.
- Assumptions/dependencies: Needs pretraining of the RIDA student on representative data; retrieval quality depends on DINO teacher alignment to the target style/domain.
- Storage and bandwidth reduction via learned 3D compression
- Sectors: cloud platforms, collaboration tools, streaming, digital twins
- Who: industry, policy (sustainability, cost)
- What: Store and transmit 3D assets as compact LoST token sequences (e.g., 128 tokens) instead of large meshes/octrees; decode on demand.
- Tools/workflows: Repository format for tokenized shapes; background decode services; APIs to fetch tokens progressively.
- Assumptions/dependencies: Compression is model-dependent (decoder must be available); legal/IP considerations for storing model-dependent representations.
- Rapid quality control and content moderation for 3D uploads
- Sectors: marketplaces, social platforms
- Who: industry, policy
- What: Use early-token decodes for fast semantic screening (e.g., category, safety) before full decode; triage and prioritize moderation queues.
- Tools/workflows: Gatekeeping service that decodes 1–4 tokens and classifies via RIDA space; escalation to full decode if needed.
- Assumptions/dependencies: Moderation policy requires calibration for false positives/negatives; relies on DINO-aligned semantics reflecting platform norms.
- Dataset curation and active learning for 3D models
- Sectors: academia, AI/ML platform teams
- Who: academia, industry
- What: Use RIDA space to cluster by semantics, select diverse exemplars, and detect duplicates; improves training efficiency and coverage for 3D generative models.
- Tools/workflows: Clustering/coverage analytics in RIDA space; sampling utilities integrated into data pipelines.
- Assumptions/dependencies: Benefit depends on how well RIDA captures semantics in the domain; periodic re-indexing as datasets evolve.
- Progressive 3D product viewing in e-commerce
- Sectors: retail, marketplaces
- Who: industry, daily life
- What: Show an instant recognizable 3D silhouette while detailed textures/geometry stream in; improves perceived performance and engagement on product pages.
- Tools/workflows: Web viewer embedding LoST decoder; CDN-backed token streaming.
- Assumptions/dependencies: Requires preprocessing of product models into LoST tokens; regulatory requirements for product fidelity may necessitate high-token final decode.
- Education: teaching hierarchical 3D semantics
- Sectors: education, outreach
- Who: academia, daily life
- What: Visualize how semantics accrete with tokens to teach shape abstraction, category-level features, and generative modeling.
- Tools/workflows: Interactive notebooks/demos with token-controlled decoding; course modules on 3D generative AI.
- Assumptions/dependencies: Requires accessible pretrained checkpoints and lightweight visualizers.
- Cost and energy savings in 3D pipelines
- Sectors: cloud ops, sustainability initiatives
- Who: industry, policy
- What: Fewer tokens reduce compute for AR training/inference and bandwidth for streaming; measurable cost and carbon reductions.
- Tools/workflows: Benchmarks tracking token budgets vs. cost/energy; cloud cost analyzers.
- Assumptions/dependencies: Savings depend on decoder efficiency and workload scale; organizational buy-in for measurement and reporting.
Long-Term Applications
These applications are feasible with further research, domain adaptation, engineering, or standardization (e.g., new datasets, representation support beyond triplanes, real-time constraints).
- Robotics: fast category-aware 3D understanding for manipulation and navigation
- Sectors: robotics, logistics, manufacturing
- Who: industry, academia
- What: Use few-token semantics to rapidly infer object category and affordances from partial scans; refine tokens as sensing continues to improve grasping and planning.
- Tools/workflows: On-robot LoST/RIDA encoders for point clouds; integration with perception/planning stacks; progressive shape completion.
- Assumptions/dependencies: Requires extension beyond triplanes to point clouds/meshes; training on real-world scans; low-latency, possibly non-diffusion decoders.
- Semantically ordered streaming standard for 3D formats
- Sectors: web/standards, software vendors
- Who: industry, policy
- What: Introduce a semantics-first “Level-of-Semantics” streaming profile (akin to LoD) for glTF/USD so clients can request tokens by semantic priority.
- Tools/workflows: Spec proposals, reference encoders/decoders, conformance tests.
- Assumptions/dependencies: Industry consensus; IP/licensing of learned tokenizers; on-device runtime feasibility.
- AEC/BIM viewers with semantics-first loading
- Sectors: architecture, engineering, construction
- Who: industry
- What: Load building models by semantic importance (e.g., structural elements first, then MEP details), enabling faster situational awareness and remote collaboration.
- Tools/workflows: BIM-to-LoST converters trained on AEC datasets; viewer support for token-based streaming.
- Assumptions/dependencies: Domain-specific semantics and part taxonomies; conversion from procedural BIM to triplanes or alternative representations.
- Healthcare: progressive anatomical model generation and retrieval
- Sectors: healthcare, medical imaging
- Who: industry, academia, policy
- What: Semantics-aware tokenization of anatomical structures for telemedicine previews, case retrieval, and education; early tokens give recognizable organ shapes, refined later for pathology details.
- Tools/workflows: Domain-adapted encoders for volumetric/mesh medical data; HIPAA/GDPR-compliant model hosting; clinician-in-the-loop tools.
- Assumptions/dependencies: Extensive domain-specific training and validation; regulatory approval; representation shift beyond triplanes; strict privacy controls.
- Scene-level, multi-object generation with semantic budgeting
- Sectors: simulation, gaming, digital twins
- Who: industry, academia
- What: Allocate tokens across objects by importance in a scene (e.g., hero assets vs. background) for scalable world generation and simulators.
- Tools/workflows: Scene schedulers that distribute token budgets; multi-object LoST-GPT; runtime early stopping per object.
- Assumptions/dependencies: Training on scene-level data; scheduling policies; hierarchical decoders that preserve global coherence.
- Cross-modal assistants that “think in tokens”
- Sectors: software, productivity, education
- Who: industry, daily life
- What: MLLMs plan 3D workflows by requesting semantic tokens (“generate 4 tokens for a chair, then add 8 for ornate legs”) to balance quality vs. latency.
- Tools/workflows: Toolformer-style APIs exposing token budgets; agentic UIs for CAD and content creation.
- Assumptions/dependencies: Tight MLLM–LoST integration; reliable token-to-part mapping and controllability.
- Insurance/claims and retail visualization
- Sectors: finance/insurance, retail
- Who: industry, daily life
- What: Quickly synthesize approximate 3D proxies from phone photos for triage or quotes; refine on the backend for detailed assessment or product configuration.
- Tools/workflows: Mobile capture apps with on-device few-token decode; backend refinement services.
- Assumptions/dependencies: Domain-specific training (damage, wear); calibration for risk; human oversight for adjudication.
- On-device AR occlusion and interaction with proxy geometry
- Sectors: AR/VR, consumer tech
- Who: industry, daily life
- What: Real-time few-token reconstructions provide proxy meshes for occlusion and physics, improving AR stability without cloud round trips.
- Tools/workflows: Real-time LoST encoders; non-diffusion or lightweight decoders; integration with ARKit/ARCore.
- Assumptions/dependencies: Real-time constraints; sensor-noise robustness; hardware acceleration.
- Sustainability metrics and procurement policy for 3D AI
- Sectors: policy, sustainability, enterprise IT
- Who: policy, industry
- What: Incorporate token efficiency and progressive decoding into green procurement and reporting (e.g., lower energy per generated asset).
- Tools/workflows: Standardized benchmarks for token-per-quality; reporting frameworks.
- Assumptions/dependencies: Agreement on metrics; independent auditing; mapping token budgets to energy usage.
- Domain-general 3D foundation models with semantics-first tokens
- Sectors: broad AI ecosystem
- Who: academia, industry
- What: Build unified 3D backbones that serve generation, recognition, and retrieval via LoST/RIDA, enabling plug-and-play across tasks and sectors.
- Tools/workflows: Large-scale pretraining on diverse 3D corpora; adapters for alternative 3D representations (splats, SDFs, point clouds).
- Assumptions/dependencies: Data availability and licensing; training cost; community benchmarks for fair comparison.
Notes on key dependencies and assumptions
- Representation: The current system is instantiated on triplane latents from a VAE; extending to meshes, point clouds, Gaussians, or medical volumes will require engineering and retraining.
- Decoder compute: The diffusion decoder adds runtime cost; for latency-critical use cases, research into lighter decoders or hybrid AR+diffusion is needed.
- Semantics source: RIDA depends on DINO features mined from image renderings; domain mismatch with target sectors may require teacher adaptation (e.g., medical or AEC teachers).
- Quality control: Early-prefix decodes are plausible but can exhibit artifacts at very low token counts; UIs and workflows should reflect uncertainty and allow refinement.
- IP and storage: Learned token representations require model availability for decode; plan for versioning, licensing, and long-term accessibility.
Glossary
- any-prefix generation: Generating partial outputs from early token prefixes that should be usable during decoding. "Consequently, `any-prefix generation' produces unusable shape intermediates"
- autoregressive (AR) models: Generative models that predict the next element in a sequence conditioned on previous ones. "it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation."
- causal masking: An attention masking scheme that prevents a token from attending to future tokens, enforcing autoregressive order. "Nested token dropout and causal masking are employed"
- categorical cross-entropy loss: A standard loss for discrete classification; here contrasted with a diffusion loss for continuous tokens. "Rather than using a categorical cross-entropy loss, we adopt a diffusion loss"
- Chamfer Distance (CD): A geometric distance measure between point sets used to assess reconstruction fidelity. "We report the Chamfer Distance (CD) for geometric"
- classifier-free guidance: A sampling technique that mixes conditional and unconditional predictions to steer generative models. "We employ a random dropout rate of 0.1 for classifier-free guidance"
- cosine similarity: A similarity measure between vectors based on the cosine of the angle between them. "maximizes the cosine similarity between 's predicted latent and the ground-truth latent ."
- Diffusion-Transformer (DiT) model: A diffusion-based generative model implemented with a Transformer backbone. "we train a Diffusion-Transformer (DiT) model to reproduce the full signal"
- diffusion decoder: A generative decoder that uses a diffusion process to reconstruct outputs from embeddings or tokens. "We use a diffusion decoder to produce the final latents from the AR generated tokens"
- diffusion loss: The denoising objective used to train diffusion models; here used for next-token prediction in continuous space. "we adopt a diffusion loss~\cite{ddpm} following MAR~\cite{ar_diff_loss}"
- DINO: A self-supervised Vision Transformer feature space used as a semantic teacher. "distance between its intermediate features and the DINO features of the original image."
- DINOv2: An improved version of DINO providing stronger visual features used as a frozen teacher. "a frozen DINOv2 ViTB14 teacher"
- edge collapse: A mesh simplification operation used in progressive representations; its reverse defines a refinement order. "learns vertex splits (i.e., reverse edge collapse ordering)"
- exposure bias: The compounding of errors in autoregressive generation due to training on ground-truth histories. "Such 1D-code streams amplify quadratic attention costs and exposure bias"
- FID: Fréchet Inception Distance, measuring distributional similarity between sets of images. "We compute the FID score to measure the distributional alignment between the generated shape renderings and the target shape renderings."
- FlexTok: An image tokenizer that orders tokens by semantic importance and enables variable-length decoding. "We draw inspiration from the recent Flextok~\cite{flextok} and Semanticist~\cite{semanticist} works"
- GPT-style Transformer: A decoder-only Transformer configured for autoregressive generation. "we ... train a GPT-style Transformer, following the standard setup of LlamaGen"
- InfoNCE: A contrastive learning objective that pulls positives together and pushes negatives apart. "We adopt a multi-positive InfoNCE loss"
- latent space: A compact, continuous representation space learned by an encoder (e.g., a VAE) for shapes. "Following common practice in the field, we start from VAE-encoded 3D shapes, which provide a smooth and compact latent space."
- Level-of-Detail (LoD): A geometric hierarchy that refines representations from coarse to fine spatial detail. "geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression."
- Level-of-Semantics Tokenization (LoST): A tokenizer that orders tokens by semantic importance so early prefixes decode into complete, plausible shapes. "We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience"
- Matryoshka representation: A nested embedding scheme where prefixes are usable and progressively refined. "Matryoshka representation~\cite{kusupati2022matryoshka} learns nested and prefix-usable embeddings."
- MaskGIT: An iterative masked token decoder that accelerates generation by parallel refinement. "MaskGIT~\cite{Chang_2022_CVPR} introduces iterative masked decoding for rapid refinement"
- nested dropout: A training strategy that randomly truncates token prefixes to enforce hierarchical ordering of information. "we use nested dropout~\cite{nested} to enforce earlier tokens to capture the principal semantics"
- octrees: Hierarchical spatial data structures that subdivide 3D space into octants for multiscale representation. "octrees~\cite{octgpt,Wang2017_OCNN}"
- OpenCLIP: An open implementation of CLIP used for conditioning generative models with text/image embeddings. "For conditional generation, we utilize OpenCLIP~\cite{openclip,Radford2021LearningTV} embeddings"
- patchification: Splitting inputs into fixed-size patches before feeding them to a Transformer or CNN. "Both models utilize patchification."
- perceptual loss: A loss computed in a learned feature space to align high-level semantics rather than pixels/geometry. "we employ it as a perceptual loss to guide the diffusion generator ."
- prefix decoder: A decoder trained to reconstruct outputs from any prefix length of the token sequence. "a prefix decoder is jointly trained to reconstruct the triplane latent features from any prefix length."
- progressive meshes: A mesh representation that supports continuous levels of detail via incremental refinement. "progressive meshes~\cite{vertexregen}"
- quadratic attention costs: The O(n2) computational complexity of full self-attention with respect to sequence length. "Such 1D-code streams amplify quadratic attention costs and exposure bias"
- register tokens: Learnable tokens not tied to specific patches that aggregate and reorder information within a Transformer. "we introduce a new set of register~\cite{darcet2023vision} tokens"
- Relational Inter-Distance Alignment (RIDA): A semantic alignment objective that matches relative distances in 3D latents to a teacher feature space. "we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss"
- Relational Knowledge Distillation (RKD): A distillation approach that transfers relational (pairwise) structure rather than absolute features. "inspired by Relational Knowledge Distillation (RKD)"
- Representation Alignment (REPA): A loss aligning internal representations of a generator with features from a semantic teacher (e.g., DINO). "employ an important semantic alignment loss -- REPA --"
- semantic salience: The relative importance of tokens according to the semantic content they convey. "orders tokens by semantic salience"
- sinusoidal positional embeddings: Deterministic position encodings injected into Transformers to provide token order information. "2D sinusoidal positional embeddings."
- token bloat: Inefficiency from requiring many tokens to represent coarse structures, harming AR training. "(i)~token bloat at coarse scale"
- tokenization: Converting data (e.g., shapes) into discrete or continuous tokens for sequence modeling. "Tokenization is a fundamental technique in the generative modeling of various modalities."
- triplane: A 3D representation encoding volumetric information as three orthogonal feature planes. "a triplane of size "
- VAE: Variational Autoencoder, an encoder–decoder model that learns a compact stochastic latent space. "we start from VAE-encoded 3D shapes"
- ViT: Vision Transformer, an attention-based architecture operating on image or patch tokens. "we employ a ViT~\cite{vit}-based encoder on patchified triplanes"
- z-scoring: Standardizing values to zero mean and unit variance to compare relational patterns across modalities. "by standardizing (z-scoring) each anchor's similarity row"
- rank distillation: Matching the relative ordering/structure of similarities from a teacher space rather than their absolute values. "we introduce the inter-instance rank distillation loss"
Collections
Sign up for free to add this paper to one or more collections.