Quantized Visual Geometry Grounded Transformer (2509.21302v1)

Published 25 Sep 2025 in cs.CV

Abstract: Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Summary

The paper presents DSFQ and NFDS techniques to enable efficient low-bit quantization of VGGT with minimal accuracy loss.
Experimental results show that QuantVGGT maintains 98% accuracy under W4A4 quantization while achieving 2.5× speedup and 3.7× memory reduction.
The approach effectively addresses heavy-tailed activations and calibration instability, facilitating deployment in resource-constrained environments.

Quantized Visual Geometry Grounded Transformer: A Technical Analysis

Introduction

The "Quantized Visual Geometry Grounded Transformer" (QuantVGGT) addresses the quantization of billion-scale 3D reconstruction transformers, specifically the Visual Geometry Grounded Transformer (VGGT). VGGT unifies dense depth estimation, point map regression, camera pose prediction, and point tracking in a single forward pass, but its 1.2B parameter scale imposes prohibitive computational and memory costs. Existing Post-Training Quantization (PTQ) methods, successful in LLMs and 2D vision models, fail to generalize to VGGT due to unique activation distribution challenges and calibration instability. QuantVGGT introduces two core innovations: Dual-Smoothed Fine-Grained Quantization (DSFQ) and Noise-Filtered Diverse Sampling (NFDS), enabling efficient low-bit quantization with minimal accuracy loss and substantial resource savings.

Figure 1: QuantVGGT effectively quantizes VGGT to W4A4, achieving 2.5 $\times$ speedup and 3.7 $\times$ compression without compromising visual quality.

Quantization Challenges in VGGT

VGGT's architecture incorporates data-independent special tokens (camera and register tokens) that induce heavy-tailed activation distributions and extreme channel variance. These tokens, injected into image tokens to encode global context and cross-view geometry, amplify outlier statistics, which are detrimental to standard quantization schemes. Additionally, the multi-view nature of 3D data results in high semantic complexity and redundancy, making calibration sample selection highly unstable. Calibration sets that are not representative or are contaminated by outliers lead to biased quantization ranges and degraded performance.

Dual-Smoothed Fine-Grained Quantization

DSFQ is designed to mitigate the quantization-unfriendly distributions in VGGT activations. It consists of two stages:

Pre-Global Hadamard Rotation: A Hadamard transform is applied to activations and weights, dispersing outlier values across channels and smoothing heavy tails. This transformation leverages the central limit effect, approximating a Gaussian distribution and reducing the dominance of extreme values.
Post-Local Channel Smoothing: After rotation, channel-wise scaling is performed in the rotated space, normalizing inter-channel variance. The scale factors are derived from the rotated distribution, ensuring robustness against special-token-induced outliers.

The quantization granularity is set to out-dimension-wise for weights and token-wise for activations, exploiting hardware efficiency by sharing quantization parameters across summation operations in matrix multiplication.

Figure 2: Overview of QuantVGGT, illustrating the DSFQ architecture and NFDS calibration pipeline.

Figure 3: DSFQ reduces heavy-tailed activation distributions and channel variance, as shown by the transformation from original to dual-smoothed distributions.

Noise-Filtered Diverse Sampling

NFDS addresses calibration instability by constructing a representative and stable calibration set. The process involves:

Outlier Filtering: Deep-layer activation statistics (mean and variance) are used to compute a noise score for each sample. Samples with high noise scores (outliers) are filtered out, retaining only those close to the typical distribution.
Frame-Aware Clustering: Leveraging VGGT's inductive bias, frame-wise correlation vectors are computed for each sample, measuring normalized similarity between the first and subsequent frames. K-Means clustering is applied to these vectors, partitioning the calibration pool into semantically meaningful regions. Uniform sampling within each cluster ensures proportional representation of the data distribution.

Figure 4: NFDS pipeline visualizes layer distributions, label-based clustering, feature-based clustering, and the proposed frame-based clustering.

Experimental Results

QuantVGGT is evaluated on camera pose estimation (Co3Dv2) and point map estimation (DTU) tasks. Under W8A8 quantization, QuantVGGT preserves 99.9% of full-precision performance. In the more aggressive W4A4 setting, QuantVGGT maintains 98% accuracy, outperforming all baselines by a significant margin. Notably, QuantVGGT achieves a 3.7 $\times$ memory reduction and 2.5 $\times$ acceleration in real-hardware inference.

Ablation studies confirm the necessity of both DSFQ components: rotation and scaling individually improve performance, but their combination yields the highest accuracy. Fine-grained token-wise quantization further enhances performance with negligible latency overhead. NFDS is shown to reduce calibration variance and improve average performance compared to random or label-based sampling.

Figure 5: QuantVGGT achieves substantial memory savings and latency speedup compared to naive quantization, with minimal performance degradation.

Visual and Saliency Analysis

Additional visualizations demonstrate that data-independent tokens consistently exhibit higher saliency and more severe heavy-tailed distributions across VGGT layers. DSFQ effectively smooths these distributions, as evidenced by activation heatmaps and 3D surface plots. Visual comparison results between full-precision and W4A4 QuantVGGT models show negligible loss in reconstruction quality.

Figure 6: Saliency analysis of registered tokens and activation distributions across VGGT layers.

Figure 7: Visual comparison of 3D reconstruction outputs between full-precision and QuantVGGT models.

Figure 8: Additional visual comparison results, confirming preservation of geometric fidelity under quantization.

Implementation Considerations

QuantVGGT's quantization pipeline is compatible with standard PTQ frameworks. The Hadamard rotation and channel-wise scaling can be implemented with negligible runtime overhead, and scale factors are fused into neighboring layers. Calibration requires only a small subset of data, and the NFDS pipeline can be executed efficiently on consumer GPUs. The method generalizes across bit-widths (W8A8, W6A6, W4A4), providing flexible trade-offs between compression and accuracy.

Implications and Future Directions

QuantVGGT demonstrates that tailored quantization strategies, informed by model-specific activation statistics and inductive biases, are essential for billion-scale 3D transformers. The approach enables real-world deployment of VGGT in resource-constrained environments, such as edge devices and robotics. Theoretical insights into activation distribution smoothing and calibration diversity may inform future quantization research for other large-scale multimodal models. Further exploration of mixed-precision quantization and adaptive calibration strategies could yield even greater efficiency gains.

Conclusion

QuantVGGT establishes a robust PTQ framework for VGGT, overcoming the challenges posed by heavy-tailed activations and calibration instability. Through DSFQ and NFDS, it achieves state-of-the-art quantization performance, substantial resource savings, and strong generalization across 3D tasks. The methodology sets a precedent for quantizing large-scale 3D transformers and highlights the importance of model-specific quantization design.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about making a very large 3D vision AI model (called VGGT) much smaller and faster so it can run on real devices without losing much accuracy. VGGT turns a short video (several pictures of the same scene) into useful 3D information like camera positions, depth maps, and 3D points. The problem is: the model is huge (about 1.2 billion parameters), so it’s slow and needs a lot of memory. The authors introduce QuantVGGT, a way to shrink (quantize) VGGT to 4-bit numbers while keeping it accurate.

What questions did the researchers ask?

How can we compress a giant 3D transformer (VGGT) so it uses much less memory and runs faster, but still gives almost the same results?
Why do usual compression tricks (post-training quantization) struggle on 3D models like VGGT?
Can we fix those problems with new techniques that understand how VGGT processes multi-view 3D data?

How did they do it?

The authors focus on post-training quantization (PTQ). Think of PTQ like turning detailed measurements (floating-point numbers) into shorter notes (small integers) to save space and speed up math. But doing this naively can ruin accuracy if some numbers are weirdly large or if the example data you use to “set your dials” is not representative.

The problem with making a huge 3D model smaller

VGGT uses special “tokens” (extra learned vectors for cameras and registration) that aren’t directly from the input images. These tokens can create heavy-tailed activations—imagine most values are small, but a few are extremely large. That makes it hard to choose good quantization ranges.
3D input is multi-view (many different frames). If you pick the wrong few examples to calibrate the quantizer, you’ll set bad ranges and accuracy will drop on new scenes.

To fix these, the paper proposes two main techniques.

Trick 1: Dual-Smoothed Fine-Grained Quantization (DSFQ)

Goal: make the numbers more “well-behaved” before turning them into low-bit integers.

Pre-global rotation (Hadamard transform):
- Analogy: If a few students shout while everyone else whispers, it’s hard to set the right microphone volume. A Hadamard rotation is like redistributing sound so each mic hears a more even mix. Mathematically, it “mixes” values across channels without changing the final result, spreading out extreme values so the distribution is smoother.
Post-local smoothing (channel-wise scaling in the rotated space):
- Analogy: After mixing, some mics may still be a bit louder than others. Channel-wise scaling acts like individual volume knobs to balance each channel so none dominates.
Fine-grained quantization granularity:
- Instead of using one shared scale for everything, they use finer-grained scales (for example, per output row for weights or per token for activations) where it doesn’t slow down the hardware. This reduces error because each small group gets its own “best-fit” scale.

Together, this dual-smoothing plus fine-grained setup significantly lowers quantization error with almost no extra runtime cost.

Trick 2: Noise-Filtered Diverse Sampling (NFDS)

Goal: pick better calibration samples so the quantizer sets fair ranges.

Filter out noisy outliers:
- Analogy: If you’re setting the thermostat for a school, you don’t pick a weird day with a sudden heat spike. They analyze activations in deeper layers (which are more informative) and remove samples that look unusually extreme across many statistics.
Frame-aware diverse clustering:
- 3D scenes depend on how frames relate to the first frame (VGGT has a strong “first vs. others” design). The authors compute how similar each frame is to the first frame, then cluster samples by this pattern. Finally, they pick a balanced set from all clusters. This ensures the calibration set is both representative and diverse.

What did they find?

With 4-bit weights and 4-bit activations (W4A4), QuantVGGT achieves about:
- 3.7× less memory usage
- 2.5× faster inference on real hardware
- More than 98% of the original model’s accuracy on key 3D tasks
On two well-known 3D benchmarks:
- Co3Dv2 (camera pose estimation): QuantVGGT holds accuracy very close to the full-precision model even at 4-bit, and clearly outperforms other popular quantization methods.
- DTU (point map estimation): QuantVGGT remains strong even when the calibration data comes from a different dataset, showing good generalization.
The new methods (DSFQ and NFDS) together beat prior quantization baselines. Rotation alone or scaling alone helps, but combining both with fine-grained quantization works best.

Why is this important?

It makes state-of-the-art 3D understanding practical on devices with limited compute and memory, such as drones, AR/VR headsets, robots, and phones.
Faster and smaller models save energy and reduce cost while keeping quality high.
The ideas—smoothing extreme activations and choosing smarter calibration samples—could help compress other large multi-view or special-token transformers, not just VGGT.

In short

Problem: VGGT is powerful for 3D tasks but too big and slow for many real uses.
Key insight: Special tokens and multi-view data create tricky number patterns and unstable calibration.
Solution: Dual-Smoothed Fine-Grained Quantization (mix then balance) + Noise-Filtered Diverse Sampling (filter outliers, then cluster by frame relationships).
Result: 3.7× smaller, 2.5× faster, and still over 98% as accurate, beating other methods.
Impact: Enables practical, high-quality 3D vision on everyday hardware.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research.

Generalization beyond VGGT-1B: The method is only validated on VGGT-1B; it is unclear how DSFQ and NFDS transfer to other 3D transformers, smaller/larger VGGT variants, or architectures without special tokens.
Coverage of VGGT’s full task suite: VGGT predicts depth, point maps, camera pose, and point tracking, but experiments assess only camera pose and point map; the impact on dense depth estimation and point tracking remains unknown.
Quantization of non-linear transformer components: The paper does not specify whether and how operations like LayerNorm, Softmax attention, residual scaling, and positional encodings are quantized (or left in full precision), limiting reproducibility and clarity on what drives gains.
Robustness under variable sequence lengths: NFDS relies on frame-aware correlation vectors anchored to the first frame; the behavior when the number of frames varies at inference/calibration time (e.g., variable f, missing first-frame semantics) is not assessed.
Choice and sensitivity of DSFQ hyperparameters: The balancing factor α is set to 0.5 without sensitivity analysis; how performance varies with α, alternative scaling statistics (e.g., percentile vs max), and per-layer tuning is not explored.
Per-layer quantization behavior: DSFQ’s effect across different layer depths/blocks and components (e.g., MLP vs attention projections) is not characterized, offering no guidance on which parts benefit most or fail under low-bit quantization.
Granularity design ablation: The paper adopts out-dimension-wise weight quantization and token-wise activation quantization, but does not compare against other granularities (per-channel, per-head, per-group) to justify the chosen scheme.
Extremely low-bit regimes: Results are reported for W8A8 and W4A4; feasibility and failure modes at W3A3/W2A2 (or mixed-precision regimes) are unexamined.
Special-token handling alternatives: The approach mitigates heavy tails via Hadamard rotation and smoothing; alternatives such as mixed-precision for special tokens, token reparameterization, or token gating are not investigated.
Hadamard rotation practicality: Implementation details for non-power-of-two dimensions (padding, block-wise Hadamard, or alternative orthogonal transforms) and their impact on accuracy/latency are not provided.
Quantization error theory: The “central limit effect” lemma is stated informally; formal conditions under which Hadamard rotation yields Gaussianization and quantization error bounds are not established, limiting theoretical guarantees.
Attention-range calibration: How NFDS and DSFQ stabilize quantization ranges for attention logits/Softmax (which are highly sensitive) is not described; potential underflow/overflow and mitigation strategies are unclear.
Calibration pipeline specification: The size of the calibration set, number of clusters K, percentile threshold T, layer set L, and the computational overhead of NFDS (feature extraction + K-Means) are not quantified, obstructing practical replication.
Domain shift robustness: Calibration is done on Co3Dv2 while evaluation includes DTU, but a systematic paper of cross-dataset/domain shifts (e.g., indoor/outdoor, occlusions, motion blur, extreme lighting) is missing.
Outlier handling at inference: NFDS filters outliers during calibration, yet the method’s robustness when encountering unseen or extreme outliers at inference time is not evaluated.
Hardware and kernel details: Reported 2.5× speedup and 3.7× memory reduction lack explicit hardware specs (GPU/CPU/NPU/ISA), integer GEMM kernels, activation caching strategies, and memory measurement protocol (batch size, sequence length), limiting portability and reproducibility.
Integer-only execution scope: Claims about avoiding dequantization when “sharing quantization parameters across the summation” do not detail how per-token activation scales are implemented in GEMM without incurring extra conversions; practical kernel designs and constraints are not shown.
Mixed-precision opportunities: The paper does not evaluate mixed schemes (e.g., higher precision for special tokens, attention logits, LayerNorm) that could further stabilize low-bit inference.
Failure-case analysis: No qualitative or per-scene error analysis (e.g., failure modes under textureless surfaces, specularities, fast motion, large parallax) is provided to understand residual weaknesses.
Energy and deployment metrics: Latency is reported, but energy consumption, throughput under memory bandwidth constraints, and end-to-end system-level performance in edge devices are not measured.
Fairness and tuning parity of baselines: It is unclear whether baseline methods (e.g., SmoothQuant, QuaRot, GPTQ) received task-specific tuning or optimal calibrations for 3D models; potential under-tuning may bias comparisons.
Reproducibility constraints: Key implementation details are deferred to the appendix and code release is “upon publication,” making independent verification difficult at present.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise set of actionable use cases that can be deployed now, grounded in the paper’s findings and released code, and mapped to likely sectors. Each item highlights potential tools/workflows and the assumptions or dependencies that affect feasibility.

Robotics (SLAM, navigation, mapping)
- Application: Deploy VGGT quantized to W4A4 on embedded platforms (e.g., NVIDIA Jetson, Qualcomm/Apple NPUs) to run real-time multi-view SLAM and 3D mapping with 2.5× speedup and 3.7× memory reduction while maintaining ~98% accuracy.
- Tools/Workflows: Integrate the Dual-Smoothed Fine-Grained Quantization (DSFQ) into the inference graph, perform NFDS calibration with representative scenes, compile INT kernels via TensorRT/TVM, fuse scale factors as suggested by the paper.
- Assumptions/Dependencies: Availability of INT4/INT8 acceleration; access to multi-view camera streams; target domain similar to calibration data; adaptation effort to real-time pipelines.
AR/VR and XR (inside-out tracking, room-scale reconstruction)
- Application: On-device dense depth and point map reconstruction in headsets or smartphones, reducing latency and power consumption for stable tracking and spatial anchoring.
- Tools/Workflows: Use NFDS to build a small, representative calibration set from typical indoor scenes; deploy DSFQ-quantized VGGT via mobile inference runtimes; fuse DSFQ’s scales to keep runtime overhead near-zero.
- Assumptions/Dependencies: Mobile runtime support for low-bit matmul; consistent multi-view inputs; calibration data reflecting indoor lighting/texture variability.
Drones/UAVs (real-time surveying and inspection)
- Application: Real-time camera pose estimation and point map generation on resource-limited flight controllers, enabling faster coverage and longer endurance due to lower compute and memory.
- Tools/Workflows: Pre-flight NFDS calibration from known terrains (e.g., construction site, farmland); deploy DSFQ-quantized VGGT on edge modules; pair with GPS/IMU for robustness.
- Assumptions/Dependencies: Hardware support for low-bit inference on drone compute; reliable multi-view capture; domain generalization from calibration scenes.
Consumer 3D scanning and photogrammetry apps
- Application: Improve responsiveness and battery life in mobile apps capturing rooms/objects for e-commerce, home improvement, and digital twins, with quantized VGGT replacing heavier pipelines.
- Tools/Workflows: App-side NFDS calibration wizard (captures a short, diverse video of target environments); push DSFQ-quantized model updates; export meshes/point clouds via standard formats.
- Assumptions/Dependencies: Smartphone NPUs/GPUs with suitable integer kernels; user consent and storage limits; robust handling of diverse lighting and motion blur.
Cloud cost and energy reduction for 3D pipelines
- Application: Reduce inference cost and carbon footprint for large-scale 3D reconstruction jobs (e.g., content creation, digital-twin generation) by switching full-precision VGGT to W8A8/W4A4 with minimal accuracy loss.
- Tools/Workflows: Batch NFDS calibration for typical client domains; DSFQ quantization of VGGT in PyTorch/ONNX; track accuracy vs. cost via ML Ops dashboards.
- Assumptions/Dependencies: Cloud accelerators with low-bit support; end-user tolerance for small accuracy differences; monitoring for domain drift.
Academic research enablement (lower barrier to experimentation)
- Application: Run billion-parameter VGGT variants on commodity GPUs for teaching and experimentation (depth estimation, pose prediction, tracking) with quantization reducing memory and latency.
- Tools/Workflows: Incorporate the authors’ released code; adopt NFDS to create small calibration sets per dataset; reproduce benchmarks and ablations with reduced compute.
- Assumptions/Dependencies: Access to VGGT weights and licenses; students’ hardware supports INT8/INT4; domain-specific calibration is performed.
Privacy-preserving, on-device processing (Policy and compliance)
- Application: Keep multi-view visual data on-device (instead of uploading to cloud) by leveraging efficient quantized inference, improving compliance with data-protection laws.
- Tools/Workflows: NFDS calibration performed locally; DSFQ quantization compiled to the device runtime; implement privacy controls and model cards documenting quantization performance.
- Assumptions/Dependencies: Local hardware acceleration; legal review for edge processing; robust logging to satisfy audit trails.
Daily life consumer use (capturing personal spaces/objects)
- Application: Faster, more battery-friendly 3D capture for home remodel planning, furniture fitting, and DIY projects using quantized VGGT in camera apps.
- Tools/Workflows: Guided NFDS calibration mode “scan your room in 30 seconds”; DSFQ model updated via app; offline export for AR visualization.
- Assumptions/Dependencies: Device support; user willingness to run calibration; variability in home environments and lighting conditions.

Long-Term Applications

Below are applications that may require additional research, scaling, hardware standardization, or regulatory validation before deployment.

Hardware–algorithm co-design for INT4/INT8 3D transformers
- Application: Native kernels/hardware units that fuse Hadamard rotation and per-channel scaling, reducing overhead and making DSFQ a first-class primitive in compilers/runtimes.
- Tools/Workflows: Co-design with silicon vendors; add DSFQ passes to compilers (TVM/XLA); standardized INT4 ops across CUDA, Metal, and NNAPI.
- Assumptions/Dependencies: Industry-wide support for INT4; compiler changes; vendor buy-in and benchmarking.
Adaptive, streaming calibration for distribution shifts
- Application: NFDS extended to online, continual calibration for changing environments (weather, lighting, scene types), maintaining accuracy on-device over time.
- Tools/Workflows: Rolling NFDS buffers; automatic outlier filtering via deep-layer statistics; K-Means updates to cluster assignments; watchdogs for accuracy drift.
- Assumptions/Dependencies: Safe on-device metrics; storage and compute budgets; robust fallback to full precision or higher bit-width during anomalies.
Generalization to other 3D foundation models with special tokens
- Application: Apply DSFQ+NFDS to multi-task 3D models beyond VGGT (e.g., newer architectures with data-independent tokens), creating a standard quantization recipe for 3D transformers.
- Tools/Workflows: Model-agnostic DSFQ modules; token-aware statistics collectors; sector-specific calibration suites.
- Assumptions/Dependencies: Access to model internals; matching inductive biases (frame-relative modeling); empirical validation across tasks.
Integrated AR cloud and mobile ecosystems
- Application: Hybrid workflows where quantized on-device VGGT handles real-time capture and the cloud refines models asynchronously, balancing privacy, latency, and fidelity.
- Tools/Workflows: Edge–cloud handoff policies; configurable bit-widths per stage; ML Ops for calibration lifecycle management.
- Assumptions/Dependencies: Reliable connectivity; interoperable formats; consistent scene semantics between edge and cloud.
Healthcare (endoscopy, surgical planning, medical training)
- Application: Real-time 3D reconstruction from multi-view medical imagery on low-power devices for navigation and training, reducing hardware complexity in operating rooms.
- Tools/Workflows: Domain-specific NFDS calibration from approved datasets; rigorous validation under clinical protocols; hardware with deterministic low-bit behavior.
- Assumptions/Dependencies: Regulatory approval; clinical-grade reliability; careful bias and safety audits; collaboration with medical device manufacturers.
Industrial inspection and metrology
- Application: On-device 3D mapping of factory floors, pipelines, and assemblies with quantized VGGT, enabling frequent inspections without heavy compute.
- Tools/Workflows: NFDS calibrated to materials/lighting of industrial sites; DSFQ compiled for ruggedized edge computers; tie-ins to maintenance software.
- Assumptions/Dependencies: Harsh environment robustness; reflective/textureless surfaces; integration with existing asset-management systems.
Standards and policy for “Green AI” quantization practices
- Application: Incorporate quantization (e.g., W8A8/W4A4) into procurement and compliance guidelines to reduce energy footprints in public-sector and enterprise AI deployments.
- Tools/Workflows: Model cards reporting energy/accuracy trade-offs; audits requiring NFDS-like calibration transparency; benchmarking standards for 3D models.
- Assumptions/Dependencies: Policymaker adoption; sector-specific thresholds for accuracy; agreed-upon measurement protocols.
Education and workforce development
- Application: Curriculum on efficient 3D perception that includes DSFQ/NFDS, preparing students and engineers to deploy large 3D models on commodity hardware.
- Tools/Workflows: Teaching modules; lab exercises reproducing W8A8/W4A4 performance; open datasets and code exercises from the authors’ repository.
- Assumptions/Dependencies: Continued open-source access; hardware availability in classrooms; alignment with evolving 3D model designs.

View Paper Prompt View All Prompts

Glossary

AUC@30: A task-specific area-under-curve metric thresholded at 30 (often degrees) for camera pose evaluation. "with AUC@30 of 89.4 and 89.5 for FP (Full Precision)."
Bit-width: The number of bits used to represent weights or activations in quantized models. "across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin."
Block-wise reconstruction: A PTQ strategy that minimizes reconstruction error at the block level to improve quantized accuracy. "BRECQ builds the block-wise reconstruction framework,"
Calibration clusters: Groupings of samples used during PTQ calibration to ensure representative coverage of data modes. "constructs frame-aware diverse calibration clusters"
Calibration dataset: A small set of data used to tune quantization parameters without retraining model weights. "a relatively small calibration dataset"
Calibration set: The selected subset of samples ultimately used during calibration. "a small calibration set"
Camera token: A special token appended to frame tokens to aggregate global camera-related information. "augments each frame with one camera token and four register tokens"
Central limit effect: The tendency of aggregated random variables to approximate a Gaussian distribution, used to justify distribution smoothing. "Due to the central limit effect, the distribution of values after Hadamard rotation tends to approximate a Gaussian,"
Channel-wise scale: Per-channel scaling factors applied to balance activation/weight magnitudes before quantization. "we introduce a channel-wise scale to normalize the internal channel distributions:"
Co3Dv2: A dataset for evaluating camera pose and 3D reconstruction performance. "on Co3Dv2 dataset"
Data-independent special tokens: Learned tokens injected regardless of input that can skew activation statistics. "data-independent special tokens (camera and register tokens)"
Dense depth estimation: Predicting a per-pixel depth map for an image or frame. "including dense depth estimation, point map regression, camera pose prediction, and point tracking"
DINOv2: A pretrained vision backbone used for tokenizing frames. "such as DINOv2"
DTU: A dataset used for point map estimation and 3D benchmarking. "DTU dataset"
Dual-Smoothed Fine-Grained Quantization (DSFQ): The paper’s rotate-and-scale PTQ scheme combining global Hadamard rotation and local channel smoothing with fine granularity. "Dual-Smoothed Fine-Grained Quantization (DSFQ)"
Fine-Grained Quantization Granularity: Using smaller quantization groups (e.g., per-token or per-output-channel) to reduce quantization error. "Fine-Grained Quantization Granularity."
Frame-aware correlation vector: A compact representation measuring similarity between the first frame and subsequent frames to guide clustering. "construct a compact frame-aware correlation vector"
Full-precision: The original floating-point model representation before quantization. "full-precision counterpart."
Hadamard matrix: An orthogonal matrix with entries ±1 used here to rotate features without changing dot products. "A Hadamard matrix"
Hadamard rotation: Applying a Hadamard transform to spread outliers and smooth activation/weight distributions. "the distribution of values after Hadamard rotation tends to approximate a Gaussian,"
Heavy-tailed activation distributions: Activation distributions with extreme outliers that hinder effective quantization. "heavy-tailed activation distributions"
Inductive bias: Architectural or procedural assumptions that guide learning, such as frame-relative modeling. "contains a strong inductive bias"
K-Means: A clustering algorithm used to form calibration regions. "using K-Means"
Low-bit quantization: Representing weights/activations with very few bits (e.g., 4-bit) to save memory and compute. "ensuring its performance even at low-bit quantization."
Matrix multiplication invariance: Property that certain rotations preserve the result of matrix multiplication, enabling lossless pre/post transforms. "the matrix multiplication invariance is preserved as follows:"
Mu-coherent: A measure of quantization difficulty bounding the maximum element relative to the group norm. "the concept of ` $\mu$ -coherent'"
Multi-view: Data comprising multiple distinct views of a scene, common in 3D reconstruction. "the multi-view nature of 3D data"
Noise-Filtered Diverse Sampling (NFDS): The proposed calibration sample selection that filters outliers and enforces diversity via clustering. "Noise-Filtered Diverse Sampling (NFDS)"
Noise-score: A statistic combining layerwise mean/variance deviations to identify and remove outlier samples. "We then compute a noise-score using global robust moments:"
Out-dimension-wise quantization: Quantizing weights per output dimension (row-wise) to reduce error while preserving integer GEMM efficiency. "we apply out-dimension-wise quantization to the weights"
Outlier filtering: Removing atypical samples whose statistics deviate strongly to stabilize calibration. "filters outliers via deep-layer statistics"
Point map estimation: Predicting 3D point locations corresponding to image pixels/patches. "the point map estimation task"
Post-local channel smoothing: Channel-wise normalization applied after rotation to balance per-channel variance. "post-local channel smoothing"
Post-Training Quantization (PTQ): Quantization that tunes scaling/ranges using a small calibration set without updating model weights. "Post-Training Quantization (PTQ)"
Pre-global Hadamard rotation: A global Hadamard transformation applied before smoothing to disperse outliers. "pre-global Hadamard rotation"
Quantization-Aware Training (QAT): Training with simulated quantization in the loop to learn robust weights/scales. "Quantization-Aware Training (QAT)"
Quantization bins: Discrete integer levels to which continuous values are mapped in quantization. "occupy most of the quantization bins"
Quantization ranges: The min/max (or scale) bounds used to map real values to quantized levels. "ensure stable quantization ranges"
QuantVGGT: The paper’s quantization framework specialized for VGGT models. "QuantVGGT"
Register tokens: Special tokens appended to frames to aggregate specific 3D attributes across views. "four register tokens"
Rotation-based quantization: PTQ methods that apply orthogonal rotations (e.g., Hadamard) to smooth distributions before quantization. "rotation-based quantization"
Robust moments: Statistics (means/variances) computed in a way that is resilient to outliers for scoring/filtering. "global robust moments"
Symmetric quantization: Mapping values to signed integer ranges centered at zero using a single scale per group. "the symmetric quantization procedure can be described as:"
Token registration: The process of concatenating special tokens (e.g., camera/register) to frame tokens before the backbone. "the token registration process is defined as"
Token-wise quantization: Quantizing activations per token to reduce loss while keeping integer accumulation coherent. "token-wise quantization"
Tokenization: Converting input images into sequences of tokens via a vision backbone. "VGGT first tokenizes each frame"
Visual Geometry Grounded Transformer (VGGT): A large transformer that predicts multiple 3D attributes from image sequences in one pass. "Visual Geometry Grounded Transformer (VGGT)"
W4A4: A 4-bit weight and 4-bit activation quantization configuration. "to W4A4 without compromising visual quality"
W8A8: An 8-bit weight and 8-bit activation quantization configuration. "W8A8 (8-bit weight and 8-bit activation quantization)"

View Paper Prompt View All Prompts

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Generate Now

Continue Learning

Authors (11)

Collections

GitHub

GitHub - wlfeng0509/QuantVGGT: Code for QuantVGGT (2 stars)

Tweets

This paper has been mentioned in 5 tweets and received 9 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

Quantized Visual Geometry Grounded Transformer (30 likes, 0 questions)

Quantized Visual Geometry Grounded Transformer (2509.21302v1)

Summary

Quantized Visual Geometry Grounded Transformer: A Technical Analysis

Introduction

Quantization Challenges in VGGT

Dual-Smoothed Fine-Grained Quantization

Noise-Filtered Diverse Sampling

Experimental Results

Visual and Saliency Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

The problem with making a huge 3D model smaller

Trick 1: Dual-Smoothed Fine-Grained Quantization (DSFQ)

Trick 2: Noise-Filtered Diverse Sampling (NFDS)

What did they find?

Why is this important?

In short

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Questions

Continue Learning

Related Papers

Authors (11)

Collections

GitHub

Tweets

YouTube

alphaXiv