3D Stylization via Large Reconstruction Model (2504.21836v1)

Published 30 Apr 2025 in cs.CV

Abstract: With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.

Summary

The paper introduces an efficient method for 3D stylization by leveraging Large Reconstruction Models and attention mechanisms from 2D image generation.
This method achieves instant 3D stylization using features from a single style image, avoiding the extensive test-time optimization needed by previous approaches like NeRF.
The efficiency and speed of this approach enable real-time graphics applications, offering users greater creative control with minimal computational overhead compared to traditional methods.

Overview of "3D Stylization via Large Reconstruction Model"

The paper, titled "3D Stylization via Large Reconstruction Model," presents an innovative approach to stylizing 3D objects by leveraging large reconstruction models (LRM) with attention mechanisms derived from 2D image generation techniques. The research addresses critical challenges in adapting visual styles from reference images to 3D assets while ensuring multi-view visual consistency. The authors highlight how traditional methods, such as those utilizing Neural Radiance Fields (NeRF), are often hindered by computational inefficiencies and require exhaustive optimization processes that are unsuitable for real-time applications.

Key Contributions

Attention Mechanisms in 3D Generation: The paper draws inspiration from successful 2D stylization methods and investigates the potential of LRM models — known for effectively using sparse inputs like single images — to capture appearance-specific features using attention mechanisms. By focusing on certain attention blocks within these models, the authors achieve a robust stylization process that injects features from reference style images without necessitating test-time optimization or additional training.
Efficiency and Practicality: Unlike previous methods requiring multi-view images and lengthy optimization, the proposed approach adapts features from a single image and offers instantaneity in 3D stylization. This advance has significant implications for interactive applications where rapid generation and stylization are beneficial.
Quantitative and Qualitative Validations: The authors present both quantitative and qualitative evaluations demonstrating the superior performance of their method in achieving high-quality 3D stylization. They provide compelling evidence of improved efficiency, highlighting the visual quality outcomes while maintaining computational practicality.

Methodology

The approach involves using a large pretrained reconstruction model that internally reconstructs detailed 3D objects from limited view inputs. The research discovers that late-stage transformer blocks within these models are critical in capturing and transferring visual styles. By incorporating style image features into these late-stage blocks, the authors effectively disentangle geometry from appearance, ensuring the latter aligns with reference style images without compromising the underlying shape fidelity.

Implications and Future Work

This paper's findings have profound implications for real-time graphics applications where low-latency stylization is crucial. The approach empowers users with creative control, enhancing the stylization process while minimizing computational overhead. Future work might explore extending these techniques to different 3D representations, such as Gaussian splats, and optimizing geometric fidelity further. Improved large reconstruction models can potentially augment the capabilities of the current framework. Moreover, exploring ways to adapt geometric intricacies alongside appearance features offers a promising trajectory.

In summary, the paper represents a significant step towards efficient, high-quality 3D stylization within LRM frameworks, offering a compelling solution to contemporary challenges faced in visual appearance adaptation.

Tweets

https://twitter.com/pekzta4/status/1917939538023813472