FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications (2403.06453v1)
Abstract: Acquiring the desired font for various design tasks can be challenging and requires professional typographic knowledge. While previous font retrieval or generation works have alleviated some of these difficulties, they often lack support for multiple languages and semantic attributes beyond the training data domains. To solve this problem, we present FontCLIP: a model that connects the semantic understanding of a large vision-LLM with typographical knowledge. We integrate typography-specific knowledge into the comprehensive vision-language knowledge of a pretrained CLIP model through a novel finetuning approach. We propose to use a compound descriptive prompt that encapsulates adaptively sampled attributes from a font attribute dataset focusing on Roman alphabet characters. FontCLIP's semantic typographic latent space demonstrates two unprecedented generalization abilities. First, FontCLIP generalizes to different languages including Chinese, Japanese, and Korean (CJK), capturing the typographical features of fonts across different languages, even though it was only finetuned using fonts of Roman characters. Second, FontCLIP can recognize the semantic attributes that are not presented in the training data. FontCLIP's dual-modality and generalization abilities enable multilingual and cross-lingual font retrieval and letter shape optimization, reducing the burden of obtaining desired fonts.
- DeepSVG: A hierarchical generative network for vector graphics animation. In Proc. NeurIPS (2020), vol. 33, pp. 16351–16361. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/bcf9d6bd14a2095866ce8c950b702341-Paper.pdf.
- Campbell N. D. F., Kautz J.: Learning a manifold of fonts. ACM Trans. Graph. 33, 4 (2014). doi:10.1145/2601097.2601212.
- Data-driven handwriting synthesis in a conjoined manner. Comput. Graph. Forum 34, 7 (2015), 235–244. doi:10.1111/cgf.12762.
- Assist users’ interactions in font search with unexpected but useful concepts generated by multimodal learning. In Proc. ICMR (2019), pp. 235–243. doi:10.1145/3323873.3325037.
- Large-scale tag-based font retrieval with generative feature learning. In Proc. ICCV (2019), pp. 9116–9125. doi:10.1109/ICCV.2019.00921.
- Large-scale visual font recognition. In Proc. CVPR (2014), pp. 3598–3605. doi:10.1109/CVPR.2014.460.
- TextDeformer: Geometry manipulation using text guidance. In Proc. SIGGRAPH (2023). doi:10.1145/3588432.3591552.
- Next generation typeface representations: Revisiting parametric fonts. In Proc. DocEng (2010), pp. 181–184. doi:10.1145/1860559.1860596.
- AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. 41, 4 (2022). doi:10.1145/3528223.3530094.
- Word-as-image for semantic typography. ACM Trans. Graph. 42, 4 (2023). doi:10.1145/3592123.
- Zero-shot text-guided object generation with dream fields. In Proc. CVPR (2022), pp. 867–876. doi:10.1109/CVPR52688.2022.00094.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. ICML (2021), pp. 4904–4916. URL: http://proceedings.mlr.press/v139/jia21b/jia21b.pdf.
- Kingma D. P., Ba J.: Adam: A method for stochastic optimization. In Proc. ICLR (2015). URL: https://arxiv.org/abs/1412.6980.
- F-vlm:open-vocabulary object detection upon frozen vision and language models. In Proc. ICLR (2023). URL: https://openreview.net/pdf?id=MIMwy4kh9lf.
- Kulahcioglu T., de Melo G.: Fonts like this but happier: A new way to discover fonts. In Proc. MM (2020), pp. 2973–2981. doi:10.1145/3394171.3413534.
- Knuth D. E.: The concept of a meta-font. Visible language 16, 1 (1982), 3–27.
- SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation, 2022. arXiv:2211.14813.
- A learned representation for scalable vector graphics. In Proc. ICCV (2019), pp. 7930–7939. doi:10.1109/ICCV.2019.00802.
- Differentiable vector graphics rasterization for editing and learning. ACM Trans. Graph. 39, 6 (2020). doi:10.1145/3414685.3417871.
- Image retrieval on real-life images with pre-trained vision-and-language models. In Proc. ICCV (2021), pp. 2125–2134. doi:10.1109/ICCV48922.2021.00213.
- PartSLIP: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proc. CVPR (June 2023), pp. 21736–21746. doi:10.1109/CVPR52729.2023.02082.
- EasyFont: A style learning-based system to easily build your large-scale handwriting fonts. ACM Trans. Graph. 38, 1 (2018). doi:10.1145/3213767.
- Grounded language-image pre-training. In Proc. CVPR (2022), pp. 10965–10975. doi:10.1109/CVPR52688.2022.01069.
- Text2mesh: Text-driven neural stylization for meshes. In Proc. CVPR (2022), pp. 13492–13502. doi:10.1109/CVPR52688.2022.01313.
- CLIP-Mesh: Generating textured meshes from text using pretrained image-text models. In Proc. SIGGRAPH Asia (2022). doi:10.1145/3550469.3555392.
- Exploratory font selection using crowdsourced attributes. ACM Trans. Graph. 33, 4 (2014). doi:10.1145/2601097.2601110.
- High-resolution image synthesis with latent diffusion models, 2021. arXiv:2112.10752.
- Hierarchical text-conditional image generation with clip latents, 2022. arXiv:2204.06125.
- Im2Vec:synthesizing vector graphics without vector supervision. In Proc. CVPR (2021), pp. 7342–7351. doi:10.1109/CVPRW53098.2021.00241.
- Learning transferable visual models from natural language supervision. In Proc. ICML (2021), pp. 8748–8763. URL: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf.
- Suveeranont R., Igarashi T.: Example-based automatic font generation. In Proc. Smart Graphics (2010), pp. 127–138. doi:10.5555/1894345.1894361.
- Shamir A., Rappoport A.: Feature-based design of fonts using constraints. In Electronic Publishing, Artistic Imaging, and Digital Typography (Berlin, Heidelberg, 1998), Hersch R. D., André J., Brown H., (Eds.), Springer Berlin Heidelberg, pp. 93–108. doi:10.1007/BFb0053265.
- MotionCLIP: Exposing human motion generation to clip space. In Proc. ECCV (2022), pp. 358–374. doi:10.1007/978-3-031-20047-2_21.
- CLIPasso: Semantically-aware object sketching. ACM Trans. Graph. 41, 4 (2022). doi:10.1145/3528223.3530068.
- Clip-NeRF: Text-and-image driven manipulation of neural radiance fields. In Proc. CVPR (2022), pp. 3835–3844. doi:10.1109/CVPR52688.2022.00381.
- Attribute2Font: Creating fonts you want from attributes. ACM Trans. Graph. 39, 4 (2020). doi:10.1145/3386569.3392456.
- Wang Y., Lian Z.: DeepVecFont: Synthesizing high-quality vector fonts via dual-modality learning. ACM Trans. Graph. 40, 6 (dec 2021). doi:10.1145/3478513.3480488.
- DeepFont: Identify your font from an image. In Proc. ICMR (2015), pp. 451–459. doi:10.1145/2733373.2806219.
- PointCLIP: Point cloud understanding by CLIP. In Proc. CVPR (2022), pp. 8552–8562. doi:10.1109/CVPR52688.2022.00836.
- Extract free dense labels from CLIP. In Proc. ECCV (2022), pp. 696–712. doi:10.1007/978-3-031-19815-1_40.
- ZegCLIP: Towards adapting clip for zero-shot semantic segmentation, 2022. arXiv:2212.03588.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.