Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Release of Pre-Trained Models for the Japanese Language (2404.01657v1)

Published 2 Apr 2024 in cs.CL, cs.AI, cs.CV, cs.LG, and eess.AS

Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.
  2. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pages 248–255.
  3. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), volume 1, pages 4171–4186.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR 2021).
  5. CLOOB: Modern hopfield networks with InfoLOOB outperform CLIP. In Advances in Neural Information Processing Systems (NeurIPS 2022), volume 35, pages 20450–20468.
  6. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), pages 369–376.
  7. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  8. OpenCLIP. Zenodo.
  9. Scaling laws for neural language models. arXiv, arXiv:2001.08361.
  10. Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR 2014).
  11. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
  12. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), volume 162, pages 12888–12900.
  13. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), pages 9019–9052.
  14. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS 2022), volume 35, pages 27730–27744.
  15. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), volume 139, pages 8748–8763.
  16. Improving language understanding by generative pre-training. OpenAI.
  17. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), pages 10684–10695.
  18. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), pages 234–241.
  19. Proximal policy optimization algorithms. arXiv, arXiv:1707.06347.
  20. How to train your ViT? Data, augmentation, and regularization in Vision Transformers. Transactions on Machine Learning Research.
  21. RoFormer: Enhanced Transformer with rotary position embedding. arXiv, arXiv:2104.09864.
  22. LLaMA: Open and efficient foundation language models. arXiv, arXiv:2302.13971.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv, arXiv:2307.09288.
  24. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS 2017), volume 30, pages 5998–6008.
  25. Pretraining is all you need for image-to-image translation. arXiv, arXiv:2205.12952.
  26. LiT: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), pages 18123–18133.
  27. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, arXiv:2204.05862.
  28. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), pages 3558–3568.
  29. Together Computer. 2023. RedPajama: An open source recipe to reproduce llama training dataset. GitHub.
  30. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 8440–8451.
  31. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), volume 162, pages 5988–6008.
  32. The Pile: An 800GB dataset of diverse text for language modeling. arXiv, arXiv:2101.00027.
  33. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703.
  34. JGLUE: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), pages 2957–2966.
  35. Spontaneous speech corpus of Japanese. In Proceedings of the Second Language Resources and Evaluation Conference (LREC 2000).
  36. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), pages 15991–16111.
  37. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP 2015), pages 5206–5210.
  38. Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research, 21(140):1–67.
  39. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  40. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations (ICLR 2023).
  41. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations (ICLR 2022).
  42. ReazonSpeech: A free and massive corpus for Japanese ASR. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Natural Language Processing, pages 1134–1139.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com