Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CRAFT: Extracting and Tuning Cultural Instructions from the Wild (2405.03138v2)

Published 6 May 2024 in cs.CL

Abstract: LLMs have rapidly evolved as the foundation of various NLP applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bin Wang (750 papers)
  2. Geyu Lin (10 papers)
  3. Zhengyuan Liu (41 papers)
  4. Chengwei Wei (17 papers)
  5. Nancy F. Chen (97 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets