Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI (2401.14019v1)

Published 25 Jan 2024 in cs.CL and cs.AI

Abstract: In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative LLMs. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!

Citations (7)

Summary

  • The paper introduces Unitxt, a modular library that enables over 100,000 recipe configurations for reproducible data preparation and evaluation in generative AI.
  • It demonstrates seamless integration with existing libraries like Hugging Face and LM-eval-harness, requiring minimal code changes for compatibility.
  • The study highlights community collaboration through a centralized catalog of predefined recipes, paving the way for scalable and innovative NLP experiments.

Introduction

In the ever-evolving field of NLP, the complexities of data preparation and evaluation for generative LLMs have become markedly pronounced. Researchers are continuously encountering challenges associated with the flexibility and reproducibility of their experiments due to the increasing richness of modalities involved, such as system prompts, model-specific formats, and task instructions. Against this backdrop, IBM Research has developed Unitxt, a Python library aimed at offering a structured, modular, and customizable solution to these challenges.

Flexible Data Preparation

At the core of Unitxt lies the concept of "recipes," which are sequences of textual data processing operators. These recipes cater to a multitude of tasks by including operators for data loading, pre-processing, prompt preparation, and prediction evaluation. Such an approach allows for over 100,000 recipe configurations, granting researchers the capability to unpack and repurpose diverse components such as model-specific formats and evaluation metrics.

Unitxt enriches flexibility by incorporating a centralized catalog that features a broad spectrum of predefined recipes based on an extensive set of shared, built-in operators. This repository further promotes community-based collaboration and reproducibility, as researchers can contribute new ingredients or utilize those shared by others, thereby fostering transparent and collaborative modern textual data workflows.

Seamless Integration with Existing Libraries

Seamless integration is a haLLMark of Unitxt, which has been designed to mitigate the inconvenience of switching between libraries. A testament to its integration capabilities is the ability to load datasets from the Hugging Face library and produce outputs compatible with existing codebases. In practice, incorporating Unitxt into established workflows, such as LM-eval-harness, necessitated a mere 30 lines of code while preserving existing APIs and ensuring compatibility with current benchmarks.

Expanding Horizons for LLMs

Integrating Unitxt has already gained traction within various IBM teams handling a range of NLP tasks from classification to question answering. The contribution of Unitxt to the open-source community, as well as the ongoing collaborative development, is set to further refine this tool. With the Unitxt library, IBM has not only established a new standard for managing the intricate requirements of textual data processing in generative AI but has essentially paved the way for elevated experimentation and novel breakthroughs in the field.

Conclusion

In conclusion, Unitxt stands out as a pioneering solution that addresses the dire need for a standardized yet flexible framework in LLM research. Its innovative approach to modular and customizable textual data workflows enables unprecedented scalability and promotes a culture of open collaboration. As the Unitxt community continues to grow, so too does its potential to revolutionize the way researchers prepare and evaluate data for LLMs, ensuring that strides in NLP are bolstered by robustness and compliance with the best practices of reproducibility.

Youtube Logo Streamline Icon: https://streamlinehq.com