Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains (2402.05140v3)

Published 6 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have demonstrated remarkable proficiency in understanding and generating natural language. However, their capabilities wane in highly specialized domains underrepresented in the pretraining corpus, such as physical and biomedical sciences. This work explores how to repurpose general LLMs into effective task solvers for specialized domains. We introduce a novel, model-agnostic framework for learning custom input tags, which are parameterized as continuous vectors appended to the LLM's embedding layer, to condition the LLM. We design two types of input tags: domain tags are used to delimit specialized representations (e.g., chemical formulas) and provide domain-relevant context; function tags are used to represent specific functions (e.g., predicting molecular properties) and compress function-solving instructions. We develop a three-stage protocol to learn these tags using auxiliary data and domain knowledge. By explicitly disentangling task domains from task functions, our method enables zero-shot generalization to unseen problems through diverse combinations of the input tags. It also boosts LLM's performance in various specialized domains, such as predicting protein or chemical properties and modeling drug-target interactions, outperforming expert models tailored to these tasks.

PDF HTML Abstract

Overview

In the field of LLMs, a longstanding challenge has been adapting these models, originally designed for general purposes, to excel in specialized domains. The domains of interest often include highly specific fields such as physical and biomedical sciences, where the data differ significantly from the textual data typically encountered in NLP. Addressing this gap, the paper introduces Tag-LLM, a framework designed to repurpose general-purpose LLMs for specialized tasks by utilizing domain-specific input tags. These tags, parameterized as continuous vectors and appended to the LLM’s embedding layer, serve as powerful tools to condition the LLM, aligning its capabilities with the requirements of a given specialized domain or task.

Design and Implementation of Tag-LLM

The proposed method stands out by splitting input tags into two categories: domain tags and function tags. Domain tags are intended to contextualize the input data, indicating to the model the type of specialized data it is processing (like chemical formulas or protein sequences), while function tags signal the model on the specific task at hand, such as predicting molecular properties or modeling drug-target interactions. This bifurcation allows for a modular approach to problem-solving, where various combinations of input tags can be deployed to tackle new or unseen tasks in a zero-shot fashion.

A distinctive three-stage protocol has been developed for learning these tags, leveraging auxiliary datasets and domain knowledge to progressively enhance the model's understanding and performance. In the first stage, domain tags are refined through next-token prediction tasks using in-domain data. Subsequent stages involve training single-domain and cross-domain function tags with increasingly specialized task-oriented data, enriching the model's capability to address complex problems across different fields.

Empirical Results and Findings

Quantitative evaluations have demonstrated the efficacy of Tag-LLM across a diverse set of tasks, including translation across eight languages and scientific endeavors such as protein property prediction and drug discovery. Notably, for tasks such as drug combination prediction and binding affinity prediction in the field of pharmaceuticals, Tag-LLM achieved state-of-the-art results, significantly outperforming both specialized models and other methods aimed at repurposing LLMs.

The modular design of Tag-LLM, combined with its systematic training protocol, not only improves performance on specialized tasks but also offers a framework for expansion, allowing new tags to be incrementally added. This capability ensures that as domains evolve or as new challenges arise, Tag-LLM can adapt and extend its proficiency accordingly.

Conclusion and Prospects

Tag-LLM represents a significant step forward in the effort to leverage the broad capabilities of general-purpose LLMs within highly specialized and technical domains. By innovatively employing domain and function tags, the framework unlocks new potential for applying AI in scientific research and beyond. Future directions for this work include expanding the applicability of the model to additional domains, exploring other forms of tag-based conditioning, and further optimizing the model for computational efficiency and adaptability. This research opens a promising avenue for making advanced AI tools more accessible and effective in tackling the complex problems faced in specialized fields of paper.