Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors (2504.20106v1)

Published 27 Apr 2025 in cs.LG and cs.AI

Abstract: Ensuring that LLMs are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.

PDF Abstract

Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors

The paper "Adaptive Helpfulness–Harmlessness Alignment with Preference Vectors" introduces a novel framework to address the balance between helpfulness and harmlessness in LLMs. Traditional methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) face challenges, including performance trade-offs and scalability issues. This research proposes the Preference Vector framework, inspired by task arithmetic, to achieve fine-grained and user-controllable preference adjustments without retraining, by leveraging modularity and dynamic preference integration.

Core Contributions

The paper identifies three core challenges in aligning LLMs to human preferences:

Performance Conflicts: Existing methods optimize multiple preferences under a single objective, leading to suboptimal outcomes due to inherent conflicts.
Controllability: Fixed preference trade-offs during training limit post-deployment adjustments.
Extendability: Integrating new preferences typically requires extensive retraining.

To address these, the Preference Vector framework creates separate models for individual preferences. It extracts behavior shifts as preference vectors, enabling the combination of these vectors at inference to modulate helpfulness and harmlessness dynamically. This modular approach facilitates scalable multi-preference alignment while retaining the ability to integrate new preferences seamlessly.

Methodology

The methodology involves:

Training Separate Models: Models are trained independently on datasets labeled for helpfulness and harmlessness, as well as their respective negative counterparts, via a preference-switching strategy.
Extraction of Preference Vectors: Task arithmetic is used to extract preference vectors by subtracting model parameters trained on opposing preferences.
Dynamic Aggregation: During inference, preference vectors are added to a base model with user-defined scaling, offering controllability and extendability.

The experimental design utilizes models, including LLaMA-3.2-3B, LLaMA-3.1-8B, and Mistral-7B, evaluated on datasets that measure both helpfulness and harmlessness. Evaluation is conducted using preference models, GPT-4, and human raters to assess model adaptability and alignment quality.

Experimental Results

The results demonstrate that the Preference Vector framework significantly improves helpfulness scores over existing baselines while maintaining comparable harmlessness without excessive conservatism. The dynamic scaling of preference vectors allows fine-tuning of model behavior to align with user-specific requirements. Furthermore, integrating additional preferences such as Psychocounseling and AI-likeness demonstrates the framework's extendability.

Human evaluations corroborate the framework's utility, particularly in delivering helpful responses while achieving competitiveness in harmlessness. The robustness of preference vectors is validated through their cosine similarities across different seeds, showing high consistency and uni-dimensional behavior alignment.

Implications and Future Directions

Practically, this framework offers a flexible solution to customize LLM behavior along multiple dimensions of human preference without the computational cost of retraining. Theoretically, the approach highlights the potential of modular preference alignment methods in AI, emphasizing adaptability and scalability.

Future research may explore the integration of additional dimensions of human preferences, further refining the balance between task-specific performance and safety concerns. Additionally, extending this framework to interact with externally defined ethical guidelines and societal norms could enhance its applicability across diverse domains such as healthcare and education. The framework's potential to enhance personalization in AI systems with a low computational footprint presents a promising avenue for the development of safe and effective LLM applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Ren-Wei Liang (2 papers)
Chin-Ting Hsu (1 paper)
Chan-Hung Yu (3 papers)
Saransh Agrawal (2 papers)
Shih-Cheng Huang (17 papers)
Shang-Tse Chen (28 papers)
Kuan-Hao Huang (33 papers)
Shao-Hua Sun (22 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos