LoRA-PAR: Dual-System PEFT for LLMs

Updated 16 March 2026

The paper introduces LoRA-PAR, a dual-system parameter-efficient fine-tuning approach that partitions LoRA adapters to separately handle intuitive (System 1) and reasoning (System 2) tasks.
It employs Taylor-based importance scoring to identify specialized parameter subsets, achieving about 40% parameter activation while retaining nearly 89% of full-adapter accuracy on benchmarks.
LoRA-PAR uses a two-stage fine-tuning protocol with supervised learning for System 1 tasks and reinforcement learning for System 2 tasks, enhancing both efficiency and interpretability.

LoRA-PAR is a dual-system parameter-efficient fine-tuning (PEFT) methodology for LLMs, inspired by cognitive dual-process theory and designed to partition both data and LoRA adapter parameters to separately target "fast/intuitive" (System 1) and "slow/reasoning" (System 2) task demands. The approach integrates multi-model data labeling, Taylor-based parameter importance scoring, and a staged fine-tuning protocol involving sequential supervised and reinforcement learning. LoRA-PAR enables selective activation of LoRA parameters for different response types, achieving high accuracy with substantial parameter savings and interpretability improvements over baseline PEFT methods (Huang et al., 28 Jul 2025).

1. Dual-System Analogy and Data-Driven Task Partitioning

LoRA-PAR draws on Kahneman’s dual-process cognitive theory, proposing that LLM capacity can be split into System 1—single-step, intuitive mappings (e.g., factual recall)—and System 2—multi-step, deliberative reasoning (e.g., arithmetic with chain of thought). The methodology employs multi-model role-playing and majority-vote labeling to classify training samples as System 1 or System 2. Specifically, given a data pool $D = \{x_k\}_{k=1}^N$ , $M$ external teacher LLMs $T_1, \ldots, T_M$ are prompted to classify each sample. Final assignments use

$\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$

This yields disjoint data subsets $D_1$ (System 1) and $D_2$ (System 2). Experimental results indicate that ensemble role-play and voting (e.g., $n=5$ ) substantially enhance data quality compared to random or single-model splits: on GSM8K, accuracy improves to 27.60% versus 25.85% for random partitioning [(Huang et al., 28 Jul 2025), Table 1].

2. Parameter Partitioning via Taylor-Based Importance Scoring

LoRA-PAR attaches LoRA adapters to all Transformer-layer Q/K/V/Gate/Up/Down weights and defines parameters $\{\phi_j\}_{j=1}^P$ . Parameter specialization is quantified per system $s \in \{1, 2\}$ by masking the loss to relevant output positions, computing first- and second-order Taylor approximations: $I_s(\phi_j) = \left| g_j \phi_j - \frac{1}{2}\hat{F}_{jj} \phi_j^2 \right|,$ where $M$ 0 and $M$ 1 estimates the diagonal Fisher information. For each system, parameters are ranked by $M$ 2, and the minimal subset $M$ 3 is chosen to exceed a cumulative importance proportion $M$ 4 (e.g., $M$ 5 retains 90% of total importance).

The three-way partition is then:

$M$ 6-only: $M$ 7
$M$ 8-only: $M$ 9
$T_1, \ldots, T_M$ 0: $T_1, \ldots, T_M$ 1

Empirical findings demonstrate that, at $T_1, \ldots, T_M$ 2, only 40.6% of parameters need activation for System 1 with $T_1, \ldots, T_M$ 389% retention of full-LoRA accuracy (27.30% vs 30.63%); random subsets with equal budget achieve only 23.43% [(Huang et al., 28 Jul 2025), Table 3].

3. Two-Stage Fine-Tuning Protocol

LoRA-PAR introduces a two-stage update schedule:

Stage 1: Supervised Fine-Tuning (SFT) on $T_1, \ldots, T_M$ 4 (System 1 tasks)
- Activate $T_1, \ldots, T_M$ 5-only and an $T_1, \ldots, T_M$ 6 fraction of $T_1, \ldots, T_M$ 7 ( $T_1, \ldots, T_M$ 8). Loss updates restricted to active parameters.
- Hyperparameters: learning rate $T_1, \ldots, T_M$ 9, batch size 32, 1–2 epochs.
Stage 2: Reinforcement Learning (RL) on $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 0 (System 2 tasks)
- Activate $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 1-only and $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 2 fraction of $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 3 ( $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 4).
- RL objective $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 5 where $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 6 is a reward for correct chain-of-thought and final answers.
- Typically, PPO/GRPO-style policy updates, 1 RL epoch, and rollout batches of $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 7 examples.

Best results are achieved when the shared subset is fully enabled in both stages ( $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 8), e.g., on GSM8K, this configuration yields 34.37% test accuracy (vs 29.57% for partial sharing) [(Huang et al., 28 Jul 2025), Table 2].

4. Empirical Results and Parameter Efficiency

LoRA-PAR delivers high parameter efficiency without loss of accuracy. With $\mathrm{label}(x_k) = \arg \max_{p} \sum_{i=1}^M \mathbb{I}[\delta_i(x_k) = p], \quad p\in\{1,2\}.$ 9, only $D_1$ 040% of LoRA parameters are activated per system, achieving 89% of full-adapter accuracy. Parameter activation strongly outperforms random selection (27.30% vs 23.43% for GSM8K SFT). When applied to LLaMA2 7B, PiSSA (a variant) with $D_1$ 1 attains 41.85% GSM8K accuracy while using less than half of LoRA parameters per system, outperforming full PiSSA and matching or surpassing other PEFT baselines on MMLU and HumanEval (see Table below) [(Huang et al., 28 Jul 2025), Table 4].

Method	GSM8K (2-epoch)	MMLU (Dolly)	MMLU (Platypus)	HumanEval
LoRA	31.86%	44.99%	43.16%	18.54%
PiSSA (θ=0.95)	41.85%	24.14%	25.38%	27.43%
LoRA-PAR (θ=0.9)	34.57%	47.09%	45.66%	19.51%

5. Interpretability and Inference Dynamics

LoRA-PAR yields interpretable parameter clusters, observable in scatter plots of System 1 versus System 2 parameter importance. These "neural subregions" empirically correspond to intuitive (fast) and reasoning (slow) processing. At inference, systems can selectively activate only the corresponding partition and optionally shared parameters, reducing memory and compute load relative to activating all LoRA adapters—an advantage when one class of task dominates real-world queries.

6. Modularity, Transferability, and Limitations

The framework is modular: hyperparameters $D_1$ 2 can be tuned per deployment, and its principles extend to other adapters (e.g., prefix-tuning) where parameter importance can be estimated. LoRA-PAR has so far been applied to decoder-only Transformer LLMs; generalization to encoder–decoder models or intermediate reasoning complexity remains open. Multi-model labeling increases annotation cost and depends on external teacher access. The binary System 1/2 split may overlook tasks requiring hybrid or intermediate reasoning; future work could investigate more granular partitioning (Huang et al., 28 Jul 2025).

7. Broader Context and Theoretical Significance

LoRA-PAR is distinct among dual-system or dual-partitioning strategies by explicitly partitioning both data and trainable parameters based on task cognitive demands and parameter utility to each subsystem. Unlike architectural dual partitioning in parallel computing (Kelly et al., 2013) or explicit cache/memory split in systems (Costa et al., 27 Jan 2025), LoRA-PAR achieves efficiency via sparsity and targeted adaptation within a shared architecture. Its interpolation of interpretability, parameter efficiency, and high accuracy demonstrates the viability of cognitive-inspired, data-driven partitioning for next-generation adaptive LLMs (Huang et al., 28 Jul 2025).