xLAM-7B-FC-R: Function-Calling 7B Transformer
- xLAM-7B-FC-R is a function-calling specialized 7B-parameter transformer that interleaves natural language with JSON schema for precise API tool use.
- It utilizes a balanced training pipeline combining execution-verified synthetic data and general instructions, enhancing both reasoning and function execution.
- Robust architectural optimizations and multi-stage quality controls enable xLAM-7B-FC-R to rival larger models on benchmarks like the Berkeley Function-Calling Leaderboard.
xLAM-7B-FC-R is a function-calling–specialized 7B-parameter large action model within the xLAM family, developed to advance open-source AI agent capabilities. Built atop the DeepSeek-Coder-7B-instruct backbone, it integrates end-to-end fine-tuning for high-fidelity JSON-style function calls interleaved with self-consistent “thought” traces. The model is distinguished by a unified, execution-verified dataset, architectural optimizations for API tool-use, and empirical performance rivaling significantly larger proprietary baselines (Zhang et al., 2024).
1. Model Architecture
xLAM-7B-FC-R inherits the dense Transformer architecture of DeepSeek-Coder-7B, comprising decoder-only Transformer layers, each with hidden dimension , feed-forward inner size , and self-attention heads. No mixture-of-experts (MoE) are present in FC variants. The multi-head attention at each layer ℓ receives input representations and computes:
This is followed by a gated GeLU feed-forward layer,
Layer normalization precedes each sublayer, and residual connections are incorporated post-sublayer.
2. Training Data and Pipeline
The model employs a unified training corpus composed of three primary sources:
- Cleaned and Augmented Agent Data: Tool-usage and web-agent trajectories standardized to a JSON schema (task, tools, format, few-shot setups, stepwise traces).
- Synthetic Function-Calling Data: 60,000 high-quality, execution-verified samples generated by APIGen across 3,673 real-world APIs spanning 21 categories.
- General Instruction Tuning: Approximately 20–30% sourced from DialogStudio and Data Provenance, filtered for non-commercial licensing and rated by Mixtral-8x22B and DeepSeek-V2 models to remove repetitious or low-quality dialogue.
During fine-tuning for the FC-R variant, each minibatch is composed of 50% execution-verified function-calling data and 50% agent/general instruction data sampled evenly. Optimization is conducted using full-model PyTorch FSDP on NVIDIA H100s, with a batch size of 128 sequences per GPU (effective 1,024 for 8 GPUs), sequence length up to 4,000 tokens, cosine-decay learning rate peaking at (100-step warmup), total training exposure of approximately 150 billion tokens over three epochs, weight decay 0.1, and dropout 0.1 in FFN.
Loss function includes both standard cross-entropy,
0
and pairwise ranking loss for DPO alignment,
1
where 2 denotes pre-softmax sequence score.
3. Function-Calling Integration
xLAM-7B-FC-R employs a unified JSON-based schema for function-calling, interleaving "thought" reasoning strings and “tool_calls” in the generated output. Instead of allocating a separate API-head, the model learns during supervised fine-tuning to emit target sequences combining natural language and JSON special tokens. A minimal output template is: 3 At inference, standard token-by-token autoregressive decoding proceeds over the merged vocabulary, with no additional special parameters for API function calls. Decoding is operationalized by: 4 This method ensures high-fidelity, schema-conforming function call outputs interleaved with structured reasoning.
4. Empirical Performance
On the Berkeley Function-Calling Leaderboard v2 (as of 2024-09-03), xLAM-7B-FC-R attains an Overall Accuracy of 80.18%, outpacing many larger open-source models. The breakdown is as follows:
| Evaluation Metric | xLAM-7B-FC-R (%) |
|---|---|
| Overall Accuracy | 80.18 |
| AST-only, simple | 70.52 |
| AST-only, multiple | 78.22 |
| AST-only, parallel | 73.88 |
| AST-only, parallel-multiple | 68.50 |
| Executable, simple | 95.21 |
| Executable, multiple | 90.00 |
| Executable, parallel | 88.00 |
| Executable, parallel-multiple | 77.50 |
| Relevance: ignore-irrelevance | 79.54 |
| Relevance: detect-relevance | 80.49 |
Comparisons: GPT-4-0125 (function-call prompt) yields 81.78% overall; Gorilla-OpenFunctions-v2 achieves 79.10%; GPT-3.5 Turbo (FC) records 75.41%. Notably, xLAM-7B-FC-R, at 7B parameters, demonstrates competitive or superior performance relative to open-source baselines of substantially greater scale (Zhang et al., 2024).
5. Key Design Optimizations
Critical design decisions contributing to model effectiveness include:
- Unified JSON Schema: Standardizing all trajectories into a ["thought","tool_calls"] format with explicit module boundaries ensures persistent API schema adherence.
- Prompt-format and Paraphrase Augmentation: Employing stochastic shuffling of tool lists and paraphrased formatting instructions mitigates overfitting to prompt structure, improving generalization.
- Multi-stage Quality Verification: Pre-training data undergoes rigorous filtering—rule-based checks (undefined tools/arguments), LLM-driven hallucination detection, and human-in-the-loop trajectory rating—to excise approximately 15% of low-quality samples.
- APIGen Synthesis: The synthetic dataset, with all API calls execution-verified, enhances parallel-call accuracy by +4%.
- Balanced Mini-batch Composition: Equal sampling of function-calling and agent/general instruction data in minibatches yields substantial gains on both AST and executable function metrics, while maintaining language understanding capacity.
6. Context and Significance
xLAM-7B-FC-R exemplifies the impact of a systematic data engineering and training pipeline on function-calling performance for autonomous AI agents. By enforcing schema-unification, incorporating execution-verified synthetic data (50% of fine-tuning), and applying multi-stage quality control, the model achieves accuracy levels near that of state-of-the-art proprietary models within a compact 7B-parameter dense Transformer. This outcome evidences that scalable, unified methodologies can democratize high-fidelity function-calling even at modest parameter counts, advancing open-source alternatives for AI agent systems (Zhang et al., 2024).