RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs (2505.10495v1)

Published 15 May 2025 in cs.LG and cs.CL

Abstract: This paper addresses fine-tuning LLMs for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text LLMs to generate high-quality synthetic training data. Our architecture's flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

Summary

RouteNator: Enhancing Synthetic Data Generation for Function Calling in LLMs

The paper "RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs" addresses the challenge of training LLMs for function calling tasks within digital content creation tools when real user interaction data is unavailable due to privacy constraints. This paper introduces a novel approach for synthetic data generation, leveraging a router-based architecture that integrates domain-specific resources such as content metadata and structured knowledge graphs, alongside multi-modal LLMs.

Synthetic Data Generation Challenge

In scenarios where user queries in natural language must be mapped to API calls, the absence of real-world, task-specific data creates significant training challenges. Existing synthetic data generation methods often fail to replicate real-world data diversity and complexity, resulting in suboptimal model performance post-fine-tuning. This paper proposes a multi-modal architecture aimed at overcoming these limitations, primarily focusing on enhancing data diversity and alignment with observed real-world distributions.

Architecture and Methodology

RouteNator's architecture exploits structured domain knowledge for synthetic data generation, employing techniques for extracting generalizable patterns from content metadata and utilizing domain-specific knowledge graphs for contextually relevant query generation. The router-based architecture features a flexible routing mechanism, which directs inputs to various specialized LLM prompt templates based on population-level statistics. This mechanism incorporates text-to-text and vision-to-text LLMs, diversifying the generated data.

The architecture encompasses three approaches to synthetic data generation:

Template-based Heuristic Generation: Utilizes rule-based templates incorporating content metadata and knowledge graphs. Although rapid in data generation, it lacks linguistic diversity and naturalness.
Single-Prompt LLM Based Generation: Deploys a Llama-3.1-70B-Instruct model with a comprehensive set of prompts. Despite generating more realistic language, it suffers from limitations in controlling distribution and maintaining variety.
Router-Based Multi-Modal Architecture: Integrates the strengths of heuristic and LLM-based approaches, utilizing weighted probabilistic sampling and multi-modal input processing. This approach results in more diverse and realistic synthetic data aligned with real-world statistics.

Experimental Evaluation and Results

The paper demonstrates the architecture's efficacy through extensive evaluation on real user queries, showing marked improvements in function classification accuracy and API parameter selection. Models fine-tuned with synthetically generated data outperform conventional approaches, setting new benchmarks for function calling tasks.

Statistical analysis and graphical representations in the paper illustrate improvements in data diversity, including balanced keyword positions and realistic word count distributions. Fine-tuning experiments reveal significant performance enhancement, particularly when combining synthetic data generated through multiple approaches, further validating the router-based architecture's utility.

Practical and Theoretical Implications

The paper's contributions have significant implications for synthetic data generation in LLM training. Practically, the router-based architecture provides a robust framework for generating highly diverse and realistic training data, improving models' ability to handle complex user interactions within content creation platforms. Theoretically, this work advances our understanding of multi-modal architectures in synthetic data generation, opening pathways for further optimizations and applications across various domains requiring function calling capabilities.

Future Directions

The paper concludes by acknowledging avenues for future research, emphasizing potential extensions to multilingual queries and generalization of the architecture to other application domains requiring complex function-calling mechanisms. There is also a prospect of integrating more advanced LLMs for higher-quality synthetic data generation and expanding real-world datasets for model evaluation.

In summary, RouteNator sets a novel precedent for synthetic data generation in LLM training, demonstrating substantial improvements in model performance for function calling tasks under privacy constraints. Its multi-modal architecture offers promising advancements for researchers and developers in AI-driven digital content creation tools.