Effect of LLM model choice on AgentSynth-generated task properties
Determine how the choice of large language model used for task generation within the AgentSynth pipeline affects the properties of the synthesized tasks, specifically their complexity, realism, and meaningfulness, and characterize the resulting task-generation variance across different LLM architectures to inform diversity and difficulty calibration.
References
We currently rely on GPT-4.1-based agents for task generation due to their robustness and broad generalization. Different LLM models might generate tasks with systematically different complexity, realism, or meaningfulness, and it remains an open and interesting research question how model choice affects generated task properties.
— AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
(2506.14205 - Xie et al., 17 Jun 2025) in Section 3 (Scalable Agent Tasks and Trajectories Generation), Task Proposer