LogiAgent: LLM Logical API Testing

Updated 14 December 2025

LogiAgent is a multi-agent LLM framework that systematically generates complex REST API scenarios to detect logical bugs and validate business rules.
It employs a global scheduler and long-term execution memory with BM25-based retrieval to enhance scenario synthesis and reduce redundant failures.
Empirical evaluations reveal that LogiAgent outperforms baselines in logical issue identification and code coverage, highlighting its superiority in API testing.

LogiAgent is a LLM-based multi-agent framework developed for the logical testing of RESTful application programming interfaces (APIs). Unlike conventional REST API testing frameworks that prioritize detection of server crashes and HTTP error codes, LogiAgent systematically targets the detection of logical bugs and domain-specific business logic violations within API responses by leveraging multi-agent LLM orchestration with scenario memory, logical oracles, and business-oriented scenario generation. Extensive empirical evaluation demonstrates that LogiAgent achieves significantly higher logical issue identification and code coverage compared to established baselines, substantiating the effectiveness of LLM-powered logical test oracles and context-driven scenario synthesis (Zhang et al., 19 Mar 2025).

1. Architectural Components and Workflow

LogiAgent’s architecture consists of several core components: an LLM-based multi-agent testing framework, a global scenario scheduler, and a long-term execution memory.

Multi-Agent Framework:
- Test Scenario Generator: Synthesizes business-logic-oriented testing scenarios comprised of 8–12 steps, each specifying API calls and expected behavioral oracles.
- API Request Executor: Executes the generated scenarios by assembling request payloads, fetching relevant parameter values and failure patterns from execution memory.
- API Response Validator (Logical Oracle): Validates API responses against scenario-specific business logic oracles, encompassing not only standard HTTP status but also semantic and logic-level correctness.
Global Scenario Scheduler: Maintains scenario progress, step-level retries, and controls test execution flow. Provides atomic interfaces for adding, retrieving, and updating scenario execution status and for termination checks.
Long-Term Execution Memory: Persists successful API request parameters and failed execution reflections. Retrieval mechanisms support BM25-based similarity scoring to maximize contextual reuse and reduce redundant failures.

The workflow is an orchestrated loop: a scenario is generated and scheduled; steps are executed with memory-based parameter enrichment; responses are validated using LLM-guided oracles; validated results and reflections are committed to execution memory. The process recurs until either the predefined request budget is exhausted or termination conditions are met.

2. Formal Mechanisms: Scenario Generation, Logical Oracle, and Memory

Scenario Generation: The agent builds an API Relationship Graph (ARG), representing logical and business workflow dependencies between APIs. Random walk sampling (max length 10) selects a subset of API endpoints for scenario context, with scenarios synthesized via LLM prompting using vectorized API descriptions, historical scenario examples, and logical dependency confirmation.

Logical Oracle (API Response Validator):

Validation proceeds by constructing a complex LLM prompt encoding: API/step description, input/output payloads, expected high-level oracle (business logic), and broader scenario context. The LLM, acting as an oracle, outputs a verdict (“Aligned”/“Not Aligned”) with natural language explanation. Validations do not rely on explicit numerical scoring, but on few-shot, multiperspective checks integrating schema, semantics, and logic requirements.

Execution Memory Mechanism:

ParamRecords: Logs (API, parameter name, parameter value) for successful requests, used in BM25 similarity-based parameter retrieval for scenario synthesis.
ReflRecords: Logs failed execution payloads and LLM validator explanations to aid reflection-driven scenario pruning. BM25 scoring for parameter retrieval is performed over a concatenated query consisting of current API and step descriptions.

3. Experimental Setup and Baselines

LogiAgent was evaluated on twelve real-world REST systems covering a diversity of domains and operational complexities (online: PetStore, Bill-Service, Genome-Nexus; local: Features-Service, Rest-Countries, News-Service, SCS, NCS, LanguageTool, Person-Controller, Project-Track, User-Management). Experiments leveraged the GPT-4o-mini model backbone, a fixed 1,000-API-call budget, and commodity hardware. Baselines included RESTler, EvoMaster (black-box), Morest, and ARAT-RL, each evaluated under both fixed-call and time-budgeted (1 hour) settings (Zhang et al., 19 Mar 2025).

4. Empirical Results: Logical Bug Detection and Coverage

Logical Issue Detection: Across 349 agent-generated test reports, LogiAgent correctly identified 234 logical issues (139 confirmed logical bugs, 95 business-logic enhancement opportunities), resulting in an overall logical accuracy of 66.19%. False positives (33.81%) were primarily attributed to LLM hallucination or domain misunderstanding, most pronounced in domain-complex systems.

Server Crash Detection: LogiAgent surpasses or matches baselines in detecting HTTP 500 errors; total of 49 distinct crash scenarios compared to 32–54 for baselines.

Test Coverage:

Operation Coverage: 231 distinct successful operations vs. best baseline (ARAT-RL) at 179 (fixed call budget).
Code Coverage: Mean branch (39.98%), line (71.78%), and method (73.06%) coverage notably exceed baselines (e.g., best line coverage baseline at 62.38%).

Ablation Study: Removal of execution memory features (parameter and reflection retrieval) results in significant code coverage drops (branch: –10.42%, method: –1.87%), confirming the importance of scenario history and reflection in API exploration.

Task	LogiAgent	Best Baseline	Coverage Improvement
Branch Coverage	39.98%	34.90%	+5.08%
Line Coverage	71.78%	62.38%	+9.40%
Method Coverage	73.06%	67.24%	+5.82%

5. Strengths, Limitations, and Failure Modes

LogiAgent is proficient at discovering both minor HTTP semantics deviations (e.g., 200 vs. 201 errors) and complex logic flaws (e.g., incorrectly ordered deletions). Its memory-driven scenario synthesis and logical validation deliver higher test and code coverage per call compared to search- and model-based baselines and exhibit superior resilience in both logical and crash-inducing test detection.

The primary limitation is the LLM’s dependence on the precise articulation of business logic in API documentation; poorly or ambiguously specified APIs can undermine oracle alignment and result in raised false positives (~34%). Domain-specialization gaps, LLM hallucination, and incomplete grounding in implementation-specific logic are prominent sources of oracle error, particularly for highly domain-specific applications.

6. Future Directions

Identified future work involves:

Integrating domain-specific knowledge bases and static code analysis to enhance logical oracle grounding and reduce hallucinations.
Incorporating hybrid memory strategies and deeper implementation reflection to minimize redundant failures and erroneous explorations.
Tighter coupling of LLM reasoning with program analysis to allow logic-informed static/dynamic test adaptation, especially where API documentation is non-normative or sparse (Zhang et al., 19 Mar 2025).

Potential implications include broader applicability to non-REST interfaces, automated generation of specification-level API oracles, and seamless integration with continuous integration/continuous deployment (CI/CD) pipelines for dynamic logic regression detection.

PDF Markdown Chat (Pro)

References (1)

LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents (2025)

LogiAgent: LLM Logical API Testing

1. Architectural Components and Workflow

2. Formal Mechanisms: Scenario Generation, Logical Oracle, and Memory

3. Experimental Setup and Baselines

4. Empirical Results: Logical Bug Detection and Coverage

5. Strengths, Limitations, and Failure Modes

6. Future Directions

Whiteboard

Follow Topic

Continue Learning

LogiAgent: LLM Logical API Testing

1. Architectural Components and Workflow

2. Formal Mechanisms: Scenario Generation, Logical Oracle, and Memory

3. Experimental Setup and Baselines

4. Empirical Results: Logical Bug Detection and Coverage

5. Strengths, Limitations, and Failure Modes

6. Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics