Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware (2505.05057v2)

Published 8 May 2025 in cs.SE

Abstract: Application Programming Interfaces (APIs) are crucial in modern software development. LLMs assist in automated code generation but often struggle with API hallucination, including invoking non-existent APIs and misusing existing ones in practical development scenarios. Existing studies resort to Retrieval-Augmented Generation (RAG) methods for mitigating the hallucination issue, but tend to fail since they generally ignore the structural dependencies in practical projects and do not indeed validate whether the generated APIs are available or not. To address these limitations, we propose MARIN, a framework for mitigating API hallucination in code generated by LLMs with hierarchical dependency aware. MARIN consists of two phases: Hierarchical Dependency Mining, which analyzes local and global dependencies of the current function, aiming to supplement comprehensive project context in LLMs input, and Dependency Constrained Decoding, which utilizes mined dependencies to adaptively constrain the generation process, aiming to ensure the generated APIs align with the projects specifications. To facilitate the evaluation of the degree of API hallucination, we introduce a new benchmark APIHulBench and two new metrics including Micro Hallucination Number (MiHN) and Macro Hallucination Rate (MaHR). Experiments on six state-of-the-art LLMs demonstrate that MARIN effectively reduces API hallucinations, achieving an average decrease of 67.52% in MiHN and 73.56% in MaHR compared to the RAG approach. Applied to Huaweis internal projects and two proprietary LLMs, MARIN achieves average decreases of 57.33% in MiHN and 59.41% in MaHR.

Summary

  • The paper introduces MARIN, a framework that mitigates API hallucination by leveraging hierarchical dependency mining to capture both local and global project dependencies.
  • It employs dependency constrained decoding using a binary mask on valid API tokens, significantly improving the accuracy of generated code.
  • Evaluations on APIHulBench and Huawei's internal projects show marked reductions in hallucination metrics and increased exact match accuracy with minimal overhead.

The paper "Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware" (2505.05057) addresses the significant challenge of API hallucination in code generated by LLMs, particularly in practical software development scenarios involving project-specific APIs. API hallucination manifests as invoking non-existent APIs or misusing existing ones, leading to erroneous code that is difficult to debug and can introduce security vulnerabilities.

Existing approaches primarily rely on Retrieval-Augmented Generation (RAG), which supplements LLM prompts with retrieved API documentation or code snippets. However, the paper identifies two key limitations of these methods: 1) They often ignore the structural dependencies within a project (like relationships between functions, classes, and files), providing isolated context that is insufficient for understanding proper API usage within the project's architecture. 2) They use unconstrained auto-regressive decoding, which allows LLMs to generate invalid API tokens based solely on model probabilities, without explicit validation against available APIs.

To overcome these limitations, the authors propose MARIN, a framework designed to mitigate API hallucination by incorporating hierarchical dependency awareness. MARIN operates in two main phases:

  1. Hierarchical Dependency Mining: This phase uses static analysis to extract relevant project context at two levels:
    • Local Dependencies: Analyzing the immediate context of the incomplete function, including valid APIs callable at the generation point and method call relationships within the same file or across files.
    • Global Dependencies: Examining broader project structure by identifying imported files and creating simplified "skeletons" of these files (retaining class definitions, field declarations, and function signatures) to provide contextual information about the broader architecture without exceeding context window limits. This mined hierarchical dependency information is then used to construct a structured input prompt for the LLM, including a project description, global dependency skeletons, local dependency details, and the incomplete function with a placeholder for the API call.
  2. Dependency Constrained Decoding: This phase leverages the mined dependencies to guide the token generation process. It involves:
    • Dependency Preprocessing: Building an API name prefix tree from the token sequences of valid reference APIs and identifying parameter patterns (tokens indicating the start of parameter lists or no parameters). This information is cached for efficient access.
    • Constrained Decoding: At each token generation step, the LLM's output logits are masked using a binary mask derived from the preprocessed dependencies. This mask ensures that only valid next tokens, based on the current token sequence within the API prefix tree or the identified parameter patterns, are considered. The next token is then selected via greedy search from the masked logits. This process continues until a complete API call is generated.

To evaluate API hallucination specifically for project-specific APIs in practical settings, the authors introduce a new benchmark called APIHulBench. This benchmark consists of 416 Java code samples from 98 recent GitHub repositories (mid-2023 to mid-2024) to avoid data leakage. It is divided into two parts: APIHulBench-F (API calls in the first 50% of function lines) and APIHulBench-M (API calls beyond the 50% mark), representing different development stages. They also propose two new metrics for quantifying API hallucination: Micro Hallucination Number (MiHN), which counts the average number of hallucinatory elements per generated API, and Macro Hallucination Rate (MaHR), which measures the proportion of generated APIs containing any hallucinations. Standard code generation metrics (EM, ES, IM) are also used.

The paper evaluates MARIN against baseline approaches (Base Generation, RAG, De-Hallucinator) using six state-of-the-art LLMs (CodeLlama-7B/13B/34B and DeepSeekCoder-1.3B/6.7B/33B) on APIHulBench. The results show that MARIN consistently and significantly outperforms baselines across all models and benchmarks, achieving an average decrease of 67.52% in MiHN and 73.56% in MaHR compared to the RAG approach, alongside substantial increases in accuracy metrics (107.3% average increase in EM). An ablation paper demonstrates that both hierarchical dependency mining and dependency constrained decoding contribute positively to performance, with global dependency having the largest impact.

Regarding efficiency, the paper shows that MARIN adds minimal computational overhead compared to base models (average 0.022s overhead per sample), while RAG and De-Hallucinator introduce significantly higher latency due to retrieval and iterative processes. MARIN also exhibits superior scalability with increasing model size compared to the baselines.

Furthermore, the authors validate MARIN's effectiveness in an industrial setting using a benchmark of 109 samples from Huawei's internal Java projects and two proprietary LLMs (PanguCoder-11B/34B). In this scenario, MARIN also achieves significant reductions in MaHR (average 59.41% decrease) and increases in EM (average 72.39% increase) with only minimal overhead (average 0.031s), demonstrating its practicality for industrial deployment.

In summary, the paper proposes MARIN as a novel framework to mitigate API hallucination in LLM-generated code by explicitly leveraging hierarchical project dependencies and employing constrained decoding. The empirical evaluation on a new benchmark and in an industrial setting validates its effectiveness and efficiency, highlighting the importance of structural project context and controlled generation for accurate API usage in code generation tasks.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube