Candidate Filtering for Bangla Code Generation

Updated 29 December 2025

Candidate Filtering is a systematic process that prunes and validates code outputs from LLMs using predefined unit tests to ensure functional correctness.
It employs multi-stage architectures and iterative feedback loops, including debugging agents, to refine candidate solutions in Bangla code generation.
Evaluation using Pass@K metrics shows that this method can boost test accuracy by up to 40 percentage points compared to single-pass generation.

Candidate Filtering (CF) in the context of Bangla-to-code generation refers to systematic procedures for selecting, refining, and validating outputs generated by LLMs to ensure strict functional correctness under defined unit tests. While not always explicitly named as “Candidate Filtering,” these selection strategies underpin the most successful models in the BLP-2025 shared task and recent Bangla code-generation frameworks, forming a critical component between LLM inference and final program selection.

1. Formal Definition and Motivation

Candidate Filtering is defined as the process by which multiple code samples (candidates), produced by a generative model given a natural language instruction (in Bangla), are systematically evaluated and pruned to retain only those meeting predefined standards, typically passing all specified unit tests. In the BLP-2025 task, the formal requirement is that “a Python function implementation […] must satisfy every assert in the test_list” attached to each prompt (Dihan et al., 22 Dec 2025, &&&1&&&, Asib et al., 10 Nov 2025).

Motivation for CF arises from the inherent stochasticity of LLM outputs and the lack of type signatures or behavioral annotation in the Bangla-language prompt: multiple generation passes and subsequent validation shrink the solution space to the semantically correct subset, especially in low-resource and cross-lingual settings where direct generation accuracy is low.

2. System Architectures Employing Candidate Filtering

Recent high-performing systems for Bangla code generation employ multi-stage agent architectures for effective candidate pruning:

Two-Agent Pipelines: NALA_MAINZ (Saadi et al., 20 Nov 2025) deploys a primary code-generation agent ("G_θ"), which samples candidate solutions, and a secondary debugger agent ("D_ϕ"), which selectively reruns tests and proposes minimal code edits only if initial candidates fail. Only those candidates that pass all asserts after Stage 2 are retained—constituting the output of the CF mechanism.
Feedback Loops: Retriv (Asib et al., 10 Nov 2025) and BanglaForge (Dihan et al., 22 Dec 2025) frameworks institute iterative, test-driven refinement. After an initial greedy or low-temperature sample, failures are diagnosed from stack traces and assertion feedback; the model is re-prompted with this feedback over several passes (typically 2–5), producing a new candidate at each step. The process concludes with only the first candidate that passes all test cases being selected.

In these designs, the candidate pool is implicitly filtered at each iteration by executable correctness under the provided tests, with additional filtering based on format constraints and error diagnostics.

3. Evaluation Metrics and Pass@K Computation

The effectiveness of Candidate Filtering is operationalized via the Pass@K suite of metrics, which measure the probability that at least one of K independently sampled candidates passes all unit tests. The precise definitions, as implemented on BLP-2025 and MBPP-Bangla (Raihan et al., 11 Sep 2025, Asib et al., 10 Nov 2025, Dihan et al., 22 Dec 2025, Saadi et al., 20 Nov 2025), are as follows:

$\mathrm{Pass@K} = 1 - \frac{\binom{N_\text{fail}}{K}}{\binom{N}{K}}$

where $N$ is the number of samples drawn, and $N_\text{fail}$ is the number that fail all tests. For $K = 1$ (the leaderboard metric), this simplifies to: $\mathrm{Pass@1} = 1 - \frac{N_\text{fail}}{N}$

The pipeline for CF consists of:

Generate $N$ code completions per problem.
For each, run all original test cases.
Discard any sample failing any assertion.
For Pass@1, output is “correct” if any remaining sample exists.

This filter—based strictly on exact unit-test passage—underpins all result reporting and system tuning.

4. Comparative Results and Impact of Candidate Filtering

The impact of robust CF is evident in experimental results:

System	Pass@1 (Dev)	Pass@1 (Test)	Notes
NALA_MAINZ GPT-5 (Stage 1: gen only)	64.6%	—	No debugging/filtering
NALA_MAINZ GPT-5 (Stage 2: +debug)	—	95.4%	Test-driven CF raises accuracy by +30.8 pp
Retriv (feedback-guided)	—	93.4%	Three feedback-guided CF passes
BanglaForge (dual-model loop)	—	84.0%	Coder + Reviewer, retrieval-augmented, 5 self-refinement
Baseline models (single-pass)	≤69%	≤40%	High failure rates without CF
TigerCoder (MBPP-Bangla, Pass@1)	—	82.0%	Strict Pass@1 with multi-PL references, multi-candidate evaluation

CF thus delivers 25–40 percentage points gain in test accuracy over single-shot generation without validation, firmly establishing it as a bottleneck technology for LRL code generation (Raihan et al., 11 Sep 2025, Saadi et al., 20 Nov 2025, Asib et al., 10 Nov 2025, Dihan et al., 22 Dec 2025).

5. Methodological Elements and Best Practices

Candidate Filtering incorporates several methodological best practices as deployed in state-of-the-art systems:

Immediate rejection of samples with any syntax or assertion failure (Saadi et al., 20 Nov 2025, Asib et al., 10 Nov 2025).
Diagnosis-driven revision: Error traces and failing tests are explicitly included in subsequent prompts to maximize the informativeness of feedback for correcting the candidate (Asib et al., 10 Nov 2025, Dihan et al., 22 Dec 2025).
Iterative minimization: Only a restricted number of regeneration passes (typically ≤3) are allowed, with early exit upon success (i.e., as soon as $p$ passes all tests, no further candidates are sampled) (Asib et al., 10 Nov 2025, Dihan et al., 22 Dec 2025).
Isolation of code evaluation: All filtering is conducted in sandboxed environments using identical runtime versions to ensure reproducibility and comparability across systems (Raihan et al., 11 Sep 2025).

From ablation studies:

Omitting the debugger agent in two-stage pipelines leads to 10–30 points Pass@1 reduction (Saadi et al., 20 Nov 2025).
Removing execution-feedback-based iterative filtering in dual-model setups yields up to −25 points penalty (Dihan et al., 22 Dec 2025).
Large gains result from combining test- and trace-based filtering with controlled prompt translations and bilingual retrieval augmentation (Dihan et al., 22 Dec 2025, Asib et al., 10 Nov 2025).

6. Challenges, Limitations, and Future Directions

Despite its efficacy, Candidate Filtering faces limitations:

Reliance on Test Suite Quality: CF’s ceiling is determined by the completeness and representativeness of the test suite. Hidden/brittle cases or unconstrained instructions may cause overfitting and “specification leakage” (Saadi et al., 20 Nov 2025).
Dependence on proprietary agents: The best-performing pipelines use closed LLM APIs (e.g., GPT-5), restricting reproducibility (Saadi et al., 20 Nov 2025).
Finite resource constraints: Iterative CF incurs significant computational overhead, particularly for large K or multiple feedback passes (Asib et al., 10 Nov 2025).

Directions for improvement include:

Human-curated translation and denser retrieval for better cross-lingual alignment (Dihan et al., 22 Dec 2025).
Systematic expansion of test suite coverage to minimize behavioral underspecification (Saadi et al., 20 Nov 2025).
Adaptive or learned self-refinement (e.g., learned stopping, reward shaping) (Dihan et al., 22 Dec 2025).

7. Context within Bangla Code Generation and Benchmarking

Candidate Filtering is central to the construction and evaluation protocols of major Bangla code-generation benchmarks in 2025, including MBPP-Bangla (Raihan et al., 11 Sep 2025), BLP-2025 Task 2 (Dihan et al., 22 Dec 2025, Saadi et al., 20 Nov 2025, Asib et al., 10 Nov 2025), and mHumanEval-Bangla (Dihan et al., 22 Dec 2025). Its core role is to ensure that solution selection is based not on surface features or probabilistic majority, but on strict, test-driven semantic correctness, thus enabling fair model comparison and analysis of cross-lingual reasoning bottlenecks. The methodology codified in these shared tasks and benchmarks is likely to influence best practices in multi-modal and low-resource code generation beyond Bangla.