Papers
Topics
Authors
Recent
Search
2000 character limit reached

BM25 MLLM Agent: Neural & Symbolic Integration

Updated 16 March 2026
  • BM25 MLLM Agents are autonomous systems that integrate traditional BM25 scoring with neural transformer attention to achieve both lexical and semantic matching.
  • They employ soft term frequency and neural IDF encoding through techniques like SVD to boost retrieval accuracy in complex technical domains.
  • By iteratively reformulating queries and self-evaluating ranked outputs, these agents deliver significant improvements in MAP and first-file retrieval metrics.

A BM25 MLLM Agent is an autonomous or semi-autonomous system that employs LLMs, typically based on transformer architectures, integrated with BM25-style information retrieval (IR) scoring either at a symbolic (lexical) or neural (semantic) level. Such agents are designed for complex retrieval-driven workflows, notably in technical or scientific domains, and leverage both classic IR and learned neural ranking paradigms for interpretable, efficient, and accurate retrieval and reranking (Lu et al., 7 Feb 2025, Caumartin et al., 7 Dec 2025).

1. Principles of BM25 and Neural BM25 Integration

BM25 is a term-weighting function used in IR, scoring documents according to their term frequency (TF), document length, and inverse document frequency (IDF):

BM25(q,d)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bd/avgdl)\mathrm{BM25}(q,d) = \sum_{t\in q} \mathrm{IDF}(t) \cdot \frac{f(t,d) (k_1+1)}{f(t,d) + k_1 (1 - b + b |d|/\mathrm{avgdl})}

where f(t,d)f(t,d) is the frequency of token tt in document dd, d|d| is the document length, avgdl\mathrm{avgdl} the mean document length, k1k_1 and bb are hyperparameters, and IDF(t)\mathrm{IDF}(t) captures the informativeness of tt. The standard application is symbolic: terms must match exactly.

Recent work demonstrates that neural LLMs, particularly transformer-based Cross-Encoders, can replicate and generalize BM25 mechanisms. Specific attention heads (the "Matching Heads") compute soft, distributed term frequency by attending not only to exact matches but also semantic neighbors, thus yielding a semantic variant termed "Neural BM25". The first singular vector of the LM's embedding matrix empirically encodes neural IDF, mirroring the role of classical document statistics (Lu et al., 7 Feb 2025).

2. Mechanistic Implementation in Cross-Encoders

Soft Term Frequency via Attention

The agent relies on a subset of the transformer's attention heads, HTFH^{TF}, which are empirically identified as contributing to token-level matching. For a query q={tj}q = \{t_j\} and document d={di}d = \{d_i\}:

  • Raw soft-TF: fraw(tj,d)=hHTFi=1dαj,i(h)f_{raw}(t_j, d) = \sum_{h \in H^{TF}} \sum_{i=1}^{|d|} \alpha^{(h)}_{j, i}
  • Saturation: fsat(t,d)=fraw(t,d)1+ksatfraw(t,d)f_{sat}(t,d) = \frac{f_{raw}(t,d)}{1 + k_{sat} f_{raw}(t,d)}, with ksatk_{sat} controlling diminishing gains, similar to the BM25 k1k_1 parameter.
  • Length Normalization: lenNorm(d)=(1b+bddl)1\mathrm{lenNorm}(d) = (1 - b + b \frac{|d|}{\overline{dl}})^{-1}
  • Final soft-TF: fsoft(t,d)=fsat(t,d)lenNorm(d)f_{soft}(t, d) = f_{sat}(t, d) \cdot \mathrm{lenNorm}(d)

Neural IDF Encoding

Through singular value decomposition of the embedding matrix WEW_E, the first singular vector u(1)u^{(1)} is shown to be highly correlated with IDF (r0.71)(r \approx -0.71). Thus, the neural IDF for a token tt is approximated as:

IDF(t)λu(1)[t]\mathrm{IDF}(t) \approx -\lambda u^{(1)}[t]

for a positive scaling factor λ\lambda.

3. Agent Pipeline and Retrieval Logic

A typical BM25-MLLM Agent workflow, as implemented in repository-level bug localization (Caumartin et al., 7 Dec 2025), involves:

  1. Pre-retrieval Reformulation: An LLM parses unstructured queries (e.g., bug reports) to extract structured fields (summary, identifiers, code snippets) using an extraction prompt and outputs a JSON schema.
  2. BM25 Retrieval: The extracted fields are concatenated into a BM25 query string (e.g., "explanation + identifiers + snippets"), and top-k candidates are retrieved using standard BM25 scoring, typically via Pyserini with defaults k1=0.9k_1 = 0.9, b=0.4b = 0.4.
  3. Neural BM25 Reranking: For enhanced semantic matching and interpretability, the agent may rerank the retrieved candidates using the neural BM25 module described above, combining information from transformer attention and neural IDF.
  4. Agentic Loop: The LLM iteratively inspects candidate documents (e.g., code files), may self-correct prior actions, conduct path validation, view content snippets, and finally produces a ranked and reasoned output.
  5. Self-Evaluation and Summarization: The agent performs a "self-evaluation" step, reordering or refining the candidate list based on all accumulated evidence (BM25/neural scores, file content, intermediate reasoning).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
preprocess:
  H_TF  {layer,head} indices for Matching Heads
  avgdl  average doc length in collection
  (U,Σ,VT)  SVD(EmbeddingMatrix);  u0  U[:,0]
  λ  scale factor to map u0 into familiar IDF range
function neural_bm25_score(query, doc):
  tok_q  tokenize(query)
  tok_d  tokenize(doc)
  attns  CrossEncoder.forward_until_last_attention(tok_q, tok_d)
  raw_tf[j] = 0  for each j in tok_q
  for h in H_TF:
    for j,i:
      raw_tf[j] += attns[h][ji]
  lenNorm = 1.0/(1  b + b * len(tok_d)/avgdl)
  score = 0
  for j in tok_q:
    sat = raw_tf[j] / (1 + k_sat * raw_tf[j])
    softTF = sat * lenNorm
    idf = -λ * u0[index(tok_q[j])]
    score += idf * softTF
  return score

4. Query Reformulation and Candidate Selection

BM25 MLLM Agents benefit substantially from pre-retrieval query reformulation. The agent employs an LLM to extract salient fields from an input (e.g., bug report), forming several ablated query variants. Empirically, concatenating explanation, identifiers, and code snippets ("exp_code" variant) provides the best trade-off, yielding up to +45% improvements in MAP@1 for retrieval over non-reformulated queries (Caumartin et al., 7 Dec 2025). This step reduces noise and increases lexical and semantic overlap between the query and target documents.

Reformulation Variant Key Fields Empirical Benefit
all all JSON fields Baseline for ablation
explanation summary only Lower performance
all_code code signals only Moderate improvement
id_snippet identifiers+code Substantial improvement
exp_code explanation+id+code Highest MAP@1, adopted as default

5. Self-Evaluation, Summarization, and Iterative Reasoning

After initial retrieval, the agent enters an interactive loop in which it inspects code files, corrects retrieval errors (e.g., invalid paths), and explicitly self-evaluates its ranking list. Summarization is performed in two locations: (a) at pre-retrieval, where the LLM condenses the bug explanation to improve IR overlap, and (b) post-retrieval, where the LLM aggregates evidence and refines the ranking. This process enhances both MAP@1 and Hit@1 scores beyond pure BM25 or reformulation baselines.

The agent's iterative correction is constrained to three attempts. The entire pipeline was validated on datasets including Long Code Arena (mean 559 files/task, Python/Java/Kotlin) and SWE-bench Lite (mean 664 files/task, Python). Models tested were open-source Qwen2.5-32B and Qwen3-30B under controlled sampling (temperature = 0).

6. Scalability, Efficiency, and Integration

Scalability is achieved through several complementary strategies:

  • Precomputation: For each document, a term-profile vector (raw soft counts for high-frequency terms) can be stored, enabling fast score adjustment without re-running full forward passes.
  • Clustering: Clustering document term profiles allows the use of centroid representatives, balancing index size and retrieval accuracy.
  • Distillation: The Matching Heads' logic may be distilled into a lightweight scoring network applied at retrieval time.
  • Pipeline Integration: Agents employ a three-stage pipeline—coarse BM25 retrieval (zero-cost via inverted index), neural BM25 reranking (interpretable, semantic), and optional full Cross-Encoder scoring for the final shortlist. The final output is generated by conditioning the LLM on reranked top-k documents.

7. Performance, Limitations, and Future Developments

Empirical results indicate:

  • Query reformulation ("exp_code") boosts MAP@1 by up to +45% on Long Code Arena and +19% on SWE-bench Lite.
  • Full agentic workflow further improves first-file MAP by 30–50%, making it competitive with fine-tuned or proprietary baselines (MAP@1: 0.336, Hit@1: 0.627 on LCA; MAP@1: 0.727, Hit@1: 0.727 on SWE-bench with Qwen3-30B) (Caumartin et al., 7 Dec 2025).
  • Performance gains are concentrated at small kk (first-file retrieval); improvements diminish for larger candidate sets.

Limitations include non-determinism in LLM behavior, the potential for contamination of pretraining data with evaluation items, the restriction to only two open-source models, and sensitivity to extraction quality from the LLM. The pipeline relies on the accuracy and stability of tool outputs, constraining its applicability in more error-prone environments.

Future directions include finer-grained indexing (e.g., method-level BM25), direct exposure of neural BM25 scores for agentic decision-making, rank-fusion strategies, IDE integration for live retrieval, and further exploration of interpretable neural scoring methods (Lu et al., 7 Feb 2025, Caumartin et al., 7 Dec 2025).


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BM25 MLLM Agent.