Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 26 tok/s Pro

GPT-4o 98 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

Efficient Code Embeddings from Code Generation Models (2508.21290v1)

Published 29 Aug 2025 in cs.CL, cs.AI, and cs.IR

Abstract: jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

Collections

Summary

The paper introduces novel code embedding models using autoregressive decoders and last-token pooling to enhance code representation efficiency.
A contrastive learning strategy with InfoNCE loss and Matryoshka representation ensures effective discrimination between code and natural language pairs.
Benchmark evaluations on tasks like CodeSearchNet and MTEB-CoIR demonstrate that these smaller models deliver competitive performance in diverse code retrieval scenarios.

Efficient Code Embeddings from Code Generation Models

Introduction

The paper "Efficient Code Embeddings from Code Generation Models" presents a novel approach to generating efficient code embeddings by leveraging code generation models. The research addresses the significant challenges faced by current code embedding models, which predominantly arise from the scarcity of semantically-rich aligned data, such as inline comments and documentation strings. Despite the success of models like Codex in synthesizing code from natural language prompts, there is a pressing need for retrieval-augmented generation (RAG) architectures that enhance contextual understanding of codebases, particularly in real-world development scenarios. This paper introduces two distinct models developed using autoregressive decoders and emphasizes the importance of domain-specific training strategies to optimize performance for applications such as natural language to code retrieval, technical question answering, and code-to-code retrieval.

Model Architecture

The proposed code embedding models, jina-code-embeddings-0.5b and jina-code-embeddings-1.5b, are founded on autoregressive decoder architectures initialized with the Qwen2.5-Coder backbones. These models generate embeddings by transforming the final hidden layer of the LLM using last-token pooling. A comprehensive analysis indicates that this technique outperforms other pooling methods, such as mean and latent attention pooling, particularly in the context of decoder-only architectures.

In understanding the downstream tasks for code embedding, the research delineates these tasks into distinct categories, each employing specific instruction strings to guide the embedding generation process. The categories include NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion, all of which ensure that the embeddings align optimally with the anticipated code retrieval applications.

Training Methodology

The training process builds upon the Qwen2.5-Coder backbones with a contrastive learning approach using the InfoNCE loss. This method optimizes the embedding model to effectively distinguish between related and unrelated code and natural language pairs. A unique aspect of this paper is the incorporation of Matryoshka representation learning, which enhances the models' flexibility by allowing truncatable embeddings, thus enabling resource usage optimization without significantly affecting precision.

The models were trained using a wide range of query-document pairs tailored for various code retrieval tasks. This included both real-world data sources and synthetically generated datasets using GPT-4o, particularly in areas with a dearth of available data, like deep learning framework code translations. Training configurations were precisely tuned, with the 0.5B and 1.5B parameter models trained on A100 GPUs for specific durations, optimizing for task performance across selected benchmarks.

Evaluation and Results

The effectiveness of the embedding models was validated on several benchmarks, including MTEB-CoIR and various code-specific tasks like CodeSearchNetRetrieval, MBPP, and WikiSQL. Results demonstrated that despite having fewer parameters, the jina-code-embeddings models performed competitively against larger alternatives, achieving superior or comparable results in many instances.

Detailed evaluation highlighted these models' leads in critical benchmarks, exhibiting high precision in code retrieval tasks across different programming languages and types. This substantiates the models' capability and efficiency in handling diverse coding challenges, making them suitable for real-world applications where the balance between performance and resource consumption is crucial.

Conclusion

The paper delivers an authoritative exploration into advancing code embedding methodologies by utilizing autoregressive code generation models as backbones. Through innovative task-specific instructions and embedding techniques like last-token pooling, the paper achieves state-of-the-art results in code retrieval. These models not only enhance retrieval-augmented generation systems but also establish a new paradigm in embedding efficiency, fundamentally shifting how smaller, more specialized models can compete with larger alternatives in practical scenarios. Future developments should focus on further refining the task-specific training strategies and exploring the integration of these embeddings into broader software engineering pipelines.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

Tweets

https://twitter.com/JinaAI_/status/1963637135439007824

https://twitter.com/_reachsumit/status/1962402912493330788

https://twitter.com/hxiao/status/1963638921650114580

https://twitter.com/tunedgradient/status/1963645959084331130

YouTube

Show All Videos

alphaXiv

Efficient Code Embeddings from Code Generation Models (21 likes, 0 questions)