Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences (2202.06689v1)

Published 14 Feb 2022 in cs.SE and cs.LG

Abstract: Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a LLM for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+.CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

PDF Abstract

Code completion is a fundamental feature in Integrated Development Environments (IDEs) that helps developers write code faster and more accurately. Traditional approaches often rely on grammar rules or simple statistical models, which struggle with the dynamic nature of modern programming languages like Python and fail to capture the nuances of code context, especially when suggesting identifiers or completing longer code sequences. NLP based models treat code as text but often ignore its inherent structure and the specific requirements of code prediction, such as handling a potentially unlimited vocabulary of identifiers.

The paper "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences" (Izadi et al., 2022 ) proposes CodeFill, a novel learning-based code completion model designed to overcome these limitations. CodeFill leverages the idea that source code information is conveyed through two channels: the natural language-like naming channel (variable names, function names, etc.) and the structural channel (AST token types, indentation). By learning from both simultaneously using a parallel Transformer architecture and Multi-Task Learning (MTL), CodeFill aims to provide more accurate and contextually relevant suggestions, including predicting multiple tokens to complete entire statements.

CodeFill's approach consists of two main phases: pre-processing and model training, followed by a post-processing step for re-ranking suggestions.

Pre-processing:

The initial step transforms raw source code into a format suitable for the model. This involves:

Removing comments, blank spaces, and blank lines.
Parsing the code to extract Abstract Syntax Tree (AST) information using libraries like Python's ast.
Identifying and replacing module, library, and alias names with special tokens like MODULE, LIBRARY, ALIAS.
Tokenizing the code and extracting four pieces of information for each token: its value (the actual text), its type (derived from the AST or lexer), its line number, and its position within the line.
Tracking variable visibility (global vs. local) to help differentiate names.
Handling indentation, which is syntactically important in Python. Special tokens <INDENT> and <DEDENT> are inserted to mark changes in indentation levels.
Applying Byte-Pair Encoding (BPE) to token values. This addresses the out-of-vocabulary (OOV) problem common in code by segmenting rare or unseen identifiers into known sub-word units, allowing the model to generate novel names. Literals (strings, numbers) are replaced with special tokens (STRING, NUMBER).

This process generates two parallel sequences for each source file: one containing token values (with BPE applied and literals replaced) and one containing AST token types. These sequences serve as the corresponding inputs to different parts of the CodeFill model. The end of statements is marked with an <EOS> token to facilitate statement completion.

For example, a Python line like value = os.environ.get(var) would be represented in parallel sequences:

Value sequence: value = os . environ . get ( var ) <EOS> (after BPE, os, environ, get, var might be single tokens or broken down further)
Type sequence: NAME ASSIGN NAME DOT NAME DOT NAME LPAR NAME RPAR EOS (using simplified AST token types)

Model Training:

CodeFill employs a parallel architecture consisting of three distinct GPT-2 Transformer models. Each Transformer is responsible for one of three tasks:

Token Value Prediction (TVP): Predicting the value of the next token given the preceding sequence of token values.
Token Type Prediction (TTP): Predicting the type of the next token given the preceding sequence of token types.
Statement Completion (SC): Predicting a sequence of token values until an <EOS> token is generated, given the preceding sequence of token values.

The models are trained using a two-stage process with soft-parameter sharing Multi-Task Learning (MTL):

Pre-training: The models are trained on a large unlabeled dataset (PY1690K) using all three tasks (TVP, TTP, SC) with a joint loss function. Soft-parameter sharing means each task has its own model with its own parameters, but the training process regularizes the distance between these parameters to encourage knowledge transfer.
Fine-tuning: The pre-trained models are fine-tuned on a smaller, task-specific dataset (PY117K) using only the TVP and SC tasks. TTP is excluded here because the limited vocabulary of types is quickly learned during pre-training.

The training uses an alternating strategy, picking a random task for each epoch with a configurable probability (e.g., 20% TTP, 40% TVP, 40% SC during pre-training). This prevents catastrophic forgetting between tasks. The objective for each task is a standard LLMing objective: maximizing the probability of the target sequence given the context.

For a sequence of predicted tokens $\{v_t\}$ given context $\{c_t\}$ , the model estimates the conditional probability:

$P(v_0,\dots,v_N|c_0,\dots,c_T) = \prod_{i=1}^{N}{P(v_i | c_0, \dots,c_T,v_0,\dots,v_{i-1})}$

The loss function during pre-training is the minimum of the cross-entropy losses for the three tasks, and during fine-tuning, it's the minimum of the TVP and SC losses.

During inference, CodeFill uses beam search with a beam width of 5 to find the most probable sequences of tokens for both single-token and multi-token predictions.

Post-processing:

To improve the relevance of suggestions, CodeFill includes a post-processing step that re-ranks the top-K predictions based on their scope visibility in the current file (function, class, file). The intuition is that names declared in closer scopes are more likely to be intended completions.

Algorithmically, for the top-10 predictions (each represented as <token, type, probability>), CodeFill checks if the predicted identifier is declared within the visible scope. If it is, the prediction's probability is multiplied by a weight. The weights are determined based on the predicted token's type (e.g., Attribute Access, Variable names, Function names) and the scope in which the identifier is declared (Function, Class, File). These weights were tuned experimentally to balance accuracy and speed, as shown in the paper's Table 1. This re-ranking prioritizes local, contextually relevant suggestions over globally frequent but less relevant ones.

Experimental Setup and Results:

CodeFill was implemented using the HuggingFace Transformers library and trained on two Python datasets: the public PY117K (29M LOC) and a larger, newly collected and deduplicated PY1690K (425M LOC) dataset. PY1690K was used for pre-training, and PY117K was split for fine-tuning (90%) and evaluation (10%).

The evaluation used comprehensive tasks reflecting real-world use:

Token-Level Prediction (TLP): Predicting the single next token.
- TLP-A: Any token.
- TLP-B: Token Type.
- TLP-C: Leaf Node (names, attributes, functions).
- TLP-D: Cardinal Point Completion (prediction after specific syntax like '.', '(', keywords). This is considered more representative of when developers trigger completion.
- Metrics: Accuracy (Top-1 match), Mean Reciprocal Rank (MRR for Top-10).
Statement-Level Prediction (SLP): Predicting the next n tokens to complete a statement, up to $n=8$ $n = 8$ .
- Metrics: METEOR and ROUGE-L (standard metrics for sequence generation).

CodeFill was compared against six baselines, including state-of-the-art models like GPT-C [svyatkovskiy2020intellicode] and TravTrans+ [kim2020code].

The results demonstrated CodeFill's superior performance across all evaluation tasks:

TLP-A: CodeFill achieved 80.6% Accuracy and 81.7% MRR, outperforming all baselines, including TravTrans+ (78.9% Acc, 79.4% MRR).
TLP-B (Token Type): CodeFill showed strong performance across all types, significantly outperforming others for Identifiers (54.4% Acc, 56.0% MRR), which are notoriously difficult to predict.
TLP-C (Leaf Node): CodeFill achieved 66.3% Accuracy and 69.5% MRR, significantly better than TravTrans+ (61.7% Acc, 63.8% MRR), indicating its strength in predicting names.
TLP-D (Cardinal Point): CodeFill had the best performance with 70.0% Accuracy and 70.9% MRR, highlighting its effectiveness at practical completion points.
SLP: CodeFill consistently outperformed baselines, especially as the completion length increased. For 4-token completions (average statement length), CodeFill achieved 70.2% METEOR and 63.8% ROUGE-L, significantly higher than TravTrans+ (64.5% METEOR, 52.4% ROUGE-L).

An ablation paper confirmed that each component contributes positively: the MTL approach (especially soft sharing) improves performance compared to a vanilla GPT-2 or hard parameter sharing, and adding the Statement Completion task further enhances results, likely by helping the model learn longer-range dependencies and context utilization.

Regarding runtime characteristics, CodeFill, like other Transformer-based models, has a relatively large number of parameters (258M), making client-side deployment impractical. However, its inference latency (73ms on the test setup) is well within the acceptable range for interactive IDE features (typically <100ms), supporting a centralized, server-based deployment model.

Contributions and Implications:

The paper's key contributions are:

CodeFill Model: A novel model that unifies learning from both source code structure (token types) and naming sequences (token values) using a parallel Transformer architecture and MTL.
Statement Completion Task: Introduction and training on a multi-token prediction task to complete entire statements, demonstrating its effectiveness in improving long-range dependency learning and providing more substantial suggestions.
Evaluation Methodology: Proposing and evaluating on novel, more realistic tasks like Cardinal Point Prediction and multi-token Statement Completion.
Public Resources: Releasing the implementation code and the large, deduplicated PY1690K dataset.

The practical implications are significant: CodeFill's superior performance, particularly in predicting names and completing statements at relevant points, translates to a better user experience in IDEs. By leveraging structural information often ignored by text-based models and training specifically for multi-token completions, CodeFill provides more accurate and helpful suggestions, potentially reducing typing effort and cognitive load for developers, especially in dynamically-typed languages where static analysis is limited.

The authors acknowledge limitations, including the need to validate findings on other programming languages and the potential for further improvements in areas like incorporating scoped information more explicitly or learning re-ranking weights automatically. Future work aims to explore these areas and potentially distill the model for more efficient deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Maliheh Izadi (36 papers)
Roberta Gismondi (1 paper)
Georgios Gousios (20 papers)

Citations (87)

View on Semantic Scholar

CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences (2202.06689v1)

Related Papers