How is Google using AI for internal code migrations? (2501.06972v1)

Published 12 Jan 2025 in cs.SE

Abstract: In recent years, there has been a tremendous interest in using generative AI, and particularly LLMs in software engineering; indeed there are now several commercially available tools, and many large companies also have created proprietary ML-based tools for their own software engineers. While the use of ML for common tasks such as code completion is available in commodity tools, there is a growing interest in application of LLMs for more bespoke purposes. One such purpose is code migration. This article is an experience report on using LLMs for code migrations at Google. It is not a research study, in the sense that we do not carry out comparisons against other approaches or evaluate research questions/hypotheses. Rather, we share our experiences in applying LLM-based code migration in an enterprise context across a range of migration cases, in the hope that other industry practitioners will find our insights useful. Many of these learnings apply to any application of ML in software engineering. We see evidence that the use of LLMs can reduce the time needed for migrations significantly, and can reduce barriers to get started and complete migration programs.

Summary

The paper demonstrates that bespoke LLM configurations can accelerate internal code migrations by over 50%, significantly reducing manual engineering effort.
The paper showcases a hybrid methodology that combines LLM edit generation with deterministic AST techniques for precise code change discovery and validation.
The paper presents detailed case studies—including int32-to-int64, JUnit, and Joda to Java time migrations—that illustrate measurable business value and scalable modernization.

The paper "How is Google using AI for internal code migrations?" discusses Google's experience using LLM based approaches to expedite code migrations within its extensive internal codebase. The authors highlight the distinction between generic AI-based software development tools and custom solutions tailored for specific Product Areas (PAs). The focus is on bespoke solutions, particularly migration workloads, and the challenges in ensuring that LLM-based code migrations deliver substantial business value, measured by a minimum of 50% acceleration in task completion.

The introduction emphasizes the challenges inherent in maintaining mature, large codebases while adapting to business demands and integrating new frameworks. Google's adoption of software engineering principles such as monorepo, analysis tools, rigorous code review, and CI/CD (Continuous Integration/Continuous Delivery) are mentioned. The paper contrasts generic AI tools, designed for widespread use, with bespoke solutions for specific tasks like code migration, which require higher complexity interactions.

The paper highlights several case studies of code migration within Google:

int32 to int64 ID migration
JUnit3 to JUnit4 migration
Joda time to Java time migration
Cleanup of experimental flags

These migrations aim to address technical debt and modernize code across various Google product units. The authors note a steady increase in the number of changelists resulting from AI-powered migrations throughout 2024, fostering an ecosystem that allows for economies of scale.

The paper emphasizes that successful LLM-based code migration requires a combination of Abstract Syntax Tree (AST)-based techniques, heuristics, and LLMs. The LLM's primary role is in edit generation, while deterministic AST techniques handle location identification and validation.

A common toolkit has been developed to aid in code changes and identify relevant files, with project-specific customization through LLM prompts, validation steps, and human-driven review and rollout phases. The rest of the paper details Google's generic code AI technologies, bespoke technologies for code migration, case studies, and key learnings.

The section on generic AI tools describes the use of LLMs for inline code completion, code review comment resolution, and adapting pasted code, leading to measurable productivity gains. The authors mention a 38% acceptance rate by software engineers, assisting in the completion of 67% of code characters. Improvements stem from larger models, enhanced context construction, and tuning models based on usage logs. The paper also alludes to the use of extensive logs of internal software engineering activities for training data.

The bespoke use of LLMs for code migration addresses the limitations of deterministic code change solutions for tasks with high contextual variance. The goal is to leverage LLMs to reduce the need for complex, hard-to-maintain AST-based transformations. The authors use LLM prompting to build common workflows with customized instructions, as well as sub-steps in the code migration like the file discovery and validations. The high-level process involves change discovery, generation using the LLM, validation, human review, and rollout.

For each migration, the authors defined success as AI saving at least 50% of the time for the end-to-end work, including change generation, finding migration points, reviews, and rollouts. This contrasts with generic technologies where success is measured by the percentage of code written by AI or acceptance rate.

The int32 to int64 ID migration case paper from Google Ads details the challenges of converting numerical IDs to 64-bit integers to avoid overflow issues. The manual effort was estimated to require hundreds of software engineering years and complex cross-team coordination. The adopted workflow involves an expert engineer identifying files and locations, an LLM-based migration toolkit producing verified changes, and manual checks by the engineer. The total time spent on the migration was reduced by an estimated 50%.

The process begins with the manual identification of protocol buffer fields for an ID, and the Kythe is used to find references to the seed in the entire Google codebase. The result of this Kythe search is a superset of files and lines that may potentially need to be modified. This superset is filtered to identify the locations to be modified accurately, before the files are passed to the LLM.

The code migration toolkit inputs include a set of files, locations of expected changes, prompts describing the change, and optional few-shot examples. The toolkit expands this set with additional relevant files, such as test files, interface files, and other dependencies. The edit generation and validation step leverages a version of the Gemini model fine-tuned on internal Google code and data, following the DIDACT methodology. At inference time, each line where a change is needed is annotated with a natural language instruction as well as a general instruction for the model. The instructions remind the model to also update the test files.

To validate the changes automatically, the team implemented a mechanism that generates prompt combinations that are tried in parallel for each file group, similar to a pass@k strategy. The validations, which are configurable and often depend on the migration, commonly involve building the changed files and running their unit tests.

The JUnit3 to JUnit4 migration case paper addresses the issue of outdated JUnit3 tests within Google's codebase. The LLM migration stack was used to automatically migrate these tests, with the resulting changelists split into smaller sets for review. The Gemini model, fine-tuned on the internal Google code base, used a set of prompts consisting of rules that humans use to do the migration manually. The updated test files were built and re-tested, with any failures sent back to the model for fixing. This technique migrated 5,359 files and modified more than 149,000 lines of code in 3 months, with approximately 87% of the AI-generated code committed without any changes.

The Joda time to Java time migration case paper focuses on migrating away from the Joda time library to the standard java.time package. The challenges include changes spanning class interfaces and fields, the need to split the work into manageable chunks, and the requirement for conversion functions between interacting components. The process involves change targeting, execution, review, and landing.

The targeting phase uses a pipeline on top of Kythe to build a cross-reference graph, identifying dependencies and categorizing potential changes. A clustering solution categorizes the potential changes, where the cross-references form directed acyclic graphs (DAGs) connecting files, to get the model to make consistent changes across files. The changes themselves are guided by instructions to the model similar to those given to human engineers.

This approach has led to the successful migration of many smaller and medium-sized file clusters. Current estimates show time savings of approximately 89% compared to manual changes, with engineers noting the tool's ability to quickly identify all places and dependencies to update.

The cleanup of experimental code case paper addresses the issue of stale experimental flags within Google's codebase. The task involves finding code locations where the flag is referenced, deleting these references, simplifying conditional expressions, cleaning up dead code, and updating tests. The process involves flag discovery and targeting, using Code Search to find flag usages.

The code cleanup process involves providing the model with a set of files, the symbol name of the flag to clean, the flag's value, and instructions on how to execute the cleanup. The large context size of Gemini allows for packing all flag usages into a single query. Following the cleanup, additional validations are performed to ensure all instances are deleted and that the code builds and tests pass.

The discussion section emphasizes the flexibility of LLMs in code modernization and their potential to transform code maintenance in large enterprises. The authors advocate for combining LLMs with AST-based techniques and breaking down complex tasks into simpler sub-tasks. They also note that while LLMs can significantly save time in change generation, human involvement is still needed for code reviews and change rollouts.

The authors advocate for constant measurement of business-level outcomes, and point out that bad model-level performance can compromise business-level outcomes, but good model-level performance does not guarantee good business-level outcomes. The use of generative AI, with bespoke techniques, comes with the cost of having to train a number of engineers in the use of these techniques. Similarly, while fine-tuned models can be useful, they also come with a cost, and a company needs to constantly assess the investing in custom models for better outcomes, versus working with out of the box models.

The paper references related work in repository-level changes with and without planning, code migrations, language translation, and code refactoring.

In conclusion, the authors express their intent to expand the use of LLM capabilities across multiple teams and product areas within Google. Future plans include expanding the portfolio of use cases from code migration to agents that can automate triaging, mitigating, and resolving complex system escalations.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rseroter/status/1879925328912556308

https://twitter.com/ShirKi/status/1887851446172897614

https://twitter.com/AmudaAdeolu/status/1888073641411092747

https://twitter.com/AmudaAdeolu/status/1889524645100953739

https://twitter.com/MisterTechBlog/status/1880045374649561272

https://twitter.com/AmudaAdeolu/status/1928711143373095177

YouTube

Show All Videos

HackerNews

How is Google using AI for internal code migrations? (2 points, 0 comments)
How is Google using AI for internal code migrations? (1 point, 0 comments)