Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Language Models for Code Syntax Understanding (2210.14473v1)

Published 26 Oct 2022 in cs.CL

Abstract: Pre-trained LLMs have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained LLMs can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing LLMs pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming LLMs fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Da Shen (3 papers)
  2. Xinyun Chen (80 papers)
  3. Chenguang Wang (59 papers)
  4. Koushik Sen (49 papers)
  5. Dawn Song (229 papers)
Citations (13)