Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators (2212.06008v3)

Published 12 Dec 2022 in cs.SE and cs.AI

Abstract: AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pietro Liguori (24 papers)
  2. Cristina Improta (9 papers)
  3. Roberto Natella (42 papers)
  4. Bojan Cukic (8 papers)
  5. Domenico Cotroneo (36 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.