Pixel Art Bench

No pixel art examples generated yet.

FAQs

What is Pixel Art Bench?

Pixel Art Bench is a benchmark that measures how well different LLMs generate pixel art across seven subjects: Mario from Super Mario Bros, a rainbow, the sun, the planet Saturn with its rings, a maze, the Mona Lisa, and hope.

How does it generate the pixel art?

Our system prompts the LLM to return a structured JSON response containing a palette of 2-10 hex colors and a 24x24 grid of indices referencing that palette. We then convert the grid into a WebP image.

What is the exact prompt?

Here is the prompt we send to each model (with "Mario from Super Mario Bros" as an example subject).

Note that instead of returning an array of hex colors, this requests an encoded format to save on tokens.

Draw pixel art on a 24x24 grid. Return a JSON object with two keys:

"palette" — palette: array of 2-10 hex colors (e.g. "4a90d9"). No # prefix. Index 0 = first color. Every color the image needs must be here, including background.

"grid" — grid: array of 24 strings, one per row (y=0 is top). Each row is a STRING of digits (e.g. "000011112222"), NOT an array. Each string has exactly 24 characters. Each character is a digit 0-9 — the palette index for that pixel.

Example 4x4:
{"palette":["87ceeb","228b22","8b4513","ff4500"],"grid":["0000","1111","2212","3333"]}

Guidelines:
- If the prompt is abstract, vague, or describes a mood/feeling (e.g. "something funny", "chaos", "love"), first brainstorm a specific, creative, surprising scene that captures that idea, then draw that scene. Don't just draw a literal symbol like a smiley face or heart -- think of something more imaginative and unexpected.
- Draw a clear pixel representation of the idea. Do not spell out words.
- Use enough colors for the scene but keep the palette lean.
- Fill every cell — use a background color, never leave gaps.
- Output ONLY the JSON object. No markdown fences, no explanation.

Draw: Mario from Super Mario Bros

And here is an example response from Gemini 3.1 Pro:

{
  "palette": [
    "5c94fc",
    "e52521",
    "fbd0b4",
    "432817",
    "0f4d8f",
    "fdfa16",
    "ffffff",
    "000000"
  ],
  "grid": [
    "000000000000000000000000",
    "000000000000000000000000",
    "000000000011100000000000",
    "000000000111111000000000",
    "000000001161611100000000",
    "000000011111111110000000",
    "000000011111111111100000",
    "000000033322267220000000",
    "000000333222267222200000",
    "000000332222222222200000",
    "000000033222227777000000",
    "000000003222222000000000",
    "000000011141141110000000",
    "000000111141141111000000",
    "000001111141141111100000",
    "000666111454454111666000",
    "006666614444444416666600",
    "000666044444444440666000",
    "000000044440044440000000",
    "000000333330033333000000",
    "000003333330033333300000",
    "000000000000000000000000",
    "000000000000000000000000",
    "000000000000000000000000"
  ]
}

This is then converted into the image you see in the list.

How is the Elo rating calculated?

We use GPT-5.4 as a judge to compare pairs of outputs for each prompt head-to-head, then compute Elo ratings from those matchups. Using an LLM judge is far more scalable than human evaluation, though it can introduce biases like favoring certain visual styles or color palettes.

Which models are evaluated?

All text models on OpenRouter.

If you don't see a particular model, it's likely because it didn't return a parsable response.

Why are some images blank?

Some models fail to return valid structured output with the expected palette and grid format, even after retries.

Can I generate my own?

Try our free pixel art generator, which lets you enter a topic and have several Gemini models generate pixel art for you.

Can I analyze the raw data?

Sure, ask on Discord or email us at [email protected].