Create a Video View Paper

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

MolmoWeb introduces the first fully open visual web agent that operates exclusively from screenshots—no HTML, no accessibility trees—just raw pixels like humans see. By releasing MolmoWebMix, a massive dataset combining synthetic and human demonstrations with GUI perception data, and a family of high-performance vision-language models, this work breaks the proprietary stranglehold on web automation research and delivers state-of-the-art results that match or exceed closed frontier systems.

Script

Most web agents today are black boxes—proprietary systems with undisclosed training data, hidden architectures, and results you can't reproduce. MolmoWeb shatters that paradigm by releasing everything: the data, the models, the code, and a vision-only design that sees the web exactly as you do.

What makes this radical is the input space. MolmoWeb doesn't peek at the page source or rely on structured markup. It sees only pixels, just like you do when you browse. This vision-centric design sidesteps the brittleness of DOM-based systems and closes the gap between how machines and humans interact with interfaces.

The engine behind this breakthrough is an unprecedented training corpus.

MolmoWebMix aggregates millions of demonstrations across six data types: synthetic task rollouts generated by language model agents, real human interactions captured with custom tooling, deterministic graph traversals that guarantee coverage, atomic skill drills for primitive operations, and massive-scale screenshot question answering that grounds pixels to language. The result is a dataset where 10 percent of the data delivers 85 to 90 percent of the performance—a testament to curation and compositional design.

At test time, the models scale beautifully. The 8 billion parameter agent hits 78.2 percent pass at 1 on WebVoyager with a 100-step budget, beating comparable open models by 5 points and surpassing closed set-of-mark GPT-4o agents by 13. But the real power emerges with parallel inference. Sampling 4 rollouts and selecting the best with a language model judge rockets accuracy past 94 percent on WebVoyager and over 60 percent on the harder Online-Mind2Web benchmark—gains of more than 20 points from stochastic search alone.

Two insights stand out from the ablations. First, synthetic data from automated agents beats human annotations in downstream accuracy, despite humans offering richer exploration behaviors—quality and filtering matter more than realism. Second, the vision-only regime doesn't just match HTML-assisted models; it surpasses them on multiple benchmarks, proving that compositional visual grounding can replace structural crutches entirely.

MolmoWeb proves that transparency and performance are not at odds. By opening every layer of the stack, it hands the research community a replicable foundation for web automation and a template for what open science in this domain should look like. Visit EmergentMind.com to explore the models, dive into the data, and create your own research videos.