From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (2510.14979v1)

Published 16 Oct 2025 in cs.CV and cs.AI

Abstract: The edifice of native Vision-LLMs (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Summary

The paper introduces NEO, a unified model that integrates vision and language from the earliest stages using native primitives.
It leverages multi-head native attention and flexible position encoding to achieve efficient pixel-word alignment and robust multimodal reasoning.
Empirical results demonstrate competitive performance with modular VLMs while using less data and resources for training.

Native Vision-Language Primitives at Scale: The NEO Architecture

Introduction and Motivation

The paper "From Pixels to Words -- Towards Native Vision-Language Primitives at Scale" (2510.14979) addresses the limitations of modular Vision-LLMs (VLMs) and proposes a unified, native approach for large-scale multimodal learning. Modular VLMs, which combine separately pre-trained vision encoders and LLMs via adapters or projectors, have achieved strong results but suffer from architectural heterogeneity, complex alignment procedures, and suboptimal scaling. The authors introduce NEO, a family of native VLMs that eschew modularity in favor of a monolithic, intrinsically multimodal architecture, aiming to unify vision and language processing from the earliest stages of the model.

Figure 1: Overview of the native vision-language framework, projecting arbitrary-resolution images into a continuous latent space for efficient early-fusion vision-language encoding and alignment.

Architectural Innovations

Unified Native Primitives

At the core of NEO is the concept of a "native primitive"—a Transformer-based building block that integrates both vision and language modalities within a single, unified attention mechanism. This primitive is designed to:

Support flexible position encoding for dynamic spatial structures.
Employ multi-head native attention (MHNA) that jointly models visual and textual dependencies.
Utilize Native Rotary Position Embeddings (Native-RoPE) with modality-specific frequency and channel allocation, ensuring compatibility with pre-trained LLM weights while capturing spatial and temporal relationships in visual data.
Figure 2: The native primitive integrates bi-directional attention for images, causal attention for text, and modality-specific rotary position embeddings, enhancing pixel–word correspondence.

Pre-Buffer and Post-LLM Decomposition

To facilitate efficient training and robust pixel-word alignment, the architecture is initially partitioned into a pre-Buffer (vision-language encoder) and a post-LLM (reasoning module), both constructed from native primitives. During pre-training, the pre-Buffer learns visual representations under the guidance of a frozen LLM, preserving linguistic knowledge and mitigating catastrophic forgetting. As training progresses, the architecture is unified into a monolithic backbone, enabling end-to-end optimization and capacity reallocation between encoding, alignment, and reasoning.

Figure 3: The NEO architecture with patch/word embedding, pre-Buffer, and post-LLM components, all built from native primitives for efficient pixel–word alignment and reasoning.

Modality-Aware Attention and Position Encoding

NEO's attention mechanism is modality-aware: text tokens use standard causal attention, while image tokens employ full bidirectional attention, allowing rich intra-image context modeling. The Native-RoPE scheme decouples temporal (T), height (H), and width (W) indices and frequencies, assigning distinct base frequencies and channel allocations to each. This design avoids the frequency mismatches and spatial/temporal entanglement issues observed in prior 1D/3D-RoPE approaches, leading to improved spatial reasoning and generalization to arbitrary resolutions and aspect ratios.

Training Paradigm

The training pipeline consists of three stages:

Pre-Training: The pre-Buffer and new QK heads are trained on 390M image-text pairs, with the LLM weights frozen. This stage focuses on learning visual concepts and establishing initial pixel-word alignment.
Mid-Training: The full model is progressively unfrozen and trained end-to-end on a mixture of captioning, conversation, detection, and OCR data, further consolidating multimodal alignment and reasoning.
Supervised Fine-Tuning: The model is fine-tuned on high-quality, task-specific instruction datasets to enhance instruction following, dialogue, and real-world applicability.
Figure 4: The NEO training recipe, showing pre-training with frozen LLM, followed by mid-training and supervised fine-tuning with end-to-end optimization.

Empirical Results and Analysis

NEO demonstrates strong performance on a wide range of vision-language benchmarks, including chart/diagram/document understanding, visual reasoning, and OCR tasks. At both 2B and 8B parameter scales, NEO matches or surpasses prior native VLMs and approaches the performance of leading modular VLMs, despite using significantly less pre-training data and no reinforcement learning. Notably, NEO achieves these results with a unified, dense architecture and without reliance on external visual encoder supervision.

Key findings include:

Competitive performance with modular VLMs (e.g., Qwen2-VL, InternVL2.5) on general benchmarks, despite a much smaller data budget.
Substantial gains over previous native VLMs (e.g., Mono-InternVL, HoVLE, OneCAT) on visual-centric tasks, highlighting the effectiveness of the native primitive and training strategy.
Ablation studies confirm the superiority of mixed attention and Native-RoPE over causal attention and prior RoPE variants, with at least 0.8% average gain across benchmarks.
The pre-Buffer, once trained, serves as a reusable component, reducing the cost of adapting new LLMs for multimodal tasks.

Figure 5: Comparison of pre-Buffer and vision encoders, showing the efficiency and transferability of the pre-Buffer for native VLM development.

Limitations and Future Directions

While NEO achieves strong results, it exhibits relative weaknesses on knowledge-intensive and OCR-heavy tasks, attributed to limitations in the training corpus and computational resources. The current approach still relies on initializing from a pre-trained LLM, and fully de novo multimodal training remains an open challenge. The authors suggest that scaling up data and model size, as well as open-sourcing intermediate components, will be critical for further progress.

The architecture is readily extensible to video, multimodal generation, and embodied AI, given its modality-agnostic primitives and flexible position encoding. The dense, unified design also facilitates deployment in resource-constrained environments and provides a strong baseline for future research in reinforcement learning and multimodal generation.

Conclusion

The NEO framework establishes a scalable, unified paradigm for native vision-language modeling, integrating vision and language from the earliest stages via modality-aware primitives and end-to-end training. The empirical results demonstrate that native VLMs can rival modular systems in both efficiency and performance, even under constrained resources. NEO's design principles—unified attention, flexible position encoding, and reusable pre-Buffer—offer a robust foundation for future advances in multimodal AI, including video understanding, generation, and real-world deployment.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

From Pixels to Words: A simple guide to the NEO vision-LLM

What is this paper about?

This paper introduces NEO, a new kind of computer model that can understand both pictures and text together. Instead of stitching a “vision part” and a “language part” together like most systems do, NEO is built as one single brain that learns vision and language at the same time. The goal is to make this “native” (all-in-one) design work well, use less training data, and be easier for others to build on.

What questions are the authors trying to answer?

The paper focuses on two big questions, rewritten in simple terms:

What makes an all-in-one vision–LLM (a “native VLM”) different from the usual plug-and-play kind (a “modular VLM”), and can a native model be just as good?
How can we design native models so they’re easier and cheaper for researchers to train and improve?

To do this, they say a good native model should:

Match pixels (from images) and words (from text) in the same “meaning space,” so they understand each other.
Keep the best parts of vision models (great at seeing) and LLMs (great at reading/reasoning) in one system.
Naturally handle cross-modal skills like encoding images and text, aligning them, and reasoning about them together.

How does NEO work? (Methods explained with simple analogies)

Think of NEO as a single brain that can look and read at the same time. Here’s how they built it:

Turning images and text into “tokens”:
- An image is chopped into small patches (like cutting a photo into many LEGO tiles). Each tile becomes a token.
- Text is split into word-like tokens using a usual language tokenizer.
- Both go into the same model, so the brain sees them together.
Paying attention the right way:
- Attention is how the model decides what to focus on.
- For text, the model uses “causal” attention (it only looks backward in the word sequence, like writing a sentence one word at a time).
- For images, the model uses “bidirectional” attention (all image tiles can see each other), so it can understand the whole picture at once.
- This mix lets NEO read like a LLM while seeing like a vision model.
Knowing where things are (Native-RoPE):
- The model needs to know positions: where a word is in a sentence (time/sequence) and where a patch is in an image (height/width).
- They use a smart positioning system, Native-RoPE, that treats time (T), height (H), and width (W) as separate “axes,” each with its own settings. Think of it like giving every token a set of coordinates: when in the sequence it appears (T) and where on the image grid it is (H, W).
- This avoids mixing up sentence position with picture position and helps the model learn fine details (like small text in an image).
A two-part training trick that becomes one:
- Pre-Buffer: a “front porch” that prepares both image and text tokens into a shared representation. It learns to connect pixels and words early on.
- Post-LLM: the main “language brain” that brings strong reading and reasoning skills.
- At first, these are trained in a guided way (with the language part helping keep good reading skills). Later, they’re merged into one unified model. This keeps the language quality strong while teaching vision from scratch.
Training in three steps:
- Pre-training: learn basic vision and language together from lots of image–caption pairs (about 390 million examples total across stages).
- Mid-training: practice on more challenging data (like OCR, object detection, and dialogues) to improve alignment and detail.
- Supervised fine-tuning: learn to follow complex instructions and answer questions well.

What did they find, and why does it matter?

NEO performs surprisingly well for an all-in-one model:

It competes with top “modular” systems that use separate vision encoders and LLMs—even though those often use much more data and more complicated setups.
It beats many other native (all-in-one) models on standard tests that check how well a model understands images, answers questions, reasons visually, and avoids “hallucinations” (making things up about an image).
It does this with a streamlined design and a reasonable amount of data (about 390M image–text pairs), showing that native models can scale efficiently.

Where it still struggles:

On tasks that rely heavily on specific knowledge or reading tiny text (like some OCR and expert-knowledge tests), NEO sometimes lags behind the very best modular systems. The authors think more or better-targeted training data would help.

Why this matters:

NEO shows that we don’t always need separate vision and language modules. A single model can learn to see and read together effectively.
This can simplify how these systems are built, lower costs, and make research more accessible.
The components (like the Pre-Buffer and Native-RoPE) are reusable, so others can adopt them and improve faster.

What could this change in the future?

Easier building blocks: Researchers can use NEO’s “native primitives” (the core building pieces) to make their own models without juggling separate vision and language parts.
Better multimodal AI: The same ideas (mixed attention and clear positional signals) can help with video understanding, image/video generation, and editing—anything that blends sight and language.
More accessible progress: Because the approach is simpler and efficient, more teams (not just big companies) can build strong multimodal models.

In short, NEO is a big step toward AI that understands images and text in a more natural, unified way—like one brain that both sees and reads—while keeping the design simple, scalable, and practical.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and concrete open questions that future work could address:

Language retention and interference: The paper does not evaluate pure language tasks (e.g., MMLU, GSM8K, HellaSwag) to quantify whether mixed attention and end-to-end multimodal training degrade or improve the LLM’s original linguistic capabilities and reasoning.
Portability to other LLM backbones: NEO is only demonstrated with Qwen3 (1.7B/8B). It is unclear how well the native primitives (expanded QK heads, Native-RoPE, mixed masking) transfer to different LLM families (e.g., LLaMA, Mistral, Gemma) without retraining pitfalls.
Video understanding claims without evidence: Native-RoPE encodes temporal indices but the paper provides no experiments on video benchmarks (e.g., MSR-VTT, Ego4D, VATEX, TGIF). The effectiveness of temporal indexing and mixed attention for video remains untested.
Multi-image alignment and correspondence: The method claims enhanced image–image correspondences via decoupled H/W indexing, yet there is no evaluation on multi-image relational reasoning (e.g., image pair matching, inter-image grounding, or cross-image VQA).
OCR performance deficit: NEO lags on OCR-heavy tasks (InfoVQA, TextVQA, DocVQA). The paper does not analyze whether this stems from RoPE frequency choices, patch size (32×32), masking strategy, data composition, or the lack of OCR-focused architectural heads; no targeted remedies are explored.
Data mixture and quality sensitivity: The fixed 3:7 text-to-multimodal ratio is not ablated. There is no analysis of how different ratios, filtering strategies, caption length distributions, or dataset sources (LAION vs. COYO vs. recaptioned OpenImages) impact alignment, language retention, and downstream performance.
Efficiency and scalability metrics: Claims of “any resolution” and improved throughput via FlexAttention are not backed by quantitative measurements (e.g., tokens per second, memory footprint vs. image resolution, latency, maximum supported sequence length, and throughput comparisons to modular encoders).
Patch size rigidity: The pipeline uses a fixed 32×32 patch granularity via Conv stride 16 and 2. There is no paper on variable patch sizes, multi-scale tokenization, or hierarchical schemes and how they trade off accuracy, memory, and speed.
Unclear pre-Buffer merging mechanics: The transition from pre-Buffer+post-LLM to a “monolithic” backbone is not specified (e.g., whether weights are merged, reinitialized, or tied). How capacity allocation evolves and whether residual specialization persists is unknown.
Pre-Buffer reusability beyond Qwen3: The paper positions the pre-Buffer as a reusable primitive but does not test reusing it across different LLM sizes/backbones, nor quantify the data/compute needed to adapt it.
Benchmark coverage gaps: Despite training on detection and grounding data, there is no evaluation on detection/grounding tasks (e.g., COCO detection, RefCOCO/RefCOCOg, Flickr30k Entities), leaving pixel–word alignment quality unquantified in region-phrase settings.
Fairness of comparisons: Many baselines use very different data scales and RL. The paper does not provide matched-budget or matched-data experiments to isolate architectural gains from training volume and recipe differences.
Native-RoPE hyperparameter sensitivity: Base frequencies (β_T=1e6, β_H=β_W=1e4) are chosen heuristically. There is no sensitivity analysis or theoretical justification for these values across tasks, aspect ratios, extremely long sequences, or extreme resolutions.
Attention design side effects: Mixed masking (bidirectional for images, causal for text) may introduce optimization or generation artifacts (e.g., exposure bias for image-as-meta-unit modeling). The paper does not analyze failure modes or alternatives (e.g., block-causal schemes).
Special token handling: The impact of <img> and </img> boundary tokens on alignment and attention is not ablated; their role in long-context scenarios and multi-image inputs is unclear.
Language-only performance during/after pretraining freeze: Pretraining freezes LLM weights; mid-training unfreezes. There is no measurement of catastrophic forgetting or recovery of language proficiency across stages.
Scaling laws and capacity anomalies: NEO-9B does not consistently outperform NEO-2.2B on some OCR/Doc tasks; the paper does not explore scaling laws, capacity allocation, or data bottlenecks causing non-monotonic scaling.
Contamination and recaptioning risks: Using InternVL2-8B to recaption OpenImages may introduce label leakage or style bias. The paper does not evaluate contamination or “teacher bias” effects on downstream benchmarks.
Multilingual generalization: Training includes English and Chinese, but cross-lingual VL capabilities are not systematically evaluated (e.g., multilingual VQA, cross-lingual OCR, multilingual document tasks).
Extension to additional modalities: The pre-Buffer is “modality-shared,” but there are no experiments on audio, depth, or other modalities to validate generality and the Native-RoPE design when adding new index dimensions.
Parameter expansion trade-offs: Expanding Q/K head dimensions (~10% params) lacks ablations on how much to expand, where (which layers), and the resulting trade-offs among accuracy, memory, and speed.
Maximum resolution and long-document limits: “Any resolution” is claimed, but practical limits (e.g., 4K+ documents, multi-page PDFs) are not reported; robustness to extreme aspect ratios, dense text pages, and tiled inputs remains unexplored.
FlexAttention portability and reproducibility: CUDA kernel modifications are mentioned but not detailed; the ease of reproducing these kernels across hardware/software stacks and their impact on correctness/performance are unaddressed.
Safety and responsible use: Beyond brief remarks, there is no systematic evaluation of toxicity, bias, jailbreak robustness, or hallucination beyond POPE/HallB, nor alignment techniques (e.g., RLHF/DPO) adapted to the native architecture.
Release completeness: Reproducibility is limited by missing details (batch sizes, gradient accumulation, exact data filtering, image normalization, augmentation policies, tokenizer specifics). The promised appendix and artifacts are not present in the paper.
RL and preference alignment with native VLMs: The paper avoids RL; it is unknown whether RLHF/DPO can be applied effectively to native early-fusion models without harming pixel–word alignment or language retention.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed with the current NEO models and primitives, leveraging their unified vision-language architecture, Native-RoPE, mixed attention, and reusable pre-Buffer to reduce integration complexity and alignment overhead.

Industry

Finance: Automated document and form understanding
- Use cases: Invoice and receipt parsing, expense reconciliation, claims processing, KYC document triage.
- Tools/workflows: Plug NEO-2.2B or NEO-9B into existing Document AI pipelines for VQA on forms, charts, and tables (ChartQA, DocVQA, InfoVQA); use pre-Buffer for domain fine-tuning with limited compute.
- Assumptions/dependencies: High-quality scanned inputs; domain-specific fine-tuning improves OCR-heavy fields; GPU inference; licensing for model weights and training data.
Retail/e-commerce: Product listing and catalog intelligence
- Use cases: Extract attributes from product photos/packaging; compare size charts; flag mismatches between text and imagery; visual Q&A for customer support.
- Tools/workflows: End-to-end ingestion of arbitrary-res images via patch embeddings; mixed attention for spatial reasoning; integrate POPE/HallucinationBench checks for hallucination reduction.
- Assumptions/dependencies: Label availability for attribute supervision; variability in packaging languages (bilingual tasks supported but may benefit from further tuning).
Manufacturing: Visual instruction comprehension and audit
- Use cases: Interpret assembly diagrams and step-by-step guides (AI2D); verify visual compliance against textual SOPs.
- Tools/workflows: On-station assistant that performs diagram Q&A; use Native-RoPE to maintain spatial locality; early-fusion reasoning across text and image tokens.
- Assumptions/dependencies: On-prem GPUs or edge devices with sufficient memory; controls for incorrect predictions in safety-critical settings.
Media and content moderation: Multimodal policy enforcement
- Use cases: Detect inconsistencies between text and image; assess claims with visual evidence; reduce hallucinations in multimodal content.
- Tools/workflows: Integrate NEO into moderation queues; use VLMEvalKit-like evaluation to track hallucination metrics (POPE, HallB).
- Assumptions/dependencies: Policy definitions for acceptable content; human-in-the-loop for escalation.

Academia

Research acceleration in native VLMs
- Use cases: Benchmarking unified primitives vs. modular pipelines; ablation studies on attention and RoPE designs; reproducible comparisons.
- Tools/workflows: Use the pre-Buffer as a reusable asset to bootstrap new native models; flexibly test Native-RoPE frequency allocations; adopt FlexAttention kernels for variable-length inputs.
- Assumptions/dependencies: Access to training data (e.g., LAION, COYO) and compute; adherence to dataset usage policies.
Instructional and diagram-based tutoring
- Use cases: Visual explanation of STEM content (AI2D), chart interpretation for data literacy courses.
- Tools/workflows: Deploy classroom assistants that answer questions about diagrams and charts; use SFT data for pedagogical dialog styles.
- Assumptions/dependencies: Guardrails against errors in high-stakes learning; alignment with curricula.

Policy and Public Sector

Digital government document processing
- Use cases: Automate triage and extraction from forms, permits, and IDs; multimodal FAQs with scanned attachments.
- Tools/workflows: Early-fusion NEO reduces integration overhead versus separate VE+LLM stacks; workflow includes scanning, tokenization, VQA, and human validation.
- Assumptions/dependencies: Privacy and security policies; data governance for personally identifiable information; procurement for GPU resources.
Compliance auditing with multimodal evidence
- Use cases: Check visual compliance statements (e.g., safety signage, equipment labels) against policy text.
- Tools/workflows: Deploy NEO for image-text consistency checks; log confidence and rationales; prioritize cases for human review.
- Assumptions/dependencies: Well-specified policies and acceptable error thresholds.

Daily Life

Personal assistants for paperwork and household tasks
- Use cases: Read receipts, extract totals; interpret appliance diagrams; answer questions about visual instructions.
- Tools/workflows: Mobile or desktop app using NEO for on-demand VQA; optional cloud inference with privacy controls.
- Assumptions/dependencies: Latency and memory constraints; potential need for quantization to run locally.
Accessibility aids
- Use cases: Describe charts and diagrams; read labels or signage; bilingual support for English/Chinese.
- Tools/workflows: Integration with screen readers; camera-based visual Q&A.
- Assumptions/dependencies: Reliable OCR in diverse lighting; safety cautions for navigational decisions.

Long-Term Applications

These require further research, scaling, domain adaptation, or optimization (e.g., larger datasets, RL, video data, edge deployment, bias and safety mitigations).

Industry

Healthcare: Multimodal clinical assistants
- Use cases: Radiology report grounding; chart interpretation in EHRs; instrument manuals comprehension.
- Tools/workflows: Domain-specific pre-Buffer fine-tuning with medical datasets; guarded reasoning; audit trails.
- Assumptions/dependencies: Access to high-quality, compliant medical data; regulatory approvals; robustness in OCR-heavy clinical artifacts.
Robotics and autonomous systems: Unified perception-action reasoning
- Use cases: Understand visual scenes and follow natural-language commands; align spatial cues with instructions.
- Tools/workflows: On-device native VLMs with mixed attention; local Native-RoPE for spatial sensitivity; integration with control stacks.
- Assumptions/dependencies: Efficient model compression/quantization; real-time constraints; safety testing in dynamic environments.
Energy and infrastructure: Visual inspection and reporting
- Use cases: Analyze images of equipment with textual manuals; generate inspection summaries; detect mismatches between field imagery and documentation.
- Tools/workflows: Field apps using NEO; pipelines for image-text grounding and report generation.
- Assumptions/dependencies: Domain data for fine-tuning; scalability across asset types and conditions.
Creative industries: Multimodal content generation and editing
- Use cases: Image-conditioned storytelling, captioning, and editing; instruction-based multimedia creation.
- Tools/workflows: Extend Native-RoPE and mixed attention to generative setups (e.g., NOVA-like pipelines); integrate editing tools with native primitives for fine-grained control.
- Assumptions/dependencies: Training on large, curated multimodal generation datasets; alignment and safety filters to avoid misuse.

Academia

Video understanding and temporal reasoning
- Use cases: Long-horizon video QA; frame-wise temporal grounding; teaching assistants for lab demonstrations.
- Tools/workflows: Expand Native-RoPE with temporal indices and appropriate base frequencies; mixed attention masks for frames; large-scale video corpora.
- Assumptions/dependencies: Access to video datasets; optimization for long sequences; storage and bandwidth.
Scaling laws and unified model theory for multimodality
- Use cases: Investigate capacity allocation across encoding/alignment/reasoning; formalize principles for Native-RoPE frequency selection.
- Tools/workflows: Controlled scaling experiments; pre-Buffer/post-LLM partition studies; community benchmarks.
- Assumptions/dependencies: Compute resources; transparent reporting and shared datasets.

Policy and Public Sector

Standards for multimodal procurement and evaluation
- Use cases: Establish benchmarks that include charts, forms, diagrams, and OCR-heavy tasks; mandate hallucination metrics (e.g., POPE, HallusionBench).
- Tools/workflows: Public evaluation suites; certification programs for multimodal systems.
- Assumptions/dependencies: Cross-agency consensus; funding for shared benchmarks; periodic audits.
Bias, privacy, and safety governance for native VLMs
- Use cases: Frameworks to assess dataset biases; guidelines for responsible deployment in services handling sensitive documents.
- Tools/workflows: Differential privacy and de-biasing pipelines; red-team testing for multimodal harms.
- Assumptions/dependencies: Access to auditing expertise; policy enforcement mechanisms.

Daily Life

Edge-native multimodal assistants
- Use cases: On-device assistants for document and diagram understanding without cloud connectivity.
- Tools/workflows: Distillation from NEO to compact models; efficient Native-RoPE implementations; hardware-aware attention kernels.
- Assumptions/dependencies: Progress in quantization/pruning; device memory and compute budgets; energy efficiency.
Lifelong learning and personal knowledge bases
- Use cases: Build personal archives that index images and associated texts; ask questions grounded in one’s visual history.
- Tools/workflows: Local embeddings via pre-Buffer; retrieval-augmented VQA; privacy-preserving storage.
- Assumptions/dependencies: Secure data management; user consent and control; incremental fine-tuning methods.

Cross-cutting dependencies and assumptions

Model availability and licensing: Access to NEO weights and code; compliance with dataset licenses.
Compute and optimization: GPU/TPU resources for training and inference; need for optimization (FlexAttention, CUDA kernels, quantization) to meet latency/memory constraints.
Data quality and domain adaptation: Performance on OCR/knowledge-heavy tasks depends on domain-specific corpora; additional SFT or RL can close gaps.
Safety and accountability: Human-in-the-loop oversight for high-stakes decisions; mitigation of hallucinations and biases; transparent logging and evaluation.

View Paper Prompt View All Prompts

Glossary

AdamW optimizer: An adaptive gradient-based optimization algorithm that decouples weight decay from the update step to improve training stability. "NEO is trained on sixteen 8-GPU (80G) nodes using the AdamW optimizer~\cite{Training:AdamW}."
Autoregressive modeling: A generative modeling approach where predictions depend on previously generated tokens in sequence. "we treat one single image as a unified meta-unit for autoregressive modeling."
Bidirectional attention: An attention mechanism allowing tokens to attend to all other tokens in both directions (past and future) within a sequence or grid. "In contrast, image tokens employ full bidirectional attention"
Causal attention: An attention mechanism that restricts each token to attend only to preceding tokens, preserving autoregressive generation. "Text tokens adhere to standard causal attention, attending only to preceding tokens to maintain autoregressive generation."
Cosine decay scheduler: A learning rate schedule that decreases the learning rate following a cosine curve over training. "with a warm-up ratio of 0.01 and a cosine decay scheduler across all stages."
Cross-attention mechanisms: Attention modules that connect two different sequences (e.g., visual and textual) by letting one attend to the other. "projection layers~\cite{VLM:LLaVA-NeXT,Datasets:Llava-OneVision} or cross-attention mechanisms~\cite{VLP:Flamingo,VLM:NVLM}."
CUDA kernel modifications: Low-level GPU code changes to optimize computations for custom operations. "as variable-length block-wise attention is fully optimized through CUDA kernel modifications."
Decoder-only architecture: A transformer architecture composed solely of decoder blocks, typically used for autoregressive generation. "a monolithic decoder-only architecture."
Divide-and-Conquer (DaC): A training or architectural strategy that partitions tasks or modules to reduce interference and improve specialization. "leverage Mixture-of-Experts (MoE) or Divide-and-Conquer (DaC) strategies to suppress vision-language interference"
Early-fusion integration: A design that merges modalities at the input or early layers to jointly process them. "Native VLMs embrace early-fusion integration rather than grafting VEs onto LLMs."
End-to-end training: Jointly optimizing all components of a model within a single training pipeline. "The entire model is optimized end-to-end."
FlexAttention: An attention implementation optimized for flexibility and efficiency on variable-length and block-wise patterns. "We use FlexAttention~\cite{VLM:FlexAttn} to minimize memory overhead and increase throughput"
Gaussian Error Linear Unit (GELU): A smooth non-linear activation function used in transformer architectures. "two Convolutional layers (Conv1–2)~\cite{CNN:Alexnet} and a Gaussian Error Linear Unit (GELU)~\cite{TransF:GELU}."
Inductive biases: Built-in assumptions or constraints in model architecture or training that guide learning patterns. "Yet, modular designs still contend with strong inductive biases in pre-trained visual semantics"
Mixture-of-Experts (MoE): An architecture that routes inputs to multiple specialized expert networks to improve capacity and efficiency. "leverage Mixture-of-Experts (MoE) or Divide-and-Conquer (DaC) strategies to suppress vision-language interference"
Modality-specific decomposition: Designing or training separate components specialized for each modality to reduce interference. "modality-specific decomposition~\cite{VLM:EVEv2,VLM:Mono-InternVL,VLM:Mono-InternVL-1.5,VLM:OneCAT}."
Monolithic backbone: A unified model architecture that jointly handles all modalities without separate encoders or adapters. "during mid-training and supervised fine-tuning, the components are upgraded to a monolithic backbone"
Multi-Head Native Attention (MHNA): A customized multi-head attention mechanism tailored for unified vision-language processing. "(ii) a Multi-Head Native Attention (MHNA) that jointly processes visual–textual connectivity;"
Native Multi-Modal Attention: An attention design within the proposed primitive that mixes masking rules across modalities. "we propose Native Multi-Modal Attention with mixed masking"
Native Rotary Position Embeddings (Native-RoPE): A modality-aware extension of RoPE that decouples temporal and spatial indexing and frequency allocation. "Native Rotary Position Embeddings (Native-RoPE) with modality-specific frequencies, preserving compatibility with pretrained LLM's weights"
Patch Embedding Layer (PEL): A module that converts image patches into token embeddings for transformer input. "we convert it into token sequences via a lightweight Patch Embedding Layer (PEL)"
Pixel unshuffle: An operation that rearranges pixels to reduce spatial resolution while increasing channel depth, used here to fold tokens. "Conv2 performs token folding like pixel unshuffle~\cite{VLM:InternVL-2.5}"
Positional Encoding (Sinusoidal Positional Encoding): A deterministic method to inject token position information into embeddings using sinusoidal functions. "and $\boldsymbol{\mathrm{PE}$ is 2D Sinusoidal Positional Encoding~\cite{TransF:ViT}."
Pre-Buffer: A shared early module that maps vision and language inputs into a unified representation before the main LLM reasoning layers. "the pre-Buffer and post-LLM components, each stacked with multiple native primitives, facilitate efficient and precise pixel–word alignment and reasoning."
Projector: A lightweight adapter that maps visual features into the LLM’s embedding space. "a Projector~\cite{VLP:Flamingo,VLM:LLaVA-1.5,VLM:NVLM}"
QK head dimensions: The dimensionality of Query and Key heads in multi-head attention, which can be expanded to model additional relations. "we expand Query (Q) and Key (K) head dimensions while fully decoupling H, W, and T relations"
QK normalization: Normalization applied to Query and Key channels to stabilize attention computation. "with their respective QK normalization~\cite{TransF:Qwen3}."
RMSNorm: A normalization method that scales inputs by their root-mean-square, used as an alternative to LayerNorm. "It adopts RMSNorm~\cite{TransF:RMSnorm} and SwiGLU~\cite{TransF:SwiGLU} consistent with the original LLM layers."
Rotary Position Embeddings (RoPE): A positional encoding technique that rotates query/key vectors to encode relative positions. "The base RoPE frequencies $\beta_{T}$ , $\beta_{H}$ , and $\beta_{W}$ are set to $1 \times 10^{6}$ , $1 \times 10^{4}$ , and $1 \times 10^{4}$ , respectively."
Scaling laws: Empirical relationships that describe how model performance scales with data, parameters, or compute. "and scaling laws needed to harmonize their components."
Supervised fine-tuning (SFT): A training phase using labeled instruction data to refine model behavior and instruction following. "During the SFT stage, NEO’s ability to follow complex linguistic instructions and varied dialogue patterns is further enhanced"
SwiGLU: A gated linear unit activation variant that improves transformer feed-forward performance. "It adopts RMSNorm~\cite{TransF:RMSnorm} and SwiGLU~\cite{TransF:SwiGLU} consistent with the original LLM layers."
Token folding: A technique that aggregates or restructures tokens to alter their spatial/channel arrangement. "Conv2 performs token folding like pixel unshuffle~\cite{VLM:InternVL-2.5}"
Variable-length block-wise attention: An attention pattern that operates on blocks of tokens with variable lengths, optimized for efficiency. "as variable-length block-wise attention is fully optimized through CUDA kernel modifications."
Visual Encoder (VE): A model component that extracts visual features from images before alignment or text processing. "a pre-trained Visual Encoder (VE)~\cite{VLP:CLIP,VLM:InternVL-1.5,TransF:EVA,VLP:SigLIP-2}"
Vision-LLMs (VLMs): Models that jointly process visual and textual inputs for multimodal understanding and generation. "Recently, Vision-LLMs (VLMs)~\cite{VLM:Qwen2.5-VL,...} have emerged as a major breakthrough"
Warm-up ratio: The fraction of initial training steps where the learning rate is gradually increased from zero to its maximum. "with a warm-up ratio of 0.01 and a cosine decay scheduler across all stages."
Word Embedding Layer (WEL): A module that converts tokens from a text tokenizer into embeddings for transformer input. "we encode it into word tokens using the original LLM Tokenizer as Word Embedding Layer (WEL)"
Zero-initialized weights: Initializing certain parameters to zero to control training dynamics and avoid disrupting pre-trained behavior. "the linear weights of K for H and W channels are zero-initialized"

View Paper Prompt View All Prompts

Open Problems

Training a fully native VLM entirely from scratch

Continue Learning

Authors (9)

Collections

GitHub

GitHub - EvolvingLMMs-Lab/NEO: NEO Series: Native Vision-Language Models from First Principles (8 stars)

Tweets

This paper has been mentioned in 9 tweets and received 1255 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (74 likes, 0 questions)

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (2510.14979v1)

Summary

Native Vision-Language Primitives at Scale: The NEO Architecture

Introduction and Motivation

Architectural Innovations

Unified Native Primitives

Pre-Buffer and Post-LLM Decomposition

Modality-Aware Attention and Position Encoding

Training Paradigm

Empirical Results and Analysis

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

From Pixels to Words: A simple guide to the NEO vision-LLM

What is this paper about?

What questions are the authors trying to answer?

How does NEO work? (Methods explained with simple analogies)

What did they find, and why does it matter?

What could this change in the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Industry

Academia

Policy and Public Sector

Daily Life

Long-Term Applications

Industry

Academia

Policy and Public Sector

Daily Life

Cross-cutting dependencies and assumptions

Glossary

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

GitHub

Tweets

YouTube

alphaXiv