AwaRes: Spatially Aware Resolution

Updated 4 July 2026

AwaRes is a spatial-on-demand framework that processes a low-resolution global view and selectively retrieves high-resolution crops for detailed visual question answering.
It employs a two-turn interaction protocol with automatic supervision to decide when and where to perform high-resolution evidence retrieval, optimizing token usage.
Empirical results demonstrate near full-resolution accuracy with around 36% token usage, showcasing improved efficiency and reduced latency over traditional methods.

Searching arXiv for papers related to “AwaRes” and its major usages. AwaRes most precisely denotes a spatial-on-demand framework for vision–LLMs in which a model first reasons over a low-resolution global view of an image and then, only when necessary, retrieves a small set of high-resolution crops through a structured tool call before producing a final answer (Shabtay et al., 14 Mar 2026). The name has also appeared in nearby but distinct literatures: as a variant spelling of AWaRe, the Attention-boosted Waveform Reconstruction network for gravitational-wave signal reconstruction (Chatterjee et al., 2024), and as shorthand for acoustic word representations, a family of fixed-dimensional embeddings for variable-length spoken word segments (Lin et al., 2023). In current arXiv usage, however, the capitalized form AwaRes is explicitly introduced as “spatially Aware to Resolution,” a framework designed to resolve the accuracy–efficiency trade-off in high-resolution visual question answering by coupling a low-resolution first pass with selective high-resolution evidence retrieval (Shabtay et al., 14 Mar 2026).

1. Nomenclature and scope

AwaRes is introduced as a framework for efficient vision–language inference under high-resolution visual inputs, especially where decisive evidence is spatially sparse, such as charts, documents, and small text (Shabtay et al., 14 Mar 2026). Its central claim is that a model need not process the entire image at native resolution; instead, it can preserve a cheap global view and escalate only to question-relevant regions.

The term is not globally unique across machine learning. In gravitational-wave analysis, the official system name is AWaRe, the “Attention-boosted Waveform Reconstruction network,” and the relevant paper explicitly notes that “AwaRes” is a variant spelling rather than the official name (Chatterjee et al., 2024). In speech processing, “AwaRes” is also used descriptively for acoustic word representations, i.e., fixed-length vectors derived from variable-length speech segments (Lin et al., 2023). This multiplicity of usage makes disambiguation important. In the strict sense of a named method, AwaRes refers to the vision–language framework in “Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs” (Shabtay et al., 14 Mar 2026).

This suggests a useful editorial distinction: AwaRes (VLM) for the spatial-on-demand framework, versus AWaRe (GW) and AwaRes/AWEs (speech) for the unrelated waveform-reconstruction and acoustic-embedding literatures.

2. Core problem: accuracy versus efficiency in VLMs

The motivating problem is the standard resolution bottleneck in VLMs. High-resolution inputs preserve fine details but induce substantial computational cost because visual token counts, key–value cache memory, and FLOPs grow sharply with resolution, while low-resolution inputs are efficient but can miss small text and fine structures (Shabtay et al., 14 Mar 2026). The tension is especially acute on document and chart benchmarks, where the answer may depend on localized, high-frequency evidence.

AwaRes addresses this by exploiting spatial sparsity. Its operating assumption is that many tasks require global context plus only a limited set of high-resolution regions. Rather than pruning tokens after a full-resolution encode, or escalating to the entire high-resolution image, AwaRes performs a single low-resolution global pass and then retrieves only those high-resolution crops needed for the question (Shabtay et al., 14 Mar 2026).

The framework formalizes the first-turn decision as a coupled policy

$\pi_\theta(C \mid q, I_{\text{low}}), \qquad C \subseteq \mathcal{C},$

where $q$ is the question, $I_{\text{low}}$ is the downsampled image, and $C$ is a subset of a fixed crop library (Shabtay et al., 14 Mar 2026). The event $C=\varnothing$ means the model answers directly from the low-resolution view; $C\neq\varnothing$ means it both decides that escalation is necessary and localizes where to look. This coupling is fundamental in the paper’s formulation, since “when to crop” and “where to crop” are not separable decisions.

Prior work discussed in the same paper spans fixed-budget token pruning, resolution-selection modules, dynamic zooming, and adaptive escalation to the full high-resolution image. AwaRes differs by keeping the control logic inside the VLM through tool use and by restricting high-resolution processing to a deployment-friendly discrete crop set (Shabtay et al., 14 Mar 2026).

3. Architecture and interaction protocol

AwaRes uses a two-turn interaction protocol with key–value reuse (Shabtay et al., 14 Mar 2026). In the first turn, the model receives the question and a low-resolution image obtained by downsampling the original image by $2\times$ in both height and width, which corresponds to approximately $25\%$ of the visual tokens of the full-resolution image. It then emits either a direct answer or a structured tool call requesting a subset of high-resolution crops.

The tool interface is a string call of the form:

$I_{\text{low}}$ 8

The crop library contains 10 predefined regions: four quadrants, a center crop, four half-images, and the full image (Shabtay et al., 14 Mar 2026). The identifiers are:

Crop id	Region
'0'–'3'	top-left, top-right, bottom-left, bottom-right
'4'	center
'5'–'8'	top half, bottom half, left half, right half
'all'	full image

If the model requests crops, the tool returns the corresponding high-resolution images, which are appended to the dialogue while the entire first-turn key–value cache for the question and low-resolution image is retained (Shabtay et al., 14 Mar 2026). In the second turn, the model integrates the preserved global context with the retrieved high-resolution evidence and produces the final answer. The protocol permits at most one escalation round.

The paper emphasizes that this is not an external-controller design. The learned policy itself selects the crop subset and composes the final answer. The retained dialogue state includes the low-resolution image, the question, the first-turn assistant action, and the returned crops. AwaRes therefore treats “where to look” as an internal decision variable rather than a separate pre-processing stage (Shabtay et al., 14 Mar 2026).

A representative case study is given on ChartQA: after failing to read axis labels at half resolution, the model issues GET_CROPS: ['5', '4'], receives the top-half and center high-resolution crops, and then answers correctly while processing far fewer tokens than a full-resolution pass (Shabtay et al., 14 Mar 2026).

4. Automatic supervision and training pipeline

AwaRes is trained from automatically constructed supervision in three stages (Shabtay et al., 14 Mar 2026). First, a judge determines whether cropping is needed. Using a base VLM $T$ , the pipeline computes answers from the low-resolution image and the full-resolution image, then uses an LLM-as-a-Judge, LLaMA-3.3-70B, to compare both against the gold answer. If the low-resolution answer is judged correct or tied with the full-resolution answer, the sample is labeled LR; otherwise it is labeled HR.

Second, an oracle grounding model, Qwen3-VL-A235B-A22B, localizes the evidence for the answer on HR-labeled samples by producing a bounding box $b=(x_1,y_1,x_2,y_2)$ in original-image coordinates (Shabtay et al., 14 Mar 2026). That box is mapped to the discrete crop library via

$q$ 0

This transforms continuous localization into a subset of deployment-friendly crop IDs.

Third, the pipeline constructs supervised transcripts. LR examples are single-turn trajectories from $q$ 1 directly to the answer. HR examples are two-turn tool-use trajectories in which the first turn requests $q$ 2, the tool returns the high-resolution crops, and the second turn outputs the answer (Shabtay et al., 14 Mar 2026).

Training proceeds in two phases. The cold-start supervised fine-tuning phase teaches the interaction protocol using the weighted negative log-likelihood

$q$ 3

with first-turn tool-call tokens upweighted by setting $q$ 4 for the tool call and $q$ 5 otherwise (Shabtay et al., 14 Mar 2026). The base model is Qwen2.5-VL-7B-Instruct; the data pool uses 10k samples from each of ChartQA, DocVQA, TextVQA, LLaVA-Multi, and VisionThink-Smart, with 5k per dataset used for SFT. Optimization uses LoRA rank 8, learning rate $q$ 6, and batch size 16.

The second phase applies multi-turn GRPO. Starting from the SFT policy $q$ 7, the method samples groups of $q$ 8 trajectories per prompt and optimizes a reward that combines answer quality and crop cost (Shabtay et al., 14 Mar 2026):

$q$ 9

with asymmetric tool cost

$I_{\text{low}}$ 0

The default hyperparameters are $I_{\text{low}}$ 1, $I_{\text{low}}$ 2, and $I_{\text{low}}$ 3 (Shabtay et al., 14 Mar 2026). Answer reward is measured through cosine similarity between sentence-transformer embeddings of the predicted answer and the reference answer. The GRPO stage uses learning rate $I_{\text{low}}$ 4 and retains LoRA rank 8.

5. Efficiency metrics and empirical results

The main efficiency statistic in AwaRes is the Retain Token Ratio,

$I_{\text{low}}$ 5

where $I_{\text{low}}$ 6 counts all visual tokens consumed across turns relative to a single full-resolution pass (Shabtay et al., 14 Mar 2026).

On six benchmarks evaluated via lmms-eval—POPE, RealWorldQA, V*-Bench, ChartQA, DocVQA, and OCRBench—AwaRes achieves an average score of 80.30 versus 80.46 for the full-resolution baseline, while using 36% of the visual tokens on average (Shabtay et al., 14 Mar 2026). By dataset, the reported AwaRes versus full-resolution results are:

Benchmark	AwaRes	Full-resolution
ChartQA	80.64 / RTR 0.32	79.80 / RTR 1.00
DocVQA	94.43 / RTR 0.28	94.00 / RTR 1.00
OCRBench	81.30 / RTR 0.42	81.10 / RTR 1.00
POPE	85.73 / RTR 0.27	87.87 / RTR 1.00
RealWorldQA	68.50 / RTR 0.43	68.80 / RTR 1.00
V*-Bench	71.20 / RTR 0.42	71.20 / RTR 1.00

Against fixed-budget pruning baselines such as VisionZip, SparseVLM, and Holo-V, AwaRes is reported as consistently more accurate at similar or lower retained-token budgets (Shabtay et al., 14 Mar 2026). Against VisionThink, which performs adaptive escalation to the entire high-resolution image, AwaRes attains 80.30 average score at RTR 0.36, versus 79.23 at RTR 0.61 for VisionThink (Shabtay et al., 14 Mar 2026). The paper further notes that VisionThink can exceed full-resolution compute on some datasets, whereas AwaRes never processes the full image unless the model explicitly requests 'all'.

Latency measurements on an H100-80GB show 0.61 s average wall-clock for AwaRes versus 2.71 s for VisionThink, approximately 4.4× faster (Shabtay et al., 14 Mar 2026). The reported task-level comparisons include 0.56 s vs 4.32 s on ChartQA, 0.51 s vs 1.78 s on DocVQA, and 0.64 s vs 3.36 s on OCRBench. The explanation given is architectural rather than merely token-based: AwaRes uses short tool calls and key–value reuse, while VisionThink generates long reasoning traces (Shabtay et al., 14 Mar 2026).

6. Ablations, behavior, and limitations

Several ablations clarify how AwaRes works. First, judge choice in the automatic supervision pipeline is not highly sensitive: replacing LLaMA-3.3-70B with DeepSeek-V3.2 yields 96.88% agreement on LR/HR labels (Shabtay et al., 14 Mar 2026). By contrast, ANLS-based labeling reduces average performance by 2.8 points, which the authors attribute to over-penalizing semantically correct paraphrases.

Second, weighted protocol learning matters. Upweighting first-turn tool-call tokens reduces parsing corruption from 10.17% to 0%, indicating that explicit supervision on the crop-call syntax is necessary for stable multi-turn behavior (Shabtay et al., 14 Mar 2026).

Third, crop-cost shaping changes policy behavior. Removing the crop-area term by setting $I_{\text{low}}$ 7 increases RTR from 0.36 to 0.42, while removing all tool cost increases RTR to 0.51, with only minor accuracy change (Shabtay et al., 14 Mar 2026). This demonstrates that the explicit cost term is chiefly responsible for preventing over-cropping.

The paper also describes a policy shift from oracle to SFT to GRPO: SFT tends to over-call the tool and overuse the 'all' crop, while GRPO moves the policy toward selective use, increasing the LR (no-call) rate to 72.2% and reducing 'all' to 4.9%, close to oracle distributions (Shabtay et al., 14 Mar 2026).

The principal limitations are structural. The discrete crop library is coarse; evidence that is very small or crosses crop boundaries may force either multiple crops or a request for 'all' (Shabtay et al., 14 Mar 2026). Grounding noise in the supervision pipeline can propagate to the SFT targets, although GRPO can partially correct this by optimizing answer reward directly. The system also permits only one crop round, so it cannot perform iterative zooming or hierarchical exploration. The paper identifies future directions including continuous crop prediction, video extension, and multi-stage zooming policies (Shabtay et al., 14 Mar 2026).

7. Relation to other “AwaRes” usages

The ambiguity of the term matters because “AwaRes” spans unrelated technical domains. In speech processing, AwaRes can denote acoustic word representations, fixed-dimensional vectors used for same–different discrimination, query-by-example, and low-resource speech processing (Lin et al., 2023). Within that literature, the “Self-Supervised Acoustic Word Embedding Learning via Correspondence Transformer Encoder” paper introduces CTE, a teacher–student Transformer method for learning robust AWEs from unlabelled speech via word-level correspondences (Lin et al., 2023). Related multilingual work jointly trains acoustic word embeddings and acoustically grounded written word embeddings for zero-resource languages (Hu et al., 2020), while later work studies contrastive multilingual adaptation (Jacobs et al., 2021) and multilingual zero-resource hate-speech keyword spotting in Swahili radio (Jacobs, 2024).

In gravitational-wave data analysis, the nearby name AWaRe designates the Attention-boosted Waveform Reconstruction network, a sequence-to-sequence deep learning model that reconstructs gravitational-wave signals from whitened strain and produces per-sample uncertainty estimates (Chatterjee et al., 2024). That paper explicitly states that “AwaRes” is only a variant spelling and that the official name is AWaRe (Chatterjee et al., 2024).

These parallel usages share only a high-level naming resemblance. The VLM AwaRes concerns spatially selective high-resolution retrieval (Shabtay et al., 14 Mar 2026); speech AwaRes concerns fixed-dimensional representations of spoken words (Lin et al., 2023); and AWaRe concerns uncertainty-aware reconstruction of gravitational-wave waveforms (Chatterjee et al., 2024). A plausible implication is that future citations should preserve the exact capitalization and expansion of each acronym to avoid conflating distinct research threads.

8. Significance

AwaRes is significant because it recasts resolution control as a learned, query-conditional tool-use problem inside the VLM itself (Shabtay et al., 14 Mar 2026). Rather than pruning tokens after expensive encoding or escalating to full resolution when uncertainty arises, it treats localized high-resolution evidence retrieval as the minimal sufficient intervention. The reported outcome is near–full-resolution accuracy at RTR 0.36 average token usage and materially lower latency than prior adaptive-resolution baselines (Shabtay et al., 14 Mar 2026).

Its methodological contribution is equally notable. The framework combines automatic LR/HR supervision, oracle-to-discrete localization, weighted multi-turn SFT, and crop-cost-aware GRPO into a single training recipe (Shabtay et al., 14 Mar 2026). This design makes “where to look” a first-class inference decision while maintaining a simple production interface based on a fixed crop library and at most two prefill passes.

In the broader literature, AwaRes exemplifies a recurring pattern across modern ML systems: selective computation conditioned on uncertainty or task structure. In VLMs this takes the form of spatial-on-demand high-resolution retrieval (Shabtay et al., 14 Mar 2026); in gravitational-wave analysis, uncertainty-aware sequence modeling appears in AWaRe (Chatterjee et al., 2024); and in speech, robust fixed-dimensional segment embeddings support efficient retrieval and discrimination in low-resource settings (Lin et al., 2023). The convergence is conceptual rather than technical, but it suggests that the AwaRes label has become associated with architectures that allocate representational or computational resources only where they matter most.