Progressive Tile-Based Streaming

Updated 23 March 2026

Progressive tile-based streaming is a technique that partitions media into independently fetchable tiles and progressively refines them based on predicted user FoV.
It employs stochastic optimization and RL-driven rate allocation to balance quality, bandwidth constraints, and computational resources.
This approach enhances immersive experiences in 360° video, point cloud, and DNN inference by enabling robust, adaptive scheduling and fine-grained QoE control.

Progressive tile-based streaming is a class of bandwidth- and computation-efficient delivery architectures in which large multimedia or computational workloads—most prominently 360-degree video, point cloud volumetric video, and deep neural network inference—are partitioned into independently fetchable and upgradable "tiles." At runtime, tiles are prioritized, scheduled, and progressively refined based on predicted user Field-of-View (FoV), bandwidth forecasts, and application-level utility models. This progressive approach enables fine-grained quality-of-service control, robustness to prediction errors, and avoidance of visual or informational discontinuities, outperforming monolithic or non-adaptive schemes.

1. Foundations and System Model

The central motif of progressive tile-based streaming is workload partitioning—spatially for media (e.g., ERP-projected frames partitioned into NxM tiles (Ghosh et al., 2018, &&&1&&&); octree-rooted spatial tiles for point clouds (Zong et al., 2023)), or tensor/block decomposition for DNN inference accelerators (Qin et al., 9 Feb 2025). Each tile $i$ can be transmitted/encoded/computed at rate $r_i$ chosen from a discrete set $\mathbb{R} = \{R_0, R_1, \ldots, R_m\}$ , where $R_0$ reflects a minimal base layer.

In typical immersive media applications, the user's FoV is both highly variable and only partially predictable, with associated tile-level view probabilities $p_i = \Pr\{i \text{ in FoV}\}$ estimated from short-term forecasts using head-/gaze-trace history and possibly deep learning-based predictors (Zhang et al., 2023, Pang, 2023). In DNN/accelerator domains, tiles correspond to tensor blocks with known computation requirements and can be scheduled per accelerator tile.

The temporal domain is divided into segments (chunks of K frames or time units), and optimization reoccurs adaptively over a receding-horizon window, typically $W=2\ldots 5$ chunks (Ghosh et al., 2018, Ghosh et al., 2017, Zong et al., 2023).

2. Stochastic and Robust Optimization Formulations

The quality-of-experience (QoE) or performance objective is formulated as a weighted sum of tile- or block-level utility terms:

$\mathrm{E}[QoE]=\sum_{i=1}^{N} p_i Q(r_i)$

where $Q(\cdot)$ is strictly increasing and typically concave (e.g., linear for user-perceived quality, logarithmic for rate-distortion utility in point clouds (Zong et al., 2023)). The design must respect a global bandwidth or resource budget $\sum_i r_i \leq B$ per segment and possibly stronger stochastic constraints:

$P\left(\sum_{i=1}^N p_i Q(r_i) \geq R \right) \geq 1 - \epsilon$

to guarantee minimum utility $R$ at probability $1-\epsilon$ in the face of FoV prediction uncertainty (Ghosh et al., 2018, Ghosh et al., 2017).

Practical solutions exploit a convex relaxation of the discrete-rate assignment, which transforms the allocation into a tractable convex problem. The relaxed solution is then down-quantized to discrete levels, with "saved" bandwidth budget used for greedy upgrades. The resulting "bounded optimality gap" is analytically characterized and diminishes with both tile count and concavity of $Q(\cdot)$ (Ghosh et al., 2018, Ghosh et al., 2017).

In point cloud video, the convex allocation employs empirically calibrated heterogeneous tile rate–quality functions, solvable by KKT water-filling algorithms with $O((W K) \log(W K))$ complexity for window-length $W$ and $K$ tiles per frame (Zong et al., 2023).

3. Progressive Mechanisms and Receding-Horizon Adaptation

All leading frameworks implement progression in both time and refinement level:

Temporal progression: Delivery is organized as a sliding or receding-horizon over $W$ future segments. After each segment's download, FoV probabilities $p_i$ are re-estimated; the next optimization window is shifted to maintain adaptation (Ghosh et al., 2018, Ghosh et al., 2017).
Multiround refinement: Particularly in 6-DoF/volumetric video, a tile's base layers are fetched early (low-confidence, long-horizon FoV prediction), while enhancement layers are opportunistically patched into frames as more precise short-term predictions become available. This structure "hedges" against prediction errors and bandwidth variations (Zong et al., 2023).
Batch/priority scheduling: In VATP360, tiles are coarsely bucketed into four priorities (viewport, object+adjacency, object-only, and background). An RL-driven agent allocates bandwidth across priorities, ensuring high-probability tiles are fetched first and enabling lightweight progression (at most $6^4=1296$ possible actions per segment) (Pang, 2023).
Viewport-aware selection: Systems such as MFTR (Zhang et al., 2023) employ a tile-classification model based on multi-modal transformer networks, directly predicting user-interested tiles for future segments. Viewport tiles are scheduled for immediate delivery at maximal rate; non-viewport tiles follow according to available budget and prediction scores.

In all designs, a minimal base layer ( $R_0$ ) is always fetched for every tile to avoid "black spots" if predictions fail (Ghosh et al., 2018, Ghosh et al., 2017).

4. Viewport Prediction, Prioritization, and Rate Allocation

Viewport prediction is central to maximizing efficiency and visual quality:

Classical and statistical approaches: FoV probabilities are derived by crowd-sourcing or short-term head-motion model fitting (Ghosh et al., 2018, Ghosh et al., 2017).
Deep-learning predictors: MFTR fuses past head, gaze, and video frames with multi-modal encoder/transformer layers and classifies each tile as "user-interested" or not. This tile-level classification enables spatial error smoothing and superior overlap metrics versus trajectory regression methods (Zhang et al., 2023).
Object- and context-aware augmentation: VATP360 augments viewport-based tile assignment with object detection (YOLOv3) to guard against gross mis-predictions. Tiles containing salient moving objects near the predicted viewport receive elevated priority (Pang, 2023).
Priority buckets: Coarse-grained grouping based on these predictions drastically reduces the complexity of subsequent rate allocation.

Rate allocation is generally formulated as a convex or RL-driven optimization, subject to strict bandwidth constraints, with penalties for inter-segment quality variation and stall, yielding both smoother and more robust playback (Ghosh et al., 2018, Pang, 2023). In point-cloud video, the temporal confidence weight $w_i$ discounts allocation farther out in time, and the solution is an explicit water-filling over tile- and frame-rate thresholds (Zong et al., 2023).

5. System Pipeline, DASH Integration, and Practical Guidelines

Complete end-to-end systems integrate the following stages:

Encoding and tiling: Media sources (ERP-projected 360° frames, octree-based point clouds, multimodal DNN layers) are partitioned into tiles, each encoded into a codec-agnostic sub-bitstream with an explicit bitrate ladder (Ozcinar et al., 2017, Zong et al., 2023, Qin et al., 9 Feb 2025).
DASH/MPD packaging: MPEG-DASH SRD is extended to encode spatial relationships, tile geometry, and available representations for each tile. Each segment, per tile, is hosted as a distinct HTTP-accessible fragment (Ozcinar et al., 2017).
Client runtime: The VR (or DNN accelerator) client computes the predicted viewport or computation priority, optimizes the tile selection and rate assignment, and issues parallel GET requests for high-utility tiles. Playback or inference commences as soon as a minimal set of tiles are available; background bandwidth is used for progressive upgrades of remaining tiles.
Sliding buffer and prefetching: Buffers maintain a short (typically 4–6 s) window of tiles or computation blocks; the progressive window advances with each completed segment (Ozcinar et al., 2017, Zong et al., 2023).
Penalty tuning: Inter-segment bitrate variation ( $\eta$ ) and stall weights ( $\lambda$ ) are selected empirically to match user-study preferences or trade-off smoothness and rebuffering.

Key design guidelines are enumerated in several studies:

Always fetch a minimal base layer for all tiles.
Use crowd statistics or learned models to identify high-probability tile sets $A_\alpha$ for targeted upgrades.
Employ a sliding window of $W=3$ –5 for runtime adaptation; windows that are too short are greedy, too long are overly conservative.
Penalize inter-segment variation and stalls as per user tolerance.
Select robustness level $\alpha \in [0.9, 0.99]$ to balance worst-case and average-case performance (Ghosh et al., 2018, Ghosh et al., 2017).

6. Extensions: 3D Volumetric, DNN Accelerators, and Resource Constraints

Volumetric and 6-DoF point cloud streaming: Progressive patching over a sliding window leverages octree-based coding, tile-level concave utility modeling, and closed-form KKT allocation to achieve >2 $\times$ gains in angular resolution and >50% bandwidth reduction under real variable network/6-DoF traces compared to non-progressive baselines (Zong et al., 2023).

DNN accelerators: StreamDCIM implements a tile-based streaming digital CIM accelerator for multimodal transformers via three key features: (1) reconfigurable CIM macro with mode-dependent tile mapping, (2) mixed-stationary, cross-forwarding dataflow to exploit parallel tile computation, and (3) ping-pong pipeline to overlap compute and on-chip rewrite. This approach achieves a 2.63 $\times$ geomean speedup and $>2\times$ energy economy over non-streaming baselines (Qin et al., 9 Feb 2025).

Task duration-squeezing constraints: In proactive VR streaming, a new class of constraints ensures that computing and communication (CC) tasks per segment do not "squeeze" into subsequent segments, preventing pipeline overload and infinite motion-to-photon latency (Wei et al., 2021). The resulting optimization exhibits three regimes—minimum-resource-limited, unconditional and conditional resource-tradeoff—dictated by the ratio of per-segment CC budget to segment playback duration. Closed-form KKT solutions explicitly characterize the segmentwise allocation of rendering and transmission time.

7. Experimental Evaluations and Performance Gains

Comprehensive emulation and real-trace simulation studies demonstrate the efficacy of progressive tile-based streaming:

QoE improvements: 20%–50% gain over baselines in both expected and worst-case metrics (Ghosh et al., 2018, Ghosh et al., 2017, Pang, 2023).
Bandwidth utilization: 8 $\times$ 8 tiling achieves ≈82% rate utilization, balancing coding overhead and granularity (Pang, 2023).
Point cloud: Up to 3 $\times$ improvement in angular resolution and 20%–40% higher utility under bandwidth/FoV prediction errors (Zong et al., 2023).
Inference accelerators: 2.6 $\times$ –3 $\times$ performance/energy gain through tile-based streaming (Qin et al., 9 Feb 2025).
Robustness: Gradient-based upgrades and multiround progressive patching hedge against FoV forecast errors, maintaining baseline-quality for all tiles while opportunistically enhancing probable tiles (Zong et al., 2023, Ghosh et al., 2018).
Overall, all leading studies confirm that careful orchestration of tile prioritization, progressive scheduling, and robust rate adaptation yields provably near-optimal or, under certain utility models, globally optimal performance within tight computational budgets.