View Selection Strategy
- View Selection Strategy is a systematic method for selecting and optimizing a subset of views to maximize performance while managing cost and resource constraints.
- It integrates cost models, data mining, and combinatorial optimization to balance query speed, storage consumption, and maintenance overhead.
- The approach employs joint optimization of materialized views and indexes using greedy, evolutionary, and machine learning techniques to improve system efficiency.
A view selection strategy is a systematic methodology for selecting, optimizing, or recommending a subset of views or perspectives from a larger candidate set, with the goal of maximizing relevant performance objectives under specified resource, cost, or domain constraints. In data management, machine learning, and computer vision, view selection is central to tasks such as materialized view selection in data warehouses, sensor/camera pose selection in 3D reconstruction, semantic observation in robotics, and feature selection in multi-view datasets. Strategies span heuristic, algorithmic, and learning-based methods, often incorporating cost models, data mining, combinatorial optimization, and domain-specific constraints.
1. Cost-Driven View Selection and Interaction Modeling
A core principle in data warehousing and multi-query optimization is the use of cost models to formally evaluate and select views. The benefit of selecting an object (index or materialized view) is defined as the workload cost reduction achieved by its addition, normalized by storage consumption or maintenance overhead (0707.1306, Aouiche et al., 2017, 0707.1548).
Let denote the current configuration, a candidate view or index, and the total storage constraint. The benefit function is typically expressed as: Crucially, modern approaches explicitly model interactions between objects via matrices encoding which queries benefit from which views or indexes, and whether indexes are built on base tables or materialized views. This captures non-trivial dependencies—for example, the fact that simultaneously materializing a view and building an index on it may have non-additive benefits, a relationship quantified by
where is the set of views indexed by (0707.1548).
This cost-driven strategy is employed both for selection under space constraints and for negotiating trade-offs between query acceleration and maintenance overhead.
2. Data Mining and Clustering for Candidate Generation
To efficiently target the vast combinatorial space of possible views in large workloads, several strategies use data mining techniques for candidate reduction:
- Clustering: Queries are represented as binary query-attribute matrices and clustered using algorithms like Kerouac. Queries within a cluster tend to share attributes (e.g., group-by fields, predicates), suggesting a single materialized view can efficiently serve all. Clustering enables merging similar queries to form consolidated candidate views, reducing redundancy (0707.1548, Aouiche et al., 2017, 0809.1963).
For example, if queries and share attributes , similarity can be defined as:
- Frequent Itemset Mining: For index selection, frequent itemset mining (e.g., the Close algorithm) is applied on the binary query-attribute matrix, identifying attribute sets that co-occur frequently across queries—prime candidates for bitmap or composite indexes (Aouiche et al., 2017, 0707.1548).
Such data-driven filtering mechanisms substantially reduce the search space, allowing subsequent cost-based selection to be effective at scale.
3. Greedy, Evolutionary, and Hybrid Search Algorithms
Exhaustive enumeration is computationally infeasible for realistic workloads; hence, algorithmic strategies employ greedy heuristics, evolutionary algorithms, or learning-based optimization:
- Greedy Algorithms: Iteratively select the candidate with the highest current benefit (e.g., marginal cost reduction per unit storage) until no positive gain remains or resource limits are met (0707.1306, 0707.1548, Aouiche et al., 2017). Pseudocode typically reflects an outer loop over candidates, benefit recalculation at each step, and configuration update subject to constraints.
- Genetic Algorithms (GA): Chromosome-based encodings represent view configurations, with fitness functions aggregating execution time, maintenance cost, and resource usage. Adaptive mutation rates and selection mechanisms such as lexicase selection are employed to explore the solution space and avoid premature convergence (Manavi, 2024, Imani et al., 2023). The GA is seeded with high-performing initial configurations, and crossover is localized to avoid disrupting beneficial view subsets.
Example fitness function:
- Machine Learning-Based and Hybrid Methods: Reinforcement learning or deep models (e.g., RLView) are trained using historical workload data, adjusting view/index selection strategies based on observed benefit and cost, and employing iterative, feedback-driven “flip” phases followed by plan evaluation (Zinchenko et al., 2024).
4. Joint View-Index Selection and Space Sharing
Historically, index and view selection were performed independently, which risks suboptimal and redundant configurations. Modern strategies advocate coupled selection, explicitly modeling interactions:
- The configuration space includes both materialized views and indexes, with the selection algorithm (often greedy or hybrid) considering their combined effect on query speed, storage, and maintenance (0707.1306, 0707.1548).
- Interaction modeling via request–view, request–index, and view–index matrices enables the system to avoid duplicated storage, overlapping acceleration effects, and unnecessary maintenance.
- Experimental results show that, under ample storage, joint optimization provides greater execution time reductions compared to independent selection. When storage is limited, the selection may favor smaller indexes, but coordinated strategies consistently yield superior space-performance trade-offs.
5. Evaluation, Validation, and Performance
Empirical validation is central to substantiating the efficacy of a view selection strategy:
- Performance Metrics: Typical metrics include query execution time, maintenance cost, storage usage, overall cost (including composite cost functions), mean absolute/squared errors (in multi-view learning tasks), and coverage metrics (e.g., percentage of queries or scene regions covered).
- Experimental Scenarios: Approaches are evaluated on realistic data warehouse schemas (e.g., fact/dimension tables, standard benchmarks such as TPC-H), production DBMS platforms, and, for feature/view selection in multi-view learning, on domain datasets (e.g., genomics, medical, or synthetic).
- Findings: Joint and mining-based strategies demonstrate reductions in query processing time (e.g., up to 69% improvement at moderate storage, and 30% for index-only cases), effective space savings, and consistent results across allocation constraints (0707.1548). Evolutionary algorithms further reduce total operational costs and converge to efficient solutions faster than baseline algorithms (Manavi, 2024).
6. Practical Implications and Future Directions
The deployment of robust view selection strategies has significant impacts on enterprise data warehouses, multi-view learning systems, and related domains:
- Automated Physical Design: View selection algorithms, especially when integrated with cost models and mining-based candidate generation, offer automated tools that relieve database administrators from laborious, error-prone manual tuning.
- Scalability and Adaptability: Modularity (e.g., easily swapped data mining or cost modules) and the capacity for online adaptation (via incremental or streaming algorithms) render these strategies suitable for dynamic, evolving workloads and new data modalities.
- Extension to Non-Relational and Heterogeneous Data: Principles from relational warehouses carry over to XML (with query clustering for materialized XML view selection (0809.1963)), RDF and semantic web databases (requiring special handling of implicit data (Goasdoué et al., 2011)), and multi-view feature selection in machine learning (using multi-objective GAs (Imani et al., 2023)).
- Challenges and Research Gaps: Areas of ongoing research include more accurate cost/benefit modeling via learned estimators, handling complex and dynamic constraints, distributed and multi-objective optimization, and general frameworks that unite view, index, plan, and cache selection (Zinchenko et al., 2024).
In sum, advanced view selection strategies integrate cost-driven formulations, candidate reduction via mining or clustering, joint modeling of structural interactions, and algorithmic search tailored to workload and system constraints. These approaches are foundational in both traditional data management systems and emerging applications involving multi-view data modalities, offering robust solutions to manage the intrinsic complexity and resource demands of large-scale information systems.