Continuous Query Language for Video Analysis

Updated 8 December 2025

Continuous Query Language for Video Analysis is a framework that specifies high-level semantic queries over video data using structured annotations from deep learning pipelines.
It leverages spatiotemporal labels, relational constructs, and windowing operators to efficiently detect composite events, such as object interactions and trajectories.
Optimizations like indexed querying, CNN-based filters, and operator pipelining enable significant reductions in latency and scalable, real-time video analytics.

Continuous Query Language (CQL) for Video Analysis refers to a suite of data models, declarative query languages, and execution frameworks designed to perform on-the-fly, high-level specification and detection of semantic events in rich video streams. These languages are engineered to operate over structured annotations produced by deep learning pipelines (object detectors, classifiers, trackers), enabling users to author spatiotemporal event queries without exhaustive neural network retraining. State-of-the-art CQLs for video analysis integrate advanced data representations (e.g., spatiotemporal labels, feature vectors, knowledge graphs), relational/temporal programming constructs, and scalable windowed execution to support real-time, expressive, and efficient querying across diverse video sources.

1. Data Models and Representation

All modern CQL systems for video analysis are founded on structured data representations that encode the essential semantics of video contents.

Spatiotemporal Labels (Rekall): Video annotations are formalized as axis-aligned rectangles in spacetime $D=T\times X\times Y$ , each labeled with metadata. A label is a tuple $\ell = (\Delta, M)$ , where $\Delta = [t_1, t_2] \times [x_1, x_2] \times [y_1, y_2]$ and $M$ contains attributes such as object class or identifier. Efficient queries demand index structures: a 1-D interval tree over $[t_1, t_2]$ for rapid temporal selection and an R-tree (potentially lifted to 3-D) over spatial coordinates for geometric selection (Fu et al., 2019).
R⁺⁺-relations and Arrables (MavVStream): R⁺⁺-relations extend the relational model to accommodate vector-valued attributes: object feature vectors ([FV]) and bounding boxes ([BB]). "Arrables" are grouped and ordered arrays of tuples, supporting efficient per-object trajectory and feature-wise operations across time (Billah et al., 2022).
Video Event Knowledge Graphs (VidCEP): Every frame is abstracted as a labeled graph where nodes represent detected objects (with attributes and confidence) and edges encode spatial or (later) temporal relations. Time-ordered sequences of such graphs support pattern matching for spatiotemporal event detection (Yadav et al., 2020).

These data models serve as the substrate upon which continuous query operators act, supporting windowed, compositional logic without the need for repeated model inference or annotation.

2. Query Language Syntax and Constructs

CQLs for video analysis are substantially influenced by database and stream-processing paradigms but extend them with primitives for video-specific operations.

SQL-like Syntax: All prominent systems employ a SELECT-FROM-WHERE core with sliding/tumbling window clauses.
- Rekall uses a Pythonic composition API, with all operators (map, filter, join, coalesce, minus) closed over sets of labels (Fu et al., 2019).
- Video Monitoring Queries express frame-level predicates via attributed object detectors, spatial and classifier predicates, and temporal window specifications (Koudas et al., 2020).
- VEQL (VidCEP) and CQL-VA (MavVStream) add explicit pattern, spatial, temporal, and aggregate constructs:
- 1 2 3 4 5
  
  SELECT SEQ(Object1, Object2) FROM Camera WHERE Object1.label='Car' AND Object2.label='Person' WITHIN TIMEFRAME_WINDOW(10) WITH_CONFIDENCE > 0.5
- (Yadav et al., 2020, Billah et al., 2022)
Operators and Semantics:
- Relational Primitives: map, filter, group_by, join, union, minus
- Temporal/Spatial Predicates: Allen interval relations, spatial_IOU, within_t, spatial relations (Left, Right, Touch, Contains)
- Custom Video Operators (MavVStream): sMatch (feature-vector similarity), Direction (trajectory orientation), Consecutive-Join (trajectory-join across streams/cameras), Compress-Consec-Tup (collapse object runs)
- Pattern Clauses (VidCEP VEQL): Temporal constructs (SEQ, EQ, CONJ, DISJ); spatial relations grouped as directional, topological, or metric (Yadav et al., 2020).
Windowing: Queries operate over explicit physical (time/frame) or logical (scene/event) windows, crucial for streaming scalability and prompt notification.

3. Query Execution Over Video Streams

Efficient, continuous execution of CQLs depends on pipelined, often parallel operators, and intelligent use of indexes and windows.

Incremental Operator DAGs (Rekall): Incoming labels are routed through a DAG of stateful operators. Stateless primitives emit results immediately, while stateful nodes (join, coalesce, minus) maintain indexed working sets within a bounded watermark window. Watermark-controlled state truncation ensures memory use is $O$ (window size) (Fu et al., 2019).
Operator Cascades with Fast Filtering (Video Monitoring Queries): A cascade incorporates fast CNN-based filters (IC, OD), only invoking full detection/classification if a frame passes these predicate-specific gates. This approach eliminates non-candidate frames before expensive processing, yielding speed-ups of 10²–10³× (Koudas et al., 2020).
Graph Pattern Matching (VidCEP): Each windowed state invokes a confidence- and window-aware subgraph matcher: object detection, attribute matching, spatial and temporal event extraction, confidence thresholding for notifications (Yadav et al., 2020).
Arrable-based Batch Operations (MavVStream): Bulk vector operations, compressed runs, and per-object grouping facilitate highly efficient trajectory and similarity queries (Billah et al., 2022).

4. Representative Queries and Event Specification

The expressiveness of CQLs for video analysis is demonstrated by the variety of supported event patterns. Selected canonical queries from recent systems:

Query Class	Example Description	Method
Spatiotemporal Composite Event	"Person enters vehicle then starts driving"	Rekall
Directional Relation	"Car left of Truck"	VEQL, CQL-VA
Appearance Pattern	"Object appears, disappears ≤5 s, then reappears (blink event)"	Rekall
Trajectory Join	"Find same person across cameras via FV similarity"	CQL-VA
Aggregate in Window	"COUNT(person) > 5 in 10s window triggers 'High Traffic Flow'"	VEQL, Video Monitoring Queries

Such queries typically compose raw detection labels (existing tracks, bounding boxes, feature vectors) using spatial, temporal, and metadata joins, with optional aggregate, direction, or similarity conditions. Temporal reasoning is achieved via Allen-style interval logic or custom SEQ/EQ/CONJ operators.

5. Algorithmic and Systemic Optimizations

Performance and scalability are contingent on efficient data organization, filtering, and windowed processing.

Indexing and Pruning: Usage of interval/R-tree structures (Rekall), window/batch-based arrable compression (MavVStream), and grouping (predicate push-down) for early elimination of ineligible tuples.
Cheap CNN-based Frame Filters (IC/OD): By redirecting the query pipeline through fast classifier or detector branches, the system bypasses expensive Mask R-CNN/YOLOv2 runs on the majority of frames. This strategy exploits side-outputs from shared CNN backbones, yielding latencies as low as 1.5–1.9 ms/frame for the filter pass (Koudas et al., 2020).
Statistical Aggregation: Monte Carlo with control variates drastically reduces the number of expensive full detections required for window aggregates, leveraging the high accuracy of filter predictions for variance reduction (up to 230× improvement) (Koudas et al., 2020).
Operator Pipelining and SIMD: Arrable groupings enable vectorized (SIMD) similarity and trajectory computations (Billah et al., 2022). DAG-based query architectures, coupled with watermark and windowing logic, ensure bounded state and parallelizability.

6. System Performance and Evaluation

Quantitative results indicate that CQL-based frameworks for video analysis achieve both interactivity and scale on commodity hardware.

Latency and Throughput
- Rekall: Queries over tens of hours of annotated video execute in under 30 seconds on a 16-core/64GB machine; bottleneck remains detector inference, not query evaluation, with Rekall operators <10% of total runtime (Fu et al., 2019).
- VidCEP: Achieves end-to-end matching latencies of 0.3–4.1 s for 5-second windows, sub-second latency for windows up to 30 minutes, and 70 fps throughput (5 parallel streams) on GPU; CPU-only throughput ~12 fps (Yadav et al., 2020).
- MavVStream: cJoin and CCTJoin yield an order-of-magnitude speed-up over classic join (e.g., 900 s vs 75 s/35 s for 44-min datasets) (Billah et al., 2022).
- Video Monitoring Queries: End-to-end query time reductions by 100×–1000×, near-perfect aggregate accuracy (>99%), efficient sampling (Koudas et al., 2020).
Accuracy
- Systematic evaluation against ground truth and manual curation shows CQLs can achieve event detection F1-scores of 0.66–0.89, with higher scores on aggregate and simple presence/count queries; specific tuning and calibration are required for hard cases (Yadav et al., 2020, Billah et al., 2022).
Resource Usage
- Memory use scales with window size and tracked object/event cardinality. Bounded state ensures viability under real-time loads (Fu et al., 2019).

7. Limitations and Future Directions

Despite significant progress, current CQL systems for video analysis manifest several limitations:

Dependence on Annotation Quality: Errors in DNN detectors propagate to query outcomes. Occlusions, low lighting, and ID switches are bottlenecks (Yadav et al., 2020).
Expressiveness Constraints: Most systems support only basic spatial/temporal predicates; parametric/fuzzy region logic (e.g., FORS, DE-9IM), hierarchical event composition, or continuous spatial algebra require further research (Yadav et al., 2020).
Parallelism and Scalability Ceilings: GPU/CPU contention arises beyond ~15 streams in VidCEP; distributed and geo-partitioned deployments require novel resource management (Yadav et al., 2020).
Practical Window/Watermark Management: Trade-offs remain between latency, memory, and query precision at window boundaries.

Future work is directed at richer window semantics (e.g., session, landmark windows), hybrid symbolic-neural event matching, distributed query coordination, and accelerated sMatch via approximate indexing. Adoption of these direction will improve CQL applicability for large-scale, real-time, customizable video situation analysis (Fu et al., 2019, Koudas et al., 2020, Yadav et al., 2020, Billah et al., 2022).