Decentralized Vision-Based PET Analysis

Updated 22 November 2025

The paper introduces a decentralized PET analysis framework that integrates multi-camera edge computing and per-pixel temporal analytics for real-time intersection safety assessment.
It employs YOLOv11m-seg for vehicle detection and homography to fuse synchronized video streams into a unified bird’s-eye view.
The system achieves sub-second temporal precision and scalable performance with distributed processing and SQL-driven data management.

Decentralized vision-based Post-Encroachment Time (PET) analysis is a methodology for real-time, high-resolution safety assessment at intersections using multi-camera computer vision, distributed edge computation, and pixel-level temporal analytics. By directly monitoring conflict zones through synchronized video streams and calculating PET at fine spatiotemporal granularity, this framework enables dynamic risk identification with sub-second precision, complementing or supplanting traditional crash-based and aggregate cell-based approaches (Chaudhuri et al., 15 Nov 2025).

1. Multi-Camera Hardware and Edge-Oriented System Architecture

The deployed system utilizes four Hikvision PTZ cameras, each mounted on a separate signal pole to ensure comprehensive coverage, including conflict zones and overlapping fields of view. Cameras operate at 1920×1080 resolution and 30 FPS, streaming via RTSP across a local Gigabit Ethernet (GigE) network. Each stream terminates on an NVIDIA Jetson AGX Xavier, responsible for on-device inference and geometric transformation.

For consistent frame-level data fusion, each Jetson exports detection results as JSON objects with Unix millisecond timestamps. A Windows industrial PC aggregates these, selecting the temporally nearest set (maximum allowed difference: 350 ms) and averaging associated timestamps to generate a canonical frame time, thereby achieving practical synchronization. The distributed pipeline is as follows: video streaming → real-time YOLOv11m-seg inference and homography warping on Jetsons → output of per-frame vehicle mask JSONs → sharing on SMB → aggregation and post-processing (bounding box fitting, timestamp assignment, SQL database ingestion) on the Windows hub → handoff to a GPU server for PET analytics and heatmap generation (Chaudhuri et al., 15 Nov 2025).

2. Visual Detection and Segmentation Pipeline

Vehicle detection operates on each edge module using YOLOv11m-seg, a medium-complexity semantic segmentation network tuned for the domain. Each 1920×1080 RGB frame is preprocessed (resize, normalize), inferenced through the YOLOv11 backbone, and polygonal vehicle masks are derived by thresholding the output of the segmentation head, then vectorized to spatial coordinates within the camera image plane. Typical inference latency per frame is 288.8 ms on Jetson AGX Xavier, representing approximately 77.5% of the end-to-end pipeline for each stream. This decentralized per-camera processing architecture supports real-time, parallel inference while minimizing network congestion and central failure points (Chaudhuri et al., 15 Nov 2025).

3. Homography Unification and Bird’s-Eye Geometric Mapping

For cross-camera and cross-viewpoint integration, each camera's pixel-space polygons are projected into a unified, global bird's-eye coordinate frame using a 3×3 homography matrix $H$ . Homographies are estimated by OpenCV's \texttt{cv2.findHomography()}, based on manually annotated correspondences between image points $(u,v)$ and real-world coordinates $(X,Y)$ . The forward mapping is:

$\begin{bmatrix} X \ Y \ 1 \end{bmatrix} = H \begin{bmatrix} u \ v \ 1 \end{bmatrix}, \qquad H = \begin{bmatrix} h_{11} & h_{12} & h_{13} \ h_{21} & h_{22} & h_{23} \ h_{31} & h_{32} & 1 \end{bmatrix}$

with normalization over the third (projective) row. Each warped mask is overlaid onto a global 1600×1600 grid ( $\sim26.2\,\text{m} \times 26.2\,\text{m}$ ). At each grid pixel, an overlap count matrix records the number of cameras reporting occupancy. A weighted fusion is performed, assigning point values (1, 2, 6, 8) to pixels seen by one, two, three, and four cameras respectively; these empirically tuned weights support robust consensus across camera overlaps. Subsequent contour extraction and minimum area rectangle fitting (\texttt{cv2.minAreaRect}) provides unified global vehicle localization (Chaudhuri et al., 15 Nov 2025).

4. Pixel-Level PET Analysis: Mathematical Formulation and Implementation

The PET process quantifies temporal gaps between departing and arriving vehicles for each pixel on the bird’s-eye view. For pixel $p=(i,j)$ , let $t_{\text{depart}}(p)$ be the timestamp when the pixel becomes vacated, and $t_{\text{arrive}}(p)$ when it next becomes occupied. The $k$ -th PET interval is:

$t_p^{(k)} = t_{\text{arrive}}^{(k)}(p) - t_{\text{depart}}^{(k)}(p)$

with all nonzero intervals greater than $0.2\,\text{s}$ retained to filter spurious events. The time-mean PET for pixel $(i,j)$ is:

$\mathrm{PET}_{i,j} = \frac{1}{N}\sum_{k=1}^N t_{i,j}^{(k)}$

where $N$ is the count of valid PET intervals logged. The implementation uses a "stopwatch matrix" $S[i,j]$ tracking elapsed time for each pixel. Each global frame provides an 800×800 binary mask $O[i,j]$ from fitted rectangles. For $O[i,j]=1$ (occupied), any nonzero $S[i,j]$ is logged and reset; for $O[i,j]=0$ , $S[i,j]$ accumulates elapsed time ( $\Delta t \sim 300$ –400 ms) since last occupation. Intervals are recorded in a relational database table with schema: $(\text{pixel}_i, \text{pixel}_j, \text{start\_ts}, \text{end\_ts}, \text{duration})$ (Chaudhuri et al., 15 Nov 2025).

5. High-Resolution Heatmaps and Data Management

The system produces pixel-level PET heatmaps at $3.3\times3.3$ cm resolution (800×800 grid over the intersection ROI). Each pixel's mean PET is visualized via a logarithmic colormap:

$\text{value}_{i,j} = \log(\mathrm{PET}_{i,j} + \epsilon)$

with $\epsilon=10^{-3}$ to ensure numerical stability. A count map overlay displays the number of observed PET intervals per pixel. Data are stored in a MySQL schema consisting of a detections table (per-frame vehicle masks) and a pet_events table (pixel-level PET interval events), supporting queries for mean PET over spatiotemporal windows:

SELECT AVG(duration_s)
FROM pet_events
WHERE pixel_i BETWEEN i1 AND i2
  AND pixel_j BETWEEN j1 AND j2
  AND start_ts ≥ T1 AND end_ts ≤ T2;

The heatmap generator compiles queried PET aggregates into .png overlays at configurable time block widths (e.g., 5 min, 30 min) (Chaudhuri et al., 15 Nov 2025).

6. System Performance, Validation, and Scalability

The measured end-to-end pipeline on the four-camera deployment at H Street and Broadway yields 2.68 sustained FPS (372.7 ms/frame end-to-end), broken down as follows: YOLOv11 segmentation/homography (288.8 ms), RTSP decoding (83.8 ms), rectangle fitting (20.3 ms, parallelized), SQL upload (32.5 ms, parallel), and PET heatmap updates (126.4 ms on downstream GPU server). Bird’s-eye calibration error is estimates at $\pm5$ cm (via reprojection of manual control points). The PET timer resolution is $\sim300$ ms, supporting sub-second hazard precision. Empirical trials with ground-truth video demonstrate event timing accuracy within $\pm0.1$ s of manual annotation. Coarser 20×20 PET grids highlight high-risk crosswalk entries and adjacent lanes; fine 800×800 grids reveal concentrated pinch-points at $\sim3.3$ cm precision. Regions such as left-turn pockets exhibit higher mean PET, consistent with lower vehicular density and speed. The pipeline has been ported in simulation to other intersections, requiring only scene-specific homography annotations and replicating performance metrics (Chaudhuri et al., 15 Nov 2025).

7. Extensions, Integration, and Future Research

Proposed refinements target further scalability and automation. Automated calibration routines using checkerboard/fisheye correction are suggested to minimize manual homography annotation. Replacement of the Windows aggregation node with a distributed edge cluster could further reduce network latency. Incorporation of DeepSORT-style multi-object tracking would provide persistent ID continuity and support velocity or trajectory analytics. Web-based dashboards (via Flask/Node.js) enable on-demand review of temporal PET fields. Integration with digital twin simulations or adaptive traffic controllers is facilitated by the SQL-based, high-resolution PET database. Advancements in edge hardware—specifically, deployment on Jetson Orin or embedded TPU clusters—are expected to achieve $≥10$ FPS, offering practical full-intersection, real-time PET analysis (Chaudhuri et al., 15 Nov 2025). A plausible implication is the potential for multimodal (vehicle/pedestrian) safety analytics by linking with connected-vehicle telemetry.

In summary, decentralized vision-based PET analysis orchestrates multi-camera edge computation, YOLOv11-based segmentation, homography unification, real-time per-pixel PET evaluation, and SQL-driven data management to deliver an intersection safety assessment system with 3.3 cm spatial, sub-second temporal accuracy, and robust scalability for diverse deployment contexts (Chaudhuri et al., 15 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Decentralized Vision-Based PET Analysis.