CompACT Pedestrian Detection
- CompACT is a complexity-aware pedestrian detection framework that balances accuracy and computational cost by integrating features with varying computational budgets.
- It employs a Lagrangian formulation and boosting algorithm to prioritize inexpensive features in early cascade stages and reserve complex features for later stages.
- Empirical evaluations on Caltech and KITTI benchmarks reveal that CompACT achieves superior detection accuracy with efficient, real-time processing.
Complexity-Aware Pedestrian Detection (CompACT) is a framework for constructing cascaded pedestrian detectors that explicitly optimize a trade-off between detection accuracy and computational complexity. CompACT enables the seamless integration of features possessing widely varying computational costs, including both classical image features and deep convolutional neural network (CNN) activations, within a single learned detector. By penalizing unnecessary computation, CompACT allocates inexpensive features to the early cascade stages and reserves complex, high-capacity features for later stages, when only a small subset of ambiguous input windows remains. This approach yields a detector that is both accurate and computationally efficient across challenging pedestrian detection benchmarks (Cai et al., 2015).
1. Lagrangian Formulation: Joint Optimization of Accuracy and Complexity
CompACT is motivated by the need to balance detection accuracy against computational constraints. Classical AdaBoost minimizes the empirical classification risk: where is the score function, are labels, and is the training set.
CompACT introduces a second risk term, “complexity risk” , to quantify average computational cost. Detector learning is thus cast as: which is equivalent, via a Lagrange multiplier , to minimizing the composite objective: $\mathcal{L}[F] = R_E[F] + \eta R_C[F]. \tag{1}$
measures classification error, and penalizes excessive computation, making the trade-off explicit. The complexity risk is defined analogously to :
with as the signed “complexity margin” and as a hinge-style loss. The per-sample implementation complexity is the average cost per window: where are weak learners, , are thresholds, and is the number of cascade stages.
2. CompACT Boosting Algorithm and Learning Dynamics
Learning proceeds via functional gradient descent on . The negative gradient for a weak learner combines classification and computational cost: where and .
At each boosting round, the weak learner maximizing is selected:
- Compute
- Compute
- Score:
- Select
- Find optimal step size via 1-D line search minimizing
- Update:
The algorithm penalizes complex features during early stages (when more windows remain), and relaxes this constraint in later stages as increases. This scheduling is a principled consequence of the joint loss in (1) (Cai et al., 2015).
3. Modeling Feature Complexity
CompACT accommodates features with heterogeneous computational costs, assigning each a cost entering directly into the risk function:
- Pre-computed features: Aggregate Channel Features (ACF), .
- Just-in-time (JIT) features: Evaluated only on surviving windows at a given stage.
- Self-Similarity (SS):
- Checkerboard (CB): on 10 ACF channels,
- LDA basis filters: PCA-like on ACF,
- Small CNN conv5 channels: for the initial trigger (entire CNN forward-pass), then 1 per JIT feature.
- CNN-Checkerboard (CNNCB): on conv5 feature maps, per feature
For CNNs, a “trigger” cost is charged on first use per window, and zero for subsequent conv5 feature extractions within that window. This cost model enables complexity-aware scheduling of diverse feature types.
4. Embedded Cascade Structure and Feature Scheduling
CompACT learns an embedded cascade of stages, each a depth-2 decision tree with threshold . The cumulative score at stage is . A window is rejected if :
Empirically, early cascade stages rely on cheap ACF features, intermediate stages incorporate SS and CB, while expensive CNNCB and CNN features are reserved for final stages. As most windows are eliminated early, only a small fraction incurs the cost of high-complexity features.
5. Integration of Deep Convolutional Neural Networks
Deep CNN conv5-layer activations are integrated as another JIT feature family with a per-feature cost structure as previously described. CompACT typically refrains from selecting raw conv5 CNN features early in the cascade, instead favoring CNNCB filters when advantageous.
To incorporate a large, ImageNet-pretrained CNN (e.g., AlexNet or VGG), the network is embedded as the final weak learner at stage . Boosting computes the optimal step size for this feature. During inference, the cascade processes early stages using fast features at approximately 4 fps (CPU), while only the ~10% of windows surviving after NMS propagate to the large CNN (adding about 2 fps processing on GPU). This architecture generalizes the combination of a proposal stage with a CNN, but operates as a unified single-pass system.
6. Empirical Results and Comparative Analysis
CompACT was extensively evaluated on Caltech and KITTI pedestrian datasets using the log-average miss-rate (MR) over FPPI and AUC for KITTI. Runtime measurements employed a 2.1 GHz Intel Xeon (CPU) and NVIDIA K40M (GPU).
Table 1: Single-feature Cascades vs. CompACT on Caltech
| Method | log-avg MR [%] | time (s) |
|---|---|---|
| ACF-only | 42.6 | 0.07 |
| SS-only | 34.3 | 0.08 |
| CB-only | 37.9 | 0.23 |
| LDA-only | 37.2 | 0.16 |
| CNN-only | 28.1 | 0.87 |
| CNNCB-only | 26.9 | 2.05 |
| CompACT-ACF | 32.2 | 0.11 |
| CompACT-CNN | 23.8 | 0.28 |
Table 2: Large CNN as Final Stage (Caltech)
| Method | AlexNet MR | Δtime | VGG MR | Δtime |
|---|---|---|---|---|
| CompACT-small-CNN only | 23.8 | — | 23.8 | — |
| + embedded large CNN-Alex | 15.0 | +0.10 | 14.8 | +0.10 |
| + embedded large CNN-VGG | 11.8 | +0.25 | 11.8 | +0.25 |
Table 3: State-of-the-Art Comparison on Caltech "Reasonable"
| Detector | MR [%] | time (s) |
|---|---|---|
| SpatialPooling+ (ECCV’14) | 25.4 | 5.0† |
| Checkerboards (CVPR’15) | 25.6 | 4.0† |
| R‐CNN (arXiv’15) | 32.9 | 2.0† |
| Katamari (ECCV) | 24.8 | 0.5† |
| CompACT‐Deep (embed VGG) | 11.8 | 2.0 |
Table 4: KITTI (Moderate) AUC and Timing
| Method | AUC [%] | time (s) |
|---|---|---|
| FilteredICF | 54.0 | 0.40 |
| pAUCEnsT | 54.5 | 0.60 |
| R‐CNN (KITTI) | 50.1 | 4.0 |
| Regionlets | 61.2 | 1.0† |
| CompACT‐Deep | 58.7 | 1.0 |
†Excludes proposal-generation time or is reported end-to-end.
Key observations:
- CompACT's Lagrangian framework governs an explicit, tunable trade-off between accuracy and complexity via .
- Boosting ranks weak learners by a composite classification-plus-computation score, naturally allocating feature types across stages.
- Embedding AlexNet or VGG as the final stage yields single-pass detectors that significantly outperform separate proposal+CNN pipelines in both accuracy and speed.
7. Implications and Significance
CompACT operationalizes the notion of budgeted detection, allowing large, heterogeneous feature pools and optimizing for both accuracy and computational efficiency. The explicit complexity-awareness advances the state of the art on pedestrian detection benchmarks and enables practical, real-time detectors. A plausible implication is that similar frameworks could be leveraged for other detection tasks requiring heterogeneous feature integration and complexity management (Cai et al., 2015).