CAM-Convs: Camera-Aware & CAM Accelerators
- CAM-Convs are specialized convolutional frameworks that either integrate camera calibration data for improved depth estimation or employ CAM-based accelerators for efficient neural computations.
- The camera-aware branch augments standard convolutions with extra channels encoding focal length, principal points, and normalized coordinates, significantly reducing errors on unseen camera setups.
- The CAM-only accelerators replace traditional multiply-accumulate operations with in-place associative computing, achieving substantial gains in energy efficiency and inference speed.
CAM-Convs are a class of convolutional operations and accelerator frameworks that leverage content-addressable memory (CAM) for enhanced neural network inference and training, with principal applications spanning camera-aware convolutional design in computer vision as well as energy- and latency-efficient hardware implementation for deep neural networks. The acronym “CAM-Convs” may refer to two distinct, non-overlapping paradigms: (1) Camera-Aware Multi-Scale Convolutions, which explicitly encode camera calibration metadata as additional channels in the convolutional pipeline to enable robust single-view depth estimation across heterogeneous camera intrinsics (Facil et al., 2019); and (2) CAM-only Convolutional Accelerators, which reframe convolution as a sequence of in-place bitwise or geometric associative operations performed with CAM primitives—drastically reducing energy and data movement overhead in purpose-built digital/hybrid hardware (Lima et al., 2024, Nguyen et al., 2023). The following entry delineates both branches, providing architectural and methodological detail, system-level implications, and comparative analysis.
1. Camera-Aware Multi-Scale Convolutions
CAM-Convs, as introduced by Lasinger et al., address the challenge of generalization in single-view depth estimation across diverse camera models, such as variable focal lengths, principal points, or sensor sizes. Standard convolutional networks exhibit severe overfitting to specific intrinsic parameters, necessitating laborious data recollection when the imaging geometry changes.
The innovation lies in augmenting the 2D convolution with camera-parameter maps. Standard convolution for an input feature map becomes
In a CAM-Conv, extra channels—centered principal point maps , field-of-view maps , and normalized coordinates —are stacked with the input features. Formally,
and the convolution operates over all channels. The coordinate channels are computed as
where are camera intrinsics.
2. Multi-Scale Network Integration and Decoder Architecture
The CAM-Conv formulation is integrated into a U-Net-style encoder–decoder architecture, using a ResNet-50 backbone encoder. The decoder comprises five upsampling levels, each augmented by a skip connection realized with a dedicated CAM-Conv on the resized camera-parameter maps at that resolution. Each decoder block concatenates the camera-aware skip and processes it with two convolutions. Output heads at each scale yield inverse depth, confidence, and (for lower resolutions) surface normals. The final denormalization applies focal-length normalization, ensuring metric consistency across diverse cameras:
3. Training Pipelines and Performance Evaluation
The primary benchmark uses the 2D-3D Semantics dataset, sampling broadly from the space of camera intrinsics. Additional cross-dataset experiments involve training on KITTI, ScanNet, MegaDepth, and Sun3D, with evaluation on NYUv2. Data augmentation includes random scaling, cropping, shifts of , and focal-length sampling from . The multi-term loss aggregates inverse-depth L1 error, scale-invariant gradient, confidence, and surface-normal terms, with empirically set weights.
Key empirical outcomes:
- Training on a single camera and testing on another causes up to a 50% increase in abs.rel error for standard networks.
- CAM-Convs dramatically improve generalization across camera intrinsics. For example, on unseen settings (e.g., s₃/test; Table 4), abs.rel improves from 0.174 (standard) to 0.164 (with CAM-Conv), and in extreme settings (s₅/f=64), from 0.369 to 0.177.
- Cross-dataset: CAM-Convs outperform Laina et al.’s model trained directly on the target dataset, even without NYUv2 fine-tuning.
- Ablation replacing all CAM-Convs with naive focal normalization or removing them entirely leads to significant degradation or failure to converge when mixing cameras.
4. CAM-Based Convolutional Hardware Accelerators
A separate axis of CAM-Convs research involves embedding convolutional neural nets into digital or hybrid hardware whose primitive operation is the associative search/modify paradigm of CAMs (Lima et al., 2024, Nguyen et al., 2023). Here, multiply–accumulate (MAC) is replaced by in-place, bit-serial add/subtract sequences (for ternary-weight networks in APs with racetrack memory) or by geometric dot-products with Hamming distance computation in FeFET-CAM arrays.
Associative Processor (AP) with Racetrack Memory
Standard 2D convolution is recast by pruning all zero weights (), splitting the sum of products into two multi-operand additions. Each addition and subtraction is realized as a sequence of masked CAM search and parallel write operations, taking advantage of the word-parallel and bit-serial nature of CAM-based addition. The compiler unrolls and factors the dataflow graph, with common subexpression elimination reducing total operations by about 31%.
DeepCAM: Geometric Dot-Product via CAM Hamming Sensing
DeepCAM replaces the inner-product stage of each convolution with an approximate geometric variant. Each feature vector is projected into a -bit sign hash, and the cosine of the angle between activation and weight vectors is efficiently estimated from the Hamming distance between their hashes, exploiting the Johnson–Lindenstrauss lemma. Using FeFET-CAM, a batch of Hamming distances can be computed in time for all rows. The final dot product is reconstructed as
where denotes Hamming distance.
5. System-Level Metrics, Comparative Results, and Limitations
Camera-Aware Multi-Scale Convs
CAM-Convs yield up to 50% lower rel.abs error and smaller prediction variance for depth estimation across arbitrary camera models compared to classical, calibration-ignorant CNNs. On cross-dataset transfer, they outperform models trained natively on the target data. Removing the extra coordinate channels or using only focal normalization sharply degrades generalization (Facil et al., 2019).
CAM-Only Hardware Accelerators
For ternary-weight ResNet-18 on ImageNet, RTM-based AP CAM-Convs achieve top-1 accuracy of 70.6% (on par with full-precision), 7.5× energy-efficiency over crossbar-CIM baselines, and approximately 3.9× speedup in latency. DeepCAM, across LeNet-5, VGG, and ResNet variants, achieves up to 3498× speedup relative to conventional CPUs, and 2.16×–109× lower dynamic energy versus established ASIC accelerators (Lima et al., 2024, Nguyen et al., 2023).
A representative table:
| Accelerator Type | Energy Efficiency Gain | Inference Speedup | Accuracy Retention |
|---|---|---|---|
| RTM-AP CAM-Convs (Lima et al., 2024) | 7.5× over crossbar | 3.9× | Yes (TWN+4b) |
| DeepCAM (Nguyen et al., 2023) | 2–109× over Eyeriss | up to 3498× over CPU | ≤1% loss (with sufficient k) |
Principal limitations:
- Camera-Aware CAM-Convs require accurate knowledge of pixel size and camera intrinsics; focal-length normalization assumes uniform pixel geometry.
- CAM-only accelerators currently best support ternary-weight and low-bit activations; extending to higher-precision necessitates more costly lookup decomposition. Small layers may under-utilize the associative array, and fully-programmable operator support (beyond add/sub/compare) is not natively available.
6. Extensions, Future Prospects, and Research Directions
Proposed extensions to camera-aware convolutions include the possibility of inputting further calibration metadata such as radial distortion or even extrinsic pose, as well as incorporating sensor noise models. On the CAM-based hardware side, further research avenues involve supporting additional operator classes (e.g., pooling, batchnorm via bitwise microkernels), federating larger topologies with efficient interconnect, and adaptive hash length selection or piecewise polynomial approximations for geometric associative convolutions.
A plausible implication is that CAM-Convs, both as a methodological extension for camera-robust deep vision systems and as a primitive for in-memory digital inference, offer an effective path forward for scaling neural architectures—either by directly transferring calibration-aware visual understanding across domains or by substantially reducing the energy and data movement limits endemic to the von Neumann paradigm.