CNN Feature Injection in Neural Architectures

Updated 2 June 2026

CNN feature injection is a set of techniques that explicitly transfers convolutional feature maps into different network components to leverage fine-grained spatial detail.
It includes horizontal and vertical fusion methods that combine CNN representations with transformers, decision forests, or other modules to enhance model performance.
Empirical studies show that strategic feature injection improves segmentation, defect detection, and recognition across various hybrid and multi-task applications.

CNN feature injection refers to a class of architectural techniques in which feature maps or representations extracted from convolutional neural networks (CNNs) are explicitly introduced (“injected”) at various points in a model’s computation, often to augment, fuse, or guide downstream modules such as transformers, decoder branches, boosted forests, or higher-level convolutional layers. Approaches under this umbrella leverage the detailed, spatially-structured representations from CNNs either to supplement more global modeling components (such as transformers), increase discriminative power in hybrid pipelines, or modulate the integration of features across depth, resolution, or modality.

1. Architectural Principles of CNN Feature Injection

CNN feature injection generally involves the explicit transfer or merging of feature maps generated by CNN layers into other components within a network hierarchy or into distinct architectures such as transformers or decision forests. Techniques vary in the selection of which feature maps to inject (e.g., low-level, high-level, multi-scale), the injection points within the target architecture, and the transformation applied to the injected features (e.g., linear projection, channel alignment, attention-based gating).

Two main paradigms have emerged:

Horizontal injection: Direct transfer of CNN features into separate modeling components that operate in parallel or sequence, such as feeding multi-stage CNN feature maps into a transformer encoder at matching depths (Jiang et al., 2023).
Vertical injection: Cross-level fusion within a CNN, where features from lower or intermediate layers are combined with high-level features through concatenation, attention, or selective gating (Du et al., 2018).

In both cases, the rationale is to leverage the fine-grained locality and inductive bias of CNNs alongside the global modeling capacity or decision logic of the recipient module.

2. CNN Feature Injection in Hybrid CNN-Transformer Architectures

Feature injection is a key design in hybrid architectures combining CNNs and transformers for dense vision tasks. For example, in "CINFormer" (Jiang et al., 2023), multi-stage CNN feature injection is employed to preserve high-frequency signals and to mitigate the loss of spatial detail that occurs when raw image features are solely processed by transformer layers. CINFormer utilizes a modified U-Net–like encoder–decoder where:

The encoder consists of a four-stage Swin-Transformer, with each stage receiving feature maps from a fixed ResNet-18+FPN stem.
At stage 1, the shallowest CNN map is projected into tokens as transformer input.
At stages 2–4, corresponding CNN feature maps are first aligned via $1 \times 1$ convolution, concatenated with the previous token stream, linearly projected, and provided as transformer input.
One-way injection preserves detailed cues, avoiding overwriting by deeper transformer-driven denoising.
A specialized Top-K self-attention module further prioritizes tokens and channels most indicative of defects.

Empirical results demonstrate that sequential, multi-layer CNN injection—rather than only at the input or via bidirectional or late fusion—yields superior segmentation accuracy for fine-detail tasks such as industrial defect detection.

3. Cross-Layer and Selective Feature Connection Mechanisms

CNN feature injection is also realized via cross-layer fusion within CNNs themselves. The Selective Feature Connection Mechanism (SFCM) (Du et al., 2018) exemplifies the use of a learned attention-based selector to regulate low-level feature injection into higher-level representations:

Low-level and high-level feature maps of matching spatial size are available.
A feature selector, computed from the high-level feature map via $1 \times 1$ convolution and spatial softmax, produces a gate for each location.
This gate weights the low-level feature map before concatenation with the high-level map.
Fusing the two in this manner (optionally with an interposed residual block) allows spatially dense, context-aware injection of detail while minimizing background noise or semantic ambiguity.

This gating paradigm selectively leverages local detail at locations deemed significant by semantic features, thus improving classification and detection metrics without introducing substantial parameter or computational overhead.

4. Modality and Task-Specific Feature Injection Patterns

CNN feature injection extends to heterogenous or multi-task setups, notably for tasks requiring spatially aligned semantic transfer. In S-DOD-CNN (Lee et al., 2019), object detection information is injected into an event recognition pipeline both indirectly (via backbone sharing) and directly (via spatially-preserved projection):

For direct injection, detection branch RoI features are aligned to the event recognition feature map grid using affine projection and weighted by detection confidence.
Multiple strategies exist: max-pooling or bilinear interpolation, class-agnostic or class-specific projection.
The resulting fused feature map is channel-concatenated with the event branch feature map at specified depths in the network (pre-conv6, pre-conv7, or deeper), presering spatial correspondence.

Empirical ablations show that such spatially precise feature injection, especially when performed after sufficient feature abstraction but before late-stage pooling, raises event-recognition AP by over 5% compared to indirect-only or earlier pointwise injections.

5. Feature Injection Beyond Deep Learning: Channel Augmentation and Decision Forests

A related feature injection strategy appears in classical pipelines where early- or mid-level CNN feature maps are injected into non-neural classifiers. The Convolutional Channel Features (CCF) method (Yang et al., 2015) summarizes this approach:

CNN feature maps (e.g., from VGG-16 conv3_3) are extracted and optionally processed via smoothing, per-channel gradients, or orientation binning to form “channels.”
These CNN-derived channel features are concatenated with hand-crafted features (HOG, LUV), forming an expanded feature vector.
This vector is used to train an ensemble of shallow decision trees via RealBoost or LogitBoost.
CCF delivers strong representational capacity, less computational burden, and model compactness—down to one or two orders of magnitude smaller than full, end-to-end CNNs—while achieving state-of-the-art in pedestrian and face detection.

This channel-injection (editor’s term) leverages the representation power of deep CNNs without incurring the training and storage demands of full backpropagation, and demonstrates the flexibility of feature injection even outside end-to-end optimization scenarios.

6. CNN Feature Injection in Hybrid Normalization and Attention Frameworks

Feature injection has also been realized within transformer architectures via integration of CNN-derived summary statistics. In the CNN Injected Transformer (CIT) (Xu et al., 2023), two forms of CNN modules are injected along with window-based self-attention:

Channel Attention Block (CAB): Each patch receives a global average-pooled channel descriptor, processed through a lightweight MLP and sigmoid gating, which reweights the local feature channels prior to concatenation with transformer tokens.
Half-Instance Normalization Block (HINB): Instance normalization is applied to half of the channel dimensions, preserving local contrast while maintaining some global structure.

These injections are in parallel to the local window-based attention, harmonizing global and local features, and effectively reducing boundary artifacts and improving spatial coherence in low-level image tasks such as exposure correction.

7. Location and Positional Feature Injection

Finally, feature injection encompasses the explicit augmentation of CNN input with global or geometric priors. "Location Augmentation for CNN" (Wang et al., 2018) proposes the injection of additional channels encoding per-pixel spatial position (row/column indices or distance from image center), concatenated to the RGB input prior to convolution:

These location channels are normalized to [0,1] for numerical stability and are processed identically to color channels throughout the network.
Empirical evaluation shows consistent segmentation performance gains, particularly for shallow architectures or small, spatially correlated object classes.

This approach harnesses domain prior by breaking the inherent translation-invariance of CNNs, providing explicit positional cues otherwise hard to infer deep within the network.

In summary, CNN feature injection encompasses a spectrum of strategies for integrating convolutional representations within or across diverse modeling elements. Whether through staged hybridization with transformers (Jiang et al., 2023, Xu et al., 2023), cross-depth fusion (Du et al., 2018), spatial projection (Lee et al., 2019), channel augmentation for decision forests (Yang et al., 2015), or explicit positional injection (Wang et al., 2018), these methods leverage the locality, detail, and inductive prior of CNNs to augment model expressivity and task performance, especially in dense prediction, detection, and multi-task contexts.