Event Temporal Slicing Convolution
- Event Temporal Slicing Convolution (ETSC) is a technique that segments event-based data into temporal slices for applying convolutional operations to capture fine-grained patterns.
- It integrates continuous-time and discrete-time convolutional operators, using fixed, adaptive, and continuous slicing methods to ensure efficient local and global temporal feature extraction.
- ETSC systems are computationally efficient and scalable, supporting applications such as action recognition, pose estimation, and open-vocabulary detection with low latency.
Event Temporal Slicing Convolution (ETSC) refers broadly to a class of operations in event-based data processing where the temporal dimension is explicitly discretized or segmented (“sliced”) into intervals, and local convolutional operations are applied within or across these slices. This approach is fundamentally motivated by the need to model fine-grained, data-driven temporal patterns in event streams—sequences of occurrences marked by precise timestamps—without losing the sparsity, temporal precision, or causality inherent in such data. ETSC and its variants have become central to a range of modern architectures for temporal point processes, neuromorphic perception, action recognition, pose estimation, and open-vocabulary event-based detection.
1. Temporal Slicing and Event Representation
ETSC architectures begin by transforming asynchronous event streams—sequences of tuples , or higher-order marks—into temporally segmented representations. The key strategies are as follows:
- Fixed, Uniform Binning: Segment the time axis into equal-width intervals (bins) and aggregate events within each bin, forming a sequence (\emph{voxel grid} or tensor) suitable for convolutional operations. This is standard in spatiotemporal event filtering for action recognition (Ghosh et al., 2019) and spiking neural networks (SNNs) (Yu et al., 2022).
- Adaptive Slicing: Use a data-dependent module (e.g., an SNN with leaky-integrate-and-fire dynamics) to dynamically determine event boundaries, such that slices are created at information-rich moments (Zhang et al., 1 Oct 2025).
- Continuous Scanning: In continuous-time settings, for each target event time , aggregate history over a look-back window (“horizon” slicing), enabling multi-scale contextual encoding (Zhou et al., 2023).
This step ensures temporal alignment and preserves causality, establishing the foundation for downstream convolutional filtering.
2. Continuous-Time and Discrete-Time Convolutional Operators
ETSC modules use parameterized convolutional kernels to aggregate history within each temporal slice or horizon. The mathematical formalism varies by data modality and task:
- Continuous-Time Convolution: For a history of events , form
where is a trainable, causal kernel (typically SIREN or MLP-parameterized), is the embedding of event , and the channel-specific horizon (Zhou et al., 2023). COTIC employs linear causal kernels for efficient multi-layer convolution:
This enables direct modeling of irregular, non-uniform event sequences without the necessity of resampling (Zhuzhel et al., 2023).
- Discrete-Time Temporal Filtering: In binned representations, apply 1D (depth-wise) convolution across the temporal axis within each slice, typically immediately prior to spatial processing:
where is filter length, convolution weights, and the temporally sliced data (Yu et al., 2022, Zhou et al., 6 Dec 2025).
- Hybrid Spatiotemporal Filtering: For event-based action recognition, 3D kernels are applied directly to voxels (Ghosh et al., 2019).
In all cases, kernels are parameterized or learned to match the temporal structure of the data, and dilation, depth, and receptive field hyperparameters are tuned to balance context size with computational efficiency.
3. Integration with Downstream Modules and Global Context
Most ETSC architectures are nested within larger pipelines that combine local temporal context with global sequence understanding:
- Recurrent Fusion: Local multi-horizon convolutional encodings are concatenated, projected, and fused via residual connections before being fed to a GRU, producing a global hidden state that accumulates both local (convolutional) and global (recurrent) information (Zhou et al., 2023).
- Attention and Gating Mechanisms: Temporal filter outputs are modulated by attention-like gates, computed as temporal convolutions followed by non-linear bottlenecked MLPs and sigmoids, controlling information flow at each time channel (Yu et al., 2022).
- Token-Level Temporal Modeling: In point cloud architectures for pose estimation, ETSC is applied across slice tokens (after spatial aggregation), with parallel standard and dilated 1D convolutions and residual connections, followed by temporal global pooling and feature concatenation (Zhou et al., 6 Dec 2025).
- SNN-CNN Hybrids: Adaptive slicing (via SNN) is followed by standard 2D convolutional backbones (CNN+FPN), enabling adaptive temporal feature granularity for efficient and semantically rich open-vocabulary detection (Zhang et al., 1 Oct 2025).
The flexibility in the point at which ETSC is applied—pre-spatial, inter-token, or pre-recurrent—enables reuse across domains.
4. Kernel Learning and Temporal Pooling Strategies
The efficacy of ETSC derives from the inductive bias and adaptability of its filters:
- Trainable Continuous Kernels: SIREN-style MLPs parameterize continuous temporal kernels, allowing the convolution operator to capture oscillatory and fine-grained temporal dependencies at each scale (Zhou et al., 2023).
- Learned Discrete Filters: In event-based action recognition, 3D filter weights are derived via unsupervised slowness regularization, yielding filters that are robust to missing spikes and focus on salient, invariant motion patterns (Ghosh et al., 2019).
- Explicit Time-Decay: Time-Discounting Convolution (TDC) introduces convolutional kernels with eligibility-trace or patch-style parameterizations, imparting exponential memory decay to ensure natural forgetting, robustness to timestamp ambiguity, and time-shift invariance (Katsuki et al., 2018).
- Dynamic Pooling: TDC further augments convolution with growing-window pooling over both raw inputs and intermediate activations, increasing resilience to temporal jitter and enhancing time-discounting effects as depth increases (Katsuki et al., 2018).
This kernel engineering directly influences the ability of ETSC modules to handle temporally sparse or ambiguous signals, as well as computational scalability.
5. Computational Complexity and Parallelization Properties
A principal motivation for ETSC is efficient, scalable handling of long and non-uniform event sequences:
- Linear Complexity in Sequence Length: For windowed or horizon-based ETSC (e.g., (Zhou et al., 2023)), per-event cost is —where is channels, is the number of events in the current window, and is feature size. If window sizes are chosen judiciously (), computation is nearly linear in total sequence length. Sliding-window indices and minibatch parallelization on GPU further reduce runtime.
- Full Parallelism over Events/Slices: Continuous-time convolution architectures (COTIC) in (Zhuzhel et al., 2023) parallelize over all events; convolution layer cost is with batched slicing, compared to in transformer-based self-attentive models.
- Minimal Overhead in SNN-CNN Hybrids: Slicing is triggered only at points of high membrane potential, limiting computations to informative intervals (Zhang et al., 1 Oct 2025); subsequent 2D convolutions are standard and well-optimized.
- Low Latency: Ablation studies in event-based pose estimation show runtime of 1.89 ms per ETSC pass on standard hardware, demonstrating suitability for real-time applications (Zhou et al., 6 Dec 2025).
6. Empirical Applications and Benchmark Performance
ETSC and its variants have reported strong performance across a variety of tasks:
- Temporal Point Process Modeling: Local horizon slicing plus continuous-time convolutional encoding with global RNN fusion yields improved predictive likelihood and accuracy over RNN/transformer baselines (Zhou et al., 2023).
- Event-based Action Recognition: Spatiotemporal filtering on fine voxelized slices, coupled with unsupervised filter learning, achieves state-of-the-art accuracy and drastically lower latency on DVS Gesture and new action datasets (e.g., 95.6% accuracy, ~56 ms latency for DVS Gesture) (Ghosh et al., 2019).
- Event-driven Pose Estimation: ETSC modules operating on sequential token slices yield consistent performance gains (e.g., 3% 2D/3D MPJPE reduction on DHP19), with minimal added runtime (Zhou et al., 6 Dec 2025).
- Adaptive Open-vocabulary Detection: Jointly trained SNN slicers and CNN backbones, optimized for detection and vision-language distillation, match or exceed fixed-slice baselines on object detection with maximal temporal feature retention (Zhang et al., 1 Oct 2025).
- Ambiguous-Timestamp Sequences: TDC’s combination of decay and dynamic pooling provides robustness to time-shift and missing annotations, outperforming TCN, VAR, and RNN baselines in variable-length scenarios (Katsuki et al., 2018).
A plausible implication is that ETSC is foundational for any event-driven modeling context requiring both fine-grained and robust temporal feature extraction.
7. Variants, Comparative Architecture, and Future Directions
A comparative summary of ETSC instantiations is provided in the following table:
| Variant / Paper | Slicing Type | Kernel Parametrization |
|---|---|---|
| (Zhou et al., 2023) | Multi-horizon (local) | SIREN-MLP continuous |
| (Zhuzhel et al., 2023) | Fixed, plus slice pts | Linear, causal continuous |
| (Zhou et al., 6 Dec 2025) | Equal segment tokens | 1D Conv + dilated Conv |
| (Yu et al., 2022) | Frame/binning | Depthwise 1D Conv + gate |
| (Ghosh et al., 2019) | Fixed bins, segments | Unsupervised 3D conv filters |
| (Zhang et al., 1 Oct 2025) | Adaptive (SNN-cut) | Standard 2D CNN |
| (Katsuki et al., 2018) | Fixed bins | Exponential-decay conv + pooling |
Current research is exploring unified models that can learn both slicing granularity and convolutional kernel structure in a task-driven, end-to-end fashion, as well as more adaptive, context- and data-driven methods that blend continuous and discrete slicing.
Future work may investigate tighter integration between event-slicing modules and large pre-trained foundation models, broader integration within spiking or neuromorphic hardware, and theoretical characterizations of information retention and loss in various slicing/convolutional schemes.