Assess cost-effectiveness of dynamic Canon layer implementations

Determine whether dynamic, input-conditioned convolutional implementations of Canon layers (e.g., adaptive or gated convolutions with weights conditioned on hidden states) yield performance improvements that justify their increased computational overhead compared to the simple residual 1D conv1d Canon layers (kernel size 4) evaluated in this paper across Transformers, linear-attention models, and state-space models.

Background

Canon layers are introduced as lightweight horizontal information-flow components, implemented here via simple residual 1D causal convolutions with kernel size 4. The authors note that more complex, dynamic variants are conceivable (e.g., input-dependent weighting), but they are not studied in this work.

The paper explicitly states uncertainty about whether the additional computational cost of dynamic implementations is justified, elevating this as an open question to be resolved through future systematic evaluation under the synthetic playground framework.

References

More complex variants—e.g., dynamic convolutions with input-dependent weighting—are possible but not studied here, as it remains unclear whether such additional cost is justified.

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers (2512.17351 - Allen-Zhu, 19 Dec 2025) in Section 4, Canon layers: Implementation variants