Assess cost-effectiveness of dynamic Canon layer implementations
Determine whether dynamic, input-conditioned convolutional implementations of Canon layers (e.g., adaptive or gated convolutions with weights conditioned on hidden states) yield performance improvements that justify their increased computational overhead compared to the simple residual 1D conv1d Canon layers (kernel size 4) evaluated in this paper across Transformers, linear-attention models, and state-space models.
Sponsor
References
More complex variants—e.g., dynamic convolutions with input-dependent weighting—are possible but not studied here, as it remains unclear whether such additional cost is justified.
— Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
(2512.17351 - Allen-Zhu, 19 Dec 2025) in Section 4, Canon layers: Implementation variants