Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Deweather Mixture-of-Experts with Uncertainty-aware Feature-wise Linear Modulation (2312.16610v1)

Published 27 Dec 2023 in cs.CV and cs.LG

Abstract: The Mixture-of-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects. However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability. In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead. We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks. The conducted experiments on the multi-deweather task show that our MoFME outperforms the baselines in the image restoration quality by 0.1-0.2 dB and achieves SOTA-compatible performance while saving more than 72% of parameters and 39% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Federated learning via posterior averaging: A new perspective and practical algorithms. arXiv preprint arXiv:2010.05273.
  2. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. Proceedings of the International Conference on Learning Representations (ICLR).
  3. Bengio, Y. 2013. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing: First International Conference (SLSP).
  4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts. arXiv:2206.00277.
  5. All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  6. BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17461–17470.
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR).
  8. Learning factored representations in a deep mixture of experts. arXiv:1312.4314.
  9. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research.
  10. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML).
  11. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  12. Adaptive mixtures of local experts. Neural computation, 3(1): 79–87.
  13. Towards more effective and economic sparsely-activated model. arXiv:2110.07431.
  14. Hierarchical mixtures of experts and the EM algorithm. Neural computation, 6(2): 181–214.
  15. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  16. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems (NIPS).
  17. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In Proceedings of the International Conference on Learning Representations (ICLR).
  18. All in one bad weather removal using architectural search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3175–3185.
  19. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European conference on computer vision (ECCV), 254–269.
  20. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design. In Advances in Neural Information Processing Systems (NIPS).
  21. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 1833–1844.
  22. Overfitting the data: Compact neural video delivery via content-aware feature modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4631–4640.
  23. ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation. arXiv preprint arXiv:2306.04344.
  24. Exploring Simple and Transferable Recognition-Aware Image Processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3): 3032–3046.
  25. Luo, Y.; et al. 2023. MoWE: mixture of weather experts for multiple adverse weather removal. arXiv:2303.13739.
  26. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
  27. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems (NIPS), 32.
  28. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence.
  29. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence, 11908–11915.
  30. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In Proceedings of the International Conference on Machine Learning (ICML).
  31. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  32. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems (NIPS).
  33. In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning. In Proceedings of the International Conference on Learning Representations (ICLR).
  34. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126: 973–992.
  35. Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems (NIPS).
  36. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations (ICLR).
  37. FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation. In Advances in Neural Information Processing Systems (NIPS).
  38. Valanarasu; et al. 2022. TransWeather: transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  39. Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence (UAI).
  40. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17683–17693.
  41. One Student Knows All Experts Know: From Sparse to Dense. arXiv:2201.10890.
  42. CondConv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems (NIPS).
  43. Towards efficient single image dehazing and desnowing. arXiv preprint arXiv:2204.08899.
  44. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  45. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14821–14831.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Rongyu Zhang (25 papers)
  2. Yulin Luo (14 papers)
  3. Jiaming Liu (156 papers)
  4. Huanrui Yang (37 papers)
  5. Zhen Dong (87 papers)
  6. Denis Gudovskiy (15 papers)
  7. Tomoyuki Okuno (16 papers)
  8. Yohei Nakata (11 papers)
  9. Kurt Keutzer (200 papers)
  10. Yuan Du (20 papers)
  11. Shanghang Zhang (173 papers)
Citations (3)