Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling (2402.10211v3)
Abstract: Reasoning from sequences of raw sensory data is a ubiquitous problem across fields ranging from medical devices to robotics. These problems often involve using long sequences of raw sensor data (e.g. magnetometers, piezoresistors) to predict sequences of desirable physical quantities (e.g. force, inertial measurements). While classical approaches are powerful for locally-linear prediction problems, they often fall short when using real-world sensors. These sensors are typically non-linear, are affected by extraneous variables (e.g. vibration), and exhibit data-dependent drift. For many problems, the prediction task is exacerbated by small labeled datasets since obtaining ground-truth labels requires expensive equipment. In this work, we present Hierarchical State-Space Models (HiSS), a conceptually simple, new technique for continuous sequential prediction. HiSS stacks structured state-space models on top of each other to create a temporal hierarchy. Across six real-world sensor datasets, from tactile-based state prediction to accelerometer-based inertial measurement, HiSS outperforms state-of-the-art sequence models such as causal Transformers, LSTMs, S4, and Mamba by at least 23% on MSE. Our experiments further indicate that HiSS demonstrates efficient scaling to smaller datasets and is compatible with existing data-filtering techniques. Code, datasets and videos can be found on https://hiss-csp.github.io.
- Accelerometer-based on-body sensor localization for health and medical monitoring applications. Pervasive and mobile computing, 7(6):746–760, 2011.
- Holo-dex: Teaching dexterity with immersive mixed reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 5962–5969. IEEE, 2023.
- Reskin: versatile, replaceable, lasting tactile skins. arXiv preprint arXiv:2111.00071, 2021.
- All the feels: A dexterous hand with large-area tactile sensing. IEEE Robotics and Automation Letters, 2023.
- More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
- Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8748–8757, 2019.
- The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013.
- Oxiod: The dataset for deep inertial odometry. arXiv preprint arXiv:1809.07491, 2018.
- Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR), 54(4):1–40, 2021.
- S1 and s2 heart sound recognition using deep neural networks. IEEE Transactions on Biomedical Engineering, 64(2):372–380, 2016.
- Daum, F. Nonlinear filters: beyond the kalman filter. IEEE Aerospace and Electronic Systems Magazine, 20(8):57–69, 2005.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
- Morphology-specific convolutional neural networks for tactile object recognition with a multi-fingered hand. In 2019 International Conference on Robotics and Automation (ICRA), pp. 57–63. IEEE, 2019.
- Vector: A versatile event-centric benchmark for multi-sensor slam. IEEE Robotics and Automation Letters, 7(3):8217–8224, 2022.
- Gardiol, N. H. Hierarchical memory-based reinforcement learning. In Neural Information Processing Systems (NIPS), volume 13, pp. 1047–1053. MIT Press, 2000.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pp. 7616–7633. PMLR, 2022.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021a.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
- See to touch: Learning tactile dexterity through visual incentives. arXiv preprint arXiv:2309.12300, 2023a.
- Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play. arXiv preprint arXiv:2303.12076, 2023b.
- Ronin: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3146–3152. IEEE, 2020.
- A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
- Adaptive linear quadratic attitude tracking control of a quadrotor uav based on imu sensor data fusion. Sensors, 19(1):46, 2018.
- Recognizing end-diastole and end-systole frames via deep temporal regression network. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part III 19, pp. 264–272. Springer, 2016.
- Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722, 2017.
- Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
- Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020.
- Tlio: Tight learned inertial odometry. IEEE Robotics and Automation Letters, PP:1–1, 07 2020a. doi: 10.1109/LRA.2020.3007421.
- Tlio: Tight learned inertial odometry. IEEE Robotics and Automation Letters, 5(4):5653–5660, 2020b.
- 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
- State estimation and control of electric loads to manage real-time energy imbalance. IEEE Transactions on power systems, 28(1):430–440, 2012.
- Unimib shar: A dataset for human activity recognition using acceleration data from smartphones. Applied Sciences, 7(10):1101, 2017.
- The impact of the mit-bih arrhythmia database. IEEE engineering in medicine and biology magazine, 20(3):45–50, 2001.
- The curious robot: Learning visual representations via physical interactions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 3–18. Springer, 2016.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Real time control of urban wastewater systems—where do we stand today? Journal of hydrology, 299(3-4):335–348, 2004.
- Simon, D. Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley & Sons, 2006.
- Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
- An any-resolution pressure localization scheme using a soft capacitive sensor skin. In 2018 IEEE International Conference on Soft Robotics (RoboSoft), pp. 170–175. IEEE, 2018.
- Machine learning methods for wind turbine condition monitoring: A review. Renewable energy, 133:620–635, 2019.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454, 2020.
- Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Hihar: A hierarchical hybrid deep learning architecture for wearable sensor-based human activity recognition. IEEE Access, 9:145271–145281, 2021.
- A new silicone structure for uskin—a soft, distributed, digital 3-axis skin sensor and its integration on the humanoid robot icub. IEEE Robotics and Automation Letters, 3(3):2584–2591, 2018.
- Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of 28th British Machine Vision Conference, pp. 1–13, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):154, 2020.
- Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
- An introduction to the kalman filter. 1995.
- Ridi: Robust imu double integration. In Proceedings of the European conference on computer vision (ECCV), pp. 621–636, 2018.
- Hierarchical temporal convolutional networks for dynamic recommender systems. In The world wide web conference, pp. 2236–2246, 2019.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.