Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling (2402.10211v3)

Published 15 Feb 2024 in cs.LG, cs.RO, and eess.SP

Abstract: Reasoning from sequences of raw sensory data is a ubiquitous problem across fields ranging from medical devices to robotics. These problems often involve using long sequences of raw sensor data (e.g. magnetometers, piezoresistors) to predict sequences of desirable physical quantities (e.g. force, inertial measurements). While classical approaches are powerful for locally-linear prediction problems, they often fall short when using real-world sensors. These sensors are typically non-linear, are affected by extraneous variables (e.g. vibration), and exhibit data-dependent drift. For many problems, the prediction task is exacerbated by small labeled datasets since obtaining ground-truth labels requires expensive equipment. In this work, we present Hierarchical State-Space Models (HiSS), a conceptually simple, new technique for continuous sequential prediction. HiSS stacks structured state-space models on top of each other to create a temporal hierarchy. Across six real-world sensor datasets, from tactile-based state prediction to accelerometer-based inertial measurement, HiSS outperforms state-of-the-art sequence models such as causal Transformers, LSTMs, S4, and Mamba by at least 23% on MSE. Our experiments further indicate that HiSS demonstrates efficient scaling to smaller datasets and is compatible with existing data-filtering techniques. Code, datasets and videos can be found on https://hiss-csp.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Accelerometer-based on-body sensor localization for health and medical monitoring applications. Pervasive and mobile computing, 7(6):746–760, 2011.
  2. Holo-dex: Teaching dexterity with immersive mixed reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  5962–5969. IEEE, 2023.
  3. Reskin: versatile, replaceable, lasting tactile skins. arXiv preprint arXiv:2111.00071, 2021.
  4. All the feels: A dexterous hand with large-area tactile sensing. IEEE Robotics and Automation Letters, 2023.
  5. More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
  6. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8748–8757, 2019.
  7. The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013.
  8. Oxiod: The dataset for deep inertial odometry. arXiv preprint arXiv:1809.07491, 2018.
  9. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR), 54(4):1–40, 2021.
  10. S1 and s2 heart sound recognition using deep neural networks. IEEE Transactions on Biomedical Engineering, 64(2):372–380, 2016.
  11. Daum, F. Nonlinear filters: beyond the kalman filter. IEEE Aerospace and Electronic Systems Magazine, 20(8):57–69, 2005.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  15. Morphology-specific convolutional neural networks for tactile object recognition with a multi-fingered hand. In 2019 International Conference on Robotics and Automation (ICRA), pp.  57–63. IEEE, 2019.
  16. Vector: A versatile event-centric benchmark for multi-sensor slam. IEEE Robotics and Automation Letters, 7(3):8217–8224, 2022.
  17. Gardiol, N. H. Hierarchical memory-based reinforcement learning. In Neural Information Processing Systems (NIPS), volume 13, pp.  1047–1053. MIT Press, 2000.
  18. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  19. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  20. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pp.  7616–7633. PMLR, 2022.
  21. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  22. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021a.
  23. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
  24. See to touch: Learning tactile dexterity through visual incentives. arXiv preprint arXiv:2309.12300, 2023a.
  25. Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play. arXiv preprint arXiv:2303.12076, 2023b.
  26. Ronin: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.  3146–3152. IEEE, 2020.
  27. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
  28. Adaptive linear quadratic attitude tracking control of a quadrotor uav based on imu sensor data fusion. Sensors, 19(1):46, 2018.
  29. Recognizing end-diastole and end-systole frames via deep temporal regression network. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part III 19, pp.  264–272. Springer, 2016.
  30. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722, 2017.
  31. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
  32. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020.
  33. Tlio: Tight learned inertial odometry. IEEE Robotics and Automation Letters, PP:1–1, 07 2020a. doi: 10.1109/LRA.2020.3007421.
  34. Tlio: Tight learned inertial odometry. IEEE Robotics and Automation Letters, 5(4):5653–5660, 2020b.
  35. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  36. State estimation and control of electric loads to manage real-time energy imbalance. IEEE Transactions on power systems, 28(1):430–440, 2012.
  37. Unimib shar: A dataset for human activity recognition using acceleration data from smartphones. Applied Sciences, 7(10):1101, 2017.
  38. The impact of the mit-bih arrhythmia database. IEEE engineering in medicine and biology magazine, 20(3):45–50, 2001.
  39. The curious robot: Learning visual representations via physical interactions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  3–18. Springer, 2016.
  40. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  41. Real time control of urban wastewater systems—where do we stand today? Journal of hydrology, 299(3-4):335–348, 2004.
  42. Simon, D. Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley & Sons, 2006.
  43. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  44. An any-resolution pressure localization scheme using a soft capacitive sensor skin. In 2018 IEEE International Conference on Soft Robotics (RoboSoft), pp.  170–175. IEEE, 2018.
  45. Machine learning methods for wind turbine condition monitoring: A review. Renewable energy, 133:620–635, 2019.
  46. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2446–2454, 2020.
  47. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  48. Hihar: A hierarchical hybrid deep learning architecture for wearable sensor-based human activity recognition. IEEE Access, 9:145271–145281, 2021.
  49. A new silicone structure for uskin—a soft, distributed, digital 3-axis skin sensor and its integration on the humanoid robot icub. IEEE Robotics and Automation Letters, 3(3):2584–2591, 2018.
  50. Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of 28th British Machine Vision Conference, pp.  1–13, 2017.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):154, 2020.
  53. Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
  54. An introduction to the kalman filter. 1995.
  55. Ridi: Robust imu double integration. In Proceedings of the European conference on computer vision (ECCV), pp.  621–636, 2018.
  56. Hierarchical temporal convolutional networks for dynamic recommender systems. In The world wide web conference, pp.  2236–2246, 2019.
  57. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
Citations (8)

Summary

  • The paper introduces a hierarchical approach that reduces MSE by at least 23% compared to causal Transformers and other models on sensor datasets.
  • The model employs a dual temporal resolution architecture to effectively capture complex, non-linear dynamics in continuous sequence data.
  • Its application on the CSP-Bench benchmark establishes a new standard for scalable and robust sequence prediction in real-world sensor scenarios.

Improving Continuous Sequence Prediction with Hierarchical State Space Models

Introduction to Continuous Sequence Prediction with HiSS

In the arena of processing and analyzing sequential sensory data, recent advancements have been heralded by the introduction of Hierarchical State Space Models (HiSS). This novel approach, tailored for continuous sequence-to-sequence prediction, capitalizes on the inherent temporal structure in sensor data to deliver improved prediction accuracy. By benchmarking against an extensive dataset comprising six real-world sensor applications, HiSS has demonstrated a significant outperformance of contemporary sequence models, namely causal Transformers, LSTMs, S4, and Mamba, by at least 23% in Mean Squared Error (MSE).

The Challenge of Continuous Sequence Prediction

Continuous sequence-to-sequence prediction necessitates the transformation of long sequences of raw sensor data into sequences representing desired physical quantities. Traditional models often falter in adequately capturing the complex, non-linear dynamics present in real-world sensor data. Furthermore, the limited availability of labeled datasets for such tasks posits additional hurdles, underscoring the need for a robust and scalable solution.

CSP-Bench: A Novel Benchmark in CSP

A critical step forward in this research was the establishment of CSP-Bench, a comprehensive benchmark explicitly curated for continuous sequence prediction tasks. Comprising six diverse real-world labeled datasets, CSP-Bench facilitates a standardized evaluation platform, unveiling the superior performance of State Space Models (SSMs) over conventional models like LSTMs and Transformers.

Delving into Hierarchical State-Space Models

The crux of HiSS lies in its hierarchical modeling architecture, stacking structured state-space models to formulate a temporal hierarchy. This design encapsulates the temporal redundancies in sensor data, enabling a distilled rendition of the input data through dual temporal resolutions. The architecture comprises a lower-level SSM that processes the data into chunks, subsequently synthesized by a higher-level SSM for global sequence prediction. This hierarchical division aligns with natural physical processes, which often exhibit behaviors across different frequency scales, thus rendering HiSS potent in disentangling and accurately predicting underlying physical quantities from noisy, high-dimensional sensor data.

Empirical Validation and Insights

Empirical scrutiny across six sensor datasets validates the prowess of HiSS, showcasing an unequivocal enhancement in prediction accuracy. Moreover, the research explores the model's compatibility with existing data-filtering techniques and its scalability to smaller datasets, affirming its adaptability and efficiency.

Theoretical and Practical Implications

The HiSS model introduces a paradigm shift in approach towards continuous sequence prediction tasks, emphasizing the utility of hierarchical temporal processing. Theoretically, it underscores the potential of tailored architectures in managing the complexities of sequential sensor data. Practically, its efficacy in dealing with data-dependant drift, noise, and other sensor-specific challenges holds promising implications for domains requiring real-time analysis of sensory data, such as robotics, medical diagnostics, and environmental monitoring.

Future Directions and Conclusion

Despite its notable achievements, HiSS, as a harbinger in the domain of continuous sequence prediction, opens avenues for further exploration. Determining optimal chunk sizes and extending the model to encompass a broader range of sensors are potential directions for future research. By pushing the boundaries of what's achievable in continuous sequence prediction, HiSS not only sets a new benchmark but also paves the way for more nuanced and sophisticated approaches to understanding and predicting the physical world through sensor data.