Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Index Policy Based on Sarsa and Q-learning for Heterogeneous Smart Target Tracking (2402.12015v1)

Published 19 Feb 2024 in eess.SY, cs.LG, and cs.SY

Abstract: In solving the non-myopic radar scheduling for multiple smart target tracking within an active and passive radar network, we need to consider both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking. Acquiring the long-term tracking performance while scheduling the beam resources of active and passive radars poses a challenge. To address this challenge, we model this problem as a Markov decision process consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which the estimation state evolves according to different discrete dynamic models for different actions - whether or not the target is being tracked. The discrete state is defined by the dynamic mode. The problem exhibits the curse of dimensionality, where optimal solutions are in general intractable. We resort to heuristics through the famous restless multi-armed bandit techniques. It follows with efficient scheduling policies based on the indices that are real numbers representing the marginal rewards of taking different actions. For the inevitable practical case with unknown transition matrices, we propose a new method that utilizes the forward Sarsa and backward Q-learning to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions, and propose a new policy, namely ISQ, aiming to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional Q-learning-based methods and rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. F. Wang and H. Li, “Joint waveform and receiver design for co-channel hybrid active-passive sensing with timing uncertainty,” IEEE Transactions on Signal Processing, vol. 68, pp. 466–477, 2020.
  2. Y. Gao, H. Li, and B. Himed, “Joint transmit and receive beamforming for hybrid active-passive radar,” IEEE Signal Processing Letters, vol. 24, no. 6, pp. 779–783, 2017.
  3. J. Yan, H. Jiao, W. Pu, C. Shi, J. Dai, and H. Liu, “Radar sensor network resource allocation for fused target tracking: A brief review,” Information Fusion, vol. 86, pp. 104–115, 2022.
  4. J. Yan, W. Pu, S. Zhou, H. Liu, and M. S. Greco, “Optimal resource allocation for asynchronous multiple targets tracking in heterogeneous radar networks,” IEEE Transactions on Signal Processing, vol. 68, pp. 4055–4068, 2020.
  5. J. Yan, W. Pu, S. Zhou, H. Liu, and Z. Bao, “Collaborative detection and power allocation framework for target tracking in multiple radar system,” Information Fusion, vol. 55, pp. 173–183, 2020.
  6. C. Kreucher, D. Blatt, A. Hero, and K. Kastella, “Adaptive multi-modality sensor scheduling for detection and tracking of smart targets,” Digital Signal Processing, vol. 16, no. 5, pp. 546–567, 2006.
  7. C. O. Savage and B. F. La Scala, “Sensor management for tracking smart targets,” Digital Signal Processing, vol. 19, no. 6, pp. 968–977, 2009.
  8. W. Zhang, C. Shi, J. Zhou, and J. Yan, “Low probability of intercept-based power allocation for target localization in distributed hybrid active-passive radar network,” in Proceedings of 2021 International Conference on Communications, Information System and Computer Engineering (CISCE).   IEEE, 2021, pp. 95–99.
  9. J. Dai, J. Yan, J. Lv, L. Ma, W. Pu, H. Liu, and M. S. Greco, “Composed resource optimization for multitarget tracking in active and passive radar network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
  10. C. Pang and G. Shan, “Sensor scheduling based on risk for target tracking,” IEEE Sensors Journal, vol. 19, no. 18, pp. 8224–8232, 2019.
  11. G. Xu, G. Shan, and X. Duan, “Non-myopic scheduling method of mobile sensors for manoeuvring target tracking,” IET Radar, Sonar & Navigation, vol. 13, no. 11, pp. 1899–1908, 2019.
  12. G. Shan, G. Xu, and C. Qiao, “A non-myopic scheduling method of radar sensors for maneuvering target tracking and radiation control,” Defence Technology, vol. 16, no. 1, pp. 242–250, 2020.
  13. U. Ayesta, “Reinforcement learning in queues,” Queueing Systems, vol. 100, no. 3-4, pp. 497–499, 2022.
  14. J. Niño-Mora, “Markovian restless bandits and index policies: A review,” Mathematics, vol. 11, no. 7, p. 1639, 2023.
  15. P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988.
  16. R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of applied probability, vol. 27, no. 3, pp. 637–648, 1990.
  17. J. Fu, B. Moran, and P. G. Taylor, “A restless bandit model for resource allocation, competition, and reservation,” Operations Research, vol. 70, no. 1, pp. 416–431, 2022.
  18. S. Howard, S. Suvorova, and B. Moran, “Optimal policy for scheduling of Gauss-Markov systems,” in Proceedings of the Seventh International Conference on Information Fusion, 2004, pp. 888–892.
  19. J. Niño-Mora and S. S. Villar, “Sensor scheduling for hunting elusive hiding targets via Whittle’s restless bandit index policy,” in Proceedings of International Conference on Network Games, Control and Optimization (NetGCooP 2011).   IEEE, 2011, pp. 1–8.
  20. J. Niño-Mora, “Whittle’s index policy for multi-target tracking with jamming and nondetections,” in International Conference on Analytical and Stochastic Modeling Techniques and Applications.   Springer, 2016, pp. 210–222.
  21. ——, “A verification theorem for threshold-indexability of real-state discounted restless bandits,” Mathematics of Operations Research, vol. 45, no. 2, pp. 465–496, 2020.
  22. C. R. Dance and T. Silander, “When are Kalman-filter restless bandits indexable?” in Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS), Montreal, Canada.   Cambridge, MA: MIT Press, 2015, pp. 1711–1719.
  23. J. Fu, Y. Nazarathy, S. Moka, and P. G. Taylor, “Towards Q-learning the Whittle index for restless bandits,” in Proccedings of 2019 Australian & New Zealand Control Conference (ANZCC).   IEEE, 2019, pp. 249–254.
  24. K. E. Avrachenkov and V. S. Borkar, “Whittle index based Q-learning for restless bandits with average reward,” Automatica, vol. 139, p. 110186, 2022.
  25. F. Meng, K. Tian, and C. Wu, “Deep reinforcement learning-based radar network target assignment,” IEEE Sensors Journal, vol. 21, no. 14, pp. 16 315–16 327, 2021.
  26. V. S. Borkar and K. Chadha, “A reinforcement learning algorithm for restless bandits,” in Proceedings of 2018 Indian Control Conference (ICC).   IEEE, 2018, pp. 89–94.
  27. F. Robledo, V. Borkar, U. Ayesta, and K. Avrachenkov, “QWI: Q-learning with Whittle index,” ACM SIGMETRICS Performance Evaluation Review, vol. 49, no. 2, pp. 47–50, 2022.
  28. F. Robledo, V. S. Borkar, U. Ayesta, and K. Avrachenkov, “Tabular and deep learning of Whittle index,” in Proceedings of EWRL 2022-15th European Workshop of Reinforcement Learning, 2022.
  29. K. Wang, L. Xu, A. Taneja, and M. Tambe, “Optimistic Whittle index policy: Online learning for restless bandits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 10 131–10 139.
  30. G. Xiong, R. Singh, and J. Li, “Learning augmented index policy for optimal service placement at the network edge,” arXiv preprint arXiv:2101.03641, 2021.
  31. L. J. Gibson, P. Jacko, and Y. Nazarathy, “A novel implementation of Q-learning for the Whittle index,” in Proceedings of Performance Evaluation Methodologies and Tools: 14th EAI International Conference, VALUETOOLS 2021, Virtual Event, October 30–31.   Springer, 2021, pp. 154–170.
  32. K. Nakhleh, S. Ganji, P.-C. Hsieh, I. Hou, S. Shakkottai et al., “NeurWIN: Neural Whittle index network for restless bandits via deep RL,” Advances in Neural Information Processing Systems, vol. 34, pp. 828–839, 2021.
  33. A. Biswas, G. Aggarwal, P. Varakantham, and M. Tambe, “Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21.   International Joint Conferences on Artificial Intelligence Organization, 2021, pp. 4039–4046.
  34. Y.-H. Wang, T.-H. S. Li, and C.-J. Lin, “Backward Q-learning: The combination of Sarsa algorithm and Q-learning,” Engineering Applications of Artificial Intelligence, vol. 26, no. 9, pp. 2184–2193, 2013.
  35. Y. Shi, B. Jiu, J. Yan, and H. Liu, “Data-driven radar selection and power allocation method for target tracking in multiple radar system,” IEEE Sensors Journal, vol. 21, no. 17, pp. 19 296–19 306, 2021.
  36. H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and empirical analysis of expected Sarsa,” in Proceedings of 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.   IEEE, 2009, pp. 177–184.
  37. N. Aslam, K. Xia, and M. U. Hadi, “Optimal wireless charging inclusive of intellectual routing based on SARSA learning in renewable wireless sensor networks,” IEEE Sensors Journal, vol. 19, no. 18, pp. 8340–8351, 2019.
  38. C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, 1992.
  39. H. Al-Tous and I. Barhumi, “Reinforcement learning framework for delay sensitive energy harvesting wireless sensor networks,” IEEE Sensors Journal, vol. 21, no. 5, pp. 7103–7113, 2020.
  40. S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for single-step on-policy reinforcement-learning algorithms,” Machine Learning, vol. 38, pp. 287–308, 2000.
  41. J. Fu and B. Moran, “Energy-efficient job-assignment policy with asymptotically guaranteed performance deviation,” IEEE/ACM Transactions on Networking, vol. 28, no. 3, pp. 1325–1338, 2020.
  42. J. Fu, X. Wang, Z. Wang, and M. Zukerman, “A restless bandit model for energy-efficient job assignments in server farms,” IEEE Transactions on Automatic Control, Jan. 2024, early Access.
  43. J. Niño-Mora and S. S. Villar, “Multitarget tracking via restless bandit marginal productivity indices and Kalman filter in discrete time,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.   IEEE, 2009, pp. 2905–2910.

Summary

We haven't generated a summary for this paper yet.