Human-AI Safety: A Descendant of Generative AI and Control Systems Safety (2405.09794v2)
Abstract: AI is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A minimum discounted reward hamilton-jacobi formulation for computing reachable sets. arXiv preprint, 2018.
- A machine learning approach for real-time reachability analysis. In 2014 IEEE/RSJ international conference on intelligent robots and systems, pp. 2202–2208. IEEE, 2014.
- Online verification of automated road vehicles using reachability analysis. IEEE Transactions on Robotics, 30(4):903–918, 2014.
- Reachable set computation for uncertain time-varying linear systems. In Proceedings of the 14th international conference on Hybrid systems: computation and control, pp. 93–102, 2011.
- Control barrier functions: Theory and applications. In 2019 18th European control conference (ECC), pp. 3420–3431. IEEE, 2019.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- A robust control framework for human motion prediction. Robotics and Automation Letters, 2020.
- DeepReach: A deep learning approach to high-dimensional reachability. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
- Hamilton-Jacobi Reachability: A brief overview and recent advances. In IEEE Conference on Decision and Control (CDC), 2017.
- Dynamic noncooperative game theory. SIAM, 1998.
- Bastani, O. Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In 2021 American control conference (ACC), pp. 3488–3494. IEEE, 2021.
- Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362–29373, 2022.
- Simnet: Learning reactive self-driving simulations from real-world observations. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5119–5125. IEEE, 2021.
- Performance assessment framework for robotic systems. Technical report, National Institute of Standards and Technology, 2018.
- The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228, 2018.
- Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.
- Guaranteed obstacle avoidance for multi-robot operations with limited actuation: A control barrier function approach. IEEE Control Systems Letters, 5(1):127–132, 2020.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Overcoming the curse of dimensionality for some hamilton–jacobi partial differential equations via neural network architectures. Research in the Mathematical Sciences, 7:1–50, 2020.
- Safe nonlinear control using robust neural lyapunov-barrier functions. In Conference on Robot Learning, pp. 1724–1735. PMLR, 2022.
- Safe control with learned certificates: A survey of neural lyapunov, barrier, and contraction methods for robotics and control. IEEE Transactions on Robotics, 2023.
- Neural approximation of pde solutions: An application to reachability computations. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 3034–3039. IEEE, 2006.
- Learning to control in power systems: Design and analysis guidelines for concrete safety problems. Electric Power Systems Research, 189:106615, 2020.
- Robust, informative human-in-the-loop predictions via empirical reachable sets. IEEE Transactions on Intelligent Vehicles, 3(3):300–309, 2018.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
- Reach-avoid problems with time-varying dynamics, targets and constraints. In HSCC, 2015.
- Bridging Hamilton-Jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8550–8556, 2019. doi: 10.1109/ICRA.2019.8794107.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, volume 34, pp. 1660–1672. Curran Associates, Inc., 2021.
- Drocc: Deep robust one-class classification. In International conference on machine learning, pp. 3711–3721. PMLR, 2020.
- World models. NIPS, 2018.
- On making robots understand safety: Embedding injury knowledge into control. The International Journal of Robotics Research, 31(13):1578–1602, 2012.
- Heath, A. All the news from openai’s first developer conference. https://www.theverge.com/2023/11/6/23948619/openai-chatgpt-devday-developer-conference-news, 2023. Accessed: 2024-01-23.
- Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
- Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 3:269–296, 2020.
- Hicks, C. I pitted chatgpt against a real financial advisor to help me save for retirement—and the winner is clear. https://fortune.com/recommends/investing/chatgpt-vs-real-financial-advisor-to-plan-retirement-which-is-better/, 2023. Accessed: 2024-01-12.
- Zero-shot goal-directed dialogue via rl on imagined conversations. arXiv preprint arXiv:2311.05584, 2023.
- ISAACS: Iterative Soft Adversarial Actor-Critic for Safety. Learning for Dynamics and Control Conference, pp. 90–103, 2023.
- The safety filter: A unified view of safety-critical control in autonomous systems. Annual Reviewsof Control, Robotics, and Autonomous Systems, 2024.
- Deception game: Closing the safety-learning loop in interactive robot autonomy. Conference on Robot Learning, 2023.
- Isaacs, R. Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation, 1999.
- ISO 15066. Robots and robotic devices – Collaborative robots. Standard, International Organization for Standardization, 2016.
- ISO 22737:2021. Intelligent transport systems. Standard, International Organization for Standardization, 2021.
- Generative modeling of multimodal multi-human behavior. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3088–3095. IEEE, 2018.
- Next generation airborne collision avoidance system. Lincoln Laboratory Journal, 19(1):17–33, 2012.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Conformal decision theory: Safe autonomous decisions from imperfect predictions. IEEE International Conference on Robotics and Automation (submitted), 2023.
- Prediction-based reachability for collision avoidance in autonomous driving. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 7908–7914. IEEE, 2021.
- Robust safe learning and control in an unknown environment: An uncertainty-separated control barrier function approach. IEEE Robotics and Automation Letters, 2023.
- Generating formal safety assurances for high-dimensional reachability. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10525–10531. IEEE, 2023.
- Provably safe motion of mobile robots in human environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1351–1357. IEEE, 2017.
- Formal specification and verification of autonomous robotic systems: A survey. ACM Computing Surveys (CSUR), 52(5):1–41, 2019.
- Pac-bayes control: Learning policies that provably generalize to novel environments, 2020.
- Safe exploration algorithms for reinforcement learning controllers. IEEE transactions on neural networks and learning systems, 29(4):1069–1081, 2017.
- Hamilton-Jacobi Formulation for Reach–Avoid Differential Games. IEEE Trans. on Automatic Control, 56(8):1849–1861, 2011.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games. IEEE Transactions on Automatic Control (TAC), 50(7):947–957, 2005.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- Online update of safety assurances using confidence-based predictions. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 12765–12771. IEEE, 2023.
- A neural approximation to continuous time reachability computations. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 6313–6318. IEEE, 2006.
- Boeing built deadly assumptions into 737 max, blind to a late design change. https://www.nytimes.com/2019/06/01/business/boeing-737-max-crash.html, 2019. Accessed: 2024-01-24.
- Noyes, D. New video of bay bridge 8-car crash shows tesla abruptly braking in ’self-driving’ mode. https://abc7news.com/tesla-sf-bay-bridge-crash-8-car-self-driving-video/12686428/, 2023. Accessed: 2024-01-24.
- On-Road Automated Driving Committee, S. Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles, 2021. URL https://www.sae.org/content/j3016_202104.
- OpenAI. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research, 2022a. Accessed: 2024-01-23.
- OpenAI. Dall-e now available without waitlist. https://openai.com/blog/dall-e-now-available-without-waitlist, 2022b. Accessed: 2024-01-23.
- OpenAI. DALL·E 2 preview: Risks and limitations. https://github.com/openai/dalle-2-preview/blob/main/system-card_04062022.md, 2022c. Accessed: 2024-01-23.
- Building safe artificial intelligence: specification, robustness, and assurance. https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1, 2018. Accessed: 2024-01-23.
- Towards Responsibility-Sensitive Safety of Automated Vehicles with Reachable Set Analysis. In 2019 IEEE International Conference on Connected Vehicles and Expo (ICCVE), pp. 1–6. IEEE. ISBN 978-1-72810-142-2. doi: 10.1109/ICCVE45908.2019.8965069. URL https://ieeexplore.ieee.org/document/8965069/.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023.
- Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,, 2022.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Learning safe multi-agent control with decentralized neural barrier certificates. International Conference on Learning Representations, 2021.
- Direct preference optimization: Your language model is secretly a reward model. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- A classification-based approach for approximate reachability. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7697–7704. IEEE, 2019.
- Russell, S. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
- Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Springer, 2017.
- Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8579–8590, 2023.
- Robust online motion planning via contraction theory and convex optimization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5883–5890. IEEE, 2017.
- Trafficsim: Learning to simulate realistic multi-agent behaviors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10400–10409, 2021.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Safety assurances for human-robot interaction via confidence-aware game-theoretic human models. In IEEE International Conference on Robotics and Automation, pp. 11229–11235, 2022.
- Conflict resolution for air traffic management: A study in multiagent hybrid systems. IEEE Transactions on automatic control, 43(4):509–521, 1998.
- Union, T. E. A european approach to artificial intelligence. https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence, 2021. Accessed: 2024-01-23.
- Hierarchical, hybrid framework for collision avoidance algorithms in the national airspace. In AIAA Guidance, Navigation and Control Conference and Exhibit, pp. 6970, 2008.
- Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177, 2023.
- Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- White House, U. S. Fact sheet: President biden issues executive order on safe, secure, and trustworthy artificial intelligence, 2023.
- Xiang, C. “he would still be here”: Man dies by suicide after talking with ai chatbot, widow says. https://www.vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says, 2023. Accessed: 2024-01-23.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2950–2968, 2021.
- Learning physically simulated tennis skills from broadcast videos. ACM Transactions On Graphics (TOG), 42(4):1–14, 2023.
- Guided conditional diffusion for controllable traffic simulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3560–3566. IEEE, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Andrea Bajcsy (36 papers)
- Jaime F. Fisac (35 papers)