Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs (2405.03901v1)

Published 6 May 2024 in cs.HC and cs.AI

Abstract: The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by LLMs that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
  2. An augmented reality interface to contextual information. Virtual reality 15, 2 (2011), 161–173.
  3. Fancy a drink in canary wharf?: A user study on location-based mobile search. In Human-Computer Interaction–INTERACT 2009: 12th IFIP TC 13 International Conference, Uppsala, Sweden, August 24-28, 2009, Proceedings, Part I 12. Springer, 736–749.
  4. Daniel L Ashbrook. 2010. Enabling mobile microinteractions. Georgia Institute of Technology.
  5. txt 4 l8r: lowering the burden for diary studies under mobile conditions. In CHI’07 extended abstracts on Human factors in computing systems. 2303–2308.
  6. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  7. Robin Burke. 2007. Hybrid web recommender systems. The adaptive web: methods and strategies of web personalization (2007), 377–408.
  8. Guanling Chen and David Kotz. 2000. A survey of context-aware mobile computing research. (2000).
  9. Li Chen and Luole Qi. 2010. A diary study of understanding contextual information needs during leisure traveling. In Proceedings of the third symposium on Information interaction in context. 265–270.
  10. Next Steps for Human-Centered Generative AI: A Technical Perspective. arXiv preprint arXiv:2306.15774 (2023).
  11. Barriers and bridges in the adoption of today’s mobile phone contextual services. In Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services. 167–176.
  12. Know your customers’ jobs to be done. Harvard business review 94, 9 (2016), 54–62.
  13. A large-scale study of daily information needs captured in situ. ACM Transactions on Computer-Human Interaction (TOCHI) 21, 2 (2014), 1–46.
  14. Karen Church and Barry Smyth. 2009. Understanding the intent behind mobile information needs. In Proceedings of the 14th international conference on Intelligent user interfaces. 247–256.
  15. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500
  16. Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting. arXiv preprint arXiv:2303.03199 (2023).
  17. An Examination of Daily Information Needs and Sharing Opportunities. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work (San Diego, CA, USA) (CSCW ’08). Association for Computing Machinery, New York, NY, USA, 679–688. https://doi.org/10.1145/1460563.1460668
  18. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
  19. G-ID: identifying 3D prints using slicing parameters. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
  20. InfraredTags: Embedding Invisible AR Markers and Barcodes Using Low-Cost, Infrared-Based 3D Printing and Imaging Tools. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–12.
  21. SenseCam: A new tool for memory rehabilitation? Revue Neurologique 172, 12 (2016), 735–747.
  22. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47, 6 (2014), 2280–2292.
  23. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
  24. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.
  25. Towards pervasive augmented reality: Context-awareness in augmented reality. IEEE transactions on visualization and computer graphics 23, 6 (2016), 1706–1724.
  26. Lifelogging: Personal big data. Foundations and Trends® in information retrieval 8, 1 (2014), 1–125.
  27. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In ACM SIGCHI Annual Conference on Human Factors in Computing Systems. ACM.
  28. Contextual queries express mobile information needs. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 327–336.
  29. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022).
  30. GenAssist: Making Image Generation Accessible. arXiv preprint arXiv:2307.07589 (2023).
  31. Co-Writing with Opinionated Language Models Affects Users’ Views. arXiv preprint arXiv:2302.00560 (2023).
  32. Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention. (2023).
  33. Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals. (2023).
  34. Designing for exploratory search on touch devices. In Proceedings of the 33rd annual ACM conference on human factors in computing systems. 4189–4198.
  35. How does mobile context affect people’s web search behavior?: A diary study of mobile information needs and search behaviors. In 2012 IEEE 26th International Conference on Advanced Information Networking and Applications. IEEE, 245–252.
  36. Overview of lifelogging: current challenges and advances. IEEE Access 9 (2021), 62630–62641.
  37. Development and usability of a life-logging behavior monitoring application for obese patients. Journal of Obesity & Metabolic Syndrome 28, 3 (2019), 194.
  38. Aircode: Unobtrusive physical tags for digital fabrication. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 449–460.
  39. Context-aware online adaptation of mixed reality interfaces. In Proceedings of the 32nd annual ACM symposium on user interface software and technology. 147–160.
  40. Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454 (2016).
  41. Visual Captions: Augmenting Verbal Communication With On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
  42. Aria Pilot Dataset. https://about.facebook.com/realitylabs/projectaria/datasets.
  43. Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann.
  44. U.S. Bureau of Labor Statistics. 2023. AMERICAN TIME USE SURVEY - 2022 RESULTS. https://www.bls.gov/news.release/atus.t12.htm. [Online; accessed 10-Dec-2023].
  45. Ray Oldenburg. 1999. The great good place: Cafes, coffee shops, bookstores, bars, hair salons, and other hangouts at the heart of a community. Da Capo Press.
  46. Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070 (2002).
  47. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
  48. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–18.
  49. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. (2023).
  50. Suman Ravuri and Andreas Stolcke. 2015. Recurrent neural network and LSTM models for lexical utterance classification. In Sixteenth annual conference of the international speech communication association.
  51. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
  52. OSVC-Open Short Video Collection 1.0. Technical Report CS-2015-002 (2015).
  53. V3C–a research video collection. In MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part I 25. Springer, 349–360.
  54. Ocr on-the-go: Robust end-to-end systems for reading license plates & street signs. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 154–159.
  55. Context-aware computing applications. In 1994 first workshop on mobile computing systems and applications. IEEE, 85–90.
  56. Bill N Schilit and Marvin M Theimer. 1994. Disseminating active map information to mobile hosts. IEEE network 8, 5 (1994), 22–32.
  57. Do life-logging technologies support memory for the past? An experimental study using SenseCam. In Proceedings of the SIGCHI conference on Human factors in computing systems. 81–90.
  58. Ben Shneiderman. 2005. Shneiderman’s eight golden rules of interface design. Retrieved july 25 (2005), 2009.
  59. A diary study of mobile information needs. In Proceedings of the sigchi conference on human factors in computing systems. 433–442.
  60. Enabling Conversational Interaction with Mobile UI using Large Language Models. arXiv preprint arXiv:2209.08655 (2022).
  61. LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing. https://api.semanticscholar.org/CorpusID:267740556
  62. PopBlends: Strategies for Conceptual Blending with Large Language Models. arXiv preprint arXiv:2111.04920 (2021).
  63. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  64. Detectron2. https://github.com/facebookresearch/detectron2.
  65. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598 (2022).
  66. Fangneng Zhan and Shijian Lu. 2019. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2059–2068.
  67. Language models as recommender systems: Evaluations and limitations. (2021).
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com