Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VerifAI: Verified Generative AI (2307.02796v2)

Published 6 Jul 2023 in cs.DB, cs.CL, and cs.LG

Abstract: Generative AI has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. Such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. Although efforts to address these risks are underway, including explainable AI and responsible AI practices such as transparency, privacy protection, bias mitigation, and social and environmental responsibility, misinformation caused by generative AI will remain a significant challenge. We propose that verifying the outputs of generative AI from a data management perspective is an emerging issue for generative AI. This involves analyzing the underlying data from multi-modal data lakes, including text files, tables, and knowledge graphs, and assessing its quality and consistency. By doing so, we can establish a stronger foundation for evaluating the outputs of generative AI models. Such an approach can ensure the correctness of generative AI, promote transparency, and enable decision-making with greater confidence. Our vision is to promote the development of verifiable generative AI and contribute to a more trustworthy and responsible use of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. [n.d.]. ChatGPT tied to Samsung’s alleged data leak. https://cybernews.com/news/chatgpt-samsung-data-leak/. Accessed: 2023-04-06.
  2. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016).
  3. RetClean: Retrieval-Based Data Cleaning Using Foundation Models and Data Lakes. CoRR abs/2303.16909 (2023).
  4. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164 (2019).
  5. Turl: Table understanding through representation learning. ACM SIGMOD Record 51, 1 (2022), 33–40.
  6. Principles of Data Integration. Morgan Kaufmann.
  7. Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources. Proc. VLDB Endow. 8, 9 (2015), 938–949.
  8. Improving language models by retrieving from trillions of tokens. CoRR abs/2112.04426 (2021).
  9. Aurum: A Data Discovery System. In ICDE.
  10. Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide (1st ed.). O’Reilly Media, Inc.
  11. PASTA: Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training. In EMNLP.
  12. OpenTFV: An Open Domain Table-Based Fact Verification System. In SIGMOD.
  13. Alon Y. Halevy. 2020. Subjective Databases. In ENASE.
  14. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  15. Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. CoRR (2020).
  16. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy (2021).
  17. Detecting and Understanding Harmful Memes: A Survey. CoRR abs/2205.04274 (2022).
  18. Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. IEEE Data Eng. Bull. (2018).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nan Tang (63 papers)
  2. Chenyu Yang (20 papers)
  3. Ju Fan (26 papers)
  4. Lei Cao (60 papers)
  5. Yuyu Luo (41 papers)
  6. Alon Halevy (29 papers)
Citations (9)