Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design (2407.16831v1)

Published 23 Jul 2024 in cs.AI

Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many LLM inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in complexity theory that we show empirically extends to LLMs (LMs). We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems. Through experiments on synthetic tasks such as prime factorization, and core benchmarks such as the MMLU, we demonstrate notable performance gains. For instance, in factoring products of two 3-digit primes, a simple NoN improves accuracy from 3.7\% to 36.6\%. On MMLU, a verifier-based judge construction with only 3 generators boosts accuracy over individual GPT-4-Turbo calls by 2.8\%. Our analysis reveals that these gains are most pronounced in domains where verification is notably easier than generation--a characterization which we believe subsumes many reasoning and procedural knowledge tasks, but doesn't often hold for factual and declarative knowledge-based settings. For mathematical and formal logic reasoning-based subjects of MMLU, we observe a 5-8\% or higher gain, whilst no gain on others such as geography and religion. We provide key takeaways for ML practitioners, including the importance of considering verification complexity, the impact of witness format on verifiability, and a simple test to determine the potential benefit of this NoN approach for a given problem distribution. This work aims to inform future research and practice in the design of compound AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Alphacode 2 technical report. URL https://api.semanticscholar.org/CorpusID:266058988.
  2. Introducing meta llama 3: The most capable openly available llm to date. URL https://ai.meta.com/blog/meta-llama-3/.
  3. Introducing the next generation of claude. URL https://www.anthropic.com/news/claude-3-family.
  4. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  5. Probabilistic checking of proofs: a new characterization of np. J. ACM, 45(1):70–122, jan 1998. ISSN 0004-5411. doi: 10.1145/273865.273901. URL https://doi.org/10.1145/273865.273901.
  6. Are more llm calls all you need? towards scaling laws of compound inference systems. arXiv preprint arXiv:2403.02419, 2024.
  7. Edmund M Clarke. Model checking. In Foundations of Software Technology and Theoretical Computer Science: 17th Conference Kharagpur, India, December 18–20, 1997 Proceedings 17, pages 54–56. Springer, 1997.
  8. The complexity of probabilistic verification. Journal of the ACM (JACM), 42(4):857–907, 1995.
  9. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  10. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024.
  11. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  12. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
  13. Christos H Papadimitriou. Computational complexity. In Encyclopedia of computer science, pages 260–265. 2003.
  14. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 21(2):120–126, feb 1978. ISSN 0001-0782. doi: 10.1145/359340.359342. URL https://doi.org/10.1145/359340.359342.
  15. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems. arXiv preprint arXiv:2310.12397, 2023.
  16. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  17. Can large language models really improve by self-critiquing their own plans? arXiv preprint arXiv:2310.08118, 2023.
  18. Mixture-of-agents enhances large language model capabilities. arXiv preprint arXiv:2406.04692, 2024.
  19. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  20. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
  21. Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond. arXiv preprint arXiv:2309.16583, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com