Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities (2301.04408v1)

Published 11 Jan 2023 in cs.CL, cs.AI, and cs.CY

Abstract: The global economy is increasingly dependent on knowledge workers to meet the needs of public and private organizations. While there is no single definition of knowledge work, organizations and industry groups still attempt to measure individuals' capability to engage in it. The most comprehensive assessment of capability readiness for professional knowledge workers is the Uniform CPA Examination developed by the American Institute of Certified Public Accountants (AICPA). In this paper, we experimentally evaluate OpenAI's text-davinci-003 and prior versions of GPT on both a sample Regulation (REG) exam and an assessment of over 200 multiple-choice questions based on the AICPA Blueprints for legal, financial, accounting, technology, and ethical tasks. First, we find that text-davinci-003 achieves a correct rate of 14.4% on a sample REG exam section, significantly underperforming human capabilities on quantitative reasoning in zero-shot prompts. Second, text-davinci-003 appears to be approaching human-level performance on the Remembering & Understanding and Application skill levels in the Exam absent calculation. For best prompt and parameters, the model answers 57.6% of questions correctly, significantly better than the 25% guessing rate, and its top two answers are correct 82.1% of the time, indicating strong non-entailment. Finally, we find that recent generations of GPT-3 demonstrate material improvements on this assessment, rising from 30% for text-davinci-001 to 57% for text-davinci-003. These findings strongly suggest that LLMs have the potential to transform the quality and efficiency of future knowledge work.

Evaluating GPT-3.5's Performance on Knowledge Work Tasks through the AICPA Exam

The paper "GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities" presents an empirical analysis of the capabilities of OpenAI's Generative Pre-trained Transformer model (GPT-3.5), specifically evaluating its potential as a knowledge worker using the framework of the Uniform CPA Examination. The authors investigate whether GPT-3.5 can adequately perform tasks typically executed by knowledge workers, particularly in the multifaceted domains of accounting, finance, law, and technology.

The paper is grounded in the context of the increasing integration of AI into professional services, where knowledge work represents a significant segment of global employment and economic activity. A critical aspect of assessing readiness for this work is the Uniform CPA Examination, established by AICPA, which comprehensively evaluates the preparedness of candidates in pertinent professional skills and knowledge areas.

Methodological Approach

The researchers utilized two distinct assessments to evaluate the zero-shot performance of GPT-3.5:

  1. Assessment 1: It involved analyzing the model's performance on a sample Regulation (REG) section of the CPA Exam. This section entails both qualitative and quantitative questions that gauge a candidate's comprehension and application of laws, rules, financial regulations, and related tasks.
  2. Assessment 2: This comprised a synthetic multiple-choice questionnaire developed in alignment with AICPA's Blueprints, focusing on the foundational Remembering, Understanding, and Application skills while omitting extensive quantitative reasoning.

The evaluations were conducted using OpenAI's text-davinci-003 model, a derivative of GPT-3.5, by generating zero-shot prompts for the text completion API.

Key Findings

The results revealed a substantial gap in GPT-3.5's ability to perform quantitative reasoning tasks without explicit training or fine-tuning:

  • Quantitative Performance: On the sample REG exam involving arithmetic reasoning, text-davinci-003 achieved accuracy levels between 5.7% and 9.4% on questions requiring numerical responses. Even for tasks involving multiple choice (MCQ) without explicit calculation, the model's performance hovered marginally above random chance.
  • Qualitative Understanding: In contrast, the model demonstrated competence in performing tasks related to remembering, understanding, and applying concepts from knowledge work. For the synthetic assessment over foundational skills, text-davinci-003 achieved an accuracy of up to 57.6%, significantly above the baseline guessing rate.

Notably, the analysis also underscored the model's evolution over successive generations of GPT-3, with text-davinci-003 exhibiting clear improvements over its predecessors, indicating a progressive enhancement in zero-shot capabilities across varied knowledge work tasks.

Implications and Future Research

The paper suggests that while current iterations of LLMs like GPT-3.5 may struggle with complex quantitative reasoning, their qualitative capabilities are advancing towards matching human-level performance in foundational knowledge work tasks. Such advancements hold potential implications for transforming the efficiency of knowledge-centric roles within professional services, enhancing productivity, and potentially redefining the landscape of such professions.

The paper opens avenues for further research to delve into integrating models with specialized algorithms or enhanced prompts that can handle numerical computations effectively. Furthermore, investigating few-shot approaches or other AI techniques could complement the zero-shot paradigm, improving the model's performance in tasks demanding intricate arithmetic capabilities.

Conclusion

In summarizing, "GPT as Knowledge Worker" contributes significantly to the discourse on the application of AI in professional knowledge domains. It underscores both the promise and current limitations of cutting-edge LLMs in dynamically adapting to complex, multidisciplinary tasks typical of CPA assessments. As AI continues to evolve, such studies will be pivotal in charting its integration course into professional and knowledge-intensive workflows.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jillian Bommarito (5 papers)
  2. Michael Bommarito (4 papers)
  3. Daniel Martin Katz (19 papers)
  4. Jessica Katz (1 paper)
Citations (49)
Youtube Logo Streamline Icon: https://streamlinehq.com