PowerGPT: AI-Powered Clinical Trial Design
- PowerGPT is an AI-powered system that automates statistical power analysis and sample size estimation in clinical trial design by integrating large language models with statistical engines.
 - It streamlines test selection, parameter extraction, and computation processes, achieving near-complete task execution and significantly improved accuracy compared to traditional methods.
 - Validated in randomized controlled trials, PowerGPT reduces analysis time and bridges expertise gaps, making advanced statistical tasks accessible to both experts and non-specialists.
 
PowerGPT is an AI-powered system designed to automate statistical power analysis and sample size calculations for clinical research trial design, integrating LLMs with statistical engines. Developed to address barriers posed by the complexity of statistical test selection and sample size estimation—tasks that have historically required high statistical expertise and specialized software—PowerGPT streamlines these processes through natural-language interfaces. The system has been evaluated via randomized clinical trials, demonstrating substantial improvements in task completion, accuracy, and efficiency among both statistically trained and non-specialist users. PowerGPT is already deployed across several research institutions, providing scalable, accessible, and accurate decision support for clinical investigators (Lu et al., 15 Sep 2025).
1. Motivation and Problem Definition
Sample size calculations and statistical power analysis are central to designing clinical trials with valid inference and ethical resource management. Traditionally, these calculations require users to manually select appropriate statistical tests and apply formulas or software tools that presuppose subject-matter expertise. These challenges lead to bottlenecks: limited access to statisticians, increased risk of under- or over-powered studies, and suboptimal protocol design. PowerGPT addresses these pain points by integrating AI-driven automation and expert statistical computation in a single, easy-to-use platform.
2. System Architecture and Workflow
PowerGPT is structured as an agent-based framework, wherein a LLM interprets user queries and orchestrates the necessary computational steps through API gateways. The system interacts with both internal and external resources, including:
- Statistical Engines: R and Python libraries (e.g., 
power.t.test()in R). - External Data Sources: For effect size or paper context.
 - API Integration Layer: For formatting, input validation, and result conversion.
 
The workflow proceeds as follows:
- User Input: A natural language query describing the experimental scenario and desired statistical test, e.g., “How many samples are needed to detect a 1-day difference in recovery time?”
 - Test Selection: The LLM parses the scenario details and selects the appropriate hypothesis test (e.g., one-sample t-test, ANOVA, Cox proportional hazards).
 - Parameter Extraction: Key parameters (effect size, standard deviation, desired significance level, power) are extracted or inferred.
 - Sample Size Calculation: The statistical engine executes the computation using validated formulas. For a two-sample t-test, the standard formula is:
 
Where is sample size per group, standard deviation, minimum detectable difference, critical value for significance, and value for power.
- Results Delivery: Outputs are returned in a clear, actionable format and augmented with stepwise rationale.
 
3. Validation: Randomized Controlled Trial Outcomes
PowerGPT’s efficacy was assessed in a stratified, randomized trial involving 36 participants (statisticians and non-statisticians) at UPenn and UTHealth. Participants completed multiple scenarios covering major statistical tests (t-tests, ANOVA, z-test, Cox model, log-rank). The outcomes demonstrated:
| Metric | PowerGPT Group | Reference Group | Statistical Significance | 
|---|---|---|---|
| Test Selection Completion | 99.3% (95.4–100.0% CI) | 88.9% (82.3–93.3% CI) | p < 0.001 | 
| Test Selection Accuracy | 95.6% (90.2–98.2% CI) | 83.6% (75.8–89.3% CI) | p < 0.001 | 
| Sample Size Accuracy | 94.1% | 55.4% | p < 0.001 | 
| Time per Question | 4.0 min (3.3–4.7 min CI) | 9.3 min | p < 0.001 | 
This demonstrates near-complete task execution and substantially improved accuracy, coupled with a significant reduction in completion time for both test selection and sample size estimation. These gains extended across all major categories of hypothesis tests evaluated.
4. Bridging Expertise Gaps
A critical outcome was the narrowing of expertise disparities. Non-statisticians using reference methods underperformed relative to statisticians in both completion and accuracy. When equipped with PowerGPT, non-statisticians' performance matched or closely tracked that of their expert counterparts. This suggests that the system effectively democratizes access to complex statistical analysis, allowing more researchers to undertake robust trial design with minimal prerequisite knowledge.
5. Deployment, Scalability, and Technical Design
PowerGPT is implemented on a cloud-based, containerized architecture (e.g., Google Cloud Run), scaling dynamically from zero to over 200 concurrent instances as needed. The platform incorporates:
- Robust API endpoints for managing requests and routing to computation engines.
 - Python-R integration layer for seamless translation, execution, and result aggregation.
 - Startup probes for fault tolerance, which monitor instance health and launch fresh processes as needed.
 
This design enables high-throughput, concurrent analysis in production research settings. PowerGPT is currently deployed across multiple institutions including UPenn, UTHealth, and Yale, and is piloted as part of the CTSA (Clinical and Translational Science Award) programs.
6. Algorithmic Methods and Statistical Formulations
PowerGPT automates the test selection and computation pipeline using established statistical algorithms. Key formulas for sample size, such as for the two-sample t-test,
are executed within libraries (e.g., R), with input parameters parsed and validated by the system. The API interconnect facilitates conversion between user-provided data, statistical function calls, and formatted outputs via JSON schemas.
7. Prospective Enhancements
Future development directions include:
- More sophisticated input validation and error handling.
 - Expansion to more advanced statistical tests and adaptive integration of effect size estimation from real-time data.
 - Stepwise transparency, with literature-backed rationales for each automated decision.
 - Deeper user feedback loops for continuous learning and domain adaptation.
 
These planned enhancements aim to increase reliability and broaden applicability—especially for complex or unconventional clinical trial scenarios.
Summary
PowerGPT represents a consequential advance in clinical trial design, integrating natural language processing and statistical computation to automate power analysis and sample size determination. Its combination of high accuracy, efficiency, scalability, and accessibility has been validated in randomized trials, supporting both statisticians and non-specialists. By bridging expertise gaps and alleviating bottlenecks in statistical workflows, PowerGPT enhances the quality and pace of clinical research while mitigating ethical risks associated with poor paper design (Lu et al., 15 Sep 2025).