- The paper introduces CAPRA, a multi-agent LLM system designed for automated formative feedback on software architecture deliverables to accelerate assessment.
- It employs a four-stage pipeline—document parsing, parallel verification, evidence anchoring, and report generation—for consistent and deterministic feedback.
- Empirical evaluation shows CAPRA achieves over 88% criteria satisfaction with up to a 10.8x speedup in processing time compared to manual reviews.
Motivation and Context
The assessment of open-ended software architecture artifacts presents a challenge in software engineering education, particularly as class sizes scale. Traditional automated grading tools—such as static analysis platforms—are effective for code-based assignments but fail to address the structural, semantic, and traceability complexities inherent in architectural documentation. LLMs have demonstrated value for similar review tasks; however, naive application to multi-modal documents can result in unreliable feedback and hallucinated critiques. CAPRA (Configurable Architecture Proficiency Report Assessment) addresses this gap by orchestrating a multi-agent LLM workflow specifically designed for scalable, reliable, and pedagogically meaningful feedback on architectural deliverables.
System Architecture and Workflow
CAPRA implements a four-stage pipeline:
- Document Parsing: Utilizes PyMuPDF for textual extraction and harnesses vision-enabled LLMs (gpt-4o) to parse UML diagrams, converting them into enriched structured text.
- Parallel Verification Agents: Specialized LLM-driven agents each perform dimension-specific analysis. The SpecificationAuditor agent audits requirements and alignment between UML diagrams and textual narrative; TestAuditor agent identifies test coverage gaps and checks adherence to testing strategies; FeatureCheckAgent assesses presence of course-specific features mined via clustered historical reports; TraceabilityMatrixAgent analyzes requirements-design-test mappings.
- Evidence Anchoring: Employs deterministic fuzzy string matching (normalized Levenshtein distance, trigram overlap pre-filter) to explicitly ground agent findings in source document spans, modulating confidence scores and discarding unverifiable critiques.
- Report Generation: Constructs structured feedback reports using deterministic LaTeX templates integrated with targeted LLM-generated narratives (e.g., executive summary, document strengths). This approach ensures verifiable, reproducible, and efficient output.
CAPRA's multi-agent orchestration, deterministic evidence anchoring, and consistency management collectively mitigate LLM hallucinations and ensure actionable feedback.
Empirical Evaluation
Experimental Protocol
CAPRA was empirically evaluated on 10 student reports from a software engineering course, using an eight-criterion taxonomy spanning extraction completeness, feature validation, issue grounding, recommendation specificity, and template/tone compliance. Evaluation was conducted with two independent binary raters per artifact; Cohen's Kappa was used to quantify inter-rater reliability.
Numerical Results
- Criteria Satisfaction: CAPRA satisfied 88.8% of the evaluated criteria under strict aggregation; lenient aggregation yielded a 91.9% pass rate.
- Inter-Rater Agreement: Unanimous agreement in 93.75% of binary decisions; overall Cohen’s Kappa at 0.582 (moderate), with feature extraction achieving perfect agreement (K=1.0). Lower agreement was observed in interpretive categories (e.g., architectural design and issue grounding), reflecting intrinsic subjectivity.
- Processing Efficiency: Each report processed in slightly over 4 minutes (mean = 248s), compared to 30–45 minutes for manual review, yielding a speedup factor of 7.2–10.8x. Processing cost averaged $0.44 per report.
- Feature Validation: Automated knowledge base generation via SALLMA workflow and HDBSCAN clustering extracted 7 canonical features from 680 raw mentions, achieving a 100% pass rate in feature validation (B1).
Illustrative Critiques
CAPRA demonstrated capacity to detect nuanced specification defects, such as semantic inconsistencies in use case actor roles and traceability gaps between UML domain models and persistence schemas—issues often overlooked in manual review.
Strengths, Limitations, and Implications
Key Strengths
- Robust Evidence Anchoring: Deterministically grounds feedback in document source spans, substantially reducing hallucinations.
- Configurable Feature Extraction: Fully automated knowledge base construction enables adaptation to evolving course rubrics without manual feature engineering.
- Multi-Agent Decomposition: Specialized agents facilitate focused, reliable analytical checks, outperforming single-prompt approaches on complex tasks.
Limitations and Threats to Validity
- Subjectivity in Interpretive Dimensions: Lower inter-rater agreement underscores the inherent difficulty in objectively evaluating architectural descriptions and issue grounding, warranting human oversight in these tasks.
- Language Contamination: Multilingual documents (Italian/English) led to tone and template drift in isolated feedback sections.
- Prompt Engineering Overhead: Feedback granularity and pedagogical scope require iterative prompt tuning to avoid generic or production-oriented critiques.
- Dataset Bias and Generalization: Evaluation corpus consisted of high-quality reports only; results may not generalize to weaker submissions or alternate institutional settings.
- API Dependency and Reproducibility: Reliance on proprietary LLMs impacts reproducibility and cost; deterministic settings only partially mitigate non-deterministic outputs.
Theoretical and Practical Implications
CAPRA demonstrates that multi-agent, evidence-anchored LLM systems can automate formative feedback for open-ended, multi-modal artifacts in software engineering education at scale. Practically, CAPRA enables instructors to provide rapid, actionable feedback cycles, offloading routinized assessment tasks and freeing human reviewers for interpretive critique. Theoretically, CAPRA's architecture validates decompositional multi-agent workflows and transparent evidence verification as strategies for curbing LLM hallucinations—a result extensible to wider educational and code review contexts.
Future developments in AI-assisted education may involve deeper integration with learning management systems, extension to diverse artifact types and languages, and migration to open-source LLMs to reduce dependency on proprietary APIs. Improving configurability and feedback scope control without onerous prompt engineering remains an important avenue for usability.
Conclusion
CAPRA operationalizes a multi-agent LLM architecture for deterministic, scalable, and customizable formative feedback on software architecture deliverables (2606.18976). Preliminary evaluation indicates high extractive reliability, significant efficiency gains, and reduced hallucination through evidence anchoring. Subjective assessment dimensions necessitate retained human oversight. The system addresses key impediments to formative assessment in large educational contexts and sets a precedent for evidence-anchored, multi-agent LLM pipelines in future educational and technical review tasks.