CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

Published 17 Jun 2026 in cs.SE and cs.AI | (2606.18976v1)

Abstract: Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying LLMs to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces CAPRA, a multi-agent LLM system designed for automated formative feedback on software architecture deliverables to accelerate assessment.
It employs a four-stage pipeline—document parsing, parallel verification, evidence anchoring, and report generation—for consistent and deterministic feedback.
Empirical evaluation shows CAPRA achieves over 88% criteria satisfaction with up to a 10.8x speedup in processing time compared to manual reviews.

CAPRA: Multi-Agent LLM System for Automated Formative Feedback on Software Architecture Deliverables

Motivation and Context

The assessment of open-ended software architecture artifacts presents a challenge in software engineering education, particularly as class sizes scale. Traditional automated grading tools—such as static analysis platforms—are effective for code-based assignments but fail to address the structural, semantic, and traceability complexities inherent in architectural documentation. LLMs have demonstrated value for similar review tasks; however, naive application to multi-modal documents can result in unreliable feedback and hallucinated critiques. CAPRA (Configurable Architecture Proficiency Report Assessment) addresses this gap by orchestrating a multi-agent LLM workflow specifically designed for scalable, reliable, and pedagogically meaningful feedback on architectural deliverables.

System Architecture and Workflow

CAPRA implements a four-stage pipeline:

Document Parsing: Utilizes PyMuPDF for textual extraction and harnesses vision-enabled LLMs (gpt-4o) to parse UML diagrams, converting them into enriched structured text.
Parallel Verification Agents: Specialized LLM-driven agents each perform dimension-specific analysis. The SpecificationAuditor agent audits requirements and alignment between UML diagrams and textual narrative; TestAuditor agent identifies test coverage gaps and checks adherence to testing strategies; FeatureCheckAgent assesses presence of course-specific features mined via clustered historical reports; TraceabilityMatrixAgent analyzes requirements-design-test mappings.
Evidence Anchoring: Employs deterministic fuzzy string matching (normalized Levenshtein distance, trigram overlap pre-filter) to explicitly ground agent findings in source document spans, modulating confidence scores and discarding unverifiable critiques.
Report Generation: Constructs structured feedback reports using deterministic LaTeX templates integrated with targeted LLM-generated narratives (e.g., executive summary, document strengths). This approach ensures verifiable, reproducible, and efficient output.

CAPRA's multi-agent orchestration, deterministic evidence anchoring, and consistency management collectively mitigate LLM hallucinations and ensure actionable feedback.

Empirical Evaluation

Experimental Protocol

CAPRA was empirically evaluated on 10 student reports from a software engineering course, using an eight-criterion taxonomy spanning extraction completeness, feature validation, issue grounding, recommendation specificity, and template/tone compliance. Evaluation was conducted with two independent binary raters per artifact; Cohen's Kappa was used to quantify inter-rater reliability.

Numerical Results

Criteria Satisfaction: CAPRA satisfied 88.8% of the evaluated criteria under strict aggregation; lenient aggregation yielded a 91.9% pass rate.
Inter-Rater Agreement: Unanimous agreement in 93.75% of binary decisions; overall Cohen’s Kappa at 0.582 (moderate), with feature extraction achieving perfect agreement (K=1.0). Lower agreement was observed in interpretive categories (e.g., architectural design and issue grounding), reflecting intrinsic subjectivity.
Processing Efficiency: Each report processed in slightly over 4 minutes (mean = 248s), compared to 30–45 minutes for manual review, yielding a speedup factor of 7.2–10.8x. Processing cost averaged $0.44 per report.
Feature Validation: Automated knowledge base generation via SALLMA workflow and HDBSCAN clustering extracted 7 canonical features from 680 raw mentions, achieving a 100% pass rate in feature validation (B1).

Illustrative Critiques

CAPRA demonstrated capacity to detect nuanced specification defects, such as semantic inconsistencies in use case actor roles and traceability gaps between UML domain models and persistence schemas—issues often overlooked in manual review.

Strengths, Limitations, and Implications

Key Strengths

Robust Evidence Anchoring: Deterministically grounds feedback in document source spans, substantially reducing hallucinations.
Configurable Feature Extraction: Fully automated knowledge base construction enables adaptation to evolving course rubrics without manual feature engineering.
Multi-Agent Decomposition: Specialized agents facilitate focused, reliable analytical checks, outperforming single-prompt approaches on complex tasks.

Limitations and Threats to Validity

Subjectivity in Interpretive Dimensions: Lower inter-rater agreement underscores the inherent difficulty in objectively evaluating architectural descriptions and issue grounding, warranting human oversight in these tasks.
Language Contamination: Multilingual documents (Italian/English) led to tone and template drift in isolated feedback sections.
Prompt Engineering Overhead: Feedback granularity and pedagogical scope require iterative prompt tuning to avoid generic or production-oriented critiques.
Dataset Bias and Generalization: Evaluation corpus consisted of high-quality reports only; results may not generalize to weaker submissions or alternate institutional settings.
API Dependency and Reproducibility: Reliance on proprietary LLMs impacts reproducibility and cost; deterministic settings only partially mitigate non-deterministic outputs.

Theoretical and Practical Implications

CAPRA demonstrates that multi-agent, evidence-anchored LLM systems can automate formative feedback for open-ended, multi-modal artifacts in software engineering education at scale. Practically, CAPRA enables instructors to provide rapid, actionable feedback cycles, offloading routinized assessment tasks and freeing human reviewers for interpretive critique. Theoretically, CAPRA's architecture validates decompositional multi-agent workflows and transparent evidence verification as strategies for curbing LLM hallucinations—a result extensible to wider educational and code review contexts.

Future developments in AI-assisted education may involve deeper integration with learning management systems, extension to diverse artifact types and languages, and migration to open-source LLMs to reduce dependency on proprietary APIs. Improving configurability and feedback scope control without onerous prompt engineering remains an important avenue for usability.

Conclusion

CAPRA operationalizes a multi-agent LLM architecture for deterministic, scalable, and customizable formative feedback on software architecture deliverables (2606.18976). Preliminary evaluation indicates high extractive reliability, significant efficiency gains, and reduced hallucination through evidence anchoring. Subjective assessment dimensions necessitate retained human oversight. The system addresses key impediments to formative assessment in large educational contexts and sets a precedent for evidence-anchored, multi-agent LLM pipelines in future educational and technical review tasks.

Markdown Report Issue