Fraunhofer AI Assessment Catalogue
- Fraunhofer AI Assessment Catalogue is a comprehensive, risk-based framework that translates high-level AI trustworthiness requirements into actionable certification criteria across fairness, transparency, reliability, and other dimensions.
- The framework employs a detailed, multi-stage methodology, including protection requirements analysis and cross-dimensional risk scoring, to guide both self-certification and preparatory audits for high-risk applications.
- Its practical application to systems like emotion recognition has demonstrated measurable improvements in accuracy, fairness, and reliability through systematic documentation, testing, and iterative remediation.
The Fraunhofer AI Assessment Catalogue is a comprehensive, risk-based framework for certifying the trustworthiness of AI systems, developed by Fraunhofer-Gesellschaft in alignment with the European Union AI Act's conceptual structure but not itself a harmonized legal standard. It provides a systematic methodology for translating high-level AI trustworthiness requirements into operationalizable assessment criteria, and is designed for use in both self-certification and preparatory audits for high-risk AI applications such as biometric identification, emotion recognition, and automated decision-making. The catalogue systematically evaluates six dimensions: Fairness, Autonomy & Control, Transparency, Reliability, Safety & Security, and Data Protection, using a structured, multi-stage workflow that unifies technical, procedural, and documentation-based evidence (Autischer et al., 20 Jan 2025, Autischer et al., 13 Jan 2026, Poretschkin et al., 2023, Corrêa et al., 2024).
1. Structural Overview and Core Dimensions
The Fraunhofer AI Assessment Catalogue operationalizes AI trustworthiness across six principal dimensions, each with its own protection objectives, risk areas, quality criteria, and testable measures (Poretschkin et al., 2023, Autischer et al., 13 Jan 2026):
- Fairness: Absence of systematic disadvantage across groups; criteria include group-wise performance, maximum accuracy gap, and open bias/disparity auditing.
- Autonomy & Control: Assurance of appropriate human oversight, user intervention, and controllability.
- Transparency: Traceability, explainability, and operational interpretability of models and outputs.
- Reliability: Consistency under both nominal and perturbed conditions, with quantitative robustness and uncertainty estimation.
- Safety & Security: Protection from unintended harm and adversarial manipulation.
- Data Protection: Compliance with GDPR and minimization of privacy risks across the AI lifecycle.
Each dimension is subdivided into risk areas, with systematic recording of objectives, quantitative or qualitative criteria (e.g., fairness metrics like statistical parity difference or disparate impact), and documentation of the technical/organizational measures undertaken.
The catalogue's assessment process is fundamentally risk-driven: after a protection requirements analysis (low, medium, high) for each dimension, only those flagged as carrying non-negligible residual risk require detailed scrutiny (Autischer et al., 20 Jan 2025, Poretschkin et al., 2023, Autischer et al., 13 Jan 2026).
2. Methodological Workflow and Risk Scoring
The standard workflow consists of the following steps (Autischer et al., 20 Jan 2025, Poretschkin et al., 2023):
- Preliminary Steps: Define the AI Profile—system functionality, intended context, boundaries, and lifecycle stages.
- Protection Requirements Analysis: Initial screening across all six dimensions, categorizing the relevance as low, medium, or high.
- Detailed Risk Analysis: For medium or high-risk dimensions, follow standard worksheets per dimension comprising risk analysis, objectives, achievement criteria, measures (tests and documentation), and summary assessments.
- Cross-Dimensional Assessment: Synthesize findings, identify overlaps, trade-offs, and unresolved risks.
- Certification Decision: Final verdict—certified, partial remediation required, or rejection.
Formal scoring is expressed using weighted criteria, with the typified risk score for dimension given by:
where is the weight for criterion , indicates fulfillment, and is the total criteria for dimension . Certification at the dimension level is contingent on for threshold ; system certification requires (an aggregate across all ) to remain below a global threshold (Autischer et al., 20 Jan 2025, Autischer et al., 13 Jan 2026).
3. Exemplary Application and Empirical Results
Empirical validation is documented through sample certifications. In the application to the RIOT/EmoPy emotion recognition system, assessors reconstructed the system, collated available technical documentation, and iteratively completed the catalogue’s detailed checklists for Fairness and Reliability (Autischer et al., 20 Jan 2025, Autischer et al., 13 Jan 2026).
Key steps included:
- Forking and building the system, gathering all documentation.
- Explicitly mapping inputs, outputs, model boundaries, dependencies, and versioning.
- Conducting risk-based analysis, identifying medium risks for Reliability and Fairness.
- Running data quality checks, robustness and bias testing, and reporting group-wise performance metrics.
Notable findings included inability to certify Fairness due to lack of demographic group data, and partial non-certifiability on Reliability absent comprehensive robustness tests. Catalogue-guided technical improvements (e.g., data annotation, model enhancements, test augmentation) produced documented gains in overall accuracy, reduced entropy, and narrowed group-wise performance gaps (Autischer et al., 13 Jan 2026).
Summary of selected metrics:
| Metric | Baseline | Enhanced | Acceptance Criterion |
|---|---|---|---|
| Accuracy | 53.88 % | 68.19 % | ≥ 60 % (high-risk) |
| Confidence | 27 % | 78.7 % | ≥ 70 % |
| Entropy | 1.38 | 0.53 | ≤ 1.0 (implicit) |
| Max Accuracy Gap (Race) | — | 7.85 % | ≤ 10 % |
This illustrates the catalogue's role in diagnosing specific technical and documentation shortcomings and driving evidence-based remediation.
4. Integration with Responsible AI Metrics and Ethical Requirements
The catalogue aligns with broader responsible AI and ethical requirements frameworks, supporting the evaluation of ethical principles such as privacy, sustainability, and truthfulness (Corrêa et al., 2024, Xia et al., 2023). These include:
- Purpose Limitation: Only data strictly necessary for the defined application is collected, supporting GDPR Article 5.
- Bias and Disparity Metrics: Use of statistical parity difference (SPD), disparate impact, group-wise error reporting, model and data cards—all formalized within the Fairness dimension.
- Environmental Sustainability: COâ‚‚-equivalent thresholds, resource-efficient development, and lifecycle assessment for sustainability.
- Transparency and Disclosure: Promotion of open development, requirement of model cards, audit and traceability features, and event-provenance logging.
Multi-tiered risk categorization—spanning minimal to unacceptable—enables targeted selection and calibration of controls and remediation efforts, guided by precise thresholds for each context (e.g., SPD ≤ 0.05, DI in [0.8, 1.25], CO₂ limits per development phase) (Corrêa et al., 2024).
Process, resource, and product metrics are further explicated and modularized in complementary catalogues; for example, explicit process metrics for roles, governance, auditability, and redressability can be quantitatively formalized and directly integrated into the Fraunhofer reporting structure (Xia et al., 2023).
5. Limitations, Challenges, and Directions for Streamlining
The application of the catalogue in real-world and academic contexts has exposed several principal limitations (Autischer et al., 20 Jan 2025, Autischer et al., 13 Jan 2026):
- Documentation Dependency: Certifiability is tightly coupled with the completeness of documentation. Legacy, unmaintained, or open-source systems lacking versioned artefacts impede fair assessment.
- Cumbersome Checklists: The extensive, sometimes redundant, question sets can slow audits; deduplication and guidance on format/level of detail are needed.
- Lack of Code-Level Assessability: The catalogue strongly emphasizes documentation and process evidence, with limited facilities for direct code-driven or integrative testing.
- Resource Intensity: Dimension-level assessment can require multiple days per system dimension, even for modestly scoped applications.
- Legal Compliance Gaps: While the catalogue supports technical trustworthiness and operational readiness, it lacks authority as a harmonized standard under the AI Act and does not address requirements for formal declarations of conformity or organization-level processes (e.g., quality management, post-market monitoring).
Recommendations include pre-selecting systems with available and living documentation, providing template response formats or standardized artefact checklists, integrating optional code-review sections for open-source systems, and introducing tiered certification modes (light/deep) to balance administrative burden against technical depth.
The catalogue's documentation rigor can be complemented by technical guidance from other catalogues (e.g., TÜV, auditing ML algorithms), or augmented with modular, quantitative accountability metrics for contemporary GenAI systems. Integration of process and product metric modules—such as provenance logging or audit automation—can further operationalize its findings (Xia et al., 2023).
6. Practical Implications and Synthesis
The Fraunhofer AI Assessment Catalogue has established itself as a rigorous and structured vehicle for AI system certification in advance of harmonized legal standards, providing strong technical and procedural scaffolding for organizations aiming for trustworthiness and regulatory alignment (Autischer et al., 13 Jan 2026, Autischer et al., 20 Jan 2025, Poretschkin et al., 2023). Its risk-based, dimension-wise structure closely models the European regulatory paradigm and enables deep pointwise technical scrutiny of fairness, reliability, privacy, and other critical properties.
Practical application confirms that embedding the catalogue’s checklists and artefact generation early in the AI development lifecycle yields both technical improvements and certification-ready documentation as natural by-products. However, structured catalogue assessments remain a preparatory step, not a substitute for legal conformity assessment under the evolving EU AI Act. Harmonized standards, post-market monitoring requirements, and cross-organizational quality management are necessary complements.
A plausible implication is that further refinements—such as introducing graded scoring, code-level auditing, and automated metric ingestion—could streamline the certification process, improve adoption, and ensure ongoing alignment as regulatory clarity and technical best practices advance. The catalogue serves as a foundational layer for trustworthy, accountable, and legally informed AI system governance.