Sola Visibility ISPM Benchmark

Updated 18 January 2026

Sola Visibility ISPM Benchmark is a standardized evaluation suite for agentic AI that assesses identity inventory and configuration hygiene on AWS, Okta, and GWS.
It employs 77 data-grounded queries ensuring direct interrogation of system configurations to deliver evidence-backed, verifiable results.
The benchmark leverages dual execution modes, fast-path and full-path exploration, to enhance accuracy and traceability in enterprise identity assessments.

The Sola Visibility ISPM Benchmark is a standardized evaluation suite developed to assess agentic AI systems on foundational Identity Security Posture Management (ISPM) visibility tasks within real, production-grade enterprise identity environments. It is the first benchmark of its kind to focus on the accuracy, reliability, and evidence-backed reasoning of tool-using agents in answering inventory and hygiene questions over live configurations in Amazon Web Services (AWS), Okta, and Google Workspace (GWS) (Engelberg et al., 11 Jan 2026).

1. Scope and Benchmark Design

The Sola Visibility ISPM Benchmark evaluates agentic AI systems against 77 “data-grounded” questions that require direct interrogation of identity management systems. It centers on two foundational visibility categories:

Identity Inventory: Enumerating and categorizing both human and non-human identities, and summarizing key attributes across platforms.
Configuration Hygiene: Detecting misconfigurations, policy violations, stale credentials, and weak authentication states across diverse identity and access management (IAM) systems.

Benchmark questions are constructed to be answerable exclusively from the explicit schemas and states of the AWS, Okta, and GWS platforms as instantiated in a live enterprise identity estate.

2. Evaluation Environment and Data Domains

The benchmark is executed within a realistic, production-mirroring environment featuring:

Okta: External identity provider with workforce user records, MFA/session policies, and app assignments.
AWS: IAM subsystems including users, roles, policies, access keys, trust relationships, and root credentials.
Google Workspace: Directory and security settings for collaborative accounts, group memberships, Drive permissions, OAuth configurations, and admin controls.

All queries are grounded in the actual configurations and states of these systems at evaluation time. This construction ensures that benchmark responses must be justified by direct evidence, rather than generic reasoning.

3. Task Categories and Representative Questions

The benchmark’s tasks are systematically categorized:

3.1 Identity Inventory

Tasks in this class require detailed enumeration or classification, such as:

Listing high-privilege human identities by type (e.g., AWS IAM users vs. Okta admins).
Identifying GWS admin users lacking mandatory 2-Step Verification.
Enumerating IAM users with console passwords but lacking MFA.
Extracting roles with attached inline policies.

3.2 Configuration Hygiene

Hygiene tasks involve detection of posture weaknesses or configuration errors, including:

Identifying customer-managed AWS policies granting Action = '*' and Resource = '*'.
Assessing root account MFA enablement and root usage recency.
Checking if GWS Super Admins have self-service recovery disabled.
Surveying Okta sign-on policies with weak MFA enforcement.
Listing Okta users inactive for more than 90 days.

Across all tasks, correct answers must be evidenced from system data, not inferred heuristically.

4. Sola AI Agent: Architecture and Execution Modes

The benchmark framework is paired with the Sola AI Agent, a tool-using, data-grounded agent engineered to translate natural language queries into verifiable execution plans:

Query Interpretation: Parses each user query, identifies target platforms, extracts relevant schema and example query templates.
Execution Modes:
- Fast-Path Exploration: Direct adaptation and execution of a known SQL template for questions with high schema confidence. Suitable for low-complexity inventory or hygiene checks.
- Full-Path Exploration (Tree-of-Thought): Decomposes the query into sub-steps, executes and validates SQL queries iteratively, recording a traceable chain-of-thought and step journal. This mode enables multi-step reasoning and explicit traceability.
Synthesis: Aggregates cross-system evidence, synthesizes natural-language answers referencing SQL and evidence extracts, and records tool-call chains for inspection.

Editor's term: “fast-path” denotes direct template adaptation; “full-path” describes explicit, traceable step-by-step reasoning.

5. Evaluation Metrics

Performance is quantified with two primary, expert-driven metrics and a third automated alternative:

Expert Accuracy ( $A_{\mathrm{expert}}$ ): Ordinal, graded correctness, averaging per-question expert-assigned scores $s_q\in\{0,0.5,1\}$ :

$A_{\mathrm{expert}} = \frac{1}{|Q|} \sum_{q\in Q} s_q$

Partial credit is explicitly recognized.

Strict Success Rate ( $S$ ): Fraction of questions with perfect expert score only ( $s_q=1$ ):

$S = \frac{|\{q\in Q : s_q=1\}|}{|Q|}$

LLM-as-Judge Metric ( $C_{\mathrm{LLM}}$ ): Automated chain-of-thought evaluation via models such as Claude Sonnet 4.5 or GPT-4.1, structurally mirroring expert accuracy.

These metrics are computed over all 77 benchmark questions, segmented by ISPM domain.

6. Experimental Results and Analysis

Results demonstrate robust agent performance in real-world multi-platform environments:

Domain	# Ques.	AnswerCorrectnessNoGT	Expert Accuracy	Strict Success Rate
AWS Hygiene	39	0.92	0.95	0.90
GWS Hygiene	14	0.54	0.75	0.71
Inventory	14	0.84	0.75	0.64
Okta Hygiene	10	0.83	0.65	0.50
Total (77)	—	0.82	0.84	0.77

AWS hygiene tasks yield the highest accuracy (0.95) and strict success (0.90), reflecting the agent’s reliability on well-schematized IAM checks.
GWS hygiene and inventory tasks display moderate performance (accuracy ~0.75); partial credit is assigned for answers where nuanced org-unit or policy details are partially addressed.
Okta hygiene exposes the most challenging queries (strict success 0.50), due to complexity in session/group correlations and ambiguous flag handling.

Execution Mode Comparison

Full-Path Exploration (40 questions): Expert accuracy 0.81, strict success 0.75; strong alignment on multi-step AWS and Okta tasks.
Fast-Path Exploration (37 questions): Expert accuracy 0.86, strict success 0.78; higher cross-domain variance, particularly in Okta when template adaptation misaligns with schema specifics.

Failure Modes

GWS hygiene failures primarily involve insufficient coverage of fine-grained org-unit settings.
Okta failures result from edge cases in policy flag handling (e.g., NULL logic) and lack of detailed group/session correlation logic.

7. Reproducibility and Future Applications

The benchmark is released with Dockerized deployment scripts, synthetic but production-faithful environment data, ingestion and query pipelines, complete annotated question sets, and thorough documentation. This enables practical reproducibility and cross-system benchmarking for researchers evaluating custom or novel agentic ISPM architectures.

Future directions will expand beyond visibility to integrate benchmarks for cross-platform correlation, behavioral anomaly detection, risk scoring, automated mitigation planning, and governance standard alignment (such as NIST 800-53, CIS controls). This suggests a broadening of ISPM evaluation from pure observability to active identity posture management and automated remediation capabilities.

The Sola Visibility ISPM Benchmark establishes an extensible and evidence-based foundation for rigorous benchmarking of agentic AI in enterprise identity security (Engelberg et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Sola-Visibility-ISPM: Benchmarking Agentic AI for Identity Security Posture Management Visibility (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sola Visibility ISPM Benchmark.