PWAgent: Web & Power System Workflows

Updated 2 July 2026

PWAgent are two architecturally distinct agentic AI systems: one automates paper-to-web transformations and the other manages engineering workflows in power systems.
The webification pipeline leverages multi-stage LLM parsing, asset decomposition, and iterative refinement using analytical priors for optimal layout balance.
The power system agent employs a structured JSON interface with risk-sensitive metrics and evidence-backed reporting for contingency analysis and mitigation.

PWAgent refers to two architecturally and operationally distinct classes of agentic AI systems, each emerging in separate research domains: (1) multi-stage LLM-driven pipelines for automated transformation of scientific papers into interactive web resources, and (2) structured tool-using agents for conducting engineering workflows within power-system operation and planning. The former is exemplified by the pipeline in Paper2Web, which sets the Pareto state of the art for academic web presentation (Chen et al., 17 Oct 2025). The latter, as instantiated in PowerAgentBench-SS, rigorously tests agentic compliance and performance for safety-critical steady-state studies in electrical networks (Mylonas et al., 17 Jun 2026). Both emphasize workflow orchestration, agent-tool interfaces, resource management, and risk-sensitive evaluation, but they target fundamentally different problem spaces and design constraints.

1. Autonomous Pipeline for Scientific Paper Webification

In the context of scholarly communication, PWAgent is a multi-stage agentic pipeline designed to convert static PDFs of scientific papers into multimedia-rich, interactive project homepages. The architecture decomposes the transformation process into three primary modules: Paper Decomposition, MCP Ingestion, and Agent-Driven Iterative Refinement.

Paper Decomposition begins with PDF-to-Markdown conversion (e.g., via MARKER or DOCLING), followed by LLM parsing that extracts a predefined schema: TextualAssets (section titles, synopses, full text), VisualAssets (figures/tables, captions, back-references), and LinkAssets (types including code, data, citations). The output is structured as machine-readable JSON/Markdown resources.

MCP Ingestion passes these assets through a Model Context Protocol server into a Resource Repository where each is indexed with unique resource IDs, metadata, and a layout budget. Cross-modal alignment tools connect images to related narrative and type link assets by function. A spatial allocation heuristic assigns each asset a provisional footprint $b_i = \alpha\,\text{semantic\_importance}(i) + (1-\alpha)\,\text{visual\_weight}(i)$ , normalized such that $\sum_i b_i = 1$ , thereby guiding grid layout generation.

The Agent-Driven Iterative Refinement process leverages an Orchestrator multimodal LLM that inspects rendered screenshots, segments them into tiles, maps each to HTML fragments, and methodically diagnoses and corrects misalignments or hierarchy problems. This proceeds through local (tile), merge (inspired by merge-sort for interdependencies), and global passes (connectivity, completeness, visual harmony), looping until convergence or a pre-specified maximum number of cycles. Component-specific protocols (for navbars, heros, etc.) enforce structured corrections.

2. Algorithms, Heuristics, and Objectives in Webification

PWAgent's layout and refinement are governed by analytically-defined priors:

The Image–Text Balance Prior penalizes deviation from the ideal 1:1 image-to-text ratio, $D = |r_{\rm img/txt} - 1|$ , using $\zeta = \frac{5}{1+\gamma D}$ and $S_{\rm img\text{-}txt} = 5 - \zeta$ .
The Information Efficiency Prior penalizes excessive verbosity: for $r = L/W$ (generated text length vs. median human), $p(r) = \frac{5}{1+\beta\max(0, r-1)}$ .

These priors feed into the agent's completeness and connectivity scoring and can trigger back-propagated adjustments during MCP ingestion.

The core iterative refinement logic is encapsulated by a loop: at each iteration, the draft HTML is rendered, screenshot-segmented, issues in fragments are diagnosed and localized patches generated and applied. Termination occurs when no further edits are determined to be needed or when a maximum number of passes is reached.

3. Evaluation Methodologies for Academic Webpage Generation

PWAgent is subject to a multi-dimensional evaluation framework:

Connectivity ( $S_{\rm Con}$ ): Rule-based, reflecting the sum of valid external and internal links, $S_{\rm Con} = \frac{1}{2} (S_{\rm external} + S_{\rm internal})$ .
Completeness ( $S_{\rm Comp}$ ): Averaging image-text and efficiency priors, $\sum_i b_i = 1$ 0.
MLLM-as-a-Judge: Holistic, human-verified scoring (1–5) on interactivity, aesthetics, and informativeness.
PaperQuiz: LLM-generated, region-anchored QA (25 verbatim, 25 interpretive) administered to six reader models and scored for accuracy. Verbosity penalties ( $\sum_i b_i = 1$ 1 from the image-text prior) discourage text overload.

Empirically, PWAgent achieves the best or near-best scores in connectivity (3.10), completeness (3.56, tied with the leader), and interactivity (+59% improvement over alphaXiv), and it achieves the top PaperQuiz rank after verbosity penalty (2.03). Cost efficiency is exemplified by a token-based cost of \$0.025/page, representing 82–54% reductions relative to GPT-4o and Gemini, while maintaining near-oracle layout quality (Chen et al., 17 Oct 2025).

4. PWAgent as Workflow Agent in Power System Studies

In power system operation, PWAgent is an agent type evaluated via the PowerAgentBench-SS benchmark. Here, it is defined not by web synthesis, but by its ability to execute end-to-end engineering workflows matching those of a transmission-security engineer.

The task instance is formalized as $\sum_i b_i = 1$ 2, encompassing topology, operating point, study set, toolset, admissible actions, validation budget, and a hidden evaluator. The PWAgent operates by selecting and sequencing externally-defined tools (API for DC-thermal models), running contingency analysis, proposing mitigations, and producing evidence-logged reports.

The agent operates through a strict JSON command/schema interface, with all tool calls (including validate, redispatch, and final submit) compulsorily logged and subject to post-hoc scoring by the hidden evaluator. Evidence-backed reporting is mandatory: only claims corresponding to explicit validation calls are credited.

5. Risk-Sensitive Metrics and Evaluation in Power Systems

Evaluation of PWAgent in PowerAgentBench-SS is driven by risk-sensitive metrics:

Recall Metrics:
- $\sum_i b_i = 1$ 3: Fraction of true top- $\sum_i b_i = 1$ 4 contingencies submitted, regardless of validation.
- $\sum_i b_i = 1$ 5: Fraction both reported and explicitly validated.
- $\sum_i b_i = 1$ 6: Fraction validated, even if not submitted.
Worst-case/Regret:
- $\sum_i b_i = 1$ 7: Highest severity outage in the hidden set.
- $\sum_i b_i = 1$ 8: Highest severity discovered.
- $\sum_i b_i = 1$ 9.
False-safe Penalties:
- FSR: Fraction of undetected "dangerous" cases (top 5% by severity).
- SWFN: Severity-weighted false negative rate.
Mitigation Metrics:
- Reduction in average severity after agent's recommended action: $D = |r_{\rm img/txt} - 1|$ 0, with action costs tracked per solution.
Tool-use Diagnostics:
- Budget utilization, invalid command rates, duplicate validations, and schema enforcement.

Pilot tasks are run on the IEEE 39-bus DC-thermal network (46 lines, 1035 N-2 cases, $D = |r_{\rm img/txt} - 1|$ 1 validation budget, $D = |r_{\rm img/txt} - 1|$ 2 report size) (Mylonas et al., 17 Jun 2026).

6. Comparative Results and Workflow Diagnostics

Key findings from benchmark experiments reveal that solver-only or answer-only evaluation is insufficient. For example, the base-loading heuristic covers the broad top-20 ( $D = |r_{\rm img/txt} - 1|$ 3), but hybrid strategies combining approximate screening and preventive redispatch identify the single worst case every time (Best=1.0) and reduce post-action severity by 7.7%.

LLM-based agents display variable performance. Qwen3.5 utilizes full budget with optimal evidence rate but only matches LODF-screen performance ( $D = |r_{\rm img/txt} - 1|$ 4). Mistral-Nemo validates few cases, produces duplicates, and misses most top events. Command-R struggles with schema compliance, leading to high $D = |r_{\rm img/txt} - 1|$ 5 but near-zero $D = |r_{\rm img/txt} - 1|$ 6. OpenAI GPT-5.5 obtains the highest LLM $D = |r_{\rm img/txt} - 1|$ 7 ( $D = |r_{\rm img/txt} - 1|$ 8), achieves a clean workflow (no duplicates, complete budget use), but does not surpass simple heuristics for coverage. No LLM agents attempted mitigation under strict evaluation (Mylonas et al., 17 Jun 2026).

Workflow diagnostics further expose substantial differences in budget awareness, premature stopping, and the impact of strict schema enforcement.

7. Significance, Implications, and Future Directions

PWAgent, as exemplified in both Paper2Web and PowerAgentBench-SS, demonstrates two key developments:

In the automation of academic dissemination, agentic, tool-using pipelines surpass template and end-to-end LLM outputs in layout balance, interactivity, informativeness, and cost efficiency by iteratively refining decomposed resources under tight computational budgets (Chen et al., 17 Oct 2025).
In safety-critical engineering, PWAgent's evidence-backed, budget-constrained workflow—enforced by structured API contracts and hidden scoring—highlights the demands for auditability, schema compliance, and risk sensitivity in LLM agent evaluation. Solver-only or end-to-end answer paradigms are shown to be inadequate when workflow traceability and operational risk must be explicitly managed (Mylonas et al., 17 Jun 2026).

A plausible implication is that future agentic AI systems spanning scientific and engineering domains will increasingly require modular architectures, explicit tool contracts, and multidimensional risk-aware evaluation, converging on best practices exemplified by these PWAgent deployments.

Markdown Report Issue Upgrade to Chat

References (2)

Paper2Web: Let's Make Your Paper Alive! (2025)

PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PWAgent.

PWAgent: Web & Power System Workflows

1. Autonomous Pipeline for Scientific Paper Webification

2. Algorithms, Heuristics, and Objectives in Webification

3. Evaluation Methodologies for Academic Webpage Generation

4. PWAgent as Workflow Agent in Power System Studies

5. Risk-Sensitive Metrics and Evaluation in Power Systems

6. Comparative Results and Workflow Diagnostics

7. Significance, Implications, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PWAgent: Web & Power System Workflows

1. Autonomous Pipeline for Scientific Paper Webification

2. Algorithms, Heuristics, and Objectives in Webification

3. Evaluation Methodologies for Academic Webpage Generation

4. PWAgent as Workflow Agent in Power System Studies

5. Risk-Sensitive Metrics and Evaluation in Power Systems

6. Comparative Results and Workflow Diagnostics

7. Significance, Implications, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research