SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

Published 7 Apr 2026 in cs.SE, cs.AI, cs.CL, and cs.IR | (2604.05711v1)

Abstract: Web applications rely heavily on hyperlinks to connect disparate information resources. However, the dynamic nature of the web leads to link rot, where targets become unavailable, and more insidiously, semantic drift, where a valid HTTP 200 connection exists, but the target content no longer aligns with the source context. Traditional verification tools, which primarily function as crash oracles by checking HTTP status codes, often fail to detect semantic inconsistencies, thereby compromising web integrity and user experience. While LLMs offer semantic understanding, they suffer from high latency, privacy concerns, and prohibitive costs for large-scale regression testing. In this paper, we propose SemLink, a novel automated test oracle for semantic hyperlink verification. SemLink leverages a Siamese Neural Network architecture powered by a pre-trained Sentence-BERT (SBERT) backbone to compute the semantic coherence between a hyperlink's source context (anchor text, surrounding DOM elements, and visual features) and its target page content. To train and evaluate our model, we introduce the Hyperlink-Webpage Positive Pairs (HWPPs) dataset, a rigorously constructed corpus of over 60,000 semantic pairs. Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources. This work bridges the gap between traditional syntactic checkers and expensive generative AI, offering a robust and efficient solution for automated web quality assurance.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces SemLink, a semantic-aware test oracle using Siamese SBERT that detects soft link rot and verifies hyperlink relevance.
It employs deep contextual extraction from anchor text, DOM elements, and OCR, achieving a 96.00% recall while processing links in under 0.1s.
The released HWPPs dataset of 60,000+ hyperlink pairs provides a robust benchmark for advancing automated regression testing and web integrity validation.

SemLink: Advancing Semantic-Aware Automated Test Oracles for Hyperlink Verification

Introduction: Semantic Drift and the Test Oracle Problem

The persistent evolution of web applications necessitates robust link verification, as the integrity of hyperlinks is foundational to reliable navigation and information retrieval. Traditional link checkers are limited by their reliance on crash oracles, namely HTTP status codes, and consequently fail to identify cases of semantic drift (“soft link rot”) where the HTTP request succeeds but the destination content is no longer contextually relevant to the link source. This problem is central to modern regression pipelines, where the need is for functional oracles capable of verifying semantic alignment rather than mere syntactic connectivity.

The “SemLink” approach addresses these deficiencies by leveraging the discriminative capabilities of a Siamese Sentence-BERT (SBERT) architecture, specifically targeting scalable, low-latency semantic verification. In contrast to LLM-based methods, which offer strong semantic understanding but are prohibitively resource-intensive for large-scale regression testing, SemLink is designed for real-time, high-throughput environments.

Semantic Test Oracle Architecture

SemLink adopts a Siamese Neural Network paradigm, employing a shared-weight SBERT backbone to encode both hyperlink source context and corresponding webpage content into dense vector representations. The system’s architecture is structured in three main stages: feature extraction, bi-encoder processing, and a comparison-classification stage via MLP.

(Figure 1)

Figure 1: The SemLink Siamese Network Architecture utilizes shared SBERT encoders and an MLP on the absolute embedding difference for semantic classification.

Feature extraction extends beyond anchor text, incorporating Side-Text from the DOM, image-based information via OCR, and structural headers within the target document. This multicomponent extraction is critical for overcoming the limitations of short or ambiguous anchors (e.g., “Read More”).

For pairwise semantic evaluation, SemLink computes the elementwise absolute embedding difference and aggregates multi-component scores through a position-aware weighting heuristic, optimizing for semantic proximity while countering DOM distance noise. The use of BCE loss (with empirically-set hyperparameters $\lambda_1=0$ , $\lambda_2=1$ ) was shown to be strictly superior to hybridized BCE+Triplet loss formulations in this application domain.

Hyperlink-Webpage Positive Pairs (HWPPs) Dataset

Progress in semantic hyperlink verification is bottlenecked by the absence of large-scale, domain-diverse benchmarks. SemLink introduces the HWPPs corpus, collecting over 60,000 curated pairs from 500 diverse, actively maintained websites under the Maintenance Assumption, thus ensuring a high proportion of semantically valid examples. The data pipeline utilizes Selenium-powered dynamic content extraction, rigorous pair validation, and Side-Text analysis through DOM traversal and weighting.

Quantitative analysis demonstrates strong baseline semantic alignment in the dataset, with over 50% of anchor/title pairs scoring $>0.9$ cosine similarity prior to fine-tuning, confirming the HWPPs validity for supervised training.

Empirical Results: Effectiveness, Efficiency, and Scalability

SemLink’s efficacy was benchmarked against leading LLMs (GPT-5.2, GPT-4o, Llama-3-8B/70B), using a manually annotated test set. SemLink achieves a Recall of 96.00% and F1-score of 92.93%, outperforming Llama-3-70B (95.30%) and GPT-3.5 Turbo (89.65%) in recall, while consuming dramatically fewer computational resources.

Ablation analysis reveals that Side-Text and image-based feature augmentation substantially increase recall—demonstrating the inadequacy of anchor-only strategies for real-world links. Inclusion of OCR and attribute extraction for image-based anchors yields the highest F1 in multimodal contexts.

Efficiency Analysis: Toward Real-Time CI/CD Integration

A core motivation for discriminative architectures is operational efficiency. SemLink processes links at 30.87/sec on consumer GPUs—47.5x faster than GPT-5.2 and 300x faster than Llama-3-70B. Only SemLink operates within the real-time threshold ( $<$ 0.1s/link), a critical constraint for enterprise-scale, per-commit regression pipelines.

Figure 2: Efficiency vs. Performance. SemLink (Green Star) is the sole approach within the real-time threshold ( $<$ 0.1s/link).

Since true real-world bottlenecks occur at the web-crawling layer, the magnitude of SemLink’s acceleration ensures that the test oracle component never dominates regression suite runtime, enabling continuous delivery at scale.

Qualitative Analysis: Robustness Against Soft Failures and Contextual Disambiguation

SemLink’s discriminative capability is best illustrated in scenarios where traditional tools collapse into false positives:

Detecting “Soft 404”: Unlike HTTP checkers, SemLink marks links returning error messages (but HTTP 200) as irrelevant by semantic comparison.
Figure 3: “Soft Link Rot” example—SemLink correctly marks HTTP 200 error pages as Irrelevant, where traditional tools fail.
Identifying Semantic Drift: SemLink robustly detects when anchor promises (e.g., specific news content) drift to generalized or inappropriate landing pages.
Figure 4: Illustration of Semantic Drift—SemLink identifies generic landing pages as Irrelevant when anchor-context overlap is low.
Heuristic Disambiguation of Generic Anchors: By extracting weighted parent and sibling DOM context, SemLink successfully validates non-descriptive anchors like “Read More”.
Figure 5: The Side-Text heuristic disambiguates generic anchor text by leveraging parent headers.

Failure analysis indicates that false negatives predominantly arise from expected login redirects and visually-rich (text-poor) targets—a known limitation of text-only discriminative models, and a direction for future integration with vision-LLMs.

Figure 6: A False Negative—links redirected to login portals are marked Irrelevant due to lack of semantic overlap, despite being functionally valid.

Implications and Future Directions

By bridging the gap between low-resource, syntactic checkers and costly, privacy-exposing generative LLM pipelines, SemLink offers a pragmatic solution for semantic verification in automated software testing. The architecture is readily extensible to CMS migration, bulk redirect validation, and regression testing post-deployment.

Practically, the public release of HWPPs provides a new foundation for reproducibility and further research in the intersection of NLP and web engineering. Theoretically, SemLink’s strong recall and speed profile demonstrate the differential advantage of task-specific discriminative models in settings where recall and latency dominate over ceiling performance. There remains future work in incorporating vision-LLMs for image-rich websites, and in employing GNNs for site-wide semantic flow analysis.

Conclusion

SemLink advances the state of automated hyperlink verification by introducing a domain-optimized, real-time semantic test oracle based on Siamese SBERT. The approach leverages deep contextual DOM extraction and a novel, weighted aggregation mechanism to deliver recall metrics on par with LLMs but at orders-of-magnitude higher throughput and lower cost. By releasing the HWPPs corpus and demonstrating strong empirical results, SemLink catalyzes further work on robust, scalable, and semantically aware software testing methodologies.

Reference: "SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT" (2604.05711)

Markdown Report Issue