JobSearch-XS: Hybrid Job Matching

Updated 3 July 2026

JobSearch-XS is an open, diagnostic benchmark for hybrid retrieval in job and candidate matching, integrating lexical, semantic, and graph-based evidence.
It enables zero-shot skill generalization with standardized corpora and annotation protocols to handle skill synonyms and nonlinear career paths.
The framework features a modular pipeline with explainable reranking using LLM-based rationales and controlled ablation for diagnostic slicing.

JobSearch-XS is an open, diagnostic benchmark and methodological framework for evaluating hybrid retrieval systems in job and candidate matching, integrating explicit skills, structured metadata, knowledge graphs, and explainable reranking. The system was introduced to address limitations of keyword-based job search—specifically, inadequate handling of skill synonyms, inability to model nonlinear career paths, and opacity of match explanations—by providing a reproducible platform for skill-centric, semantic, and reasoning-aware retrieval across diverse real-world recruitment scenarios (Vyaas et al., 15 Mar 2026). The JobSearch-XS suite features a standardized evaluation corpus, annotation protocols for skill generalization, and modular pipelines for fusion of lexical, semantic, and graph-based evidence.

1. Motivation and Conceptual Foundations

Classical job search engines and platforms deploy keyword-centric filters and basic attribute matching (BM25 or proprietary heuristics), failing on skill synonymy (“K8s”/“Kubernetes”), nonlinear trajectories (promotion, reskilling moves), and transparent ranking explanations. JobSearch-XS was conceived to:

Enable reproducible, open evaluation for hybrid job/candidate retrieval models.
Benchmark zero-shot generalization to skills unseen at training-time, via explicit skill-disjoint train/dev/test query splits.
Support rapid method iteration (large “silver” labels; precise “gold” labels) while providing high-fidelity human evaluation.
Encourage research into slice-aware diagnostics—“where do retrieval pipelines fail, and what module bottlenecks emerge as skill requirements, query complexity, or domain distribution shift?”

This orientation reflects an empirical gap in prior research, which primarily reported aggregate scores and neglected skill/competency inference and systematic error mapping (Vyaas et al., 15 Mar 2026).

2. Dataset Composition and Labeling

JobSearch-XS centers on a fixed, public snapshot from NYC Open Data (“NYC Jobs”: 1,283 postings) designed for structured, skill-centric retrieval tasks (Vyaas et al., 15 Mar 2026). Each job record encodes title, full description, salary, location, skill/qualification requirements, and seniority. The benchmark supplies:

30 synthetic queries—10 title, 10 natural language paraphrase, 10 skill-synonym—to probe retrieval generalization across lexical, semantic, and knowledge-graph axes.
Two label sets:
- ~29,000 “silver” query–job pairs, automatically annotated by Jaccard overlap on canonicalized skills (threshold τ=0.3) using an ESCO-style synonym table to address surface-form variability.
- 40 “gold” query–job pairs (2 per dev/test query), manually double-annotated (Cohen’s κ≈0.85) for high-quality metrics.
Skill normalization pipeline: grounded in a Neo4j KG with ESCO-style mapping; all raw forms (“K8s”, “ML”) are mapped to canonical skill IDs.
Explicit skill-disjoint splitting: 10 train, 10 dev, 10 test queries with ≈60% of dev/test skill tokens unseen during training, enforcing realistic zero-shot evaluation.

The design ensures both large-scale coverage for rapid iteration and high-quality, diagnostic analysis via skill synonym and structured metadata mapping.

3. Hybrid Retrieval Architecture and Workflow

JobSearch-XS provides a multi-stage reference pipeline (see (Vyaas et al., 15 Mar 2026)), integrating the following retrieval evidence streams:

Query Enrichment: Extraction of named entities, raw and normalized skill mentions, degree requirements, location, and company from the query. Expansion is performed via KG traversal (depth-2 RELATED_TO) to surface latent/related skills; queries are embedded via all-MiniLM-L6-v2 (ℝ^384).
Parallel Multi-Source Retrieval:
- Lexical: BM25 (Elasticsearch), returns top-150.
- Semantic: HNSW-based approximate nearest neighbor search, returns top-150.
- Graph-based: Neo4j Cypher over REQUIRES_SKILL, seeded with raw and expanded skills, returns top-75.
Reciprocal Rank Fusion (RRF): The three top-K lists are unioned (max 400 docs); each candidate receives an RRF score (∑_{r∈R} 1/(k+rank_r(d)), k=60), with weights adaptively tuned based on query length.
Hard Constraint Filtering: Post-retrieval application of degree/visa/other categorical constraints.
White-Box Multi-Factor Reranking: Final utility scoring via U(c,j)=∑_{f∈F} w_f ϕ_f(c,j), where F includes Skill, Experience, Location, Salary, Semantic, and Company features, with default weights (e.g., Skill 0.35). Jaccard skill overlap, KG-relatedness, level-distance, location/salary alignment, embedding similarity, and company preference are separately scored. Explanations are rendered via LLM for transparency.

This architecture is modular, allowing researchers to substitute alternative retrievers or rerankers, isolate diagnostics, and benchmark next to the reference stack.

4. Evaluation Protocols, Metrics, and Diagnostic Slicing

Evaluation in JobSearch-XS is explicitly slice-aware, measuring both precision and real-time tractability:

Primary metrics: Precision@k, Recall@k, NDCG@k, Mean Reciprocal Rank (MRR). All are calculated on the gold-labeled dev/test queries.
Latency: Median (P50) and P95 end-to-end response times (in ms), critical for production relevance.
Skill-disjoint evaluation: Skill-coverage is sequenced to emphasize generalization, with the majority of test skills unseen in the train split.
Diagnostic slicing (as in PJB (Wang et al., 18 Mar 2026)): Modules (lexical, KG, semantic, reranking) are ablated and their effect stratified by query type (parallel constraints, serial inference, hybrid) or domain family.
- Retrieval variants (BM25, semantic, KG-only, hybrid) show distinct performance: KG-only achieves perfect recall on graph-reachable pairs; hybrid+rerank yields NDCG@10=0.81 (7% lift over BM25) at <100 ms latency. However, recall at large k (e.g., Recall@100=0.35) reveals limitations due to aggressive capping in precision-focused fusion.
User study: Pilot of N=20 provides Likert ratings for relevance, synonym handling, explanation quality, and UI responsiveness; mean top-5 relevance 3.92/5, explanation helpfulness 4.17/5.

5. Explainability, Transparency, and Factor-Wise Control

JobSearch-XS embeds explainable AI in two tight feedback loops:

White-box multifactor reranker: Each contributing factor (Skill, Experience, Location, etc.) is individually weighted, with interactive sliders for expert/live adjustment and all intermediate scores retained for explanation.
LLM-based rationale generation: For each recommendation, a supporting evidence chain (e.g., KG path for a skill match, explicit salary/experience thresholds) is passed to an LLM to generate grounded, auditable explanations. In pilot audits, the top influencing factor agreed with human auditor judgment in 70.5% of test cases; all explanations avoided unsupported claims.
Query adaptation: Skill expansion and slot normalization support informed navigation of synonym space and lateral transfers within occupational clusters.

6. Broader Impact, Limitations, and Reproducibility

JobSearch-XS delivers an open, modular platform for diagnosing and advancing job search IR:

Public release (MIT license; CC BY 4.0 for data) with codebase and installation/usage documentation at [https://github.com/coral-lab-asu/job-hunt-AI], supporting full ingestion-to-evaluation workflows within minutes on commodity hardware (Vyaas et al., 15 Mar 2026).
Enables ablation-driven research into the benefits and bottlenecks of hybrid retrieval, knowledge graphs, explainable reranking, and skill generalization.
Empirical findings identify open directions: recall-limited precision fusion (Recall@100 = 0.35 on NDCG-optimal runs), zero-shot skill matching under novel synonym and domain combinations, personalized utility estimation, and module sensitivity to query complexity.
Limitations: current corpus is small (1,283 jobs, 30 queries), with skill-taxonomy and synonym mapping critical to retrieval quality; hard-constraint filtering may underrepresent open-ended or atypical candidate/job matches. Pilot user-study size is limited; large-scale behavioral A/B evaluations are not yet reported.
Direct extensibility: the JobSearch-XS stack and labelled data provide a proxy for other occupation/skill taxonomies (ESCO, ONET), supporting both academic benchmarking and industrial IR system diagnoses.

JobSearch-XS is situated within an emerging ecosystem of recruitment IR benchmarks and diagnostics:

The PJB benchmark (Wang et al., 18 Mar 2026) introduces reasoning-aware binary job-competency judgments over 300k resumes, highlighting the gap between average leaderboard improvements and domain/logic-specific bottlenecks. PJB employs similar module ablations and slice-aware diagnostics, but at larger scale and with richer parallel/serial constraint annotation.
JobHop (Johary et al., 12 May 2025) offers trajectory-scale, ESCO-normalized transition datasets suitable for career path modeling and transition recommender evaluation.
Methods from Profile Analyst (Coleman, 2016), PrivateJobMatch (Saini, 2019), and RecSys Challenge 2016 (Pacuk et al., 2016) are readily pluggable into the XS diagnostic pipeline for comparative evaluation.

In summary, JobSearch-XS operationalizes hybrid, explainable, and slice-driven evaluation for skills-based job search and matching, establishing a replicable standard for semantic-aware labor market IR research (Vyaas et al., 15 Mar 2026).