Housing Statute QA Dataset

Updated 21 October 2025

Housing Statute QA Dataset is an expert-annotated benchmark that evaluates retrieval and QA systems for complex statutory housing law queries.
It comprises over 6,800 Yes/No legal questions with gold passages extracted from authoritative sources, ensuring real-world relevance.
The dataset challenges conventional retrieval methods with low lexical similarity, promoting query expansion techniques to improve system performance.

The Housing Statute QA Dataset is an expert-annotated benchmark for evaluating retrieval and question-answering systems in the domain of statutory housing law. It is constructed to reflect the complexity and specificity of real-world legal research tasks, particularly as encountered in the context of eviction and related statutory issues across U.S. states and territories.

1. Construction and Sources

The benchmark is derived from the Legal Services Corporation (LSC) Eviction Laws Database, a resource originally assembled by legally trained researchers to support tenant inquiries regarding eviction and broader housing regulations. The LSC database consists of complex, practical legal questions annotated with citations to relevant statutes. During construction, non–Yes/No queries were removed or split into multiple binary subquestions; each multi-answer item yields several Yes/No queries, ensuring every example fits binary QA paradigms. Citations anchor each query to authoritative statutory text, extracted as “gold” passages from sources such as Justia, representing verbatim legislative language from the most recent statute compilations available (typically 2021 or newer).

This annotation process directly mirrors how practicing attorneys conduct statutory research, combining manual expert review, citation search, and extraction of dispositive statutory excerpts, thereby imbuing the dataset with a high degree of domain fidelity and practical relevance (Zheng et al., 6 May 2025).

2. Dataset Structure and Content

The Housing Statute QA dataset comprises more than 6,800 question–answer pairs in its main publicly released “rc_questions” split, with further examples in the “knowledge_qa” split (the latter lacking gold passages). Each example is a Yes/No question regarding statutory requirements, defenses, procedures, or restrictions in housing law. Gold passages are concise excerpts from state housing statutes, selected according to explicit annotated citations. This ensures that every example remains grounded in the actual text of applicable law.

The supporting document retrieval corpus is large, consisting of approximately 1–2 million statute passages spanning U.S. jurisdictions. The typical queries are phrased in accessible, practical terms (“Does a tenant have a right to cure nonpayment within 14 days…?”), while the statutory text relies on formal legal syntax, creating significant surface-level linguistic disparity. Consequently, this corpus poses notable challenges for both keyword-based and embedding-based retrievers.

3. Retrieval Evaluation and Baseline Performance

Several retrieval approaches are benchmarked using standard IR metrics: Recall@K (K = 10, 100, etc.) and mean reciprocal rank (MRR). Baseline lexical methods such as BM25 exhibit moderate performance, largely due to the low mean lexical similarity (≈ 0.09) between queries and their gold statute passages, which is substantially lower than similar metrics on other legal retrieval datasets. Dense retrieval models (e.g., E5 family), including E5-large-v2, are also evaluated, showing improvements, yet remain constrained by the lexical and semantic gap.

Generative query expansion techniques, in particular those employing structured legal reasoning prompts, substantially increase baseline recall. When BM25 is augmented with rollouts generated by such prompts, Recall@10 gains approximately 10 percentage points. These expansions enable the retriever to bridge semantic disparities, producing intermediary queries that better align with statutory language and reasoning (Zheng et al., 6 May 2025).

4. Technical Features and Methodological Challenges

Key technical features include:

Binary reformulation of complex statutory and procedural questions, facilitating evaluation of system prediction as Yes/No classification.
Gold passage identification via citation extraction, mirroring domain-expert workflows.
Grounding in state law corpora, allowing jurisdiction-specific QA.

Principal methodological challenges are:

Low lexical overlap and semantic disconnect between practical queries and formal statutes.
Necessity for retrieval systems to perform not only surface-level matching but substantive legal reasoning to bridge user intent and statutory requirements.
Evaluation complexity, as key passages may not contain direct lexical cues referencing the query.

The retrieval setup applies standard IR metrics and TF-IDF cosine similarity:

$\cos(q, d) = \frac{\sum_k TF_{q,k} \cdot TF_{d,k}}{\|q\| \cdot \|d\|}$

where $TF_{q,k}$ and $TF_{d,k}$ denote term frequencies for the $k$ -th term in the query $q$ and document $d$ respectively. However, dense and re-ranking methods, alongside query expansion, are favored for improved retrieval effectiveness.

5. Implications for Legal AI and Retrieval-Augmented QA Systems

The benchmark highlights fundamental obstacles and directions in retrieval-augmented legal question-answering (“RAG”) for statutory domains:

Baseline retrievers underperform due to semantic and linguistic gaps; conventional approaches (BM25, naive dense retrievers) are insufficient.
Query expansion via structured legal reasoning substantially improves recall, suggesting that future systems should integrate reasoning rollouts in their retrieval pipeline.
The benchmark’s use of jurisdiction-anchored gold passages enables robust system evaluation and supports downstream analysis of jurisdictional variation in statutory interpretation and enforcement.
A plausible implication is that progress in statutory QA will require retrievers that model both the intent and the reasoning pathways typically traversed by legal experts.
Hybrid retrieval architectures merging legal IR heuristics and machine learning appear promising; ongoing improvements in the embedding capacity of dense models for legal text are expected to further advance performance.

6. Position in the Broader Statute QA Landscape

Relative to alternative resources (such as generic fairness or housing QA datasets), the Housing Statute QA dataset is distinguished by its rigorous grounding in real legal research processes, explicit citation linkage, and focus on domain-specific statutory reasoning. It complements datasets like FairHome (Bagalkotkar et al., 2024), which focus on compliance risk identification in conversational contexts, by directly targeting statutory interpretation and retrieval.

It shares with multimodal and spatiotemporal datasets (e.g., HouseTS (Wang et al., 1 Jun 2025)) the ambition of supporting housing policy analysis, though its unique contribution is the facilitation of retrieval-augmented QA workflows for the statutory law domain.

7. Future Directions and Research Prospects

The benchmark underscores that legal RAG systems remain a challenging open research area. Key future directions suggested in the data include:

Development of retrievers optimized for legal reasoning, possibly via more sophisticated expansion and reasoning modules.
Improved modeling of statutory language embeddings reflecting cross-jurisdictional syntax and latent legal concepts.
Exploration of hybrid retrieval–reasoning systems employing legal heuristics, annotation-guided learning, and continual refinement.
Expansion of the corpus to encompass structural legal changes and dynamic statute updates.
Ongoing evaluation of new retrieval algorithms and generative expansion strategies on complex, real-world legal QA tasks.

In sum, the Housing Statute QA Dataset constitutes a domain-intensive resource for rigorous evaluation and innovation in statutory question-answering models, advancing research in legal information retrieval, expert annotation, and retrieval-augmented large language modeling (Zheng et al., 6 May 2025).