Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Concept-based indexing in text information retrieval (1303.1703v1)

Published 7 Mar 2013 in cs.IR and cs.CL

Abstract: Traditional information retrieval systems rely on keywords to index documents and queries. In such systems, documents are retrieved based on the number of shared keywords with the query. This lexical-focused retrieval leads to inaccurate and incomplete results when different keywords are used to describe the documents and queries. Semantic-focused retrieval approaches attempt to overcome this problem by relying on concepts rather than on keywords to indexing and retrieval. The goal is to retrieve documents that are semantically relevant to a given user query. This paper addresses this issue by proposing a solution at the indexing level. More precisely, we propose a novel approach for semantic indexing based on concepts identified from a linguistic resource. In particular, our approach relies on the joint use of WordNet and WordNetDomains lexical databases for concept identification. Furthermore, we propose a semantic-based concept weighting scheme that relies on a novel definition of concept centrality. The resulting system is evaluated on the TIME test collection. Experimental results show the effectiveness of our proposition over traditional IR approaches.

Concept-Based Indexing in Text Information Retrieval

The paper presented by Fatiha Boubekeur and Wassila Azzoug explores a novel approach to improve information retrieval (IR) systems through concept-based indexing using semantic entities other than traditional keyword-based methods. Traditional IR systems, which operate predominantly on the “bag of words” model, often encounter limitations due to lexical ambiguities such as synonymy and polysemy. This paper introduces a concept-based retrieval framework designed to mitigate these issues by using WordNet and WordNetDomains for semantic concept identification and weighting.

Technical Approach

The proposed framework relies on two core innovations: concept identification and concept weighting. The process of concept identification involves mapping terms within a document to their corresponding conceptual entries in WordNet, a comprehensive lexical database. This is enhanced by employing WordNetDomains, which adds a hierarchical domain-based facet to each synset, providing a contextual backdrop that aids in resolving ambiguities. This two-tiered approach refines the semantic representation of documents by associating terms with precise, contextually relevant concepts.

In parallel, concept weighting addresses the significance of each identified concept within the text. This is achieved by introducing a novel weighting scheme that utilizes concept centrality. Centrality is determined both locally—by evaluating a concept's semantic relatedness to others within the same document—and globally—by its discrimination power across the entire corpus. The resulting weighting mechanism is a product of frequency (tf) and semantic relevance (sim), calibrated using experimental tuning of the weighting parameter α.

Experimental Evaluation

The experimental analysis of this framework employed the TIME test collection, consisting of historical articles from TIME magazine accompanied by query relevance judgments. The experiments involved assessing the effectiveness of concept-based indexing against traditional keyword approaches using precision at various cut-off points and Mean Average Precision (MAP).

The results demonstrate that the proposed concept-based indexing paradigm (both in its Sem-TF-IDF and Sem-BM25 implementations) consistently outperformed classic keyword-based methods. Notably, significant precision improvements were observed, particularly in early-stage retrieval (P@5, P@10), highlighting enhanced recall and precision through semantic understanding. Moreover, the novel cc-idc weighting scheme showed marked improvements against both tf-idf and BM25, underscoring the robustness of concept centrality in semantic indexing.

Implications and Future Research

The introduction of semantic-driven indexing methods presents practical and theoretical advancement in the field of IR, with potential applications extending to areas such as text mining and user preference modeling. The capability to semantically discern and relate documents beyond surface lexical coincidences is particularly vital in domains requiring nuanced interpretation, like legal or biomedical text retrieval.

Future research is expected to further explore the dependency of the weighting parameter on different corpora, potentially leading to dynamic or adaptive weighting models. Additionally, extending this framework to multilingual IR systems could leverage cross-lingual capabilities of resources like EuroWordNet, fostering broader application in global information systems.

This paper provides a compelling case for integrating semantic technologies in traditional IR, pointing towards a direction where retrieval systems understand and process language more akin to human semantic cognition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Fatiha Boubekeur (1 paper)
  2. Wassila Azzoug (1 paper)
Citations (347)