Efficient Fuzzy Search Engine with B-Tree Search Mechanism (1411.6773v1)

Published 25 Nov 2014 in cs.IR

Abstract: Search engines play a vital role in day to day life on internet. People use search engines to find content on internet. Cloud computing is the computing concept in which data is stored and accessed with the help of a third party server called as cloud. Data is not stored locally on our machines and the softwares and information are provided to user if user demands for it. Search queries are the most important part in searching data on internet. A search query consists of one or more than one keywords. A search query is searched from the database for exact match, and the traditional searchable schemes do not tolerate minor typos and format inconsistencies, which happen quite frequently. This drawback makes the existing techniques unsuitable and they offer very low efficiency. In this paper, we will for the first time formulate the problem of effective fuzzy search by introducing tree search methodologies. We will explore the benefits of B trees in search mechanism and use them to have an efficient keyword search. We have taken into consideration the security analysis strictly so as to get a secure and privacy-preserving system.

Citations (14)

View on Semantic Scholar

Summary

The paper presents a privacy-preserving fuzzy keyword search engine that integrates dictionary-based fuzzy search with a B-tree indexing method to reduce search overhead.
It employs an inverted index and trapdoor encryption to securely match noisy queries with encrypted cloud data despite typos and formatting variations.
Performance analysis reveals that the dictionary-based approach minimizes fuzzy keyword generation compared to wildcard methods, enhancing search efficiency.

This paper (1411.6773) proposes an efficient and privacy-preserving fuzzy keyword search engine specifically designed for encrypted data stored in cloud environments. The core problem it addresses is that traditional keyword search methods require exact matches, which is impractical for users in cloud settings where data might be encrypted and search queries may contain typos or variations (e.g., "P.O Box" vs. "P O Box").

The proposed system tackles this by introducing a combination of dictionary-based fuzzy keyword generation, B-tree index search, and an inverted index structure, all while maintaining data privacy through encryption.

Core Concepts and Techniques:

Fuzzy Keyword Search: The system enables searching for terms that are similar to the user's query, even if they don't match exactly. Similarity is measured using the string edit distance, which is the minimum number of insertions, deletions, or substitutions required to transform one string into another.
Dictionary-based Fuzzy Set Construction: Unlike wildcard-based methods that might generate numerous meaningless variations (e.g., "ASTUDENT" from "STUDENT"), this approach uses a predefined dictionary (e.g., a list of legal English words). For a given query word $w$ and a maximum edit distance $d$ , it generates a set $FP_{w,d}$ containing only words from the dictionary that are within an edit distance of $d$ from $w$ . This results in a smaller, more relevant set of potential keywords, improving efficiency and reducing the search request size.
B-Tree Search Mechanism: A B-tree is used as the primary data structure for indexing keywords. B-trees are suitable for disk-based storage systems (like cloud storage) because they minimize disk I/O operations during searches. The paper describes a standard B-tree search function B-TREE-SEARCH(x, k) which operates similarly to binary search but with multiple branches at each node, efficiently locating a keyword (or its numerical representation) within the index structure.
Inverted Index: After a fuzzy keyword is identified (potentially via the B-tree search on dictionary entries), an inverted index is used to map this keyword to the documents (or file IDs) that contain it. The inverted index stores a list of document identifiers for each indexed keyword. A fully inverted index can also store the exact position of the keyword within the document, allowing for more precise results or phrase searching (though the paper's primary focus is document retrieval).
Encryption and Privacy: To protect the sensitive data and the keywords themselves, the system uses encryption. Data files are encrypted before being outsourced to the cloud. Keywords are also protected. The paper mentions using MD5 for encryption (though MD5 is a hashing algorithm, the description implies an encryption or secure trapdoor generation mechanism). A secret key (sk) is shared between the data owner and authorized users. Search queries are transformed into "trapdoors" ( $T_w = f(sk, W_i)$ ) using this secret key. The cloud server receives the trapdoor and uses it to find matching entries in the index table without learning the actual keyword. Only authorized users with the secret key can decrypt the retrieved file identifiers.

Proposed Architecture and Workflow (Referring to Figure 1):

Data Owner Side:
- Files: Original documents ready for processing.
- Format Extractor: Reads various file formats (doc, txt, xls) to extract content.
- Text Filter: Removes punctuation, separators, and stop words from the extracted text to get a clean list of potential keywords.
- Index Formation: Creates the dictionary of legal keywords, generates fuzzy keyword sets (using the dictionary-based approach and edit distance), builds the B-tree index for efficient lookup, and constructs the inverted index mapping keywords/trapdoors to file IDs.
- Encryption: Encrypts the data files (Encrypted Files) and potentially the file identifiers, using the shared secret key.
- The encrypted files and the index table (containing trapdoors and encrypted file identifiers) are then OUTSOURCED to the CLOUD SERVER.
Data User Side:
- The authorized user inputs a search keyword w.
- Fuzzy Keywords: The user-side component generates the fuzzy keyword set $FP_{w,d}$ for the input w using the shared dictionary and edit distance $d$ .
- TRAPDOOR WITH SECRET KEY: The user calculates the trapdoor $T_w$ for the query keyword (or potentially for each keyword in the fuzzy set) using the secret key sk.
- SEARCH REQUEST: The user sends the trapdoor $T_w$ to the CLOUD SERVER.
Cloud Server Side:
- Receives the SEARCH REQUEST (trapdoor $T_w$ ).
- B TREE SEARCH: Searches the B-tree index using the trapdoor to find matching entries corresponding to keywords in the fuzzy set $FP_{w,d}$ .
- Compares the incoming trapdoor(s) with the trapdoors stored in the INDEX TABLE.
- Identifies the encrypted file identifiers ({Enc(sk,FIDwi)}) associated with the matching trapdoors using the INDEX TABLE.
- Returns the set of matching encrypted file identifiers (SEARCH RESULTS) to the user.
Data User Side (cont.):
- Receives SEARCH RESULTS (encrypted file identifiers).
- If the user requests a file, they send a DOWNLOAD REQUEST for the specified encrypted file(s).
- Upon receiving the encrypted file, the user uses the secret key sk for DECRYPTION.

Implementation Considerations:

Dictionary Management: Maintaining a comprehensive and up-to-date dictionary of legal words is crucial for the effectiveness of the dictionary-based approach.
Edit Distance Threshold: Choosing an appropriate edit distance d is a trade-off: a larger d allows for more typos but increases the size of the fuzzy keyword set and search time; a smaller d is faster but less tolerant to errors.
B-Tree Implementation: A robust B-tree implementation that efficiently handles insertions, deletions, and searches is necessary. The order of the B-tree (m) affects performance and storage requirements.
Inverted Index Construction: Building and managing the inverted index requires mapping keywords to multiple document IDs. For large datasets, this index can become substantial.
Encryption Scheme: While the paper mentions MD5, in a real-world secure system, a proper searchable symmetric encryption scheme (like those referenced in the paper's related work [10, 11, 13]) would be required to generate trapdoors that allow the server to perform comparisons without learning the keyword, combined with a strong symmetric encryption algorithm (like AES) for encrypting the data files and file identifiers. MD5 is a one-way hash and unsuitable for encryption or generating reversible/comparable trapdoors in this context.
Key Management: Securely sharing and managing the secret key between the data owner and authorized users is critical.
Performance: The paper shows a performance analysis (Figure 6) comparing Dictionary-based Fuzzy Search (DFS) and Wildcard-based Fuzzy Search (WFS), concluding that DFS generates significantly fewer fuzzy keywords, especially for longer words, leading to a smaller search request size and better efficiency.

Practical Applications:

This system is applicable in scenarios where:

Sensitive data needs to be stored encrypted in the cloud.
Users need to search this encrypted data.
Search queries might contain minor errors (typos, formatting inconsistencies).
Examples include secure document repositories, encrypted email archives, or health record systems hosted on cloud infrastructure.

The paper concludes that the proposed Dictionary-based fuzzy keyword search scheme using a B-Tree search mechanism offers an efficient and privacy-preserving solution for fuzzy search over encrypted cloud data. Future work mentioned includes supporting conjunctive keyword searches (searching for multiple terms together) and keyword ranking based on user preferences.

PDF Markdown

Efficient Fuzzy Search Engine with B-Tree Search Mechanism (1411.6773v1)

Summary

Related Papers