SIMILAR Framework: Layout Document Retrieval
- SIMILAR Framework is a modular system that retrieves documents by analyzing visual layout features such as text boxes, images, and tables in OOXML files.
- It employs both exact and approximate matching techniques to compute similarity scores based on geometric and stylistic document attributes.
- Key strengths include enhanced forensic detection and user-customizable queries, while limitations involve scalability and limited format support.
The SIMILAR (SIMIlAr LayouÂt RÂetrieval) Framework is a modular system designed to retrieve electronic documents with similar visual layouts, particularly under conditions where traditional content-based or metadata-based search strategies are rendered ineffective. This approach addresses forensic and investigative scenarios in which relevant documents cannot be identified by keywords or metadata but may share distinctive layout characteristics, such as those produced by organizational templates or habitual document design patterns (Chung, 2018).
1. Design Motivation and Problem Context
Electronic document retrieval traditionally prioritizes content similarity, metadata comparison, or byte-level hashing. However, adversarial scenarios such as deliberate paraphrasing, translation, or metadata stripping can defeat these approaches. Forensic investigators found that targeting similar document layouts—capturing page sizes, text box geometries, image positions, tables, and font usages—can recover groups of related documents that would otherwise evade detection by keyword or signature-based searches. The SIMILAR Framework directly addresses this gap by formalizing a search and retrieval process grounded in layout, rather than content semantics or metadata (Chung, 2018).
2. System Architecture
The SIMILAR Framework operates in two main phases: offline preprocessing/indexing and online query/retrieval.
Offline Preprocessing and Indexing:
- Format Parsing: The system decompresses OOXML containers (ZIP structure) and parses interrelated XML parts (e.g., slide#.xml, styles.xml).
- Feature Extraction: For each page or slide, atomic layout features are extracted, covering geometric and styling attributes.
- Normalization & Storage: Extracted features are converted into a unified XML or JSON record (PageFeature) and stored in a feature database. The prototype implementation uses in-memory storage, with future plans for RDBMS or NoSQL indexing.
Online Querying and Retrieval:
- Query Construction: Users specify layout constraints (such as slide width, fonts, table dimensions, image positions) via a GUI, which are encoded as an XML RetrievalQuery (RQ).
- Matching & Similarity Computation: For each candidate PageFeature of the matching type, the system performs both exact and approximate matching to compute a document similarity score .
- Ranking & Output: RetrievalResults are ranked and filtered by user-selected similarity thresholds and can be visualized or exported for downstream processing.
3. Layout Feature Extraction
SIMILAR supports Microsoft OOXML file types (DOCX, PPTX, XLSX) and parses key XML components relevant for each type:
- PPTX: slide#.xml, slideMaster#.xml, slideLayout#.xml, theme#.xml
- DOCX: document.xml, styles.xml, header/footer.xml
- XLSX: sheet#.xml, drawing#.xml, styles.xml, chart#.xml
Layout elements are processed via type-specific handlers, with extracted data structured as:
| Feature | Attributes |
|---|---|
| PageSize | width, height, margins |
| TextBox | x, y, width, height, fontName, fontSize, fontColor |
| Image | x, y, width, height, mediaType |
| Table | x, y, width, height, rows, cols, cellStyles |
| Shape | x, y, width, height, type |
These are serialized into XML "PageFeature" subtrees that can be rapidly compared during search.
4. Similarity Computation
Suppose a RetrievalQuery (RQ) consists of atomic constraints . For a candidate PageFeature , per-constraint similarity is computed, then averaged:
Matching strategies by constraint type:
- Exact Matching (EM): if , else 0.
- Approximate Matching (AM):
- AM-1 (Categorical): for match, 0 otherwise.
- AM-2 (Typology/tolerance): for exact, 0.5 for in-category but different subtype, 0 otherwise.
- AM-3 (Scalars): 0 with 1; 2 determined from dataset bounds.
- AM-4 (Coordinates): 3 with 4 as Euclidean distance between point attributes.
Thresholds for perception of "visual similarity" by human users stabilized around 5 when sufficient constraints (6) are used.
5. Query Mechanism and User Workflow
The user interface is implemented in Qt and allows domain experts to construct queries interactively. The GUI dialog captures document type, selectable layout items, and value entry. Queries are serialized as XML matching the feature schema.
On execution, the core module dispatches documents of matching type to parsers, loads the extracted PageFeature, executes per-constraint matching, and compiles SimVtree results, including final similarity scores for retrieval ranking (Chung, 2018).
6. Implementation and Performance
- Language & GUI: Python 3.4 with Qt 5.4.
- Core Components: SSDocMainDialog (GUI handler), WorkerThread (asynchronous processing), SSDocCore (pipeline orchestration), format-specific FileParsers, SimVtree (stores features and similarity scores).
- Data: GOVDOCS1 converted corpus (PPTX: 4,140; DOCX: 5,451; XLSX: 7,124).
- Performance: Extraction/parsing ranges from 13–24 minutes depending on file type, retrieval queries complete within 0.5–12 min, with overall end-to-end times under 30 minutes for all formats.
- Effectiveness: Layout similarity search significantly outperformed keyword search in forensic scenarios—retrieval precision and recall up to 0.83 versus 0.06–0.10 for keywords when using layout constraints and 7.
7. Strengths, Limitations, and Future Directions
Strengths:
- Introduces an orthogonal similarity axis (visual layout) complementing content-based retrieval.
- User-controllable, high-granularity queries.
- Modular architecture facilitating integration with forensic pipelines.
Limitations:
- Prototype is in-memory; scaling to large corpora requires database or indexed storage.
- Currently supports only Office OOXML formats; lacks generalization to PDF, ODF, CAD, or scanned images.
- Single-threaded operation; lacks distributed or parallel retrieval capability.
Planned Extensions:
- Broader format support and feature types (charts, embedded objects).
- Incorporation of high-performance DBMS or spatial indexes (e.g., R-tree for geometric features).
- Network-exposed services for multi-user access.
- Machine-learning-driven S-value calibration and automatic discrimination of highly-separable layout features for relevance filtering.
8. Impact and Role in Digital Forensics
The SIMILAR Framework fills a critical gap in investigative workflows by enabling document triage and linkage based not on textual or structural similarity, but on persistent layout conventions. Empirical evaluation demonstrates that layout-based similarity has the potential to surface clusters of adversarial or coordinated documents otherwise invisible to content analysis. Its modular structure and empirical performance benchmarks position it as a foundational retrieval layer for forensic, compliance, and archival systems concerned with layout provenance and organizational signature detection (Chung, 2018).