Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT (2103.11933v3)

Published 22 Mar 2021 in cs.LG and econ.EM

Abstract: This study provides an efficient approach for using text data to calculate patent-to-patent (p2p) technological similarity, and presents a hybrid framework for leveraging the resulting p2p similarity for applications such as semantic search and automated patent classification. We create embeddings using Sentence-BERT (SBERT) based on patent claims. We leverage SBERTs efficiency in creating embedding distance measures to map p2p similarity in large sets of patent data. We deploy our framework for classification with a simple Nearest Neighbors (KNN) model that predicts Cooperative Patent Classification (CPC) of a patent based on the class assignment of the K patents with the highest p2p similarity. We thereby validate that the p2p similarity captures their technological features in terms of CPC overlap, and at the same demonstrate the usefulness of this approach for automatic patent classification based on text data. Furthermore, the presented classification framework is simple and the results easy to interpret and evaluate by end-users. In the out-of-sample model validation, we are able to perform a multi-label prediction of all assigned CPC classes on the subclass (663) level on 1,492,294 patents with an accuracy of 54% and F1 score > 66%, which suggests that our model outperforms the current state-of-the-art in text-based multi-label and multi-class patent classification. We furthermore discuss the applicability of the presented framework for semantic IP search, patent landscaping, and technology intelligence. We finally point towards a future research agenda for leveraging multi-source patent embeddings, their appropriateness across applications, as well as to improve and validate patent embeddings by creating domain-expert curated Semantic Textual Similarity (STS) benchmark datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hamid Bekamiri (2 papers)
  2. Daniel S. Hain (3 papers)
  3. Roman Jurowetzki (6 papers)
Citations (29)

Summary

Insights into PatentSBERTa: Advancements in Patent Analytics Using Augmented SBERT

The paper "PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT" presents a robust contribution to the field of patent analytics by leveraging advanced NLP techniques. The authors propose a novel framework that integrates several machine learning methodologies to optimize the calculation of patent-to-patent (p2p) similarity and enhances automated patent classification processes.

The primary innovation in this paper lies in the employment of Sentence-BERT (SBERT) for generating embeddings from patent claims, subsequently augmented with techniques derived from RoBERTa. This augmentation allows for fine-tuning SBERT with in-domain supervised data, efficiently adapting these models to handle patent claim text. The proposed system addresses common challenges in patent analysis, such as computational expense and the handling of domain-specific language, presenting an optimized workflow applicable to vast datasets without the need for extensive hardware resources.

Key numerical results highlight the model's efficacy in addressing complex multi-class and multi-label classification tasks. The authors report an impressive accuracy of 54% and an F1 score exceeding 66% at the CPC subclass level on a dataset encompassing over 1.49 million patents. These metrics indicate improvements over existing models like PatentBERT and DeepPatent, thereby substantiating the claims of enhanced classification performance.

The paper explores the broader implications of their hybrid model. By streamlining the process of p2p similarity computation, the framework facilitates applications such as semantic patent search and patent landscaping, contributing valuable tools for technology intelligence. The simplicity and interpretability of the KNN classification approach make the system accessible to end-users, fostering wider practical adoption.

On a theoretical front, the authors propose avenues for future research, suggesting enhancements in patent embeddings by utilizing multi-source data and developing domain-expert curated Semantic Textual Similarity (STS) benchmark datasets. Such developments promise to refine accuracy further and broaden applications.

To summarize, the proposed PatentSBERTa framework marks a significant step forward in automating patent classification and measuring technological similarity. While challenges remain such as dataset size and scalability, the paper presents a compelling vision for integrating deep NLP techniques into patent data analytics, setting the stage for ongoing advancements in this critical domain.