PaECTER: Patent-level Representation Learning using Citation-informed Transformers (2402.19411v1)

Published 29 Feb 2024 in cs.IR, cs.CL, and cs.LG

Abstract: PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained LLM (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.

References (12)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces PaECTER, a citation-informed Transformer model fine-tuned on 300K European patents to significantly boost semantic similarity and citation prediction.
It employs a contrastive learning approach with triplet loss, leveraging examiner-added citation data to optimize domain-specific patent embeddings.
PaECTER outperforms existing models, including BERT for Patents, by achieving improved Rank First Relevant (RFR) and Mean Average Precision (MAP) scores on similarity tasks.

Enhancing Patent Analysis with PaECTER: A Novel Citation-informed Transformer Approach

Introduction to PaECTER

In the domain of patent analysis and management, the quest for better tools and models for representing and analyzing patent texts is ongoing. The introduction of the PaECTER (Patent Embeddings using Citation-informed TransformERs) model marks a significant advancement in this area. This model fine-tunes BERT for Patents using examiner-added citation information to generate numerical representations of patent documents, which has been shown to outperform existing state-of-the-art models on several fronts.

The Need for Domain-Specific Models

Domain-specific models, like PaECTER, are essential to accurately process and understand patent texts, which often contain language and terminology not found in general language corpora. The existing models tailored for scientific or general English texts fall short when applied directly to patent documents due to the unique linguistic features and terminology used within patents. By incorporating patent-specific vocabulary and understanding into the model, PaECTER achieves greater accuracy in tasks such as similarity analysis and citation prediction.

Training Data and Methodology

The training dataset for PaECTER consisted of 300,000 English-language patent families filed with the European Patent Office (EPO) from 1985 to 2022. These patents were specifically chosen due to the control and quality of citation information by EPO examiners. The dataset employs a contrastive learning technique and utilizes citation information to identify similar and dissimilar patents, leveraging a triplet loss function for model optimization. This method allows PaECTER to understand and quantify similarities between patents based on the text and citation context effectively.

Comparative Performance and Results

PaECTER showed superior performance in similarity tasks when compared against other models, including the current state-of-the-art BERT for Patents. Notably, PaECTER significantly outperformed BERT for Patents on patent citation prediction test datasets, with a notable improvement in both Rank First Relevant (RFR) and Mean Average Precision (MAP) metrics. These results underscore PaECTER's advanced ability to accurately represent patent documents and predict their relevance and similarity to other patents.

Practical Implications and Future Directions

The development and deployment of PaECTER have significant practical implications for both patent analysis and the broader intellectual property management field. By facilitating more accurate and efficient semantic similarity searches, PaECTER enhances the capability of inventors and patent examiners to conduct prior art searches and knowledge flow analysis. The model's ability to detect semantic similarities where citations may be lacking or strategically manipulated opens new avenues for identifying innovative breakthroughs and understanding patent ecosystems more deeply. Looking forward, the application of PaECTER could further expand into automated patent classification, trend analysis, and the development of more advanced patent search and analysis tools. As the model is publicly available on Hugging Face, it also provides a platform for further research and development within the domain.

Conclusion

PaECTER represents a significant leap forward in the nuanced analysis of patent texts, offering a powerful tool for researchers and practitioners in the domain of intellectual property management. By leveraging citation-informed Transformers and focusing on domain-specific embeddings, PaECTER improves upon existing methods for understanding and navigating the complex landscape of patent documents. Its success in numerical representation and similarity tasks has the potential to transform how patents are analyzed, aiding in the discovery of innovative technologies and streamlining prior art searches.

Acknowledgements

The development of PaECTER was supported by a grant from the European Patent Office under their Academic Research Programme and utilized the high-performance computing cluster at the Max Planck Computing and Data Facility. This support underscores the collaborative effort between academia and industry to innovate and improve upon tools for patent analysis and intellectual property management.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Michael_E_Rose/status/1763550214558785992