Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PaECTER: Patent-level Representation Learning using Citation-informed Transformers (2402.19411v1)

Published 29 Feb 2024 in cs.IR, cs.CL, and cs.LG

Abstract: PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained LLM (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3613–3618, Hong Kong, China. Association for Computational Linguistics.
  2. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
  3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  4. Patent citation data in social science research: Overview and best practices. Journal of the Association for Information Science and Technology, 68(6):1360–1374.
  5. Measuring Technological Innovation over the Long Run. American Economic Review: Insights, 3(3):303–320.
  6. Decoupled Weight Decay Regularization. In Proceedings of the ICLR 2019, New Orleans, LA.
  7. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, Hong Kong.
  8. SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.
  9. Leveraging the BERT algorithm for Patents with TensorFlow and BigQuery. Technical report, Google.
  10. SEARCHFORMER: Semantic patent embeddings by siamese transformers for prior art search. World Patent Information, 73:102192.
  11. Analysing European and International Patent Citations: A Set of EPO Patent Database Building Blocks. OECD Science, Technology and Industry Working Papers.
  12. Transformers: State-of-the-Art Natural Language Processing. In Liu, Q. and Schlangen, D., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Citations (1)

Summary

  • The paper introduces PaECTER, a citation-informed Transformer model fine-tuned on 300K European patents to significantly boost semantic similarity and citation prediction.
  • It employs a contrastive learning approach with triplet loss, leveraging examiner-added citation data to optimize domain-specific patent embeddings.
  • PaECTER outperforms existing models, including BERT for Patents, by achieving improved Rank First Relevant (RFR) and Mean Average Precision (MAP) scores on similarity tasks.

Enhancing Patent Analysis with PaECTER: A Novel Citation-informed Transformer Approach

Introduction to PaECTER

In the domain of patent analysis and management, the quest for better tools and models for representing and analyzing patent texts is ongoing. The introduction of the PaECTER (Patent Embeddings using Citation-informed TransformERs) model marks a significant advancement in this area. This model fine-tunes BERT for Patents using examiner-added citation information to generate numerical representations of patent documents, which has been shown to outperform existing state-of-the-art models on several fronts.

The Need for Domain-Specific Models

Domain-specific models, like PaECTER, are essential to accurately process and understand patent texts, which often contain language and terminology not found in general language corpora. The existing models tailored for scientific or general English texts fall short when applied directly to patent documents due to the unique linguistic features and terminology used within patents. By incorporating patent-specific vocabulary and understanding into the model, PaECTER achieves greater accuracy in tasks such as similarity analysis and citation prediction.

Training Data and Methodology

The training dataset for PaECTER consisted of 300,000 English-language patent families filed with the European Patent Office (EPO) from 1985 to 2022. These patents were specifically chosen due to the control and quality of citation information by EPO examiners. The dataset employs a contrastive learning technique and utilizes citation information to identify similar and dissimilar patents, leveraging a triplet loss function for model optimization. This method allows PaECTER to understand and quantify similarities between patents based on the text and citation context effectively.

Comparative Performance and Results

PaECTER showed superior performance in similarity tasks when compared against other models, including the current state-of-the-art BERT for Patents. Notably, PaECTER significantly outperformed BERT for Patents on patent citation prediction test datasets, with a notable improvement in both Rank First Relevant (RFR) and Mean Average Precision (MAP) metrics. These results underscore PaECTER's advanced ability to accurately represent patent documents and predict their relevance and similarity to other patents.

Practical Implications and Future Directions

The development and deployment of PaECTER have significant practical implications for both patent analysis and the broader intellectual property management field. By facilitating more accurate and efficient semantic similarity searches, PaECTER enhances the capability of inventors and patent examiners to conduct prior art searches and knowledge flow analysis. The model's ability to detect semantic similarities where citations may be lacking or strategically manipulated opens new avenues for identifying innovative breakthroughs and understanding patent ecosystems more deeply. Looking forward, the application of PaECTER could further expand into automated patent classification, trend analysis, and the development of more advanced patent search and analysis tools. As the model is publicly available on Hugging Face, it also provides a platform for further research and development within the domain.

Conclusion

PaECTER represents a significant leap forward in the nuanced analysis of patent texts, offering a powerful tool for researchers and practitioners in the domain of intellectual property management. By leveraging citation-informed Transformers and focusing on domain-specific embeddings, PaECTER improves upon existing methods for understanding and navigating the complex landscape of patent documents. Its success in numerical representation and similarity tasks has the potential to transform how patents are analyzed, aiding in the discovery of innovative technologies and streamlining prior art searches.

Acknowledgements

The development of PaECTER was supported by a grant from the European Patent Office under their Academic Research Programme and utilized the high-performance computing cluster at the Max Planck Computing and Data Facility. This support underscores the collaborative effort between academia and industry to innovate and improve upon tools for patent analysis and intellectual property management.

X Twitter Logo Streamline Icon: https://streamlinehq.com