PaECTER: Patent-level Representation Learning using Citation-informed Transformers (2402.19411v1)
Abstract: PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained LLM (BERT for Patents) on our patent citation prediction test dataset on two different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners. PaECTER is available on Hugging Face.
- SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3613–3618, Hong Kong, China. Association for Computational Linguistics.
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Patent citation data in social science research: Overview and best practices. Journal of the Association for Information Science and Technology, 68(6):1360–1374.
- Measuring Technological Innovation over the Long Run. American Economic Review: Insights, 3(3):303–320.
- Decoupled Weight Decay Regularization. In Proceedings of the ICLR 2019, New Orleans, LA.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, Hong Kong.
- SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.
- Leveraging the BERT algorithm for Patents with TensorFlow and BigQuery. Technical report, Google.
- SEARCHFORMER: Semantic patent embeddings by siamese transformers for prior art search. World Patent Information, 73:102192.
- Analysing European and International Patent Citations: A Set of EPO Patent Database Building Blocks. OECD Science, Technology and Industry Working Papers.
- Transformers: State-of-the-Art Natural Language Processing. In Liu, Q. and Schlangen, D., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.