Natural Language Processing in Patents: A Survey (2403.04105v2)
Abstract: Patents, encapsulating crucial technical and legal information, present a rich domain for NLP applications. As NLP technologies evolve, LLMs have demonstrated outstanding capabilities in general text processing and generation tasks. However, the application of LLMs in the patent domain remains under-explored and under-developed due to the complexity of patent processing. Understanding the unique characteristics of patent documents and related research in the patent domain becomes essential for researchers to apply these tools effectively. Therefore, this paper aims to equip NLP researchers with the essential knowledge to navigate this complex domain efficiently. We introduce the relevant fundamental aspects of patents to provide solid background information, particularly for readers unfamiliar with the patent system. In addition, we systematically break down the structural and linguistic characteristics unique to patents and map out how NLP can be leveraged for patent analysis and generation. Moreover, we demonstrate the spectrum of text-based patent-related tasks, including nine patent analysis and four patent generation tasks.
- M. Frumkin. Early history of patents for innovation. Transactions of the Newcomen Society, 26(1):47–56, 1947.
- A literature review on the state-of-the-art in patent analysis. World Patent Information, 37:3–13, 2014.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
- A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):87–110, 2022.
- A survey on deep learning for patent analysis. World Patent Information, 65:102035, 2021.
- Patent retrieval: a literature review. Knowledge and Information Systems, 61:631–660, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- The harvard USPTO patent dataset: A large-scale, well-structured, and multi-purpose corpus of patent applications. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Patent data for engineering design: A critical review and future directions. Journal of Computing and Information Science in Engineering, 22(6):060902, 2022.
- Julian Just. Natural language processing for innovation search–reviewing an emerging non-human innovation intermediary. Technovation, 129:102883, 2024.
- Summarization, simplification, and generation: The case of patents. Expert Systems with Applications, 205:117627, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- The state-of-the-art on intellectual property analytics (ipa): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (ip) data. World Patent Information, 55:37–51, 2018.
- Natural language processing in the legal domain. arXiv preprint arXiv:2302.12039, 2023.
- Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117:721–744, 2018.
- Patent classification by fine-tuning bert language model. World Patent Information, 61:101965, 2020.
- Automated categorization in the international patent classification. In Acm Sigir Forum, volume 37, pages 10–25. ACM New York, NY, USA, 2003.
- Automated patent classification for crop protection via domain adaptation. Applied AI Letters, 4(1):e80, 2023.
- Deeppatent: Large scale patent drawing recognition and retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2309–2318, 2022.
- Patentmatch: a dataset for matching patent claims & prior art. arXiv preprint arXiv:2012.13919, 2020.
- Automating the search for a patent’s prior art with a full text similarity search. PloS one, 14(3):e0212103, 2019.
- A deep learning based method for extracting semantic information from patent documents. Scientometrics, 125:289–312, 2020.
- Natural language processing to identify the creation and impact of new technologies in patent text: Code, data, and new measures. Research Policy, 50(2):104144, 2021.
- Deep learning for predicting patent application outcome: The fusion of text and network embeddings. Journal of Informetrics, 17(2):101402, 2023.
- A multi-aspect neural tensor factorization framework for patent litigation prediction. IEEE Transactions on Big Data, 2023.
- Bigpatent: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, 2019.
- Creating a silver standard for patent simplification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1045–1055, 2023.
- Building machine translation tools for patent language: A data generation strategy at the european patent office. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 471–479, 2023.
- The europat corpus: A parallel corpus of european patent data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 732–740, 2022.
- Patent claim generation by fine-tuning openai gpt-2. World Patent Information, 62:101983, 2020.
- Evaluating information retrieval systems on european patent data: The clef-ip campaign. In Current Challenges in Patent Information Retrieval, pages 113–142. Springer, 2017.
- Patent-related tasks at ntcir. Current Challenges in Patent Information Retrieval, pages 77–111, 2017.
- Trec-chem: large scale chemical information retrieval evaluation at trec. In Acm Sigir Forum, volume 43, pages 63–70. ACM New York, NY, USA, 2009.
- A new function-based patent knowledge retrieval tool for conceptual design of innovative products. Computers in Industry, 115:103154, 2020.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Domain-specific word embeddings for patent classification. Data Technologies and Applications, 53(1):108–122, 2019.
- Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR, 2014.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
- Early detection of valuable patents using a deep learning model: Case of semiconductor industry. Technological Forecasting and Social Change, 158:120146, 2020.
- Patent litigation prediction: A convolutional tensor factorization approach. In IJCAI, pages 5052–5059, 2018.
- Patent quality valuation with deep learning models. In Database Systems for Advanced Applications: 23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part II 23, pages 474–490. Springer, 2018.
- Deep learning for patent landscaping using transformer and graph embedding. Technological Forecasting and Social Change, 175:121413, 2022.
- Learning efficient representations for image-based patent retrieval. arXiv preprint arXiv:2308.13749, 2023.
- Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
- Contextual local primitives for binary patent image retrieval. Multimedia Tools and Applications, 77:9111–9151, 2018.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Deriving design feature vectors for patent images using convolutional neural networks. Journal of Mechanical Design, 143(6):061405, 2021.
- Approaches to automatically extract affordances from patents. In Proceedings of the Design Society: International Conference on Engineering Design, volume 1, pages 2487–2496. Cambridge University Press, 2019.
- Impact of knowledge search practices on the originality of inventions: A study in the oil & gas industry through dynamic patent analysis. Technological Forecasting and Social Change, 168:120782, 2021.
- Segmentation of patent claims for improving their readability. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pages 66–73, 2014.
- Patentsberta: A deep nlp based hybrid model for patent distance and classification using augmented sbert. arXiv preprint arXiv:2103.11933, 2021.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
- Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.
- Understanding the trends in blockchain domain through an unsupervised systematic patent analysis. IEEE Transactions on Engineering Management, 2021.
- An empirical study on patent novelty detection: A novel approach using machine learning and natural language processing. In 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 1–7. IEEE, 2020.
- Predicting and analyzing factors in patent litigation. In NIPS2016, ML and the Law Workshop, 2016.
- Evaluation and identification of potential high-value patents in the field of integrated circuits using a multidimensional patent indicators pre-screening strategy and machine learning approaches. Journal of Informetrics, 17(2):101406, 2023.
- Patent analytics based on feature vector space model: A case of iot. Ieee Access, 7:45705–45715, 2019.
- Patent automatic classification based on symmetric hierarchical convolution neural network. Symmetry, 12(2):186, 2020.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
- An lstm approach to patent classification based on fixed hierarchy vectors. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 495–503. SIAM, 2018.
- Intelligent compilation of patent summaries using machine learning and natural language processing techniques. Advanced Engineering Informatics, 43:101027, 2020.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers in the real world: A survey on nlp applications. Information, 14(4):242, 2023.
- A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
- Event-based dynamic graph representation learning for patent application trend prediction. IEEE Transactions on Knowledge and Data Engineering, 2023.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Forecasting emerging technologies using data augmentation and deep learning. Scientometrics, 123:1–29, 2020.
- Multi-document summarization for patent documents based on generative adversarial network. Expert Systems with Applications, 207:117983, 2022.
- Technological troubleshooting based on sentence embedding with deep transformers. Journal of Intelligent Manufacturing, 32(6):1699–1710, 2021.
- Improving language understanding by generative pre-training. OpenAI blog, 2018.
- Generative design ideation: a natural language generation approach. In International Conference on-Design Computing and Cognition, pages 39–50. Springer, 2022.
- Comparison and analysis of embedding methods for patent documents. In 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 152–155. IEEE, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Ai for patents: A novel yet effective and efficient framework for patent analysis. IEEE Access, 10:59205–59218, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- OpenAI. GPT-4 Technical Report. arXiv e-prints, page arXiv:2303.08774, March 2023.
- Enhancing patent retrieval using text and knowledge graph embeddings: a technical note. Journal of Engineering Design, 33(8-9):670–683, 2022.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Deep learning for technical document classification. IEEE Transactions on Engineering Management, 2022.
- Patfig: Generating short and long captions for patent figures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2843–2849, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32, 2016.
- Pgt: a prompt based generative transformer for the patent domain. In ICML 2022 Workshop on Knowledge Retrieval and Language Models, 2022.
- Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.
- Reliable multi-view deep patent classification. Mathematics, 10(23):4545, 2022.
- Robi Polikar. Ensemble based systems in decision making. IEEE Circuits and systems magazine, 6(3):21–45, 2006.
- An ensemble framework for patent classification. World Patent Information, 75:102233, 2023.
- Neural unsupervised domain adaptation in nlp—a survey. arXiv preprint arXiv:2006.00632, 2020.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
- Legal-bert: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, 2020.
- A review: Knowledge reasoning over knowledge graph. Expert Systems with Applications, 141:112948, 2020.
- Engineering knowledge graph from patent database. Journal of Computing and Information Science in Engineering, 22(2):021008, 2022.
- Interpretable patent recommendation with knowledge graph and deep learning. Scientific Reports, 13(1):2586, 2023.
- Modern information retrieval, volume 463. ACM press New York, 1999.
- Pres: a score metric for evaluating recall-oriented information retrieval applications. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 611–618, 2010.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
- Jieh-Sheng Lee. Evaluating generative patent language models. World Patent Information, 72:102173, 2023.
- A hierarchical feature extraction model for multi-label mechanical patent classification. Sustainability, 10(1):219, 2018.
- Optimizing neural networks for patent classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 688–703. Springer, 2019.
- Patentnet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics, pages 1–25, 2022.
- Hierarchical document classification as a sequence generation task. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pages 147–155, 2020.
- Engineering knowledge graph for keyword discovery in patent search. In Proceedings of the design society: international conference on engineering design, volume 1, pages 2249–2258. Cambridge University Press, 2019.
- A text-embedding-based approach to measuring patent-to-patent technological similarity. Technological Forecasting and Social Change, 177:121559, 2022.
- Enriching word embeddings for patent retrieval with global context. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41, pages 810–818. Springer, 2019.
- Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
- Cross-domain retrieval in the legal and patent domains: a reproducibility study. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, pages 3–17. Springer, 2021.
- An intelligent patent recommender adopting machine learning approach for natural language processing: A case study for smart machinery technology mining. Technological Forecasting and Social Change, 164:120511, 2021.
- Searchformer: Semantic patent embeddings by siamese transformers for prior art search. World Patent Information, 73:102192, 2023.
- Bert-pli: Modeling paragraph-level interactions for legal case retrieval. In IJCAI, pages 3501–3507, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Patent image retrieval using transformer-based deep metric learning. World Patent Information, 74:102217, 2023.
- Automated patent landscaping. Artificial Intelligence and Law, 26(2):103–125, 2018.
- Three real-world datasets and neural computational models for classification tasks in patent landscaping. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11498–11513, 2022.
- Technology identification from patent texts: A novel named entity recognition method. Technological Forecasting and Social Change, 186:122160, 2023.
- Unveiling the inventive process from patents by extracting problems, solutions and advantages with natural language processing. Solutions and Advantages with Natural Language Processing, 2022.
- A knowledge graph approach for recommending patents to companies. Electronic Commerce Research, pages 1–32, 2021.
- Extraction and linking of motivation, specification and structure of inventions for early design use. Journal of Engineering Design, pages 1–26, 2023.
- Measuring technological novelty with patent-based indicators. Research policy, 45(3):707–723, 2016.
- A novelty detection patent mining approach for analyzing technological opportunities. Advanced Engineering Informatics, 42:100941, 2019.
- A doc2vec and local outlier factor approach to measuring the novelty of patents. Technological Forecasting and Social Change, 174:121294, 2022.
- Measuring novelty in science with word embedding. PloS one, 16(7):e0254034, 2021.
- An explainable ai (xai) model for text-based patent novelty analysis. Expert Systems with Applications, page 120839, 2023.
- Assessment of patentability by means of semantic patent analysis–a mathematical-logical approach. World Patent Information, 73:102182, 2023.
- Activehne: active heterogeneous network embedding. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2123–2129, 2019.
- Direct validation of citation counts as indicators of industrially important patents. Research policy, 20(3):251–259, 1991.
- Citation frequency and the value of patented inventions. Review of Economics and statistics, 81(3):511–515, 1999.
- Research on patent quality evaluation based on rough set and cloud model. Expert Systems with Applications, 235:121057, 2024.
- A personalized recommendation system for high-quality patent trading by leveraging hybrid patent analysis. Scientometrics, 126:9369–9391, 2021.
- Multi-task learning based high-value patent and standard-essential patent identification model. Information Processing & Management, 60(3):103327, 2023.
- Early identification of emerging technologies: A machine learning approach using multiple patent indicators. Technological Forecasting and Social Change, 127:291–303, 2018.
- Forecasting emerging technologies: A supervised learning approach through patent analysis. Technological Forecasting and Social Change, 125:236–244, 2017.
- A deep learning framework to early identify emerging technologies in large-scale outlier patents: An empirical study of cnc machine tool. Scientometrics, 126:969–994, 2021.
- Determining technology life cycle prediction based on patent bibliometric data. International Journal of Information Science and Management (IJISM), 21(3):161–185, 2023.
- Stochastic technology life cycle analysis using multiple patent indicators. Technological Forecasting and Social Change, 106:53–64, 2016.
- Discovering new technology opportunities based on patents: Text-mining and f-term analysis. Technovation, 60:1–14, 2017.
- Technet: Technology semantic network based on patent data. Expert Systems with Applications, 142:112995, 2020.
- Idea generation with technology semantic network. AI EDAM, 35(3):265–283, 2021.
- Discovering new applications: Cross-domain exploration of patent documents using causal extraction and similarity analysis. World Patent Information, 75:102238, 2023.
- Using summarization techniques on patent database through computational intelligence. In Progress in Artificial Intelligence: 19th EPIA Conference on Artificial Intelligence, EPIA 2019, Vila Real, Portugal, September 3–6, 2019, Proceedings, Part II 19, pages 508–519. Springer, 2019.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Bruno Pouliquen. Full-text patent translation at wipo; scalability, quality and usability. In Proceedings of the 6th Workshop on Patent and Scientific Literature Translation, 2015.
- Progress in machine translation. Engineering, 18:143–153, 2022.
- Errors of machine translation of terminology in the patent text from english into chinese. ASP Transactions on Computers, 1(1):12–17, 2021.
- Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
- Aline Larroyed. Redefining patent translation: The influence of chatgpt and the urgency to align patent language regimes in europe with progress in translation technology. GRUR International, 72(11):1009–1017, 2023.
- Domain adaptation of general natural language processing tools for a patent claim visualization system. In Multidisciplinary Information Retrieval: 6th Information Retrieval Facility Conference, IRFC 2013, Limassol, Cyprus, October 7-9, 2013. Proceedings 6, pages 70–82. Springer, 2013.
- Svetlana Sheremetyeva. Automatic text simplification for handling intellectual property (the case of multiple patent claims). In Proceedings of the Workshop on Automatic Text Simplification-Methods and Applications in the Multilingual Society (ATS-MA 2014), pages 41–52, 2014.
- Text simplification of patent documents. In Automated Invention for Smart Industries: 18th International TRIZ Future Conference, TFC 2018, Strasbourg, France, October 29–31, 2018, Proceedings, pages 225–237. Springer, 2018.
- Controllable sentence simplification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4689–4698, 2020.
- Jieh-Sheng Lee. Controlling patent text generation by structural metadata. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 3241–3244, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Lekang Jiang (7 papers)
- Stephan Goetz (6 papers)