Presenting Terrorizer: an algorithm for consolidating company names in patent assignees (2403.12083v1)
Abstract: The problem of disambiguation of company names poses a significant challenge in extracting useful information from patents. This issue biases research outcomes as it mostly underestimates the number of patents attributed to companies, particularly multinational corporations which file patents under a plethora of names, including alternate spellings of the same entity and, eventually, companies' subsidiaries. To date, addressing these challenges has relied on labor-intensive dictionary based or string matching approaches, leaving the problem of patents' assignee harmonization on large datasets mostly unresolved. To bridge this gap, this paper describes the Terrorizer algorithm, a text-based algorithm that leverages NLP, network theory, and rule-based techniques to harmonize the variants of company names recorded as patent assignees. In particular, the algorithm follows the tripartite structure of its antecedents, namely parsing, matching and filtering stage, adding an original "knowledge augmentation" phase which is used to enrich the information available on each assignee name. We use Terrorizer on a set of 325'917 companies' names who are assignees of patents granted by the USPTO from 2005 to 2022. The performance of Terrorizer is evaluated on four gold standard datasets. This validation step shows us two main things: the first is that the performance of Terrorizer is similar over different kind of datasets, proving that our algorithm generalizes well. Second, when comparing its performance with the one of the algorithm currently used in PatentsView for the same task (Monath et al., 2021), it achieves a higher F1 score. Finally, we use the Tree-structured Parzen Estimator (TPE) optimization algorithm for the hyperparameters' tuning. Our final result is a reduction in the initial set of names of over 42%.
- Disambiguating patent inventors, assignees, and their locations in patentsview, Tech. Rep (2021).
- Z. Griliches, Patent statistics as economic indicators: a survey, in: R&D and productivity: the econometric evidence, University of Chicago Press, 1998, pp. 287–343.
- B. P. Abraham, S. D. Moitra, Innovation assessment through patent analysis, Technovation 21 (2001) 245–252.
- B. H. Hall, D. Harhoff, Recent research on the economics of patents, Annu. Rev. Econ. 4 (2012) 541–565.
- B. N. Sampat, A survey of empirical evidence on patents and innovation, NBER WORKING PAPER SERIES (2018).
- K. Pavitt, Patent statistics as indicators of innovative activities: possibilities and problems, Scientometrics 7 (1985) 77–99.
- Inventive progress measured by multi-stage patent citation analysis, research Policy 34 (2005) 1591–1607.
- Data production methods for hamonized patent statistics: Patentee sector allocation, Available at SSRN 944464 (2006).
- Measuring patent quality: Indicators of technological and economic value, OECD Science, Technology and Industry Working Papers (2013).
- Data production methods for harmonized patent statistics: Patentee name harmonization, Available at SSRN 944470 (2006).
- J. Raffo, S. Lhuillery, How to play the “names game”: Patent retrieval comparing different heuristics, Research policy 38 (2009) 1617–1627.
- Who’s who in patents. a bayesian approach, Cahiers du GREThA 7 (2009) 07–2009.
- Harmonizing harmonized patentee names: an exploratory assessment of top patentees, Eurostat Working Paper (2010).
- How to kill inventors: testing the massacrator© algorithm for inventor disambiguation, Scientometrics 101 (2014) 477–504.
- Disambiguation of patent inventors and assignees using high-resolution geolocation data, Scientific data 4 (2017) 1–21.
- Collective knowledge, prolific inventors and the value of inventions: An empirical study of french, german and british patents in the us, 1975–1999, Economics of Innovation and New Technology 17 (2008) 5–22.
- The careers and co-authorship networks of us patent-holders, since 1975, Unpublished Working Paper, Harvard University (2009).
- E. Miguélez, I. Gómez-Miguélez, Singling out individual inventors from patent data, Available at SSRN 1856875 (2011).
- Disambiguation and co-authorship networks of the us patent inventor database (1975–2010), Research Policy 43 (2014) 941–955.
- Seeing the non-stars:(some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records, Research Policy 44 (2015) 1672–1701.
- Inventor name disambiguation for a patent database using a random forest and dbscan, in: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, 2016, pp. 269–270.
- Disambiguating uspto inventor names with semantic fingerprinting and dbscan clustering, The Electronic Library 37 (2019) 225–239.
- S. M. Petrie, T. D. Julius, A novel text representation which enables image classifiers to also simultaneously classify text, applied to name disambiguation, Scientometrics (2023) 1–25.
- Identifying the technology profiles of r&d performing firms—a matching of r&d and patent data, International Journal of Innovation and Technology Management 14 (2017) 1740003.
- Simple and effective way to disambiguate and standardize patent applicants using an attention mechanism with data augmentation, IEEE Access (2023).
- Matching patents to compustat firms, 1980–2015: Dynamic reassignment, name changes, and ownership structures, Research Policy 50 (2021) 104217.
- Standardization and accuracy of japanese patent applicant names, Available at SSRN 2147190 (2012).
- Deeppatent: patent classification with convolutional neural networks and word embedding, Scientometrics 117 (2018) 721–744.
- Technet: Technology semantic network based on patent data, Expert Systems with Applications 142 (2020) 112995.
- A machine learning approach for solar power technology review and patent evolution analysis, Applied Sciences 9 (2019) 1478.
- Natural language processing to identify the creation and impact of new technologies in patent text: Code, data, and new measures, Research Policy 50 (2021) 104144.
- Patent transactions in the marketplace: Lessons from the uspto patent assignment dataset, Journal of Economics & Management Strategy 27 (2018) 343–371.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
- Foreign competition and domestic innovation: Evidence from us patents, American Economic Review: Insights 2 (2020) 357–374.
- N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019).
- Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (2008) P10008.
- Detecting global bridges in networks, Journal of Complex Networks 4 (2016) 319–329.
- Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631.
- Disambiguation of company names via deep recurrent networks, Expert Systems with Applications 238 (2024) 122035.
- M. Coffano, G. Tarasconi, Crios-patstat database: sources, contents and access rules, Center for Research on Innovation, Organization and Strategy, CRIOS working paper (2014).