PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures (2501.15074v1)

Published 25 Jan 2025 in cs.CV and cs.AI

Abstract: Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ~355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PatentLMM - a novel multimodal LLM specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. We make the code and data publicly available.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Glossary

BEiT: A self-supervised vision model that uses discrete tokens to learn image representations via masked prediction. "Our formulation of the LAMIM objective is similar to BEiT~\cite{bao2022beit} and therefore requires a discrete image tokenizer."
BLEU: An n-gram precision-based metric for evaluating generated text against references. "our proposed approach surpasses their best performance on the average BLEU metric by 10.22\% and 4.43\% on an absolute scale for generating brief and detailed descriptions, respectively."
BPE tokenizer: Byte Pair Encoding; a subword tokenization method that splits text into frequently occurring units. "The OCR extracted text is tokenized using the BPE tokenizer~\cite{bpe} and represented using a learnable embedding matrix."
Cosine schedule: A learning rate schedule that follows a cosine decay over training steps. "stage 2 training takes place at a learning rate of 2e-4 with a cosine schedule, for 12K steps using Adam optimizer."
dVAE: Discrete Variational Autoencoder; an image tokenizer that maps images to discrete codebook indices. "competing works dVAE~\cite{ramesh2021dall.e-dvae} and VQGAN~\cite{Esser_2021taming}."
GPT-4V: The vision-enabled variant of GPT-4 capable of processing images and generating text. "GPT-4V demonstrated superior performance among baselines across all metrics, significantly outperforming other baselines owing to its large scale and the diverse data it has seen during its pre-training."
HUPD: Harvard USPTO Patent Dataset; a large corpus of US patent documents used for domain adaptation. "We continue to pre-train the LLaMA-2 7B model using LoRA~\cite{hu2022lora} adapters, on the descriptions from HUPD patent dataset~\cite{suzgun2024harvard}, to bias the model to generate the language inherent to patent documents."
LAMIM: Layout-Aware Masked Image Modeling; a masked image modeling objective that only masks informative patches in document-like images. "Our formulation of the LAMIM objective is similar to BEiT~\cite{bao2022beit} and therefore requires a discrete image tokenizer."
LayoutLMv3: A multimodal transformer for document image understanding that jointly models text, layout, and vision. "The proposed PatentMME shares its architecture with LayoutLMv3~\cite{huang2022layoutlmv3} and is a multi-modal transformer model that processes image, text, and document layout information jointly."
LLaMA-2 7B: A 7-billion-parameter LLM used as the base for domain adaptation. "PatentLLaMA is a domain-adapted version of the LLaMA-2 7B model for the patent domain."
LLaVA-1.5: A multimodal LLM framework that aligns visual features to LLaMA using a simple two-stage process. "In contrast, LLaVA-1.5~\cite{LLaVA-1.5} proposes a relatively simple and effective two-stage approach."
LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning technique that injects trainable low-rank matrices into existing layers. "We continue to pre-train the LLaMA-2 7B model using LoRA~\cite{hu2022lora} adapters"
METEOR: A text evaluation metric that considers precision, recall, and alignment using stemming and synonyms. "we use standard image captioning metrics such as BLEU~\cite{bleu}, ROUGE~\cite{lin-2004-rouge} and METEOR~\cite{banerjee-lavie-2005-meteor}."
MiniGPT-4: A multimodal model that connects a vision encoder to an LLM with a learned projection for image-conditioned generation. "MiniGPT-4~\cite{Zhu2023MiniGPT4EV} builds upon pretrained BLIP-2 and finetunes an additional linear layer to project queries into the LLM on a curated dataset."
MLM: Masked Language Modeling; an objective where some text tokens are masked and predicted to learn language representations. "PatentMME is a specialized multi-modal vision encoder for patent figures, trained using masked language modeling loss, along with two other novel loss functions focused on learning structure from sparse patent figures."
OCR-VQGAN: A VQGAN variant tailored for images containing text, improving tokenization of textual regions. "We choose OCR-VQGAN~\cite{rodriguez2023ocr-vqgan} since its tokenized image representation is capable of handling textual information better than competing works dVAE~\cite{ramesh2021dall.e-dvae} and VQGAN~\cite{Esser_2021taming}."
OFA: An image captioning and multimodal generation framework pre-trained with various tasks in a unified architecture. "For image captioning baselines, we study the state-of-the-art models GIT~\cite{Wang2022GITAG}, BLIP~\cite{Li2022BLIPBL} and OFA~\cite{OFA}."
Patch Classification (PC): A multi-label classification objective to identify semantic elements (e.g., nodes, text, arrows) in image patches. "Patch Classification (PC). In this multi-label binary classification objective, we classify each of the $M$ image patches into one or more of the following five categories: node, node label, figure label, text, and arrows."
PEGASUS: A transformer model pre-trained for abstractive summarization using gap-sentence generation. "We benchmark the text-only baseline Pegasus~\cite{Zhang2019PEGASUSPW} by generating patent figure descriptions from OCR tokens extracted from patent figures."
QFormer: A query-based transformer in BLIP-2 that bridges a vision encoder and an LLM by producing aligned prompt embeddings. "BLIP-2~\cite{Li2023BLIP2BL} leverages pre-trained ViT~\cite{ViT} and LLaMA~\cite{LLaMA-meta}, combined with QFormer, to translate image embeddings into LLM prompt embeddings."
ROUGE: A recall-oriented metric for evaluating summaries and generated text via overlapping n-grams and sequences. "we use standard image captioning metrics such as BLEU~\cite{bleu}, ROUGE~\cite{lin-2004-rouge} and METEOR~\cite{banerjee-lavie-2005-meteor}."
Tesseract OCR: An open-source optical character recognition engine used to extract text from images. "the OCR text is extracted using off-the-shelf Tesseract OCR engine~\cite{tesseractOCR}."
Vision Transformer (ViT): A transformer-based image encoder that processes images as sequences of patches. "The Vision Transformer (ViT)~\cite{ViT}, commonly used as a vision encoder in existing image captioning frameworks, is typically pre-trained on natural scene images, which are fundamentally different from patent figures."
VQGAN: Vector-Quantized GAN; an image tokenizer and generator using discrete codebooks for high-fidelity reconstruction. "competing works dVAE~\cite{ramesh2021dall.e-dvae} and VQGAN~\cite{Esser_2021taming}."
Warm-up steps: Initial training steps with gradually increasing learning rate to stabilize optimization. "During Step-1, the weights of the multimodal transformer remain frozen and only the loss heads are trained for 1 epoch with a higher learning rate of 1e-3 and 1K warm-up steps to learn good initialization."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures (2501.15074v1)

Summary

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Authors (4)

Collections

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures (2501.15074v1)

Sponsor

Summary

Whiteboard

Paper Prompts

Top Community Prompts

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections