PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Shreya Shukla 1*, Nakul Sharma 1*, Manish Gupta 2, Anand Mishra 1

* : Equal Contribution

1Indian Institute of Technology Jodhpur   2Microsoft, India

AAAI 2025

[Paper] [Code] [Data]


Our work aims generate brief and detailed descriptions for patent figures to aid drafting for novel inventions.

     

Abstract

Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ~355K patent figures along with their brief and detailed textual descriptions extracted from 60K+ US patent documents. In addition, we propose PatentLMM - a novel multimodal large language model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. We make the code and data publicly available.

     

Highlights

  • We introduce a large-scale dataset, PatentDesc-355K, with ~355K patent figures and their brief and detailed descriptions.
  • We propose a novel multi-modal model, PatentLMM, comprising a patent-domain-specialized vision encoder trained using objectives specifically tailored to capture the structure of patent documents, and an LLM fine-tuned on patent data.
  • We extensively benchmark existing captioning models and multi-modal LLMs and show that our proposed approach surpasses their best performance by large margins.
     

PatentDesc-355K Dataset

Please follow the instructions in this section of our GitHub repo for a guide on downloading, processing and using our dataset.

Bibtex

Please cite our work as follows:

@inproceedings{shukla2025patentlmm,
  author    = "Shukla, Shreya and 
              Sharma, Nakul and 
              Gupta, Manish and
              Mishra, Anand",
  title     = "PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures",
  booktitle = "AAAI",
  year      = "2025",
}

Acknowledgements

This work was supported by the Microsoft Academic Partnership Grant (MAPG) 2023.