VisToT

Abstract

Table-to-text generation has been widely studied in the Natural Language Processing community in the recent years. Given the significant increase in multimodal data, it is critical to incorporate signals from both images as well as tables to generate relevant text. While tables contain a structured list of facts, images are a rich source of unstructured visual information. For example, in tourism domain, images can be used to infer knowledge such as the type of landmark (e.g., Church), its architecture (e.g., Ancient Roman), and composition (e.g., White marble). However, previous work has largely focused on table-to-text without images. In this paper, we introduce the novel task of Vision-augmented Table-To-Text Generation (VisToT), defined as follows: given a table and an associated image, produce a descriptive sentence conditioned on the multimodal input. For the task, we present a novel multimodal table-to-text dataset, WikiLandmarks, covering 73,084 unique world landmarks. Further, we also present a novel competitive multimodal architecture, namely, VT3 that generates accurate sentences conditioned on the image and table pairs. Through extensive analyses and experiments, we show that visual cues from images are helpful in (i) inferring missing information from incomplete tables, and (ii) strengthening the importance of useful information from noisy tables for natural language generation.

Highlights

Introduced VisToT – a novel task of Data-to-Text generation conditioned on multimodal data of tables and images.
Introduced a new dataset 'WikiLandmarks' containing 73,084 unique world landmarks for studying the VisToT task.
Proposed VT3: Visual-Tabular Data-to-Text Transformer that generates accurate text descriptions based on the information provided in image and table pairs.

VisToT: Vision-Augmented Table-to-Text Generation

Prajwal Gatti¹, Anand Mishra¹, Manish Gupta², Mithun Das Gupta²

¹Indian Institute of Technology Jodhpur ²Microsoft, India

EMNLP 2022

[Paper] [Code] [Poster] [Short Talk] [Slides]

Abstract

Highlights

WikiLandmarks Dataset

Bibtex

Acknowledgements

VisToT: Vision-Augmented Table-to-Text Generation

Prajwal Gatti1, Anand Mishra1, Manish Gupta2, Mithun Das Gupta2

1Indian Institute of Technology Jodhpur 2Microsoft, India

EMNLP 2022

[Paper] [Code] [Poster] [Short Talk] [Slides]

Abstract

Highlights

WikiLandmarks Dataset

Bibtex

Acknowledgements

Prajwal Gatti¹, Anand Mishra¹, Manish Gupta², Mithun Das Gupta²

¹Indian Institute of Technology Jodhpur ²Microsoft, India