Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Show Me the World in My Language: Establishing the First Baseline
for Scene-Text to Scene-Text Translation

Abstract

In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.

Keywords: Visual Translation, Scene Text Synthesis, Evaluation Metrics, Cross-lingual Scene Text Editing.

The Visual Translation Problem

Suppose you are visiting Delhi, India and arrive at the Rithala (Hindi: रिठाला) metro station. If you are not familiar with Hindi, the signboard on the left might be incomprehensible. The result of our proposed baseline solution, shown on the right, seamlessly transliterates the station name रिठाला to English. In our work, we aim to visually translate (or transliterate, when necessary, as in this case) text from the source language to the target language while preserving visual attributes of the source scene text. Specifically, we focus on visual translation between Hindi and English, and vice versa.

Dataset

VT-Real: Real Scene Image Dataset for evaluating Visual Translation between Hindi and English

Previous

Next

VT-Real includes 269 images and 1021 words. Images are obtained from ICDAR 2013 and Bharat Scene Text Dataset. These images were translated between English and Hindi by three human annotators. We provide pointers to the images in the respective source dataset and our annotations here.
1. ICDAR 2013 dataset can be downloaded from here . Please download Training Set Images (Disk size: 142 MB).
2. The images used from Bharat Scene Text dataset (BSTD) can be downloaded from here . (Disk size: 60 MB)
3. Please download ground truth translation annotations by three annotators for ICDAR images (English-to-Hindi) and BSTD images (Hindi-to-English).
VT-Synth: Synthetic Training Data for Visual Translation between Hindi and English
VT-Syn is a synthetically generated corpus of ~600K visually diverse paired bilingual word images in pairs of English-Hindi. It can be used for training visual translation or cross-lingual scene text editing or scene text removal or scene text binarization.
1. Synthetic word images in Hindi Language [Download] [572,367 images, disk size: 17 GB]
2. Synthetic word images in English Language [Download] [572,367 images, disk size: 17 GB]
3. Background images used for generating the synthetic word images [Download] [572,367 images, disk size: 13 GB]
4. Synthetic Hindi words in gray background [Download] [572,367 images, disk size: 6.1 GB]
5. Synthetic English words in gray background [Download] [572,367 images, disk size: 5.8 GB]
6. Mask for synthetic Hindi word images [Download] [572,367 images, disk size: 3.4 GB]
7. Mask for synthetic English word images [Download] [572,367 images, disk size: 3.2 GB]
8. Skeleton images for synthetic Hindi word images [Download] [572,367 images, disk size: 407 MB]
9. Skeleton images for synthetic English word images [Download] [572,367 images, disk size: 374MB]

Note: same image filenames inside engSynthWords and hinSynthWords folders contain synthetic word images with perceptually same font and background style in English and Hindi, respectively.

Short Talk

Paper

Show Me the World in My Language: Establishing the First Baseline
for Scene-Text to Scene-Text Translation
Shreyas Vaidya*, Arvind Kumar Sharma*, Prajwal Gatti, Anand Mishra.
(* indicates equal contribution)
[paper]

Bibtex


    @InProceedings{vistransICPR2024,
        author    = {Vaidya, Shreyas and Sharma, Arvind Kumar and Gatti, Prajwal and Mishra, Anand},
        title     = {Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation},
        booktitle = {ICPR},
        year      = {2024},
    }

Team

card image