Composite Sketch+Text Queries for Retrieving Objects with Elusive Names
and Complex Interactions

AAAI 2024

Prajwal Gatti 1 Kshitij Gopal Parikh 1 Dhriti Prasanna Paul 1 Manish Gupta 2 Anand Mishra 1

1 IIT Jodhpur 2 Microsoft

Abstract

Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them clearly, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects along with text describing “difficult-to-sketch but easy-to-verbalize” object’s attributes or interaction with the scene. This novel problem statement is distinctly different from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ∼2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple train- ing objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, as well as composite query modalities

Keywords: sketch+text-based image retrieval, cross-modal retrieval, image retrieval, SBIR, CSTBIR

 

The CSTBIR Problem



Composite Sketch+Text Based Image Retrieval: A user wants to search “Numbat digging in the ground” but does not know the word “numbat”, and the interaction “digging in the ground” is not easy to sketch. Thus, the user may use a hand-drawn sketch of "numbat" along with the text "digging in the ground" to retrieve the desired images.

 

Sketches in CSTBIR




Examples of sketches.

 

Short Talk

 

Paper

Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Prajwal Gatti, Kshitij Gopal Parikh, Drithi Prasanna Paul, Manish Gupta, Anand Mishra.
[paper]
[appendix]

 

 

 

Bibtex


    @InProceedings{cstbir2024aaai,
        author    = {Gatti, Prajwal and Parikh, Kshitij Gopal and Paul, Dhriti Prasanna and Gupta, Manish and Mishra, Anand},
        title     = {Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions},
        booktitle = {AAAI},
        year      = {2024},
    }       
			  

 

 

Team

card image

Prajwal
Gatti

card image

Kshitij
Parikh

card image

Dhriti Prasanna Paul

card image

Manish
Gupta

card image

Anand
Mishra

 

 

 

Acknowledgment

This work is supported by the Startup Research Grant from the Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India (Grant No: SRG/2021/001948).

 

 

For questions, please contact Prajwal Gatti or raise an issue on GitHub.

Copyright © IIT Jodhpur