Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, the goal is to localize the subject and object on the test video. With modern Visio-lingual understanding capabilities, solving this problem may be relatively easy, subject to the availability of large-scale annotated training examples. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects connected via an unseen predicate with the help of only a few support set videos sharing the common predicate. We address this challenging problem referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T -partite random field where T is the number of frames in the test video, and the vertices of the random field represent candidate bounding boxes for the subject and object correspond to the random variables. This objective function is composed of frame level and visual relationship similarity potentials. These potentials are learned using a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic way. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatio-temporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR and compare the proposed approach with competitive baselines to assess its efficacy.
Highlights
We propose a novel problem setup for referring relationship task in videos where with the help of a few videos, the model has to learn to localize subject and object corresponding to a query visual relationship that is unseen during training..
We propose a new formulation to solve this task based on the minimization of an objective function on T-partite random field where T is the number of frames in the test video, and the vertices of the random field representing potential bounding boxes for subject and objects correspond to the random variables.
We present two aggregation techniques to enrich query-conditioned relational embeddings, namely global semantic and local localization aggregations.
Qualitative Results
Bibtex
Please cite this work as follows::
@InProceedings{Kumar_2023_CVPR,
author = {Kumar, Yogesh and Mishra, Anand},
title = {Few-Shot Referring Relationships in Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {2289-2298}
}
}
}
Acknowledgements
This work is partly supported by a gift grant from Accenture Labs (project number: S/ACT/AM/20220078). Y. Kumar is supported by a UGC fellowship.