Deep convolutional networks for image recognition and LSTM-based text sequence generation models have done wonders for establishing a new state of the art in two domains that were previously considered uphill battles — object recognition for a large and open category set, and the generation of captions describing an image. But even given this new vantage point, we
see both lively participation in the flagship Vision and Language workshop at this year’s meeting of the ACL (Association for Computational Linguistics) as well as a shared task at the Workshop for Machine Translation and several papers at the main sessions of the large computational linguistics conferences. The field is hence alive and thriving, mostly by pushing the envelope on tasks that would have been unthinkable just a few years ago as well as working towards a higher level of understanding of the models that have been developed.
One twist on the task of caption generation is the addition of caption in another language, as collected for the Flickr30k image dataset by adding German captions — both translations and original captions — to the dataset in a large-scale effort (Desmond Elliott, Stella Frank, Khalil Sima’an and Lucia Specia: Multi30K: Multilingual English-German Image Descriptions; V&L Workshop 2016). On this basis, the First Conference on Machine Translation (WMT16) offered shared tasks in, firstly, a multimodal machine translation, which is based on the use of images as additional context in translating a given caption, and secondly, crosslingual image description generation, where image descriptions in another language offer additional context for the
description of a given image.
In a similar task, one of the ‘outstanding papers’ at ACL 2016 (Julian Hitschler, Schigehiko Schamoni and Stefan Riezler: Multimodal Pivots for Image Caption Translation) use a small development and test set created by adding German (query) captions to the MS-COCO dataset. They use this data for validating an approach that primarily relies on image and text retrieval models in a crosslingual retrieval setting instead of using large amounts of translated captions: from a query image with a German source caption, they use a translation of the source caption and the query image to retrieve a set of images with English captions, and in a second step use these image captions together in order to rerank a list of target hypotheses produced by the machine translation model. In their experiments, they found that a text-only translation model trained only on out-of-domain text has strong difficulties with the out-of-domain text, but that their retrieval-based model even (significantly) outperforms a translation model that uses the translations from the Multi30k dataset to add in-domain training data to the text-only translation model.
As a more focused and challenging task in comparison to caption generation, Visual Question Answering (VQA; Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, Lawrence Zitnick and Devi Parikh, ICCV 2015) poses questions to an image that necessitate some reasoning on top of the mere object recognition that is needed for finding a caption. Hence, simple questions may regard the number of cats on a bed, the object covering windows, etc. In this challenging task, most models achieve 60-70% accuracy in a
multiple-choice test, versus 91% for humans faced with the multiple-choice task. A recent EMNLP 2016 short paper (Aishwarya Agarwal, Dhruv Batra and Devi Parikh: Analyzing the Behavior of Visual Question Answering Models). They look at three general questions: firstly, if
models can extrapolate to questions dissimilar to the training set, or answers dissimilar to those of the training set. They find that 67% of mistakes can be successfully predicted from the distance of a test question-image pair to those in the training set, and that 74% of mistakes can be predicted based on the distance of the test answer to those in the training set. Secondly and thirdly, they check whether state-of-the-art models consider the whole material of the question — models have 68% of the final accuracy when making predictions based on half the original question, anticipating “typical” questions; models that are given a different image give the same answer for 56% of questions (due to various types of questions such as “How many zebras” or “What covers the ground” that are only asked in typical interesting cases).
In a slightly different approach to analyzing visual question answering, another EMNLP short paper (Abishek Das, Harsh Agrawal, Lawrence Zitnick, Devi Parik and Dhruv Batra: Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?), the authors create a dataset where humans mark the informative region of an image with respect to a question by unblurring part of it in order for others to see the answer. In an analysis of the regions of the image used by attention-based models for VQA, they find that the attention distribution is much less strongly correlated than the areas marked by other humans for the same question in the image, and also more weakly than a supervised model that predicts regions of the image where human eye fixations would occur.
The realm of single concepts, though, is still subject to considerable attention from current work: In a recent EMNLP paper (Douwe Kiela, Anita Vero and Stephen Clark: Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics), the authors consider the task of cross-modal or multimodal concept similarity using MEN and SimLex-999, and vary the source of images (using existing datasets such as ImageNet and the ESP game, using images from Flickr, or using image search from Bing or Google), as well as the model used for extracting the visual features (using AlexNet, GoogLeNet, and VGGNet as the winners of the 2012, 2014 and 2015 ImageNet challenges). As results, they show that the better covereage of Google and Bing search to be important, while the vision models’ performance shows comparably small differences.
Mukherjee and Hospedales (Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition, EMNLP 2016) show that Gaussian word-distribution embeddings, for images and words with a cross-modal distribution mapping, together with a probability product kernel as the similarity measure, yields significant improvements in retrieving images for unseen categories (“zero-shot” learning) on ImageNet1k and the Animals with Attributes dataset.
In a recent paper from ACL (Hao Zhang, Zhiting Hu, Yuntian Deng, Mrinmaya Sachan, Zhicheng Yan and Eric Xing, Learning Concept Taxonomies from Multi-Modal Data), the authors show that a Bayesian model incorporating visual information for sibling and parent prediction improves the learning of taxonomies (using subtrees of ImageNet’s hierarchy) over the state-of-the-art features of Bansal et al. (ACL 2014: Structured Learning for Taxonomy Induction with Belief Propagation) for taxonomy learning with only textual features.