STSM Reports

iV&L Net STSMs and reports

Title: Describing ear images using natural language

Beneficiary :Žiga Emeršič, University of Ljubljana, Slovenia

Host: University of Las Palmas de Gran Canaria, Spain

Period: 05/02/2018-16/02/2018

Abstract: The purpose of the STSM was to develop a system that will not only remove/replace ear accessories, but also describe the decisions it made in a natural human language through the following steps: (i) Ear accessories pixel-wise detection and segmentation using state-of-the-art approaches; (ii) Replacement of ear accessories with appropriate content that looks like an ear; (iii) Description in natural language of what was done – if and how the images were modified. This is planned, not only to increase the overall robustness of ear recognition, but also to increase the transparency of the procedure to the experts (e.g. forensic experts) using the system. Our goal was to prepare a showcase of such a system that would do these steps. We investigated possible solutions and also implemented initial solutions. Still there is a lot to do and room for improvement, but we present some preliminary, qualitative results to show the feasibility of our approach. Steps of segmentation and infilling are not yet connected, because additional training in both parts is still required. Nevertheless, with the preliminary results we have proved that the whole pipeline is feasible and close to the final implementation.

Download Report


Title: Producing facial images from text descriptions

Beneficiary :Blaž Meden, University of Ljubljana, Slovenia

Host: University of Las Palmas de Gran Canaria, Spain

Period: 05/02/2018-16/02/2018

Abstract: The purpose of the STSM was to bridge the gap between natural language processing and computer vision technologies. We decided to explore the possibilities of producing photo-realistic facial images using recently introduced deep generative neural networks by using natural language descriptions as an input. Generative neural networks and generative adversarial networks are promising methods for new content generation and learning the essential latent statistics about large image-based datasets. To establish the connection with the vision and language we intended to research if such generative models can gain the capability of generating new content based on natural language inputs. To accomplish this task we combined technologies from natural language processing field (NLP) and deep convolutional neural networks (CNNs), especially recent generative neural networks (GNNs).

Download Report


Title: Quantity Expressions in Language & Vision

Beneficiary :Sandro Pezzelle, Università di Trento, Italy

Host: University of Amsterdam, The Netherlands

Period: 27/09/2017-31/10/2017

Abstract: The purpose of this STSM was to merge the background of the applicant with the hosts' expertise in Logic, Formal Linguistics, and Cognitive Semantics to explore research issues involving theoretical aspects of quantifiers (some, most, all) from a computational perspective. The motivation was that quantifiers, though widely used in everyday communication by speakers of almost every language, are only partially investigated in Computational Linguistics, Computer Vision, and work combining Language & Vision. A proper extraction, evaluation, and learning of quantity information from texts and images, however, seems to be as needed as challenging, as it emerges, for instance, by the poor performance of state-of-the-art Visual Question Answering (VQA) models on the so-called count questions (e.g. "How many children are wearing hats?"). The shared starting point between the applicant and the host was that any computational model aimed at reproducing speakers' behavior should take into account `fuzzy' quantification (i.e. quantifiers) besides exact numbers.

Download Report


Title: Multi-modal Multitask Learning using Attention-Based Neural Networks

Beneficiary :Iacer Calixto, Dublin City University, Ireland

Host: University of Sheffield, Sheffield, UK

Period: 01/07/2017-16/07/2017

Abstract: Machine Translation has recently been addressed from a multimodal perspective, where visual features are incorporated in an Neural Machine Translation (NMT) model, which is trained end-to-end. One of the ways researchers have incorporated visual features into NMT is through multitask learning, where a network is trained to translate sentences from a source language into a target language (MT) and also to rank sentences given images and vice-versa (image-sentence ranking). By training a model towards two tasks, not only could overfitting issues be avoided, but also the lack of grounding in a MT model can be addressed (i.e. incorporating visual features). The aim of the STSM was twofold: (i) to present some of my previous work to the host research group; (ii) to work collaboratively on a multitask approach integrating natural language processing (NLP) models using multimodal corpora.

Download Report


Title: Learning spatial templates of actions and implicit spatial language

Beneficiary :Guillem Collell, KU Leuven, Belgium

Host: ETH Zurich, Switzerland

Period: 27/06/2017-20/09/2017

Abstract: The research carried out during the stay at the Computer Vision Laboratory (CVL) in ETH focuses on building automated methods for acquiring common sense spatial knowledge. Endowing machines with common sense knowledge is one of the most important long-term goals of artificial intelligence research. Lack of common sense has been recurrently argued as one of the main reasons that prevents machines from exhibiting more human-like behavior when solving tasks. In particular, the research conducted during the STSM is aimed at advancing our knowledge and existing methods for understanding spatial language. To this end, we design neural-network models that learn to predict common sense spatial knowledge that is left implicit in language. The models learn from multimodal, image-text paired data with annotations and we allow for both, a quantitative evaluation of their performance and a qualitative visualization of their predictions. The motivation for such interdisciplinary collaboration is to tackle the above challenges by combining the prior knowledge in natural language processing and representation learning of the visiting researcher with the extensive expertise in computer vision of the host laboratory.

Download Report


Title: Working on a new research project on multi-modal machine translation

Beneficiary :Nenad Zivic, University of Nis, Serbia

Host: University of Sheffield, United Kingdom

Period: 08/04/2017-15/04/2017

Abstract: This STSM aimed to establish a new collaboration between iV&L Net members. In addition to that, the main research question of the collaboration that started with this visit is “Can visual input help human translators”. This question is fundamental as it can be a clue to whether the visual input is expected to provide better results of machine translation systems in the future. Also, one more purpose of this visit is to plan future collaboration on more research projects regarding connecting multi-modal machine translation with multi-sense word embedding systems.

Download Report


Title: Logical words in Vision and Language

Beneficiary :Raffaella Bernardi, University of Trento, Italy

Host: University of Amsterdam, The Netherlands

Period: 21/03/2017-28/03/2017

Abstract: Abstract: We had planned to make progress towards the following research objectives: (i) To discuss current results on visually-grounded reasoning skills in Vision and Language Models; (ii) To select reasoning skills linked to certain logical words that are visually-grounded and have been studied in detail in the fields of Computational Semantics and Dialogue; (iii) To design empirical experiments with human participants to investigate literal and pragmatic interpretations of the selected visually-grounded logical words; and (iv) To design experiments to evaluate current state-of-the-art computational systems against human literal vs. pragmatic interpretations of logical words.

Download Report


Title: Large Scale Knowledge Bases of integrated Vision and Language Representations

Beneficiary :Lorenzo Gregori, University of Florence, Italy

Host: Cognitive Systems Research Institute, Athens, Greece

Period: 01/02/2017-30/04/2017

Abstract: The purpose of the mission was the analysis of action concepts by comparing two resources based on different theoretical framework: PRAXICON, a multisensory and multimodal semantic memory developed in CSRI in the framework of European research and development Projects, and IMAGACT, a multilingual ontology of action developed within two Italian research Projects (IMAGACT and MODELACT), that contains a fine-grained categorization of concepts and the visual representation of actions. The two resources are both focused of action representation, but from different points of view: IMAGACT action concepts are linguistically discriminated (according to the semantics of verbs in different languages) and are visually represented by video scenes, while PRAXICON concepts are based on the action motorics and are represented by visual and linguistic entities (image, video, words, phrases).

Download Report


Title: A Deep Hybrid CNN Architecture for Activity Recognition on Daily Living Activity Videos

Beneficiary :Farhood Negin

Host: Computer Vision Center and Universitat Autònoma de Barcelona, Spain

Period: 01/02/2017-01/03/2017

Abstract: This STSM’s goal was to explore capabilities of deep learning architectures in order to solve gesture recognition problem in a medical framework using vision systems. With this visit we have designed and evaluated a deep learning architecture on a dataset recorded from cognitively impaired patient in Institute Claude Pompidou (ICP) in Nice, France. We’ve obtained promising results in the experiments and showed high potential of LSTM’s (in modeling time dependencies of videos) when they are coupled with Convolutional Neural Networks compared to conventional approaches.

Download Report


Title: Natural language quantifier learning for multi-modal deep neural nets

Beneficiary :Alexander Kuhnle, University of Cambridge

Host: University of Trento, Italy

Period: 13/2/2017-24/2/2017

Abstract: Our plan for my visit at the University of Trento was to start off the project, to analyze available data, discuss the details of the task and experimental setup, and to set out further work aiming for a joint publication. My own PhD research focuses on evaluating multimodal deep learning systems with respect to linguistic understanding abilities, one being the ability to understand statements involving quantifiers. Underlying our joint project is the idea that a representation for a concept, or instances of it, should contain information about these typical attributes and their frequency, to some degree at least.

Download Report


Title: Visual Question Answering

Beneficiary :Ted Zhang, KU Leuven

Host: Dr. Dengxin Dai (ETHZ), Dr. Luc Van Gool (ETHZ)

Period: 23/1/17 - 19/2/2017

Abstract: The work performed centers on visual question answering, which is the task of answering questions about images. To establish some reference points, I first implemented two baseline models. Subsequently, I propose to improve the model by building an attention architecture that integrates multiple models and pays more attention to the model that will likely output the correct answer for the given type of question.

Download Report


Title: Multimedia document classification using multi-modal features

Beneficiary :Yaakov HaCohen-Kerner, Jerusalem College of Technology, Jerusalem, Israel

Host: Information Technologies Institute, Centre for Research and Technology, Thessaloniki, Greece

Period: 15/02/2015 to 19/02/2015

Download Report


Title: Semantic space models of fast mapping in cross-modal concept learning

Beneficiary :Raquel Fernandez Rovira, Institute for Logic, Language and Computation University of Amsterdam, The Netherlands

Host: University of Trento, Italy

Period: 15/02/2015 to 15/03/2015

Download Report


Title: Auto-tagging and categorisation of life-logging photos

Beneficiary :Stavri Nikolov, Imagga Technologies Ltd, Bulgaria

Host: University of Barcelona, Spain

Period: 15/06/2015 to 28/06/2015

Download Report


Title: Synchronizing Visual and Natural Language Grammars for Pattern Analysis and Description

Beneficiary :Adrian Muscat, University of Malta, Malta

Host: University of Brighton, UK

Period: 30/04/2015 to 09/05/2015

Download Report