Data Resources Repository

Details/Description:

The Amazon dresses dataset used in Zoghbi et al. (2016) and Laenen et al. (2018). The dataset consists of 53,689 images of dresses and their product descriptions as found in the webshop. K. Laenen, S. Zoghbi, and M-F. Moens. 2018. Web Search of Fashion Items with Multimodal Querying. In Proceedings of WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining (WSDM). S. Zoghbi, G. Heyman, J. C. Gomez, and M-F. Moens. 2016. Fashion Meets Computer Vision and NLP at e-Commerce Search. International Journal of Computer and Electrical Engineering (IJCEE).

Link:

https://liir.cs.kuleuven.be/software.html

Details/Description:

Version of the Flickr30K dataset with: (i) translations of the current (in English) image descriptions into German, French and Czech; (ii) independently created descriptions for the same images in German; (iii) additional test sets from other Flickr groups and from MSCOCO.

Link:

https://github.com/multi30k/dataset

Details/Description:

This dataset contains facial images of 51 persons, from different rotations, illuminations and expressions. Unusually, the images are RGB + D (depth) + T (thermal).

Link:

http://www.vap.aau.dk/rgb-d-t-based-face-recognition/

Details/Description:

The dataset features a total of 5724 annotated frames divided in three indoor scenes. Activity in scene 1 and 3 is using the full depth range of the Kinect for XBOX 360 sensor whereas activity in scene 2 is constrained to a depth range of plus/minus 0.250 m in order to suppress the parallax between the two physical sensors. Scene 1 and 2 are situated in a closed meeting room with little natural light to disturb the depth sensing, whereas scene 3 is situated in an area with wide windows and a substantial amount of sunlight. For each scene, a total of three persons are interacting, reading, walking, sitting, reading, etc. Every person is annotated with a unique ID in the scene on a pixel-level in the RGB modality.

Link:

http://www.vap.aau.dk/vap-trimodal-people-segmentation-dataset/

Details/Description:

This resource was developed from the reports of 124 participants divided in three behavioural experiments with visuo-tactile stimulation, which were captured audio-visually from two camera-views (frontal/profile). This methodology allowed the acquisition of approximately 95 hours of video, audio, and text data covering: object-feature-action data (e.g., perceptual features, namings, functions), Exploratory Acts (haptic manipulation for feature acquisition/verification), gestures and demonstrations for object/feature/action description, and reasoning patterns (e.g., justifications, analogies) for attributing a given characterization.

Link:

http://dx.doi.org/10.6084/m9.figshare.1457788

Details/Description:

TasvirEt: a multilingual dataset for automatic image description – an expansion of the Flickr8k dataset with crowd-sourced Turkish descriptions .

Link:

http://semihyagcioglu.com/projects/tasviret/

Details/Description:

The files in this dataset contain verb and sense annotations for 3,518 images taken from MSCOCO and TUHOI datasets. Each image is annotated with one of 90 verbs, and with the OntoNotes sense realised for a given verb in the image.

Link:

https://github.com/spandanagella/verse

Details/Description:

Leeds Robotic Commands is a dataset of real-world RGB-D scenes of a robot manipulating different objects together with natural language descriptions of these actions. The scenes were recorded using a Microsoft Kinect2 sensor, and the descriptions were annotated by non-expert volunteers. The dataset includes 204 videos consisting of 17,373 frames in total. The dataset contains a total of 1024 commands, average of five-per video. A total of 51 different objects are manipulated in the videos such as basic block shapes, fruits, cutlery, and office supplies.

Link:

http://archive.researchdata.leeds.ac.uk/id/eprint/116

Details/Description:

The dataset contains short video clips depicting human activities in a kitchen, collected over 5 days from a mobile robotic platform. Each clip has been annotated with natural language descriptions of the activity and descriptions of the clothing of actors within the scene, obtained by crowd-sourcing.

Link:

http://archive.researchdata.leeds.ac.uk/id/eprint/235

Details/Description:

806 multimodal (image+text) questions from real users on ancient Egyptian artworks in a museum room. 200 multimodal documents where the answers come from. Each question is annotated with its ground-truth answer. For more details, please refer to: Sheng, S., Van Gool, L. and Moens, M.F., 2016, January. A dataset for multimodal question answering in the cultural heritage domain. In Proceedings of the COLING 2016 Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH) (pp. 10-17). ACL.

Link:

http://liir.cs.kuleuven.be/software.php

Details/Description:

Al-Mus'haf corpus is a new Quranic corpus rich in morphosyntactical information. To build such a corpus of the Quran, we used a semi-automatic technique which consists in using the morphosyntactic of Standard Arabic words "AlKhalil Morpho Sys" version 2 followed by a manual treatment in colaboration with experts in Arabic grammar. The corpus and the results we achieved can be used by researchers as baselines to test and evaluate their Arabic tools. In addition, this corpus can be used to train, optimize and evaluate existing approaches. If you use this corpus, please cite the following paper: Imad Zeroual and Abdelhak Lakhouaja, “A new Quranic Corpus rich in morphosyntactical information”, International Journal of Speech Technology (IJST), 2016, (DOI) 10.1007/s10772-016-9335-7.

Link:

http://oujda-nlp-team.net/en/programms/al-mushaf-corpus/

Details/Description:

12073 news articles retrieved from several sites. The news articles were annotated with the following six topics found in the IPTC news codes taxonomy: Nature_Environment, Politics, Science_Technology, Economy_Business_Finance, Health and Lifestyle_Leisure. It should be noted that the articles are classified to a single topic.

Link:

http://mklab2.iti.gr/multisensor/images/6/65/MULTISENSOR_NewsArticlesData_12073.zip

Details/Description:

Contains 150 web news articles, which are references to specific Wikipedia pages, so as to ensure reliable ground-truth. The selected topics and the corresponding number of articles per topic are: Barack Obama(5), Premier League(37), Cypriot Financial Crisis 2013(5), Rolling Stones(1), Debt Crisis in Greece(5), Samsung Galaxy S5(35), Greek Elections June 2012(5), smartphone(5), Malaysia Airlines Flight 370(5), Stephen Hawking(1), Michelle Obama(38), Tohoku earthquake and tsunami(5), NBA draft(1), U2(1), Wall Street(1)

Link:

http://mklab2.iti.gr/multisensor/images/d/de/WikiRef150.zip

Details/Description:

Contains 220 news articles, which are references to specific Wikipedia pages. The selected topics of the WikiRef220 dataset (and the number of articles per topic) are: Paris Attacks November 2015 (36), Barack Obama (5), Premier League (37), Cypriot Financial Crisis 2012-2013 (5), Rolling Stones (1), Debt Crisis in Greece (5), Samsung Galaxy S5 (35), Greek Elections June 2012 (5), smartphone (5), Malaysia Airlines Flight 370 (39), Stephen Hawking (1), Michelle Obama (38), Tohoku earthquake and tsunami (5), NBA draft (1), U2 (1), Wall Street (1). The topics Barack Obama, Cypriot Financial Crisis 2012-2013, Rolling Stones, Debt Crisis in Greece, Greek Elections June 2012, smartphone, Stephen Hawking, Tohoku earthquake and tsunami, NBA draft, U2 and Wall Street appear no more than 5 times and therefore, they are regarded as noise. The WikiRef186 dataset (4 topics) is the WikiRef220 without 34 documents related to “Malaysia Airlines Flight 370” and the WikiRef150 dataset (3 topics) is the WikiRef186 without the 36 documents related to “Paris Attacks”. if you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing.

Link:

http://mklab.iti.gr/files/WikiRef_dataset.zip

Details/Description:

Contains the URLs and the annotation for 2382 web pages/ articles retrieved from several sites. The web pages are annotated with the following six topics found in the IPTC news codes taxonomy: Nature_Environment, Politics, Science_Technology, Economy-Business_Finance, Health and Lifestyle_leisure. It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article: Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using Random Forests and weighted multimodal features. In Information Retrieval Facility Conference (pp. 63-75). Springer International Publishing.

Link:

http://mklab.iti.gr/files/ArticlesNewsSitesData_2382.7z

Details/Description:

Contains the URLs and the annotation for 1043 web pages/ articles retrieved from three well known sites with news (i.e. BBC, The Guardian, and Reuter). Web pages are annotated with the following four topics found in the IPTC news codes taxonomy: Economy-Business-Finance, Lifestyle-Leisure, Science-Technology, and Sports. It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article: Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using Random Forests and weighted multimodal features. In Information Retrieval Facility Conference (pp. 63-75). Springer International Publishing.

Link:

http://mklab.iti.gr/files/ArticlesNewsSitesData.7z

Details/Description:

CVC-ClinicHDSegment. The first public database with High definition images to evaluate image segmentation methods. Contains 200 HD images, being pixel-wise binary mask labelled only by clinicians. It was used in MICCAI 2017 Sub-challenge on Gastrointestinal Image Analysis, Polyp Segmentation task

Link:

https://endovissub2017-giana.grand-challenge.org/polypsegmentation/

Details/Description:

CVC-VideoClinicDB, the largest annotated dataset available with colonoscopy videos labelled exclusively by clinicians. Used in MICCAI 2017 Sub-challenge on Gastrointestinal Image Analysis (GIANA) organized by Jorge Bernal and Aymeric Histace

Link:

https://endovissub2017-giana.grand-challenge.org/polypdetection/

Details/Description:

CVC-EndoSceneStill. A complete still-frame dataset for polyp detection and localization. Includes binary annotations for polyps, specular highlights and luminal region

Link:

http://www.cvc.uab.es/CVC-Colon/index.php/databases/cvc-endoscenestill/

Details/Description:

Verb Senses in Images (VerSe) dataset 3,518 images, each annotated with one of 90 verbs, and with the OntoNotes sense realized for a given verb in the image. Images are taken from two existing multimodal datasets (COCO and TUHOI).

Link:

https://github.com/spandanagella/verse

Details/Description:

Expansion of the Flickr30K dataset with: (i) translations of the current (in English) image descriptions into German; (ii) independently created descriptions for the same images in German.

Link:

http://www.statmt.org/wmt16/multimodal-task.html

Details/Description:

Turkish descriptions for the Flickr8k dataset

Link:

http://tasviret.cs.hacettepe.edu.tr

Details/Description:

Datasets associated with the ChaLearn challenges. These are all challenges around the notion of "Looking at people".

Link:

http://gesture.chalearn.org

Details/Description:

The Multi30K dataset extends the Flickr30K dataset with i) 31K German translations created by professional translators over a subset of the English descriptions, and ii) 155K descriptions crowdsourced independently of the original English descriptions. Paper: https://arxiv.org/abs/1605.00459

Link:

http://www.statmt.org/wmt16/multimodal-task.html

Details/Description:

The AllenAI's Charades dataset is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades).

Link:

http://allenai.org/plato/charades/

Details/Description:

Google Youtube-8M YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities. It also comes with precomputed state-of-the-art vision features from billions of frames, which fit on a single hard disk. This makes it possible to train video models from hundreds of thousands of video hours in less than a day on 1 GPU!

Link:

https://research.google.com/youtube8m/

Details/Description:

SotA Survey on activity recognition including generation of semantic description of videos

Link:

http://www-sop.inria.fr/members/Francois.Bremond/Postscript/iVL_ActivityRecognition_Survey.pdf

Details/Description:

Illinois image description data (Hockenmaier et al.)

Link:

http://nlp.cs.illinois.edu/HockenmaierGroup/data.html

Details/Description:

Generalized 1M image-caption corpus (Kuznetsova et al.)

Link:

http://www3.cs.stonybrook.edu/~pkuznetsova/imgcaption/