Image search using multilingual texts: a cross-modal learning approach between image and text Maxime Portaz Qwant Research

Maxime Portaz,Hicham Randrianarivo,A. Nivaggioli,Estelle Maudet,Christophe Servan,Sylvain Peyronnet

Published 2019 in arXiv.org

ABSTRACT

Multilingual (or cross-lingual) embeddings represent several languages in a unique vector space. Using a common embedding space enables for a shared semantic between words from different languages. In this paper, we propose to embed images and texts into a unique distributional vector space, enabling to search images by using text queries expressing information needs related to the (visual) content of images, as well as using image similarity. Our framework forces the representation of an image to be similar to the representation of the text that describes it. Moreover, by using multilingual embeddings we ensure that words from two different languages have close descriptors and thus are attached to similar images. We provide experimental evidence of the efficiency of our approach by experimenting it on two datasets: Common Objects in COntext (COCO) [19] and Multi30K [7].

PUBLICATION RECORD

Publication year
2019
Venue
arXiv.org
Publication date
2019-03-21
Fields of study
Linguistics, Computer Science
Identifiers
arXiv 1903.11299
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization
2018influential reference
Fully Convolutional Network and Region Proposal for Instance Identification with Egocentric Vision
2017cited by this paper
Simple Recurrent Units for Highly Parallelizable Recurrence
2017cited by this paper
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
2017cited by this paper
Word Translation Without Parallel Data
2017cited by this paper
WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks
2016cited by this paper
Multi30K: Multilingual English-German Image Descriptions
2016cited by this paper
Fully-Convolutional Siamese Networks for Object Tracking
2016cited by this paper
End-to-End Learning of Deep Visual Representations for Image Retrieval
2016cited by this paper
Enriching Word Vectors with Subword Information
2016influential reference
Aggregated Residual Transformations for Deep Neural Networks
2016influential reference
Deep Residual Learning for Image Recognition
2015cited by this paper
Bilingual Word Representations with Monolingual Quality in Mind
2015cited by this paper
Skip-Thought Vectors
2015cited by this paper
FaceNet: A unified embedding for face recognition and clustering
2015influential reference
An Autoencoder Approach to Learning Bilingual Word Representations
2014cited by this paper
Deep Metric Learning Using Triplet Network
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
Learning Fine-Grained Image Similarity with Deep Ranking
2014cited by this paper
Improving Vector Space Word Representations Using Multilingual Correlation
2014cited by this paper
Neural Codes for Image Retrieval
2014cited by this paper
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
2014cited by this paper
Deep visual-semantic alignments for generating image descriptions
2014cited by this paper
Microsoft COCO: Common Objects in Context
2014influential reference
Deep Metric Learning for Person Re-identification
2014cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Distributed Representations of Words and Phrases and their Compositionality
2013influential reference
Multilingual Distributed Representations without Word Alignment
2013cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Inducing Crosslingual Distributed Representations of Words
2012cited by this paper
TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation
2009cited by this paper
A Neural Probabilistic Language Model
2003cited by this paper
Signature Verification Using A "Siamese" Time Delay Neural Network
1993cited by this paper

CITED BY

Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
2025cites this paper
Vision-based image similarity measurement for image search similarity
2023cites this paper
Universal Multimodal Representation for Language Understanding
2023cites this paper
Dual-View Curricular Optimal Transport for Cross-Lingual Cross-Modal Retrieval
2023cites this paper
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
2022cites this paper
Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model
2022cites this paper
Cross-lingual and Multilingual CLIP
2022cites this paper
Divide-and-Conquer Predictor for Unbiased Scene Graph Generation
2022cites this paper
Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning
2022influential citation
Towards Zero-shot Cross-lingual Image Retrieval and Tagging
2021cites this paper
Graphcore C2 Card performance for image-based deep learning application: A Report
2020cites this paper
Towards Zero-shot Cross-lingual Image Retrieval
2020cites this paper
QISS: An Open Source Image Similarity Search Engine
2020cites this paper