Yash Patel


I am a first year PhD student at the Center for Machine Perception , Czech Technical University in Prague, where I am supervised by Prof. Jiří Matas. Previously, I graduated with a Master's degree in Computer Vision from the Robotics Institute of Carnegie Mellon University, where I worked with Prof. Abhinav Gupta. During the Masters, I did two Research Internships at Amazon (first at A9 and second at AWS-AI), during these internships, my work was supervised by Prof. R. Manmatha and Prof. Alex Smola.
Even before that, I obtained a Bachelor in Technology with Honors by Research in Computer Science and Engineering from International Institute of Information Technology, Hyderabad (IIIT-H). During my undergrad, I was working with Prof. C.V. Jawahar at the Center for Visual Information Technology (CVIT). At some point during my undergrad, I did a Research Internship at the Computer Vision Center (CVC), Universitat Autònoma de Barcelona, where I was supervised by Prof. Dimosthenis Karatzas. I did another internship during my undergrad at Center for Machine Perception , Czech Technical University in Prague, where I was supervised by Prof. Jiří Matas.
Research Interests: Self-Supervised Representation Learning, Image Compression, Scene Text Detection and Recognition, Tracking and Segmentation in Videos, 3D Reconstruction


email | GitHub | LinkedIn | Google Scholar | ResearchGate


patelyas AT cmp DOT felk DOT cvut DOT cz 
yashp AT alumni DOT cmu DOT edu
My picture

Academic Services

Reviewer for: TPAMI, IJCV, ICPR, ACCV

Awards


Publications

Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2020
pdf   abstract   bibtex

We propose a new end-to-end trainable model for lossy image compression which includes a number of novel components. This approach incorporates 1) a hierarchical auto-regressive model; 2)it also incorporates saliency in the images and focuses on reconstructing the salient regions better; 3) in addition, we empirically demonstrate that the popularly used evaluations metrics such as MS-SSIM and PSNR are inadequate for judging the performance of deep learned image compression techniques as they do not align well with human perceptual similarity. We, therefore propose an alternative metric, which is learned on perceptual similarity data specific to image compression. Our experiments show that this new metric aligns significantly better with human judgements when compared to other hand-crafted or learned metrics. The proposed compression model not only generates images which are visually better but also gives superior performance for subsequent computer vision tasks such as object detection and segmentation when compared to other engineered or learned codecs.
@article{patel2020hierarchical,
  title={Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:2002.04988},
  year={2020}
}

Deep Perceptual Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf   abstract   bibtex

Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting compressed images. In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG. However, the images produced by these deep learned techniques are in many cases clearly worse to human eyes than those produced by JPEG-2000 or BPG.
We propose the use of an alternative, deep perceptual metric, which has been shown to align better with human perceptual similarity. We then propose Deep Perceptual Compression (DPC) which makes use of an encoder-decoder based image compression model to jointly optimize on the deep perceptual metric and MS-SSIM. Via extensive human evaluations, we show that the proposed method generates visually better results than previous learning based compression methods and JPEG-2000, and is comparable to BPG. Furthermore, we demonstrate that for tasks like object-detection, images compressed with DPC give better accuracy.
@article{patel2019deep,
  title={Deep Perceptual Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1907.08310},
  year={2019}
}

Human Perceptual Evaluations for Image Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf   abstract   bibtex

Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS-SSIM (multi-scale structural similarity). This has led to some deep learning techniques which directly optimize for MS-SSIM by choosing it as a loss function. While this leads to a higher MS-SSIM for such techniques, we demonstrate using user studies that the resulting improvement may be misleading. Deep learning techniques for image compression with a higher MS-SSIM may actually be perceptually worse than engineered compression schemes with a lower MS-SSIM.
@article{patel2019human,
  title={Human Perceptual Evaluations for Image Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1908.04187},
  year={2019}
}

Self-Supervised Visual Representations for Cross-Modal Retrieval
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
International Conference on Multimedia Retrieval (ICMR), 2019
Spotlight
pdf   abstract   bibtex

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.

@inproceedings{patel2019self,
  title={Self-Supervised Visual Representations for Cross-Modal Retrieval},
  author={Patel, Yash and Gomez, Lluis and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval},
  year={2019},
  organization={ACM}
}

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition--RRC-MLT-2019
Nibal Nayef*, Yash Patel*, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, Jean-Marc Ogier
International Conference on Document Analysis and Recognition (ICDAR), 2019
Oral
pdf   abstract   bibtex   portal

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text:(a) text detection,(b) cropped word script classification,(c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.
@article{nayef2019icdar2019,
  title={ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition--RRC-MLT-2019},
  author={Nayef, Nibal and Patel, Yash and Busta, Michal and Chowdhury, Pinaki Nath and Karatzas, Dimosthenis and Khlif, Wafa and Matas, Jiri and Pal, Umapada and Burie, Jean-Christophe and Liu, Cheng-lin and others},
  journal={arXiv preprint arXiv:1907.00945},
  year={2019}
}

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
Michal Bušta, Yash Patel, Jiri Matas
International Workshop on Robust Reading, Asian Conference on Computer Vision (ACCV), 2018
Best Paper Award
pdf   abstract   bibtex   code

An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed. The approach is based on a single fully convolutional network (FCN) with shared layers for both tasks.
E2E-MLT is the first published multi-language OCR for scene text. While trained in multi-language setup, E2E-MLT demonstrates competitive performance when compared to other methods trained for English scene text alone. The experiments show that obtaining accurate multi-language multi-script annotations is a challenging problem.
@inproceedings{buvsta2018e2e,
  title={E2E-MLT-an unconstrained end-to-end method for multi-language scene text},
  author={Bu{\v{s}}ta, Michal and Patel, Yash and Matas, Jiri},
  booktitle={Asian Conference on Computer Vision},
  pages={127--143},
  year={2018},
  organization={Springer}
}

TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces
Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
Under Review at Pattern Recognition Journal, arXiv e-print, 2018
pdf   abstract   bibtex   code

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community.
In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches.
@article{patel2018texttopicnet,
  title={TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces},
  author={Patel, Yash and Gomez, Lluis and Gomez, Raul and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  journal={arXiv preprint arXiv:1807.02110},
  year={2018}
}

Learning Sampling Policies for Domain Adaptation
Yash Patel*, Kashyap Chitta*, Bhavan Jasani*
ArXiv e-prints, 2018
pdf   abstract   bibtex   code

We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that learned sampling policies construct labeled sets that improve accuracies of visual classifiers over baselines.

@inProceedings{patel2018learning,
  title={Learning Sampling Policies for Domain Adaptation},
  author = {Yash Patel
  and Kashyap Chitta
  and Bhavan Jasani},
  booktitle={ArXiv e-prints},
  year={2018}
}

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
Lluis Gomez*, Yash Patel*, Marçal Rusiñol, Dimosthenis Karatzas, CV Jawahar
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
pdf   abstract   bibtex   code

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.
@inproceedings{gomez2017self,
  title={Self-supervised learning of visual features through embedding images into text topic spaces},
  author={Gomez, Lluis and Patel, Yash and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={4230--4239},
  year={2017}
}

Dynamic Lexicon Generation for Natural Scene Images
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas
International Workshop on Robust Reading, European Conference on Computer Vision (ECCV), 2016
pdf   abstract   bibtex   code

Many scene text understanding methods approach the end-to-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely discussed. In this paper we propose a method that generates contextualized lexicons for scene images using only visual information. For this, we exploit the correlation between visual and textual information in a dataset consisting of images and textual content associated with them. Using the topic modeling framework to discover a set of latent topics in such a dataset allows us to re-rank a fixed dictionary in a way that prioritizes the words that are more likely to appear in a given image. Moreover, we train a CNN that is able to reproduce those word rankings but using only the image raw pixels as input. We demonstrate that the quality of the automatically obtained custom lexicons is superior to a generic frequency-based baseline.
@inproceedings{patel2016dynamic,
  title={Dynamic lexicon generation for natural scene images},
  author={Patel, Yash and Gomez, Lluis and Rusinol, Mar{\c{c}}al and Karatzas, Dimosthenis},
  booktitle={European Conference on Computer Vision},
  pages={395--410},
  year={2016},
  organization={Springer}
}

Dynamic Narratives for Heritage Tour
Anurag Ghosh*, Yash Patel*, Mohak Sukhwani, CV Jawahar
VisART, European Conference on Computer Vision (ECCV), 2016
pdf   abstract   bibtex   code

We present a dynamic story generation approach for the egocentric videos from the heritage sites. Given a short video clip of a ‘heritage-tour’ our method selects a series of short descriptions from the collection of pre-curated text and create a larger narrative. Unlike in the past, these narratives are not merely monotonic static versions from simple retrievals. We propose a method to generate on the fly dynamic narratives of the tour. The series of the text messages selected are optimised over length, relevance, cohesion and information simultaneously. This results in ‘tour guide’ like narratives which are seasoned and adapted to the participants selection of the tour path. We simultaneously use visual and gps cues for precision localization on the heritage site which is conceptually formulated as a graph. The efficacy of the approach is demonstrated on a heritage site, Golconda Fort, situated in Hyderabad India. We validate our approach on two hours of data collected over multiple runs across the site for our experiments.
@inproceedings{ghosh2016dynamic,
  title={Dynamic narratives for heritage tour},
  author={Ghosh, Anurag and Patel, Yash and Sukhwani, Mohak and Jawahar, CV},
  booktitle={European Conference on Computer Vision},
  pages={856--870},
  year={2016},
  organization={Springer}
}


Present & Past Affiliations

A9



counter since 03/10/2019