Yash Patel

I am a third year PhD candidate at the Center for Machine Perception, Czech Technical University in Prague, where I am supervised by Prof. Jiří Matas. My research on Training Neural Networks on Non-Differentiable Losses is supported by Research Center for Informatics (RCI), Amazon Research Award (ARA) and Software Competence Center Hagenberg (SCCH COMET). During my PhD, I did two internships at Meta (FAIR) and AWS AI

Previously, I graduated with a Master's degree in Computer Vision from the Robotics Institute of Carnegie Mellon University, where I worked with Prof. Abhinav Gupta. During the Masters, I did two Research Internships at Amazon (first at A9 and second at AWS-AI), during these internships, my work was supervised by Prof. R. Manmatha.

Even before that, I obtained a Bachelor in Technology with Honors by Research in Computer Science and Engineering from International Institute of Information Technology, Hyderabad (IIIT-H). During my undergrad, I was working with Prof. C.V. Jawahar at the Center for Visual Information Technology (CVIT).

At some point during my undergrad, I did a Research Internship at the Computer Vision Center (CVC), Universitat Autònoma de Barcelona, where I was supervised by Prof. Dimosthenis Karatzas. I did another internship during my undergrad at Center for Machine Perception, Czech Technical University in Prague, where I was supervised by Prof. Jiří Matas.

Research Interests: Self-Supervised Representation Learning, Image Compression, Scene Text Detection and Recognition, Tracking and Segmentation in Videos, 3D Reconstruction

email | Twitter | GitHub | LinkedIn | Google Scholar | ResearchGate

patelyas AT cmp DOT felk DOT cvut DOT cz 
yashp AT alumni DOT cmu DOT edu
My picture


[NEW] Contrastive Classification and Representation Learning with Probabilistic Interpretation
Rahaf Aljundi, Yash Patel, Milan Sulc, Daniel Olmeda, Nikolay Chumerin
Association for the Advancement of Artificial Intelligence (AAAI), 2023
pdf   abstract   bibtex

Cross entropy loss has served as the main objective function for classification-based tasks. Widely deployed for learning neural network classifiers, it shows both effectiveness and a probabilistic interpretation. Recently, after the success of self supervised contrastive representation learning methods, supervised contrastive methods have been proposed to learn representations and have shown superior and more robust performance, compared to solely training with cross entropy loss. However, cross entropy loss is still needed to train the final classification layer. In this work, we investigate the possibility of learning both the representation and the classifier using one objective function that combines the robustness of contrastive learning and the probabilistic interpretation of cross entropy loss. First, we revisit a previously proposed contrastive-based objective function that approximates cross entropy loss and present a simple extension to learn the classifier jointly. Second, we propose a new version of the supervised contrastive training that learns jointly the parameters of the classifier and the backbone of the network. We empirically show that our proposed objective functions show a significant improvement over the standard cross entropy loss with more training stability and robustness in various challenging settings.
  title={Contrastive Classification and Representation Learning with Probabilistic Interpretation},
  author={Aljundi, Rahaf and Patel, Yash and Sulc, Milan and Olmeda, Daniel and Chumerin, Nikolay},
  journal={arXiv preprint arXiv:2211.03646},

[NEW] Recall@k Surrogate Loss with Large Batches and Similarity Mixup
Yash Patel, Giorgos Tolias, Jiri Matas
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
pdf   supplementary   abstract   bibtex   webpage   code   video

This work focuses on learning deep visual representation models for retrieval by exploring the interplay between a new loss function, the batch size, and a new regularization approach. Direct optimization, by gradient descent, of an evaluation metric, is not possible when it is non-differentiable, which is the case for recall in retrieval. A differentiable surrogate loss for the recall is proposed in this work. Using an implementation that sidesteps the hardware constraints of the GPU memory, the method trains with a very large batch size, which is essential for metrics computed on the entire retrieval database. It is assisted by an efficient mixup regularization approach that operates on pairwise scalar similarities and virtually increases the batch size further. The suggested method achieves state-of-the-art performance in several image retrieval benchmarks when used for deep metric learning. For instance-level recognition, the method outperforms similar approaches that train using an approximation of average precision.
  title={Recall@ k surrogate loss with large batches and similarity mixup},
  author={Patel, Yash and Tolias, Giorgos and Matas, Ji{\v{r}}{\'\i}},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

[NEW] Plant recognition by AI: Deep neural nets, transformers, and kNN in deep embeddings
Lukáš Picek, Milan Šulc, Yash Patel, Jiri Matas
Frontiers in Plant Science, 2022
pdf   abstract   bibtex

The article reviews and benchmarks machine learning methods for automatic image-based plant species recognition and proposes a novel retrieval-based method for recognition by nearest neighbor classification in a deep embedding space. The image retrieval method relies on a model trained via the Recall@k surrogate loss. State-of-the-art approaches to image classification, based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT), are benchmarked and compared with the proposed image retrieval-based method. The impact of performance-enhancing techniques, e.g., class prior adaptation, image augmentations, learning rate scheduling, and loss functions, is studied. The evaluation is carried out on the PlantCLEF 2017, the ExpertLifeCLEF 2018, and the iNaturalist 2018 Datasets—the largest publicly available datasets for plant recognition. The evaluation of CNN and ViT classifiers shows a gradual improvement in classification accuracy. The current state-of-the-art Vision Transformer model, ViT-Large/16, achieves 91.15% and 83.54% accuracy on the PlantCLEF 2017 and ExpertLifeCLEF 2018 test sets, respectively; the best CNN model (ResNeSt-269e) error rate dropped by 22.91% and 28.34%. Apart from that, additional tricks increased the performance for the ViT-Base/32 by 3.72% on ExpertLifeCLEF 2018 and by 4.67% on PlantCLEF 2017. The retrieval approach achieved superior performance in all measured scenarios with accuracy margins of 0.28%, 4.13%, and 10.25% on ExpertLifeCLEF 2018, PlantCLEF 2017, and iNat2018–Plantae, respectively.
  title={Plant Recognition by AI: Deep Neural Nets, Transformers and kNN in Deep Embeddings},
  author={Picek, Luk{\'a}{\v{s}} and {\v{S}}ulc, Milan and Patel, Yash and Matas, Ji{\v{r}}{\'\i}},
  journal={Frontiers in Plant Science},

FEDS--Filtered Edit Distance Surrogate
Yash Patel, Jiri Matas
International Conference on Document Analysis and Recognition (ICDAR), 2021
pdf   abstract   bibtex   video

This paper proposes a procedure to robustly train a scene text recognition model using a learned surrogate of edit distance. The proposed method borrows from self-paced learning and filters out the training examples that are hard for the surrogate. The filtering is performed by judging the quality of the approximation, using a ramp function, which is piece-wise differentiable, enabling end-to-end training. Following the literature, the experiments are conducted in a post-tuning setup, where a trained scene text recognition model is tuned using the learned surrogate of edit distance. The efficacy is demonstrated by improvements on various challenging scene text datasets such as IIIT-5K, SVT, ICDAR, SVTP, and CUTE. The proposed method provides an average improvement of 11.2% on total edit distance and an error reduction of 9.5% on accuracy.
  title={FEDS-Filtered Edit Distance Surrogate},
  author={Patel, Yash and Matas, Ji{\v{r}}{\'\i}},
  booktitle={International Conference on Document Analysis and Recognition},

Saliency Driven Perceptual Image Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
IEEE/CVF Winter Applications of Computer Vision (WACV), 2021
pdf   abstract   bibtex   supplementary   video   amazon science

This paper proposes a new end-to-end trainable model for lossy image compression, which includes several novel components. The method incorporates 1) an adequate perceptual similarity metric; 2) saliency in the images; 3) a hierarchical auto-regressive model. This paper demonstrates that the popularly used evaluations metrics such as MS-SSIM and PSNR are inadequate for judging the performance of image compression techniques as they do not align with the human perception of similarity. Alternatively, a new metric is proposed, which is learned on perceptual similarity data specific to image compression. The proposed compression model incorporates the salient regions and optimizes on the proposed perceptual similarity metric. The model not only generates images which are visually better but also gives superior performance for subsequent computer vision tasks such as object detection and segmentation when compared to existing engineered or learned compression techniques.
  title={Saliency driven perceptual image compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},

Neural Network-based Acoustic Vehicle Counting
Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen
European Signal Processing Conference (EUSIPCO), 2021
pdf   abstract   bibtex

This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of a vehicle-to-microphone distance predicted from audio. The distance is predicted via a two-stage (coarse-fine) regression, both realised using neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The 95% confidence interval for the mean of vehicle counting error is within [0.28%,−0.55%]. Besides the minima-based counting, we propose a deep learning counting which operates on the predicted distance without detecting local minima. Results also show that removing low frequencies in features improves the counting performance.
  title={Neural network-based acoustic vehicle counting},
  author={Djukanovi{\'c}, Slobodan and Patel, Yash and Matas, Ji{\v{r}}i and Virtanen, Tuomas},
  booktitle={2021 29th European Signal Processing Conference (EUSIPCO)},

Learning Surrogates via Deep Embedding
Yash Patel, Tomas Hodan, Jiri Matas
European Conference on Computer Vision (ECCV), 2020
pdf   abstract   bibtex   video   long video

This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to 39% relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to 4.25% relative improvement in the F1 score.
  title={Learning surrogates via deep embedding},
  author={Patel, Yash and Hoda{\v{n}}, Tom{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i}},
  booktitle={European Conference on Computer Vision},

Deep Perceptual Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf   abstract   bibtex

Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting compressed images. In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG. However, the images produced by these deep learned techniques are in many cases clearly worse to human eyes than those produced by JPEG-2000 or BPG.
We propose the use of an alternative, deep perceptual metric, which has been shown to align better with human perceptual similarity. We then propose Deep Perceptual Compression (DPC) which makes use of an encoder-decoder based image compression model to jointly optimize on the deep perceptual metric and MS-SSIM. Via extensive human evaluations, we show that the proposed method generates visually better results than previous learning based compression methods and JPEG-2000, and is comparable to BPG. Furthermore, we demonstrate that for tasks like object-detection, images compressed with DPC give better accuracy.
  title={Deep Perceptual Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1907.08310},

Human Perceptual Evaluations for Image Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf   abstract   bibtex

Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS-SSIM (multi-scale structural similarity). This has led to some deep learning techniques which directly optimize for MS-SSIM by choosing it as a loss function. While this leads to a higher MS-SSIM for such techniques, we demonstrate using user studies that the resulting improvement may be misleading. Deep learning techniques for image compression with a higher MS-SSIM may actually be perceptually worse than engineered compression schemes with a lower MS-SSIM.
  title={Human Perceptual Evaluations for Image Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1908.04187},

Self-Supervised Visual Representations for Cross-Modal Retrieval
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
International Conference on Multimedia Retrieval (ICMR), 2019   Spotlight
pdf   abstract   bibtex

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.

  title={Self-Supervised Visual Representations for Cross-Modal Retrieval},
  author={Patel, Yash and Gomez, Lluis and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval},

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition--RRC-MLT-2019
Nibal Nayef*, Yash Patel*, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, Jean-Marc Ogier
International Conference on Document Analysis and Recognition (ICDAR), 2019   Oral
pdf   abstract   bibtex   portal

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text:(a) text detection,(b) cropped word script classification,(c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.
  title={ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019},
  author={Nayef, Nibal and Patel, Yash and Busta, Michal and Chowdhury, Pinaki Nath and Karatzas, Dimosthenis and Khlif, Wafa and Matas, Jiri and Pal, Umapada and Burie, Jean-Christophe and Liu, Cheng-lin and others},
  booktitle={2019 International conference on document analysis and recognition (ICDAR)},

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
Michal Bušta, Yash Patel, Jiri Matas
International Workshop on Robust Reading, Asian Conference on Computer Vision (ACCV), 2018
Best Paper Award
pdf   abstract   bibtex   code

An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed. The approach is based on a single fully convolutional network (FCN) with shared layers for both tasks.
E2E-MLT is the first published multi-language OCR for scene text. While trained in multi-language setup, E2E-MLT demonstrates competitive performance when compared to other methods trained for English scene text alone. The experiments show that obtaining accurate multi-language multi-script annotations is a challenging problem.
  title={E2E-MLT-an unconstrained end-to-end method for multi-language scene text},
  author={Bu{\v{s}}ta, Michal and Patel, Yash and Matas, Jiri},
  booktitle={Asian Conference on Computer Vision},

TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces
Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
Under Review at Pattern Recognition Journal, arXiv e-print, 2018
pdf   abstract   bibtex   code

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community.
In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches.
  title={TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces},
  author={Patel, Yash and Gomez, Lluis and Gomez, Raul and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  journal={arXiv preprint arXiv:1807.02110},

Learning Sampling Policies for Domain Adaptation
Yash Patel*, Kashyap Chitta*, Bhavan Jasani*
ArXiv e-prints, 2018
pdf   abstract   bibtex   code

We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that learned sampling policies construct labeled sets that improve accuracies of visual classifiers over baselines.
  title={Learning Sampling Policies for Domain Adaptation},
  author = {Yash Patel
  and Kashyap Chitta
  and Bhavan Jasani},
  booktitle={ArXiv e-prints},

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
Lluis Gomez*, Yash Patel*, Marçal Rusiñol, Dimosthenis Karatzas, CV Jawahar
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017
pdf   abstract   bibtex   code

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.
  title={Self-supervised learning of visual features through embedding images into text topic spaces},
  author={Gomez, Lluis and Patel, Yash and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},

Dynamic Lexicon Generation for Natural Scene Images
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas
International Workshop on Robust Reading, European Conference on Computer Vision (ECCV), 2016
pdf   abstract   bibtex   code

Many scene text understanding methods approach the end-to-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely discussed. In this paper we propose a method that generates contextualized lexicons for scene images using only visual information. For this, we exploit the correlation between visual and textual information in a dataset consisting of images and textual content associated with them. Using the topic modeling framework to discover a set of latent topics in such a dataset allows us to re-rank a fixed dictionary in a way that prioritizes the words that are more likely to appear in a given image. Moreover, we train a CNN that is able to reproduce those word rankings but using only the image raw pixels as input. We demonstrate that the quality of the automatically obtained custom lexicons is superior to a generic frequency-based baseline.
  title={Dynamic lexicon generation for natural scene images},
  author={Patel, Yash and Gomez, Lluis and Rusinol, Mar{\c{c}}al and Karatzas, Dimosthenis},
  booktitle={European Conference on Computer Vision},

Dynamic Narratives for Heritage Tour
Anurag Ghosh*, Yash Patel*, Mohak Sukhwani, CV Jawahar
VisART, European Conference on Computer Vision (ECCV), 2016
pdf   abstract   bibtex   code

We present a dynamic story generation approach for the egocentric videos from the heritage sites. Given a short video clip of a ‘heritage-tour’ our method selects a series of short descriptions from the collection of pre-curated text and create a larger narrative. Unlike in the past, these narratives are not merely monotonic static versions from simple retrievals. We propose a method to generate on the fly dynamic narratives of the tour. The series of the text messages selected are optimised over length, relevance, cohesion and information simultaneously. This results in ‘tour guide’ like narratives which are seasoned and adapted to the participants selection of the tour path. We simultaneously use visual and gps cues for precision localization on the heritage site which is conceptually formulated as a graph. The efficacy of the approach is demonstrated on a heritage site, Golconda Fort, situated in Hyderabad India. We validate our approach on two hours of data collected over multiple runs across the site for our experiments.
  title={Dynamic narratives for heritage tour},
  author={Ghosh, Anurag and Patel, Yash and Sukhwani, Mohak and Jawahar, CV},
  booktitle={European Conference on Computer Vision},


Academic Services




Present & Past Affiliations