Yash Patel

Doctoral Dissertation

Neural Network Training and Non-Differentiable Objective Functions
Yash Patel
Supervisor: Professor Jiří Matas
Ph.D. Dissertation, Czech Technical University in Prague, 2023
pdf slides abstract bibtex

Many important computer vision tasks are naturally formulated with to have a non-differentiable objective. Therefore, the standard, dominant training procedure of a neural network is not applicable since back-propagation requires the gradients of the objective with respect to the model’s output. Most deep learning methods side-step the problem sub-optimally by using a proxy loss for training, which was originally designed for another task and is not tailored to the specifics of the objective. The proxy loss functions may or may not align well with the original non-differentiable objective. An appropriate proxy has to be designed for a novel task, which may not be feasible for a non-specialist. This thesis makes four main contributions toward bridging the gap between the non-differentiable objective and the training loss function. Throughout the thesis, we refer to a loss function as a surrogate loss if it is a differentiable approximation of the non-differentiable objective. Note that we use the terms objective and evaluation metric interchangeably.
First, we propose an approach for learning a differentiable surrogate of a decomposable and non-differentiable evaluation metric. The surrogate is learned jointly with the task-specific model in an alternating manner. The approach is validated on two practical tasks of scene text recognition and detection, where the surrogate learns an approximation of edit distance and intersection-over-union, respectively. In a post-tuning setup, where a model trained with the proxy loss is trained further with the learned surrogate on the same data, the proposed method shows a relative improvement of up to $39$\% on the total edit distance for scene text recognition and $4.25$\% on $F_{1}$ score for scene text detection.
Second, an improved version of training with the learned surrogate where the training samples that are hard for the surrogate are filtered out. This approach is validated for scene text recognition. It outperforms our previous approach and attains an average improvement of $11.2\%$ on total edit distance and an error reduction of $9.5\%$ on accuracy on several popular benchmarks. Note that the two proposed methods for learning a surrogate and training with the surrogate do not make any assumptions about the task at hand and can be potentially extended to novel tasks.
Third, for recall@k, a non-decomposable and non-differentiable evaluation metric, we propose a hand-crafted surrogate that involves designing differentiable versions of sorting and counting operations. An efficient mixup technique for metric learning is also proposed that mixes the similarity scores instead of the embedding vectors. The proposed surrogate attains state-of-the-art results on several metric learning and instance-level search benchmarks when combined with training on large batches. Further, when combined with the kNN classifier, it also serves as an effective tool for fine-grained recognition, where it outperforms direct classification methods.
Fourth, we propose a loss function termed Extended SupCon that jointly trains the classifier and backbone parameters for supervised contrastive classification. The proposed approach benefits from the robustness of contrastive learning and maintains the probabilistic interpretation like a soft-max prediction. Empirical results show the efficacy of our approach under challenging settings such as class imbalance, label corruption, and training with little labeled data.
Overall the contributions of this thesis make the training of neural networks more scalable -- to new tasks in a nearly labor-free manner when the evaluation metric is decomposable, which will help researchers with novel tasks. For non-decomposable evaluation metrics, the differentiable components developed for the recall@k surrogate, such as sorting and counting, can also be used for creating new surrogates.
Automatic translations of the abstract to the Czech language by Google Translate and ChatGPT are included in the appendix.

@article{patel2023neural,
  title={Neural Network Training and Non-Differentiable Objective Functions},
  author={Patel, Yash},
  year={2023}
}

Selected Publications

The Amazon Nova family of models: Technical report and model card
Amazon Artificial General Intelligence
Technical Reports, Amazon Science, 2024
pdf abstract bibtex models

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

@Article{Intelligence2024,
 author = {Amazon Artificial General Intelligence},
 title = {The Amazon Nova family of models: Technical report and model card},
 year = {2024},
 url = {https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card},
 journal = {Amazon Technical Reports},
}

Generalized Differentiable RANSAC
Tong Wei, Yash Patel, Alexander Shekhovtsov, Jiri Matas, Daniel Barath
IEEE/CVF International Conference on Computer Vision (ICCV), 2023
pdf abstract bibtex code

We propose ∇-RANSAC, a generalized differentiable RANSAC that allows learning the entire randomized robust estimation pipeline. The proposed approach enables the use of relaxation techniques for estimating the gradients in the sampling distribution, which are then propagated through a differentiable solver. The trainable quality function marginalizes over the scores from all the models estimated within ∇-RANSAC to guide the network learning accurate and useful inlier probabilities or to train feature detection and matching networks. Our method directly maximizes the probability of drawing a good hypothesis, allowing us to learn better sampling distribution. We test ∇-RANSAC on a number of real-world scenarios on fundamental and essential matrix estimation, both outdoors and indoors, with handcrafted and learning-based features. It is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives.

@article{wei2023generalized,
  title={Generalized differentiable RANSAC},
  author={Wei, Tong and Patel, Yash and Shekhovtsov, Alexander and Matas, J and Barath, D},
  journal={arXiv preprint arXiv:2212.13185},
  year={2023}
}

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
pdf abstract bibtex code

Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work.

@article{radenovic2023filtering,
  title={Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training},
  author={Radenovic, Filip and Dubey, Abhimanyu and Kadian, Abhishek and Mihaylov, Todor and Vandenhende, Simon and Patel, Yash and Wen, Yi and Ramanathan, Vignesh and Mahajan, Dhruv},
  journal={arXiv preprint arXiv:2301.02280},
  year={2023}
}

Contrastive Classification and Representation Learning with Probabilistic Interpretation
Rahaf Aljundi, Yash Patel, Milan Sulc, Daniel Olmeda, Nikolay Chumerin
Association for the Advancement of Artificial Intelligence (AAAI), 2023
pdf abstract bibtex

Cross entropy loss has served as the main objective function for classification-based tasks. Widely deployed for learning neural network classifiers, it shows both effectiveness and a probabilistic interpretation. Recently, after the success of self supervised contrastive representation learning methods, supervised contrastive methods have been proposed to learn representations and have shown superior and more robust performance, compared to solely training with cross entropy loss. However, cross entropy loss is still needed to train the final classification layer. In this work, we investigate the possibility of learning both the representation and the classifier using one objective function that combines the robustness of contrastive learning and the probabilistic interpretation of cross entropy loss. First, we revisit a previously proposed contrastive-based objective function that approximates cross entropy loss and present a simple extension to learn the classifier jointly. Second, we propose a new version of the supervised contrastive training that learns jointly the parameters of the classifier and the backbone of the network. We empirically show that our proposed objective functions show a significant improvement over the standard cross entropy loss with more training stability and robustness in various challenging settings.

@article{aljundi2022contrastive,
  title={Contrastive Classification and Representation Learning with Probabilistic Interpretation},
  author={Aljundi, Rahaf and Patel, Yash and Sulc, Milan and Olmeda, Daniel and Chumerin, Nikolay},
  journal={arXiv preprint arXiv:2211.03646},
  year={2022}
}

DocILE Benchmark for Document Information Localization and Extraction
Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas
International Conference on Document Analysis and Recognition (ICDAR), 2023 Oral
pdf abstract bibtex webpage code

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at this https URL.

@article{vsimsa2023docile,
  title={DocILE Benchmark for Document Information Localization and Extraction},
  author={{\v{S}}imsa, {\v{S}}t{\v{e}}p{\'a}n and {\v{S}}ulc, Milan and U{\v{r}}i{\v{c}}{\'a}{\v{r}}, Michal and Patel, Yash and Hamdi, Ahmed and Koci{\'a}n, Mat{\v{e}}j and Skalick{\`y}, Maty{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i} and Doucet, Antoine and Coustaty, Micka{\"e}l and others},
  journal={arXiv preprint arXiv:2302.05658},
  year={2023}
}

DocILE 2023 Teaser: Document Information Localization and Extraction
Štěpán Šimsa, Milan Šulc, Matyáš Skalický, Yash Patel, Ahmed Hamdi
European Conference on Information Retrieval (ECIR), 2023
pdf abstract bibtex webpage

The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the reproducibility and cross-evaluation of published methods. The DocILE 2023 competition, hosted as a lab at the CLEF 2023 conference and as an ICDAR 2023 competition, will run the first major benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents. With thousands of annotated real documents from open sources, a hundred thousand of generated synthetic documents, and nearly a million unlabeled documents, the DocILE lab comes with the largest publicly available dataset for KILE and LIR. We are looking forward to contributions from the Computer Vision, Natural Language Processing, Information Retrieval, and other communities. The data, baselines, code and up-to-date information about the lab and competition are available at this https URL.

@article{vsimsa2023docile,
  title={DocILE 2023 Teaser: Document Information Localization and Extraction},
  author={{\v{S}}imsa, {\v{S}}t{\v{e}}p{\'a}n and {\v{S}}ulc, Milan and Skalick{\`y}, Maty{\'a}{\v{s}} and Patel, Yash and Hamdi, Ahmed},
  journal={arXiv preprint arXiv:2301.12394},
  year={2023}
}

Recall@k Surrogate Loss with Large Batches and Similarity Mixup
Yash Patel, Giorgos Tolias, Jiri Matas
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
pdf supplementary abstract bibtex webpage code video

This work focuses on learning deep visual representation models for retrieval by exploring the interplay between a new loss function, the batch size, and a new regularization approach. Direct optimization, by gradient descent, of an evaluation metric, is not possible when it is non-differentiable, which is the case for recall in retrieval. A differentiable surrogate loss for the recall is proposed in this work. Using an implementation that sidesteps the hardware constraints of the GPU memory, the method trains with a very large batch size, which is essential for metrics computed on the entire retrieval database. It is assisted by an efficient mixup regularization approach that operates on pairwise scalar similarities and virtually increases the batch size further. The suggested method achieves state-of-the-art performance in several image retrieval benchmarks when used for deep metric learning. For instance-level recognition, the method outperforms similar approaches that train using an approximation of average precision.

@inproceedings{patel2022recall,
  title={Recall@ k surrogate loss with large batches and similarity mixup},
  author={Patel, Yash and Tolias, Giorgos and Matas, Ji{\v{r}}{\'\i}},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7502--7511},
  year={2022}
}

Plant recognition by AI: Deep neural nets, transformers, and kNN in deep embeddings
Lukáš Picek, Milan Šulc, Yash Patel, Jiri Matas
Frontiers in Plant Science, 2022
pdf abstract bibtex

The article reviews and benchmarks machine learning methods for automatic image-based plant species recognition and proposes a novel retrieval-based method for recognition by nearest neighbor classification in a deep embedding space. The image retrieval method relies on a model trained via the Recall@k surrogate loss. State-of-the-art approaches to image classification, based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT), are benchmarked and compared with the proposed image retrieval-based method. The impact of performance-enhancing techniques, e.g., class prior adaptation, image augmentations, learning rate scheduling, and loss functions, is studied. The evaluation is carried out on the PlantCLEF 2017, the ExpertLifeCLEF 2018, and the iNaturalist 2018 Datasets—the largest publicly available datasets for plant recognition. The evaluation of CNN and ViT classifiers shows a gradual improvement in classification accuracy. The current state-of-the-art Vision Transformer model, ViT-Large/16, achieves 91.15% and 83.54% accuracy on the PlantCLEF 2017 and ExpertLifeCLEF 2018 test sets, respectively; the best CNN model (ResNeSt-269e) error rate dropped by 22.91% and 28.34%. Apart from that, additional tricks increased the performance for the ViT-Base/32 by 3.72% on ExpertLifeCLEF 2018 and by 4.67% on PlantCLEF 2017. The retrieval approach achieved superior performance in all measured scenarios with accuracy margins of 0.28%, 4.13%, and 10.25% on ExpertLifeCLEF 2018, PlantCLEF 2017, and iNat2018–Plantae, respectively.

@article{picekplant,
  title={Plant Recognition by AI: Deep Neural Nets, Transformers and kNN in Deep Embeddings},
  author={Picek, Luk{\'a}{\v{s}} and {\v{S}}ulc, Milan and Patel, Yash and Matas, Ji{\v{r}}{\'\i}},
  journal={Frontiers in Plant Science},
  pages={2788},
  publisher={Frontiers}
}

FEDS--Filtered Edit Distance Surrogate
Yash Patel, Jiri Matas
International Conference on Document Analysis and Recognition (ICDAR), 2021
pdf abstract bibtex video

This paper proposes a procedure to robustly train a scene text recognition model using a learned surrogate of edit distance. The proposed method borrows from self-paced learning and filters out the training examples that are hard for the surrogate. The filtering is performed by judging the quality of the approximation, using a ramp function, which is piece-wise differentiable, enabling end-to-end training. Following the literature, the experiments are conducted in a post-tuning setup, where a trained scene text recognition model is tuned using the learned surrogate of edit distance. The efficacy is demonstrated by improvements on various challenging scene text datasets such as IIIT-5K, SVT, ICDAR, SVTP, and CUTE. The proposed method provides an average improvement of 11.2% on total edit distance and an error reduction of 9.5% on accuracy.

@inproceedings{patel2021feds,
  title={FEDS-Filtered Edit Distance Surrogate},
  author={Patel, Yash and Matas, Ji{\v{r}}{\'\i}},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={171--186},
  year={2021},
  organization={Springer}
}

Saliency Driven Perceptual Image Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
IEEE/CVF Winter Applications of Computer Vision (WACV), 2021
pdf abstract bibtex supplementary video amazon science

This paper proposes a new end-to-end trainable model for lossy image compression, which includes several novel components. The method incorporates 1) an adequate perceptual similarity metric; 2) saliency in the images; 3) a hierarchical auto-regressive model. This paper demonstrates that the popularly used evaluations metrics such as MS-SSIM and PSNR are inadequate for judging the performance of image compression techniques as they do not align with the human perception of similarity. Alternatively, a new metric is proposed, which is learned on perceptual similarity data specific to image compression. The proposed compression model incorporates the salient regions and optimizes on the proposed perceptual similarity metric. The model not only generates images which are visually better but also gives superior performance for subsequent computer vision tasks such as object detection and segmentation when compared to existing engineered or learned compression techniques.

@inproceedings{patel2021saliency,
  title={Saliency driven perceptual image compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={227--236},
  year={2021}
}

Neural Network-based Acoustic Vehicle Counting
Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen
European Signal Processing Conference (EUSIPCO), 2021
pdf abstract bibtex

This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of a vehicle-to-microphone distance predicted from audio. The distance is predicted via a two-stage (coarse-fine) regression, both realised using neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The 95% confidence interval for the mean of vehicle counting error is within [0.28%,−0.55%]. Besides the minima-based counting, we propose a deep learning counting which operates on the predicted distance without detecting local minima. Results also show that removing low frequencies in features improves the counting performance.

@inproceedings{djukanovic2021neural,
  title={Neural network-based acoustic vehicle counting},
  author={Djukanovi{\'c}, Slobodan and Patel, Yash and Matas, Ji{\v{r}}i and Virtanen, Tuomas},
  booktitle={2021 29th European Signal Processing Conference (EUSIPCO)},
  pages={561--565},
  year={2021},
  organization={IEEE}
}

Learning Surrogates via Deep Embedding
Yash Patel, Tomas Hodan, Jiri Matas
European Conference on Computer Vision (ECCV), 2020
pdf abstract bibtex video long video

This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to 39% relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to 4.25% relative improvement in the F1 score.

@inproceedings{patel2020learning,
  title={Learning surrogates via deep embedding},
  author={Patel, Yash and Hoda{\v{n}}, Tom{\'a}{\v{s}} and Matas, Ji{\v{r}}{\'\i}},
  booktitle={European Conference on Computer Vision},
  pages={205--221},
  year={2020},
  organization={Springer}
}

Deep Perceptual Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf abstract bibtex

Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting compressed images. In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG. However, the images produced by these deep learned techniques are in many cases clearly worse to human eyes than those produced by JPEG-2000 or BPG.
We propose the use of an alternative, deep perceptual metric, which has been shown to align better with human perceptual similarity. We then propose Deep Perceptual Compression (DPC) which makes use of an encoder-decoder based image compression model to jointly optimize on the deep perceptual metric and MS-SSIM. Via extensive human evaluations, we show that the proposed method generates visually better results than previous learning based compression methods and JPEG-2000, and is comparable to BPG. Furthermore, we demonstrate that for tasks like object-detection, images compressed with DPC give better accuracy.

@article{patel2019deep,
  title={Deep Perceptual Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1907.08310},
  year={2019}
}

Human Perceptual Evaluations for Image Compression
Yash Patel, Srikar Appalaraju, R. Manmatha
arXiv e-print, 2019
pdf abstract bibtex

Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS-SSIM (multi-scale structural similarity). This has led to some deep learning techniques which directly optimize for MS-SSIM by choosing it as a loss function. While this leads to a higher MS-SSIM for such techniques, we demonstrate using user studies that the resulting improvement may be misleading. Deep learning techniques for image compression with a higher MS-SSIM may actually be perceptually worse than engineered compression schemes with a lower MS-SSIM.

@article{patel2019human,
  title={Human Perceptual Evaluations for Image Compression},
  author={Patel, Yash and Appalaraju, Srikar and Manmatha, R},
  journal={arXiv preprint arXiv:1908.04187},
  year={2019}
}

Self-Supervised Visual Representations for Cross-Modal Retrieval
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
International Conference on Multimedia Retrieval (ICMR), 2019 Spotlight
pdf abstract bibtex

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.


@inproceedings{patel2019self,
  title={Self-Supervised Visual Representations for Cross-Modal Retrieval},
  author={Patel, Yash and Gomez, Lluis and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the 2019 on International Conference on Multimedia Retrieval},
  year={2019},
  organization={ACM}
}

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition--RRC-MLT-2019
Nibal Nayef*, Yash Patel*, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin Liu, Jean-Marc Ogier
International Conference on Document Analysis and Recognition (ICDAR), 2019 Oral
pdf abstract bibtex portal

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text:(a) text detection,(b) cropped word script classification,(c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

@inproceedings{nayef2019icdar2019,
  title={ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019},
  author={Nayef, Nibal and Patel, Yash and Busta, Michal and Chowdhury, Pinaki Nath and Karatzas, Dimosthenis and Khlif, Wafa and Matas, Jiri and Pal, Umapada and Burie, Jean-Christophe and Liu, Cheng-lin and others},
  booktitle={2019 International conference on document analysis and recognition (ICDAR)},
  pages={1582--1587},
  year={2019},
  organization={IEEE}
}

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
Michal Bušta, Yash Patel, Jiri Matas
International Workshop on Robust Reading, Asian Conference on Computer Vision (ACCV), 2018
Best Paper Award
pdf abstract bibtex code

An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed. The approach is based on a single fully convolutional network (FCN) with shared layers for both tasks.
E2E-MLT is the first published multi-language OCR for scene text. While trained in multi-language setup, E2E-MLT demonstrates competitive performance when compared to other methods trained for English scene text alone. The experiments show that obtaining accurate multi-language multi-script annotations is a challenging problem.

@inproceedings{buvsta2018e2e,
  title={E2E-MLT-an unconstrained end-to-end method for multi-language scene text},
  author={Bu{\v{s}}ta, Michal and Patel, Yash and Matas, Jiri},
  booktitle={Asian Conference on Computer Vision},
  pages={127--143},
  year={2018},
  organization={Springer}
}

TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces
Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C.V. Jawahar
Under Review at Pattern Recognition Journal, arXiv e-print, 2018
pdf abstract bibtex code

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community.
In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches.

@article{patel2018texttopicnet,
  title={TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces},
  author={Patel, Yash and Gomez, Lluis and Gomez, Raul and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  journal={arXiv preprint arXiv:1807.02110},
  year={2018}
}

Learning Sampling Policies for Domain Adaptation
Yash Patel*, Kashyap Chitta*, Bhavan Jasani*
ArXiv e-prints, 2018
pdf abstract bibtex code

We address the problem of semi-supervised domain adaptation of classification algorithms through deep Q-learning. The core idea is to consider the predictions of a source domain network on target domain data as noisy labels, and learn a policy to sample from this data so as to maximize classification accuracy on a small annotated reward partition of the target domain. Our experiments show that learned sampling policies construct labeled sets that improve accuracies of visual classifiers over baselines.

@inProceedings{patel2018learning,
  title={Learning Sampling Policies for Domain Adaptation},
  author = {Yash Patel
  and Kashyap Chitta
  and Bhavan Jasani},
  booktitle={ArXiv e-prints},
  year={2018}
}

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
Lluis Gomez*, Yash Patel*, Marçal Rusiñol, Dimosthenis Karatzas, CV Jawahar
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017
pdf abstract bibtex code

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.

@inproceedings{gomez2017self,
  title={Self-supervised learning of visual features through embedding images into text topic spaces},
  author={Gomez, Lluis and Patel, Yash and Rusi{\~n}ol, Mar{\c{c}}al and Karatzas, Dimosthenis and Jawahar, CV},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={4230--4239},
  year={2017}
}

Dynamic Lexicon Generation for Natural Scene Images
Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas
International Workshop on Robust Reading, European Conference on Computer Vision (ECCV), 2016
pdf abstract bibtex code

Many scene text understanding methods approach the end-to-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely discussed. In this paper we propose a method that generates contextualized lexicons for scene images using only visual information. For this, we exploit the correlation between visual and textual information in a dataset consisting of images and textual content associated with them. Using the topic modeling framework to discover a set of latent topics in such a dataset allows us to re-rank a fixed dictionary in a way that prioritizes the words that are more likely to appear in a given image. Moreover, we train a CNN that is able to reproduce those word rankings but using only the image raw pixels as input. We demonstrate that the quality of the automatically obtained custom lexicons is superior to a generic frequency-based baseline.

@inproceedings{patel2016dynamic,
  title={Dynamic lexicon generation for natural scene images},
  author={Patel, Yash and Gomez, Lluis and Rusinol, Mar{\c{c}}al and Karatzas, Dimosthenis},
  booktitle={European Conference on Computer Vision},
  pages={395--410},
  year={2016},
  organization={Springer}
}

Dynamic Narratives for Heritage Tour
Anurag Ghosh*, Yash Patel*, Mohak Sukhwani, CV Jawahar
VisART, European Conference on Computer Vision (ECCV), 2016
pdf abstract bibtex code

We present a dynamic story generation approach for the egocentric videos from the heritage sites. Given a short video clip of a ‘heritage-tour’ our method selects a series of short descriptions from the collection of pre-curated text and create a larger narrative. Unlike in the past, these narratives are not merely monotonic static versions from simple retrievals. We propose a method to generate on the fly dynamic narratives of the tour. The series of the text messages selected are optimised over length, relevance, cohesion and information simultaneously. This results in ‘tour guide’ like narratives which are seasoned and adapted to the participants selection of the tour path. We simultaneously use visual and gps cues for precision localization on the heritage site which is conceptually formulated as a graph. The efficacy of the approach is demonstrated on a heritage site, Golconda Fort, situated in Hyderabad India. We validate our approach on two hours of data collected over multiple runs across the site for our experiments.

@inproceedings{ghosh2016dynamic,
  title={Dynamic narratives for heritage tour},
  author={Ghosh, Anurag and Patel, Yash and Sukhwani, Mohak and Jawahar, CV},
  booktitle={European Conference on Computer Vision},
  pages={856--870},
  year={2016},
  organization={Springer}
}

Patents

(2024) Statistical model training systems, US Patent 11,868,440, Yash Patel, R Manmatha, Alexander Smola, Son D Tran, Sheng Zha.
(2021) Hierarchical auto-regressive image compression system, US Patent 10,965,948, Srikar Appalaraju, Yash Patel, R Manmatha.
(2021) Learned lossy image compression codec, US Patent 10,909,728, Srikar Appalaraju, R Manmatha, Yash Patel.

Academic Services

Reviewer for TPAMI, IJCV, ICPR, ACCV, WACV, ICCV, CVPR

(2023) Organizer, DocILE Lab and Challenge, International Conference on Document Analysis and Recognition (ICDAR).
(2023) Organizer, DocILE Lab and Challenge, Conference and Labs of the Evaluation Forum (CLEF).
(2019) Organizer, MLT Competition, International Conference on Document Analysis and Recognition (ICDAR).
(2019) Organizer, tutorial on Joint Image-Text Embedding Learning and applications, ICDAR.
(2018) Organizer/Program Chair, 3rd International Workshop on Robust Reading (IWRR), ACCV 2018.

Awards

(2024) Antonín Svoboda Award for the Best Ph.D. Thesis.
(2024) Dean's Award, FEL CVUT for outstanding Ph.D. dissertation.
(2021) Amazon Research Award, for Training Neural Networks on Non-Differentiable Losses (with Prof. Jiri Matas).
(2018) Best Paper Award, to E2E-MLT at IWRR ACCV.
(2017) Open Informatics Young Scientist Scholarship, Czech Technical University in Prague.
(2017) Won (Rank-1) ICDAR RRC-MLT, competition on cropped word script identification.
(2017) Dean's Research Award, IIIT Hyderabad, for excellence in research.
(2016) Dean's Research Award, IIIT Hyderabad, for excellence in research.
(2016) Dean's Academic Award, IIIT Hyderabad, for merit list.

Talks

(10/03/2022) Training Neural Networks on Non-differentiable Losses, at Faculty of Mathematics and Physics, Charles University. slides.
(20/01/2022) Training Neural Networks on Non-differentiable Losses, at Rossum AI. slides video.
(26/11/2020) Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval, at CTU reading group.slides video.
(14/05/2020) Variational Autoencoders with application to unsupervised representation learning, at CTU reading group. slides video.