ECE 657A DKMA | reading

Individual Topics References

There is no official textbook for this course. The primary textbooks listed at the top of this page are a useful resource that cover all the fundamentals we do in class. The rest of the papers and resources listed below are seminal (ie. famous and influential) or otherwise interesting papers on the topics of the course.

Discussion and Annotation of Papers

Many papers referred to in the course will should have a hypothesis link associated with it.

Hypothesis is a website which allows very nice commenting, annotation and sharing of papers. We have a group for the course on the site: (Hypothes.is Group : ECE657A). The navigation on the Hypothesis site is not very good, so you can use the links on this paper to find related papers and jump straight to the hypothesis page. If there isn’t one, you can create it and ask the course staff to add the hypothesis link.

Feel free to comment, annotate and discuss papers on the site in the ECE657A group. The Hypothes.is annotation discussion for the course is lightly moderated. If you encounter an offensive or inappropriate comment you can flag the annotation and the course administrator (Prof or TA) will see if and can choose to remove or deal with the comment. The original poster will not see your flag.

Jump to Topic: text ~ seminal ~ machine-learning ~ dimensionality-reduction ~ auto encoders ~ kernel-methods ~ ensemble-methods ~ deep-learning ~ unsupervised-learning ~ variational-inference ~ support-vector-machines ~ convolutional-network ~ recurrent-networks ~ anomaly-detection ~ data-augmentation ~ loss-functions ~ optimizer-gradient-methods ~ ablation-study ~ natural-language-processing ~ attention-mechanism ~ transformers ~ transfer-learning ~ active-learning ~ ai-for-science

text

Machine Learning: A Probabilistic Perspective

Murphy, Kevin

2012.

keywords: textbooks-optional

URL
Pattern Classification

Duda, R O, Hart, P E, and Stork, D G

2000.

keywords: textbooks-optional
Deep Learning

Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron

2016.

keywords: textbooks-optional

URL

seminal

Deep Learning of Representations for Unsupervised and Transfer Learning

Bengio, Yoshua

In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Bellevue, Washington, USA , 2012.

keywords: machine-learning ~ transfer-learning ~ representation-learning ~ seminal ~ lecture-1221

Abs URL Hypoth

Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.
A tutorial on speech understanding systems.

Newell, Allen

1975.

keywords: artificial-intellgience ~ ablation-study ~ seminal ~ natural-language-processing
U-NET

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham , 2015.

keywords: deep-learning ~ medical-imaging ~ convolutional-network ~ data-augmentation ~ representation-learning ~ seminal

Abs

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Glove: Global Vectors for Word Representation

Pennington, Jeffrey, Socher, Richard, and Manning, Christopher

In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar , 2014.

keywords: natural-language-processing ~ seminal ~ machine-learning ~ representation-learning

Abs URL

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview

White, G. M.

Computer. 1976.

keywords: natural-language-processing ~ natural-language-processing ~ machine-learning ~ seminal

Abs URL

Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
OC-SVM

Support vector method for novelty detection

Schölkopf, Bernhard, Williamson, Robert C, Smola, Alex J, Shawe-Taylor, John, and Platt, John C

In NeurIPS conference. 2000.

keywords: machine-learning ~ seminal ~ kernel-methods ~ anomaly-detection ~ support-vector-machines
Deep learning

LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey

Nature. 2015.

keywords: seminal ~ deep-learning ~ neural-networks ~ machine-learning
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

machine-learning

Deep Learning of Representations for Unsupervised and Transfer Learning

Bengio, Yoshua

In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Bellevue, Washington, USA , 2012.

keywords: machine-learning ~ transfer-learning ~ representation-learning ~ seminal ~ lecture-1221

Abs URL Hypoth

Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.
The Genius Neuroscientist Who Might Hold the Key to True AI

Raviv, Shaun

2018.

keywords: theory ~ machine-learning ~ active-learning ~ free-energy ~ artificial-intellgience ~ neuroscience ~ course-part-1 ~ fun-reading

Abs URL Hypoth

Karl Friston’s free energy principle might be the most all-encompassing idea since Charles Darwin’s theory of natural selection. But to understand it, you need to peer inside the mind of Friston himself.
Glove: Global Vectors for Word Representation

Pennington, Jeffrey, Socher, Richard, and Manning, Christopher

In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar , 2014.

keywords: natural-language-processing ~ seminal ~ machine-learning ~ representation-learning

Abs URL

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview

White, G. M.

Computer. 1976.

keywords: natural-language-processing ~ natural-language-processing ~ machine-learning ~ seminal

Abs URL

Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche

Coupé, Christophe, Oh, Yoon Mi, Dediu, Dan, and Pellegrino, François

Science Advances. 2019.

keywords: natural-language-processing ~ representation-learning ~ machine-learning ~ compact-encoding ~ science ~ psychology

Abs URL

\textlessp\textgreaterLanguage is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.\textless/p\textgreater
Incremental local outlier detection for data streams

Pokrajac, Dragoljub, Lazarevic, Aleksandar, and Latecki, Longin Jan

In 2007 IEEE symposium on CIDM. 2007.

keywords: anomaly-detection ~ streaming-ensembles ~ machine-learning
OC-SVM

Support vector method for novelty detection

Schölkopf, Bernhard, Williamson, Robert C, Smola, Alex J, Shawe-Taylor, John, and Platt, John C

In NeurIPS conference. 2000.

keywords: machine-learning ~ seminal ~ kernel-methods ~ anomaly-detection ~ support-vector-machines
Deep learning

LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey

Nature. 2015.

keywords: seminal ~ deep-learning ~ neural-networks ~ machine-learning
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning

Lamba, Harshall

2019.

keywords: machine-learning ~ deep-learning ~ attention-mechanism ~ blog ~ website

Abs URL

A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Brownlee, Jason

2017.

keywords: attention-mechanism ~ blog ~ machine-learning ~ website ~ autoencoders ~ representation-learning ~ deep-learning ~ blog

Abs URL

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]

dimensionality-reduction

Fisher and Kernel Fisher Discriminant Analysis: Tutorial

Benyamin Ghojogh, Fakhri Karray, Mark Crowley,

. 2019.

keywords: kernel-methods; dimensionality-reduction

URL
Multidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,

.

keywords: dimensionality-reduction; manifold-learning

URL
Stochastic Neighbor Embedding with Gaussian and Student-t Distributions: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,

. 2020.

keywords: dimensionality-reduction; probability-distributions

URL
Unsupervised and Supervised Principal Component Analysis: Tutorial

Benyamin Ghojogh, Mark Crowley,

. 2019.

keywords: dimensionality-reduction

URL

auto encoders

kernel-methods

Fisher and Kernel Fisher Discriminant Analysis: Tutorial

Benyamin Ghojogh, Fakhri Karray, Mark Crowley,

. 2019.

keywords: kernel-methods; dimensionality-reduction

URL
OC-SVM

Support vector method for novelty detection

Schölkopf, Bernhard, Williamson, Robert C, Smola, Alex J, Shawe-Taylor, John, and Platt, John C

In NeurIPS conference. 2000.

keywords: machine-learning ~ seminal ~ kernel-methods ~ anomaly-detection ~ support-vector-machines

ensemble-methods

Mondrian forests: Efficient online random forests

Lakshminarayanan, Balaji, Roy, Daniel M, and Teh, Yee Whye

In NeurIPS conference. 2014.

keywords: ensemble-methods ~ streaming-ensembles
Mondrian forests for large-scale regression when uncertainty matters

Lakshminarayanan, Balaji, Roy, Daniel M, and Teh, Yee Whye

In Artificial Intelligence and Statistics. 2016.

keywords: ensemble-methods ~ streaming-ensembles
The Mondrian Process

Roy, Daniel M, and Teh, Yee Whye

In NeurIPS conference. 2008.

keywords: ensemble-methods ~ streaming-ensembles
On-line random forests

Saffari, Amir, Leistner, Christian, Santner, Jakob, Godec, Martin, and Bischof, Horst

In 2009 IEEE ICCV workshops. 2009.

keywords: ensemble-methods ~ streaming-ensembles
Streaming random forests

Abdulsalam, Hanady, Skillicorn, David B, and Martin, Patrick

In 11th IDEAS 2007. 2007.

keywords: ensemble-methods ~ streaming-ensembles
Extremely randomized trees

Geurts, Pierre, Ernst, Damien, and Wehenkel, Louis

Machine learning. 2006.

keywords: ensemble-methods
Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning

Criminisi, Antonio, Shotton, Jamie, and Konukoglu, Ender

Foundations and Trends&#0174 in Computer Graphics and Vision. 2012.

keywords: ensemble-methods
Random forests

Breiman, Leo

Machine learning. 2001.

keywords: ensemble-methods
Binary Space Partitioning Forests

Fan, Xuhui, Li, Bin, and Sisson, Scott Anthony

In 22nd AISTATS conference. 2019.

keywords: ensemble-methods
The binary space partitioning-tree process

Fan, Xuhui, Li, Bin, and Sisson, Scott Anthony

In 21st AISTATS conference. 2018.

keywords: ensemble-methods
Understanding random forests: From theory to practice

Louppe, Gilles

2014.

keywords: ensemble-methods

deep-learning

A Comparative Study of Deep Learning Loss Functions for Multi-Label Remote Sensing Image Classification

Yessou, Hichame, Sumbul, Gencer, and Demir, Begüm

In IEEE International Geoscience and Remote Sensing Symposium. 2020.

keywords: computer-vision ~ course-review-definitions ~ deep-learning ~ loss-functions

Abs URL

This paper analyzes and compares different deep learning loss functions in the framework of multi-label remote sensing (RS) image scene classification problems. We consider seven loss functions: 1) cross-entropy loss; 2) focal loss; 3) weighted cross-entropy loss; 4) Hamming loss; 5) Huber loss; 6) ranking loss; and 7) sparseMax loss. All the considered loss functions are analyzed for the first time in RS. After a theoretical analysis, an experimental analysis is carried out to compare the considered loss functions in terms of their: 1) overall accuracy; 2) class imbalance awareness (for which the number of samples associated to each class significantly varies); 3) convexibility and differentiability; and 4) learning efficiency (i.e., convergence speed). On the basis of our analysis, some guidelines are derived for a proper selection of a loss function in multi-label RS scene classification problems.
U-NET

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham , 2015.

keywords: deep-learning ~ medical-imaging ~ convolutional-network ~ data-augmentation ~ representation-learning ~ seminal

Abs

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bai, Shaojie, Kolter, J. Zico, and Koltun, Vladlen

arXiv:1803.01271 [cs]. 2018.

keywords: recurrent-networks ~ optimizer-gradient-methods ~ rnn ~ deep-learning

Abs URL

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Deep learning

LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey

Nature. 2015.

keywords: seminal ~ deep-learning ~ neural-networks ~ machine-learning
An overview of gradient descent optimization algorithms

Ruder, Sebastian

arXiv:1609.04747 [cs]. 2017.

keywords: optimizer-gradient-methods ~ deep-learning

Abs URL

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning

Lamba, Harshall

2019.

keywords: machine-learning ~ deep-learning ~ attention-mechanism ~ blog ~ website

Abs URL

A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Brownlee, Jason

2017.

keywords: attention-mechanism ~ blog ~ machine-learning ~ website ~ autoencoders ~ representation-learning ~ deep-learning ~ blog

Abs URL

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]
Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images

Lu, Ming Y., Williamson, Drew F. K., Chen, Tiffany Y., Chen, Richard J., Barbieri, Matteo, and Mahmood, Faisal

arXiv:2004.09666 [cs, eess, q-bio]. 2020.

keywords: attention-mechanism ~ classification ~ clustering ~ deep-learning ~ medical-imaging ~ computer-vision ~ digital-pathology

Abs URL

The rapidly emerging field of computational pathology has the potential to enable objective diagnosis, therapeutic response prediction and identification of new morphological features of clinical relevance. However, deep learning-based computational pathology approaches either require manual annotation of gigapixel whole slide images (WSIs) in fully-supervised settings or thousands of WSIs with slide-level labels in a weakly-supervised setting. Moreover, whole slide level computational pathology methods also suffer from domain adaptation and interpretability issues. These challenges have prevented the broad adaptation of computational pathology for clinical and research purposes. Here we present CLAM - Clustering-constrained attention multiple instance learning, an easy-to-use, high-throughput, and interpretable WSI-level processing and learning method that only requires slide-level labels while being data efficient, adaptable and capable of handling multi-class subtyping problems. CLAM is a deep-learning-based weakly-supervised method that uses attention-based learning to automatically identify sub-regions of high diagnostic value in order to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space. In three separate analyses, we demonstrate the data efficiency and adaptability of CLAM and its superior performance over standard weakly-supervised classification. We demonstrate that CLAM models are interpretable and can be used to identify well-known and new morphological features. We further show that models trained using CLAM are adaptable to independent test cohorts, cell phone microscopy images, and biopsies. CLAM is a general-purpose and adaptable method that can be used for a variety of different computational pathology tasks in both clinical and research settings.

unsupervised-learning

Unsupervised word embeddings capture latent knowledge from materials science literature

Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav

Nature. 2019.

keywords: representation-learning ~ ai-for-material-design ~ ai-for-science ~ unsupervised-learning ~ embeddings ~ proj-chemgymrl ~ nlp ~ natural-language-processing

Abs URL Hypoth

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

variational-inference

Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,

. 2020.

keywords: course-diver-deeper-into-a-topic; factor-analysis; variational-inference

URL

support-vector-machines

OC-SVM

Support vector method for novelty detection

Schölkopf, Bernhard, Williamson, Robert C, Smola, Alex J, Shawe-Taylor, John, and Platt, John C

In NeurIPS conference. 2000.

keywords: machine-learning ~ seminal ~ kernel-methods ~ anomaly-detection ~ support-vector-machines

convolutional-network

Are Pre-trained Convolutions Better than Pre-trained Transformers?

Tay, Yi, Dehghani, Mostafa, Gupta, Jai, Bahri, Dara, Aribandi, Vamsi, Qin, Zhen, and Metzler, Donald

In ACL 2021. 2021.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions

Wu, Felix, Fan, Angela, Baevski, Alexei, Dauphin, Yann N., and Auli, Michael

In International Conference on Learning Representations (ICLR 2019). 2019.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
CNN-Course

Stanford:CS231n Convolutional Neural Networks for Visual Recognition

Li, Fei-Fei

2021.

keywords: cnns ~ deep learning ~ convolutional-network ~ reference

URL
U-NET

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham , 2015.

keywords: deep-learning ~ medical-imaging ~ convolutional-network ~ data-augmentation ~ representation-learning ~ seminal

Abs

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

recurrent-networks

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bai, Shaojie, Kolter, J. Zico, and Koltun, Vladlen

arXiv:1803.01271 [cs]. 2018.

keywords: recurrent-networks ~ optimizer-gradient-methods ~ rnn ~ deep-learning

Abs URL

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .

anomaly-detection

Fast Anomaly Detection for Streaming Data

Tan, Swee Chuan, Ting, Kai Ming, and Liu, Tony Fei

.

keywords: anomaly-detection

Abs

This paper introduces Streaming Half-Space-Trees (HS-Trees), a fast one-class anomaly detector for evolving data streams. It requires only normal data for training and works well when anomalous data are rare. The model features an ensemble of random HS-Trees, and the tree structure is constructed without any data. This makes the method highly efficient because it requires no model restructuring when adapting to evolving data streams. Our analysis shows that Streaming HS-Trees has constant amortised time complexity and constant memory requirement. When compared with a state-of-theart method, our method performs favourably in terms of detection accuracy and runtime performance. Our experimental results also show that the detection performance of Streaming HS-Trees is not sensitive to its parameter settings.
Anomaly detection: A survey

Chandola, Varun, Banerjee, Arindam, and Kumar, Vipin

ACM computing surveys (CSUR). 2009.

keywords: anomaly-detection
Isolation forest

Liu, Fei Tony, Ting, Kai Ming, and Zhou, Zhi-Hua

In 2008 Eighth IEEE International Conference on Data Mining. 2008.

keywords: anomaly-detection
Isolation-based anomaly detection

Liu, Fei Tony, Ting, Kai Ming, and Zhou, Zhi-Hua

ACM Transactions on Knowledge Discovery from Data (TKDD). 2012.

keywords: anomaly-detection
LOF: identifying density-based local outliers

Breunig, Markus M, Kriegel, Hans-Peter, Ng, Raymond T, and Sander, Jörg

In ACM sigmod record. 2000.

keywords: anomaly-detection
Incremental local outlier detection for data streams

Pokrajac, Dragoljub, Lazarevic, Aleksandar, and Latecki, Longin Jan

In 2007 IEEE symposium on CIDM. 2007.

keywords: anomaly-detection ~ streaming-ensembles ~ machine-learning
A review of novelty detection

Pimentel, Marco AF, Clifton, David A, Clifton, Lei, and Tarassenko, Lionel

Signal Processing. 2014.

keywords: anomaly-detection
OC-SVM

Support vector method for novelty detection

Schölkopf, Bernhard, Williamson, Robert C, Smola, Alex J, Shawe-Taylor, John, and Platt, John C

In NeurIPS conference. 2000.

keywords: machine-learning ~ seminal ~ kernel-methods ~ anomaly-detection ~ support-vector-machines
Outlier Detection Data Sets

Rayana, Shebuti

2019.

keywords: anomaly-detection

URL
Intrusion Detection Evaluation Dataset (CICIDS2017)

Canadian Institute for Cybersecurity,

2017.

keywords: anomaly-detection

URL
Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization.

Sharafaldin, Iman, Lashkari, Arash Habibi, and Ghorbani, Ali A

In ICISSP. 2018.

keywords: anomaly-detection
Anomaly detection via over-sampling principal component analysis

Yeh, Yi-Ren, Lee, Zheng-Yi, and Lee, Yuh-Jye

2009.

keywords: anomaly-detection
Anomaly detection via online oversampling principal component analysis

Lee, Yuh-Jye, Yeh, Yi-Ren, and Wang, Yu-Chiang Frank

IEEE transactions on knowledge and data engineering. 2013.

keywords: anomaly-detection
Online anomaly detection using KDE

Ahmed, Tarem

In 2009 IEEE conference on global telecommunications. 2009.

keywords: anomaly-detection

data-augmentation

U-NET

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham , 2015.

keywords: deep-learning ~ medical-imaging ~ convolutional-network ~ data-augmentation ~ representation-learning ~ seminal

Abs

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
NeurIPS 2020 : Unsupervised Data Augmentation for Consistency Training

keywords: medical-imaging ~ data-augmentation ~ natural-language-processing

URL

loss-functions

A Comparative Study of Deep Learning Loss Functions for Multi-Label Remote Sensing Image Classification

Yessou, Hichame, Sumbul, Gencer, and Demir, Begüm

In IEEE International Geoscience and Remote Sensing Symposium. 2020.

keywords: computer-vision ~ course-review-definitions ~ deep-learning ~ loss-functions

Abs URL

This paper analyzes and compares different deep learning loss functions in the framework of multi-label remote sensing (RS) image scene classification problems. We consider seven loss functions: 1) cross-entropy loss; 2) focal loss; 3) weighted cross-entropy loss; 4) Hamming loss; 5) Huber loss; 6) ranking loss; and 7) sparseMax loss. All the considered loss functions are analyzed for the first time in RS. After a theoretical analysis, an experimental analysis is carried out to compare the considered loss functions in terms of their: 1) overall accuracy; 2) class imbalance awareness (for which the number of samples associated to each class significantly varies); 3) convexibility and differentiability; and 4) learning efficiency (i.e., convergence speed). On the basis of our analysis, some guidelines are derived for a proper selection of a loss function in multi-label RS scene classification problems.

optimizer-gradient-methods

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Bai, Shaojie, Kolter, J. Zico, and Koltun, Vladlen

arXiv:1803.01271 [cs]. 2018.

keywords: recurrent-networks ~ optimizer-gradient-methods ~ rnn ~ deep-learning

Abs URL

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Adam: A method for stochastic optimization

Kingma, Diederik P, and Ba, Jimmy

arXiv preprint arXiv:1412.6980. 2014.

keywords: optimizer-gradient-methods ~ optimizer-adam
An overview of gradient descent optimization algorithms

Ruder, Sebastian

arXiv:1609.04747 [cs]. 2017.

keywords: optimizer-gradient-methods ~ deep-learning

Abs URL

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.

ablation-study

A tutorial on speech understanding systems.

Newell, Allen

1975.

keywords: artificial-intellgience ~ ablation-study ~ seminal ~ natural-language-processing
Selective Search for Object Recognition

Uijlings, J. R. R., Sande, K. E. A., Gevers, T., and Smeulders, A. W. M.

International Journal of Computer Vision. 2013.

keywords: ablation-study

Abs URL

This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi. unitn.it/~uijlings/SelectiveSearch.html).
Rich feature hierarchies for accurate object detection and semantic segmentation

Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra

arXiv:1311.2524 [cs]. 2014.

keywords: ablation-study ~ computer-vision ~ representation-learning ~ object-detection

Abs URL

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.
Analysing differences between algorithm configurations through ablation

Fawcett, Chris, and Hoos, Holger H.

Journal of Heuristics. 2016.

keywords: ablation-study

Abs URL

Developers of high-performance algorithms for hard computational problems increasingly take advantage of automated parameter tuning and algorithm configuration tools, and consequently often create solvers with many parameters and vast configuration spaces. However, there has been very little work to help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance. In this work, we present an automated technique for answering such questions by performing ablation analysis between two algorithm configurations. We perform an extensive empirical analysis of our technique on five scenarios from propositional satisfiability, mixed-integer programming and AI planning, and show that in all of these scenarios more than 95 % of the performance gains between default configurations and optimised configurations obtained from automated configuration tools can be explained by modifying the values of a small number of parameters (1–4 in the scenarios we studied). We also investigate the use of our ablation analysis procedure for producing configurations that generalise well to previously-unseen problem domains, as well as for analysing the structure of the algorithm parameter response surface near and between high-performance configurations.

natural-language-processing

Unsupervised word embeddings capture latent knowledge from materials science literature

Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav

Nature. 2019.

keywords: representation-learning ~ ai-for-material-design ~ ai-for-science ~ unsupervised-learning ~ embeddings ~ proj-chemgymrl ~ nlp ~ natural-language-processing

Abs URL Hypoth

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
A tutorial on speech understanding systems.

Newell, Allen

1975.

keywords: artificial-intellgience ~ ablation-study ~ seminal ~ natural-language-processing
Delphi: Towards Machine Ethics and Norms

Jiang, Liwei, Hwang, Jena D., Bhagavatula, Chandrasekhar, Bras, Ronan Le, Forbes, Maxwell, Borchardt, Jon, Liang, Jenny, Etzioni, Oren, Sap, Maarten, and Choi, Yejin

2021.

keywords: ai-ethics ~ ai-ethics ~ ai-for-good ~ natural-language-processing ~ crowdsourcing ~ fun-reading

arXiv PDF URL
Glove: Global Vectors for Word Representation

Pennington, Jeffrey, Socher, Richard, and Manning, Christopher

In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar , 2014.

keywords: natural-language-processing ~ seminal ~ machine-learning ~ representation-learning

Abs URL

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview

White, G. M.

Computer. 1976.

keywords: natural-language-processing ~ natural-language-processing ~ machine-learning ~ seminal

Abs URL

Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche

Coupé, Christophe, Oh, Yoon Mi, Dediu, Dan, and Pellegrino, François

Science Advances. 2019.

keywords: natural-language-processing ~ representation-learning ~ machine-learning ~ compact-encoding ~ science ~ psychology

Abs URL

\textlessp\textgreaterLanguage is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.\textless/p\textgreater
NeurIPS 2020 : Unsupervised Data Augmentation for Consistency Training

keywords: medical-imaging ~ data-augmentation ~ natural-language-processing

URL
Improving Language Understanding by Generative Pre-Training

Radford, Alec, Narasimhan, Karthik, Salimans, Tim, and Sutskever, Ilya

.

keywords: natural-language-processing; attention-mechanism

Abs

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

attention-mechanism

Are Pre-trained Convolutions Better than Pre-trained Transformers?

Tay, Yi, Dehghani, Mostafa, Gupta, Jai, Bahri, Dara, Aribandi, Vamsi, Qin, Zhen, and Metzler, Donald

In ACL 2021. 2021.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions

Wu, Felix, Fan, Angela, Baevski, Alexei, Dauphin, Yann N., and Auli, Michael

In International Conference on Learning Representations (ICLR 2019). 2019.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
Improving Language Understanding by Generative Pre-Training

Radford, Alec, Narasimhan, Karthik, Salimans, Tim, and Sutskever, Ilya

.

keywords: natural-language-processing; attention-mechanism

Abs

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning

Lamba, Harshall

2019.

keywords: machine-learning ~ deep-learning ~ attention-mechanism ~ blog ~ website

Abs URL

A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Brownlee, Jason

2017.

keywords: attention-mechanism ~ blog ~ machine-learning ~ website ~ autoencoders ~ representation-learning ~ deep-learning ~ blog

Abs URL

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]
Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images

Lu, Ming Y., Williamson, Drew F. K., Chen, Tiffany Y., Chen, Richard J., Barbieri, Matteo, and Mahmood, Faisal

arXiv:2004.09666 [cs, eess, q-bio]. 2020.

keywords: attention-mechanism ~ classification ~ clustering ~ deep-learning ~ medical-imaging ~ computer-vision ~ digital-pathology

Abs URL

The rapidly emerging field of computational pathology has the potential to enable objective diagnosis, therapeutic response prediction and identification of new morphological features of clinical relevance. However, deep learning-based computational pathology approaches either require manual annotation of gigapixel whole slide images (WSIs) in fully-supervised settings or thousands of WSIs with slide-level labels in a weakly-supervised setting. Moreover, whole slide level computational pathology methods also suffer from domain adaptation and interpretability issues. These challenges have prevented the broad adaptation of computational pathology for clinical and research purposes. Here we present CLAM - Clustering-constrained attention multiple instance learning, an easy-to-use, high-throughput, and interpretable WSI-level processing and learning method that only requires slide-level labels while being data efficient, adaptable and capable of handling multi-class subtyping problems. CLAM is a deep-learning-based weakly-supervised method that uses attention-based learning to automatically identify sub-regions of high diagnostic value in order to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space. In three separate analyses, we demonstrate the data efficiency and adaptability of CLAM and its superior performance over standard weakly-supervised classification. We demonstrate that CLAM models are interpretable and can be used to identify well-known and new morphological features. We further show that models trained using CLAM are adaptable to independent test cohorts, cell phone microscopy images, and biopsies. CLAM is a general-purpose and adaptable method that can be used for a variety of different computational pathology tasks in both clinical and research settings.

transformers

Are Pre-trained Convolutions Better than Pre-trained Transformers?

Tay, Yi, Dehghani, Mostafa, Gupta, Jai, Bahri, Dara, Aribandi, Vamsi, Qin, Zhen, and Metzler, Donald

In ACL 2021. 2021.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions

Wu, Felix, Fan, Angela, Baevski, Alexei, Dauphin, Yann N., and Auli, Michael

In International Conference on Learning Representations (ICLR 2019). 2019.

keywords: transformers ~ attention-mechanism ~ convolutional-network ~ deep learning

Abs URL Hypoth

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
Attention is All you Need

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Łukasz, and Polosukhin, Illia

Advances in Neural Information Processing Systems. Long Beach, California , 2017.

keywords: attention-mechanism ~ seminal ~ lecture-1221 ~ transformers ~ machine-learning ~ deep-learning ~ natural-language-processing

Abs arXiv PDF URL Hypoth

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

transfer-learning

Deep Learning of Representations for Unsupervised and Transfer Learning

Bengio, Yoshua

In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Bellevue, Washington, USA , 2012.

keywords: machine-learning ~ transfer-learning ~ representation-learning ~ seminal ~ lecture-1221

Abs URL Hypoth

Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.

active-learning

The Genius Neuroscientist Who Might Hold the Key to True AI

Raviv, Shaun

2018.

keywords: theory ~ machine-learning ~ active-learning ~ free-energy ~ artificial-intellgience ~ neuroscience ~ course-part-1 ~ fun-reading

Abs URL Hypoth

Karl Friston’s free energy principle might be the most all-encompassing idea since Charles Darwin’s theory of natural selection. But to understand it, you need to peer inside the mind of Friston himself.

ai-for-science

Unsupervised word embeddings capture latent knowledge from materials science literature

Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav

Nature. 2019.

keywords: representation-learning ~ ai-for-material-design ~ ai-for-science ~ unsupervised-learning ~ embeddings ~ proj-chemgymrl ~ nlp ~ natural-language-processing

Abs URL Hypoth

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.