Individual Topics References
There is no official textbook for this course. The primary textbooks listed at the top of this page are a useful resource that cover all the fundamentals we do in class. The rest of the papers and resources listed below are seminal (ie. famous and influential) or otherwise interesting papers on the topics of the course.
Discussion and Annotation of Papers
Many papers referred to in the course will should have a hypothesis link associated with it.
Hypothesis is a website which allows very nice commenting, annotation and sharing of papers. We have a group for the course on the site: (Hypothes.is Group : ECE657A).
The navigation on the Hypothesis site is not very good, so you can use the links on this paper to find related papers and jump straight to the hypothesis page. If there isn’t one, you can create it and ask the course staff to add the hypothesis link.
Feel free to comment, annotate and discuss papers on the site in the ECE657A group.
The Hypothes.is annotation discussion for the course is lightly moderated. If you encounter an offensive or inappropriate comment you can flag the annotation and the course administrator (Prof or TA) will see if and can choose to remove or deal with the comment. The original poster will not see your flag.
Jump to Topic: text ~ seminal ~ machine-learning ~ dimensionality-reduction ~ auto encoders ~ kernel-methods ~ ensemble-methods ~ deep-learning ~ unsupervised-learning ~ variational-inference ~ support-vector-machines ~ convolutional-network ~ recurrent-networks ~ anomaly-detection ~ data-augmentation ~ loss-functions ~ optimizer-gradient-methods ~ ablation-study ~ natural-language-processing ~ attention-mechanism ~ transformers ~ transfer-learning ~ active-learning ~ ai-for-science
Machine Learning: A Probabilistic Perspective
Murphy, Kevin
2012.
keywords:
textbooks-optional
Pattern Classification
Duda, R O,
Hart, P E,
and Stork, D G
2000.
keywords:
textbooks-optional
Deep Learning
Goodfellow, Ian,
Bengio, Yoshua,
and Courville, Aaron
2016.
keywords:
textbooks-optional
Deep Learning of Representations for Unsupervised and Transfer Learning
Bengio, Yoshua
In Proceedings of ICML Workshop on Unsupervised and Transfer Learning.
Bellevue, Washington, USA
, 2012.
Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.
A tutorial on speech understanding systems.
Newell, Allen
1975.
U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf,
Fischer, Philipp,
and Brox, Thomas
In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015.
Cham
, 2015.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Glove: Global Vectors for Word Representation
Pennington, Jeffrey,
Socher, Richard,
and Manning, Christopher
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Doha, Qatar
, 2014.
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview
White, G. M.
Computer.
1976.
Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
Support vector method for novelty detection
Schölkopf, Bernhard,
Williamson, Robert C,
Smola, Alex J,
Shawe-Taylor, John,
and Platt, John C
In NeurIPS conference.
2000.
Deep learning
LeCun, Yann,
Bengio, Yoshua,
and Hinton, Geoffrey
Nature.
2015.
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Deep Learning of Representations for Unsupervised and Transfer Learning
Bengio, Yoshua
In Proceedings of ICML Workshop on Unsupervised and Transfer Learning.
Bellevue, Washington, USA
, 2012.
Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.
The Genius Neuroscientist Who Might Hold the Key to True AI
Raviv, Shaun
2018.
Karl Friston’s free energy principle might be the most all-encompassing idea since Charles Darwin’s theory of natural selection. But to understand it, you need to peer inside the mind of Friston himself.
Glove: Global Vectors for Word Representation
Pennington, Jeffrey,
Socher, Richard,
and Manning, Christopher
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Doha, Qatar
, 2014.
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview
White, G. M.
Computer.
1976.
Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche
Coupé, Christophe,
Oh, Yoon Mi,
Dediu, Dan,
and Pellegrino, François
Science Advances.
2019.
\textlessp\textgreaterLanguage is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.\textless/p\textgreater
Incremental local outlier detection for data streams
Pokrajac, Dragoljub,
Lazarevic, Aleksandar,
and Latecki, Longin Jan
In 2007 IEEE symposium on CIDM.
2007.
Support vector method for novelty detection
Schölkopf, Bernhard,
Williamson, Robert C,
Smola, Alex J,
Shawe-Taylor, John,
and Platt, John C
In NeurIPS conference.
2000.
Deep learning
LeCun, Yann,
Bengio, Yoshua,
and Hinton, Geoffrey
Nature.
2015.
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning
Lamba, Harshall
2019.
A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks
Brownlee, Jason
2017.
Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]
Fisher and Kernel Fisher Discriminant Analysis: Tutorial
Benyamin Ghojogh, Fakhri Karray, Mark Crowley,
.
2019.
keywords:
kernel-methods; dimensionality-reduction
Multidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,
.
keywords:
dimensionality-reduction; manifold-learning
Stochastic Neighbor Embedding with Gaussian and Student-t Distributions: Tutorial and Survey
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,
.
2020.
keywords:
dimensionality-reduction; probability-distributions
Unsupervised and Supervised Principal Component Analysis: Tutorial
Benyamin Ghojogh, Mark Crowley,
.
2019.
Fisher and Kernel Fisher Discriminant Analysis: Tutorial
Benyamin Ghojogh, Fakhri Karray, Mark Crowley,
.
2019.
keywords:
kernel-methods; dimensionality-reduction
Support vector method for novelty detection
Schölkopf, Bernhard,
Williamson, Robert C,
Smola, Alex J,
Shawe-Taylor, John,
and Platt, John C
In NeurIPS conference.
2000.
Mondrian forests: Efficient online random forests
Lakshminarayanan, Balaji,
Roy, Daniel M,
and Teh, Yee Whye
In NeurIPS conference.
2014.
Mondrian forests for large-scale regression when uncertainty matters
Lakshminarayanan, Balaji,
Roy, Daniel M,
and Teh, Yee Whye
In Artificial Intelligence and Statistics.
2016.
The Mondrian Process
Roy, Daniel M,
and Teh, Yee Whye
In NeurIPS conference.
2008.
On-line random forests
Saffari, Amir,
Leistner, Christian,
Santner, Jakob,
Godec, Martin,
and Bischof, Horst
In 2009 IEEE ICCV workshops.
2009.
Streaming random forests
Abdulsalam, Hanady,
Skillicorn, David B,
and Martin, Patrick
In 11th IDEAS 2007.
2007.
Extremely randomized trees
Geurts, Pierre,
Ernst, Damien,
and Wehenkel, Louis
Machine learning.
2006.
Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning
Criminisi, Antonio,
Shotton, Jamie,
and Konukoglu, Ender
Foundations and Trends® in Computer Graphics and Vision.
2012.
Random forests
Breiman, Leo
Machine learning.
2001.
Binary Space Partitioning Forests
Fan, Xuhui,
Li, Bin,
and Sisson, Scott Anthony
In 22nd AISTATS conference.
2019.
The binary space partitioning-tree process
Fan, Xuhui,
Li, Bin,
and Sisson, Scott Anthony
In 21st AISTATS conference.
2018.
Understanding random forests: From theory to practice
Louppe, Gilles
2014.
-
U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf,
Fischer, Philipp,
and Brox, Thomas
In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015.
Cham
, 2015.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Bai, Shaojie,
Kolter, J. Zico,
and Koltun, Vladlen
arXiv:1803.01271 [cs].
2018.
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Deep learning
LeCun, Yann,
Bengio, Yoshua,
and Hinton, Geoffrey
Nature.
2015.
An overview of gradient descent optimization algorithms
Ruder, Sebastian
arXiv:1609.04747 [cs].
2017.
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning
Lamba, Harshall
2019.
A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks
Brownlee, Jason
2017.
Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]
Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images
Lu, Ming Y.,
Williamson, Drew F. K.,
Chen, Tiffany Y.,
Chen, Richard J.,
Barbieri, Matteo,
and Mahmood, Faisal
arXiv:2004.09666 [cs, eess, q-bio].
2020.
The rapidly emerging field of computational pathology has the potential to enable objective diagnosis, therapeutic response prediction and identification of new morphological features of clinical relevance. However, deep learning-based computational pathology approaches either require manual annotation of gigapixel whole slide images (WSIs) in fully-supervised settings or thousands of WSIs with slide-level labels in a weakly-supervised setting. Moreover, whole slide level computational pathology methods also suffer from domain adaptation and interpretability issues. These challenges have prevented the broad adaptation of computational pathology for clinical and research purposes. Here we present CLAM - Clustering-constrained attention multiple instance learning, an easy-to-use, high-throughput, and interpretable WSI-level processing and learning method that only requires slide-level labels while being data efficient, adaptable and capable of handling multi-class subtyping problems. CLAM is a deep-learning-based weakly-supervised method that uses attention-based learning to automatically identify sub-regions of high diagnostic value in order to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space. In three separate analyses, we demonstrate the data efficiency and adaptability of CLAM and its superior performance over standard weakly-supervised classification. We demonstrate that CLAM models are interpretable and can be used to identify well-known and new morphological features. We further show that models trained using CLAM are adaptable to independent test cohorts, cell phone microscopy images, and biopsies. CLAM is a general-purpose and adaptable method that can be used for a variety of different computational pathology tasks in both clinical and research settings.
Unsupervised word embeddings capture latent knowledge from materials science literature
Tshitoyan, Vahe,
Dagdelen, John,
Weston, Leigh,
Dunn, Alexander,
Rong, Ziqin,
Kononova, Olga,
Persson, Kristin A.,
Ceder, Gerbrand,
and Jain, Anubhav
Nature.
2019.
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley,
.
2020.
keywords:
course-diver-deeper-into-a-topic; factor-analysis; variational-inference
Support vector method for novelty detection
Schölkopf, Bernhard,
Williamson, Robert C,
Smola, Alex J,
Shawe-Taylor, John,
and Platt, John C
In NeurIPS conference.
2000.
Are Pre-trained Convolutions Better than Pre-trained Transformers?
Tay, Yi,
Dehghani, Mostafa,
Gupta, Jai,
Bahri, Dara,
Aribandi, Vamsi,
Qin, Zhen,
and Metzler, Donald
In ACL 2021.
2021.
In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions
Wu, Felix,
Fan, Angela,
Baevski, Alexei,
Dauphin, Yann N.,
and Auli, Michael
In International Conference on Learning Representations (ICLR 2019).
2019.
Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
Stanford:CS231n Convolutional Neural Networks for Visual Recognition
Li, Fei-Fei
2021.
U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf,
Fischer, Philipp,
and Brox, Thomas
In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015.
Cham
, 2015.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Bai, Shaojie,
Kolter, J. Zico,
and Koltun, Vladlen
arXiv:1803.01271 [cs].
2018.
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Fast Anomaly Detection for Streaming Data
Tan, Swee Chuan,
Ting, Kai Ming,
and Liu, Tony Fei
.
This paper introduces Streaming Half-Space-Trees (HS-Trees), a fast one-class anomaly detector for evolving data streams. It requires only normal data for training and works well when anomalous data are rare. The model features an ensemble of random HS-Trees, and the tree structure is constructed without any data. This makes the method highly efficient because it requires no model restructuring when adapting to evolving data streams. Our analysis shows that Streaming HS-Trees has constant amortised time complexity and constant memory requirement. When compared with a state-of-theart method, our method performs favourably in terms of detection accuracy and runtime performance. Our experimental results also show that the detection performance of Streaming HS-Trees is not sensitive to its parameter settings.
Anomaly detection: A survey
Chandola, Varun,
Banerjee, Arindam,
and Kumar, Vipin
ACM computing surveys (CSUR).
2009.
Isolation forest
Liu, Fei Tony,
Ting, Kai Ming,
and Zhou, Zhi-Hua
In 2008 Eighth IEEE International Conference on Data Mining.
2008.
Isolation-based anomaly detection
Liu, Fei Tony,
Ting, Kai Ming,
and Zhou, Zhi-Hua
ACM Transactions on Knowledge Discovery from Data (TKDD).
2012.
LOF: identifying density-based local outliers
Breunig, Markus M,
Kriegel, Hans-Peter,
Ng, Raymond T,
and Sander, Jörg
In ACM sigmod record.
2000.
Incremental local outlier detection for data streams
Pokrajac, Dragoljub,
Lazarevic, Aleksandar,
and Latecki, Longin Jan
In 2007 IEEE symposium on CIDM.
2007.
A review of novelty detection
Pimentel, Marco AF,
Clifton, David A,
Clifton, Lei,
and Tarassenko, Lionel
Signal Processing.
2014.
Support vector method for novelty detection
Schölkopf, Bernhard,
Williamson, Robert C,
Smola, Alex J,
Shawe-Taylor, John,
and Platt, John C
In NeurIPS conference.
2000.
Outlier Detection Data Sets
Rayana, Shebuti
2019.
Intrusion Detection Evaluation Dataset (CICIDS2017)
Canadian Institute for Cybersecurity,
2017.
Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization.
Sharafaldin, Iman,
Lashkari, Arash Habibi,
and Ghorbani, Ali A
In ICISSP.
2018.
Anomaly detection via over-sampling principal component analysis
Yeh, Yi-Ren,
Lee, Zheng-Yi,
and Lee, Yuh-Jye
2009.
Anomaly detection via online oversampling principal component analysis
Lee, Yuh-Jye,
Yeh, Yi-Ren,
and Wang, Yu-Chiang Frank
IEEE transactions on knowledge and data engineering.
2013.
Online anomaly detection using KDE
Ahmed, Tarem
In 2009 IEEE conference on global telecommunications.
2009.
U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf,
Fischer, Philipp,
and Brox, Thomas
In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015.
Cham
, 2015.
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
NeurIPS 2020 : Unsupervised Data Augmentation for Consistency Training
-
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Bai, Shaojie,
Kolter, J. Zico,
and Koltun, Vladlen
arXiv:1803.01271 [cs].
2018.
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Adam: A method for stochastic optimization
Kingma, Diederik P,
and Ba, Jimmy
arXiv preprint arXiv:1412.6980.
2014.
An overview of gradient descent optimization algorithms
Ruder, Sebastian
arXiv:1609.04747 [cs].
2017.
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
A tutorial on speech understanding systems.
Newell, Allen
1975.
Selective Search for Object Recognition
Uijlings, J. R. R.,
Sande, K. E. A.,
Gevers, T.,
and Smeulders, A. W. M.
International Journal of Computer Vision.
2013.
This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi. unitn.it/~uijlings/SelectiveSearch.html).
Rich feature hierarchies for accurate object detection and semantic segmentation
Girshick, Ross,
Donahue, Jeff,
Darrell, Trevor,
and Malik, Jitendra
arXiv:1311.2524 [cs].
2014.
keywords:
ablation-study
~
computer-vision
~
representation-learning
~
object-detection
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.
Analysing differences between algorithm configurations through ablation
Fawcett, Chris,
and Hoos, Holger H.
Journal of Heuristics.
2016.
Developers of high-performance algorithms for hard computational problems increasingly take advantage of automated parameter tuning and algorithm configuration tools, and consequently often create solvers with many parameters and vast configuration spaces. However, there has been very little work to help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance. In this work, we present an automated technique for answering such questions by performing ablation analysis between two algorithm configurations. We perform an extensive empirical analysis of our technique on five scenarios from propositional satisfiability, mixed-integer programming and AI planning, and show that in all of these scenarios more than 95 % of the performance gains between default configurations and optimised configurations obtained from automated configuration tools can be explained by modifying the values of a small number of parameters (1–4 in the scenarios we studied). We also investigate the use of our ablation analysis procedure for producing configurations that generalise well to previously-unseen problem domains, as well as for analysing the structure of the algorithm parameter response surface near and between high-performance configurations.
Unsupervised word embeddings capture latent knowledge from materials science literature
Tshitoyan, Vahe,
Dagdelen, John,
Weston, Leigh,
Dunn, Alexander,
Rong, Ziqin,
Kononova, Olga,
Persson, Kristin A.,
Ceder, Gerbrand,
and Jain, Anubhav
Nature.
2019.
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
A tutorial on speech understanding systems.
Newell, Allen
1975.
Delphi: Towards Machine Ethics and Norms
Jiang, Liwei,
Hwang, Jena D.,
Bhagavatula, Chandrasekhar,
Bras, Ronan Le,
Forbes, Maxwell,
Borchardt, Jon,
Liang, Jenny,
Etzioni, Oren,
Sap, Maarten,
and Choi, Yejin
2021.
Glove: Global Vectors for Word Representation
Pennington, Jeffrey,
Socher, Richard,
and Manning, Christopher
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Doha, Qatar
, 2014.
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
Speech Recognition: A Tutorial Overview
White, G. M.
Computer.
1976.
Research toward mechanical recognition of speech is laying the foundation for significant advances in pattern recognition and artificial intelligence. This paper explains the nature of some of these advances and provides an introduction to the state of the art of automatic speech recognition.
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche
Coupé, Christophe,
Oh, Yoon Mi,
Dediu, Dan,
and Pellegrino, François
Science Advances.
2019.
\textlessp\textgreaterLanguage is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.\textless/p\textgreater
NeurIPS 2020 : Unsupervised Data Augmentation for Consistency Training
Improving Language Understanding by Generative Pre-Training
Radford, Alec,
Narasimhan, Karthik,
Salimans, Tim,
and Sutskever, Ilya
.
keywords:
natural-language-processing; attention-mechanism
Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Are Pre-trained Convolutions Better than Pre-trained Transformers?
Tay, Yi,
Dehghani, Mostafa,
Gupta, Jai,
Bahri, Dara,
Aribandi, Vamsi,
Qin, Zhen,
and Metzler, Donald
In ACL 2021.
2021.
In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions
Wu, Felix,
Fan, Angela,
Baevski, Alexei,
Dauphin, Yann N.,
and Auli, Michael
In International Conference on Learning Representations (ICLR 2019).
2019.
Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
Improving Language Understanding by Generative Pre-Training
Radford, Alec,
Narasimhan, Karthik,
Salimans, Tim,
and Sutskever, Ilya
.
keywords:
natural-language-processing; attention-mechanism
Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Intuitive Understanding of Attention Mechanism in Deep Learning
Lamba, Harshall
2019.
A TensorFlow Implementation of Neural Machine Translation with Attention
How Does Attention Work in Encoder-Decoder Recurrent Neural Networks
Brownlee, Jason
2017.
Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. In this tutorial, you will discover the attention mechanism for the Encoder-Decoder model. After completing this tutorial, you will know: About the Encoder-Decoder model and attention mechanism for machine translation. How to implement the attention mechanism step-by-step. [\ldots]
Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images
Lu, Ming Y.,
Williamson, Drew F. K.,
Chen, Tiffany Y.,
Chen, Richard J.,
Barbieri, Matteo,
and Mahmood, Faisal
arXiv:2004.09666 [cs, eess, q-bio].
2020.
The rapidly emerging field of computational pathology has the potential to enable objective diagnosis, therapeutic response prediction and identification of new morphological features of clinical relevance. However, deep learning-based computational pathology approaches either require manual annotation of gigapixel whole slide images (WSIs) in fully-supervised settings or thousands of WSIs with slide-level labels in a weakly-supervised setting. Moreover, whole slide level computational pathology methods also suffer from domain adaptation and interpretability issues. These challenges have prevented the broad adaptation of computational pathology for clinical and research purposes. Here we present CLAM - Clustering-constrained attention multiple instance learning, an easy-to-use, high-throughput, and interpretable WSI-level processing and learning method that only requires slide-level labels while being data efficient, adaptable and capable of handling multi-class subtyping problems. CLAM is a deep-learning-based weakly-supervised method that uses attention-based learning to automatically identify sub-regions of high diagnostic value in order to accurately classify the whole slide, while also utilizing instance-level clustering over the representative regions identified to constrain and refine the feature space. In three separate analyses, we demonstrate the data efficiency and adaptability of CLAM and its superior performance over standard weakly-supervised classification. We demonstrate that CLAM models are interpretable and can be used to identify well-known and new morphological features. We further show that models trained using CLAM are adaptable to independent test cohorts, cell phone microscopy images, and biopsies. CLAM is a general-purpose and adaptable method that can be used for a variety of different computational pathology tasks in both clinical and research settings.
Are Pre-trained Convolutions Better than Pre-trained Transformers?
Tay, Yi,
Dehghani, Mostafa,
Gupta, Jai,
Bahri, Dara,
Aribandi, Vamsi,
Qin, Zhen,
and Metzler, Donald
In ACL 2021.
2021.
In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
Pay Less Attention with Lightweight and Dynamic Convolutions
Wu, Felix,
Fan, Angela,
Baevski, Alexei,
Dauphin, Yann N.,
and Auli, Michael
In International Conference on Learning Representations (ICLR 2019).
2019.
Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
Attention is All you Need
Vaswani, Ashish,
Shazeer, Noam,
Parmar, Niki,
Uszkoreit, Jakob,
Jones, Llion,
Gomez, Aidan N.,
Kaiser, Łukasz,
and Polosukhin, Illia
Advances in Neural Information Processing Systems.
Long Beach, California
, 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Deep Learning of Representations for Unsupervised and Transfer Learning
Bengio, Yoshua
In Proceedings of ICML Workshop on Unsupervised and Transfer Learning.
Bellevue, Washington, USA
, 2012.
Deep learning algorithms seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with higher-level learned features defined in terms of lower-level features. The objective is to make these higher-level representations more abstract, with their individual features more invariant to most of the variations that are typically present in the training distribution, while collectively preserving as much as possible of the information in the input. Ideally, we would like these representations to disentangle the unknown factors of variation that underlie the training distribution. Such unsupervised learning of representations can be exploited usefully under the hypothesis that the input distribution P(x) is structurally related to some task of interest, say predicting P(y|x). This paper focuses on the context of the Unsupervised and Transfer Learning Challenge, on why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where we care about predictions on examples that are not from the same distribution as the training distribution.
The Genius Neuroscientist Who Might Hold the Key to True AI
Raviv, Shaun
2018.
Karl Friston’s free energy principle might be the most all-encompassing idea since Charles Darwin’s theory of natural selection. But to understand it, you need to peer inside the mind of Friston himself.
Unsupervised word embeddings capture latent knowledge from materials science literature
Tshitoyan, Vahe,
Dagdelen, John,
Weston, Leigh,
Dunn, Alexander,
Rong, Ziqin,
Kononova, Olga,
Persson, Kristin A.,
Ceder, Gerbrand,
and Jain, Anubhav
Nature.
2019.
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.