The world of Data Science and Machine Learning is vast and there are too many topics to cover in a single course. So, to help us bring some order to the chaos, this course is organized hierarchically in terms of Topics, Modules and cross-cutting labels of Core and Non-Core topics.

Organization of Topics

The Topics of this course all come down to being one of the following:

  • Algorithms for extracting patterns, estimating values, assigning categories or making decisions based on input datasets
  • Methodologies for preprocessing, organizing and transforming data in a way that improves our ability to learn useful models and be confident in their correctness
  • Tasks we want to perform on datasets such as prediction, classification, anomaly detection, interpretation, etc.

CORE vs. NON-CORE Topics

To further focus our discussion, we have also designed this course so that some of these methodologies, algorithms, and tasks are labelled as CORE and others NON-CORE. This doesn’t mean the non-core topics are less important, it may mean they are simply too complex to treat fully in one course where we need to start from the fundamental skills.

Our high-level goals for this course are for the student to leave with:

  1. A deep understanding and real experience with the most important foundational methods
  2. A broad understanding of the landscape so that you can find the right tool you need in the right situation. So how to assess which methodologies and algorithms to use for which task given your dataset will be an important learning outcome.

Learning Outcomes

The learning outcomes are a way to concretely define what it is you should expect to learn in this course and how inform how you will be assessed. The outcomes can be understood in four parts, which necessarily interact with each other and relate to the topics above.

Theory

For core topics (m,a,t):

  • define them at a detailed level (ie. could include mathematical definition)
  • distinguish them from others when given theoretical cases or concrete examples

For core methodologies or algorithms:

  • design a detailed solution for a given task utilizing the core methodology or algorithm
  • implement them, in code, on real data to perform a given task

For non-core topics:

  • define them at a high level
  • distinguish them from others when given theoretical cases or concrete examples

Analysis

Given a new dataset:

  • Describe the properties of the dataset (size, dimensions, nominal, categorical, continuous, etc.).
  • Summarize the data using simple statistical measures.
  • Analyse the distribution patterns (eg. mean, variance, skew, missing data, cross-correlation) of the data.

Design

Given a new dataset and data analysis or machine learning task, be able to do the following:

  • Write a concise design plan for performing the task including specific details including:
    • data preparation pipeline
    • data separation, training and validation methodology
    • proposal for a specific algorithm, with sufficient parameter choices, to perform the task
  • Justify your design choices in writing, including:
    • discussion of computational performance tradeoffs
    • data requirements of the proposed approach compared to alternatives
    • interpretability vs. accuracy tradeoff
    • comparison to the next best alternative approach that could be followed

Implementation

On a given dataset and common Data Analysis and Machine Learning tasks, demonstrate the ability to:

  • implement a full data processing pipeline to clean, normalize, otherwise prepare the data
  • perform feature, dimensionality and manifold processing as needed to obtain a better dataset to perform the task
  • concretely implement in code a solution for the task using the methods and algorithms from the course
  • write a short descriptive report with numerical and visual analysis of the performance of your solution and interesting patterns found in the data