COMP 499 Introduction to Data Analytics

Summer 2019 Semester 1: May 7 to June 18, 2019

Lectures: Tuesdays and Thursdays 13:15 to 15:45 in H-562

Labs: Tuesdays and Thursdays 10:15 to 12:15 in H-847

Labs: Tuesdays and Thursdays 16:15 to 18:15 in H-847

Section AA

Instructor: Greg Butler (gregb@cse.concordia.ca)

Lab Demonstrater: Stephanie Kamgnia (kamgnia.phanie@gmail.com)


Course Outline


Announcements

2019-07-09 All Marks

2019-06-12 Project Presentation Schedule

Thursday 2019-06-13 Lecture
1330-1345 Khalid Baraka

Tuesday 2019-06-18 Lab H-847
1015-1030 Emilio Assuncao
1030-1045 Ema Dijmarescu
1045-1100 Simon Huang
1100-1115 Maurice Ngwakum
1115-1130 Starly Solon
1130-1145 Mordechai Zirkind

Tuesday 2019-06-18 Lecture H-403
1315-1330 Steven Zanga
1330-1345 Muherthan Thalayasingam
1345-1400 Sabrina Rieck
1400-1415 Genevieve Plante-Brisebois
1415-1430 Gabriel Noriega
1430-1445 Beeri Nduwimana
1445-1500 Alexandre Masmoudi
1500-1515 Jasmine Leblond-Chartrand
1515-1530 Claudia Feochari
1530-1545 Patrick Bui

Tuesday 2019-06-18 Lab H-847
1615-1630 Corentin Artaud
1630-1645 Khaled Ali
1645-1700 Alessandro Power

2019-06-07 Preliminary reports are okay for everyone who submitted one, though some are incomplete as expected.
Remember to include your name and student number in the Jupyter notebook. It makes things easier for me.
I would have liked to see more use of graphs. Remember when you come to EDA to use scatter plots (with line and correlation coefficinet) for you bi-variate analysis.

2019-06-04 Midterm and Assignment Marks

2019-06-03 Project

Submit your preliminary report by 20:00 Thursday 6 June 2019 to EAS as in Course Outline.

Your Jupyter notebook and pdf should include: the questions that you are seeking to answer; the source of the datasets by explicit url's and retrieval steps in the notebook; a description of each dataset according to standard descriptive analysis; and wrangling of the data to clean it, structure it as tidy data, and perform necessary enrichment/merging/integration.

Each of these steps should be clearly commented in the notebook; the code should be clear; and the results of the code should be clear.

After this preliminary work, you should be ready to continue with the rest of the project which considers exploratory data analysis, model building, validation, and story-telling.

The presentation will take place in labs and lectures for Lab 13 and Lecture 13 on Tuesday 18 June 2019. Presentations will be 10 minutes, followed by 5 minutes question period, for each student.

The final report is due 20:00 on Tuesday 18 June 2019 to EAS as in the Course Outline.

The report should include your full Jupyter notebook as a pdf file, and your presentation as a pdf file, zipped. Make sure your notebook is well commented, and clear.

The project is worth 25% of the course mark. There will be 5% for the preliminary report, 10% for the presentation, and 10% for the final report.

2019-05-28 Lecture room is changed to H-403.

2019-05-27 Lab 7 has issues for those working on Mac OS (mojave): seems to be CPYTHON.
The material works fine on other platforms, including the lab computers, and Mac OS (sierra).

So complete Lab 7 on the lab computers, if you must.

Submit your Jupyter notebook to EAS by 20:00 Tuesday 28 May 2019.

2019-05-21 Final Exam is scheduled for Friday 21 June 2019 from 0900 to 1200 noon in H-521.

2019-05-21 Instructions for Lab 6 and 7

2019-05-19 Assignment 1 is due to show to TA at end of Lab 5; and to submit to EAS by 20:00 Tuesday 21 May 2019. (See course outline)

2019-05-14 Some data set suggestions for your project

2019-05-08 Ericsson Hackathon Saturday, June 1st, 2019.
FYI - For Your Information

Ericsson will be hosting a data science hackathon in which students will apply AI/ML for Smart homes on wireless networks.
Please find below the link to the invitation/Google docs application form: https://docs.google.com/forms/d/e/1FAIpQLSdcRdmtladxCn_Vr29RYQL4sPNKaowixR_DjZA8ZfhX2APodg/viewform?usp=sf_link
In brief:
Event: Ericsson Data Science Hackathon
Date: Saturday, June 1st
Entries: Teams of 3-4 (3 teams per university will be selected to participate)
Location: Ericsson ENCQOR Site (1000 Rue Saint-Jacques)
Students must form a team and apply as a group

2019-05-07 Lab start on Thursday 9 May 2019.


Lab Schedule

  1. Tues 2019-05-07: Lab 1: There is no lab 1
  2. Thu 2019-05-09: Lab 2: Jupyter installation; Python introduction (software-carpentry.org)
    Use the anaconda distribution of Jupyter to install Jupyter and Python onto your laptops, as shown in the video Jupyter Installation from codingthesmartway.com
    Learn about basic Python - variables, strings, lists, dictionaries, sets - and use Jupyter, as shown in the video Learn Python for Beginners from codingthesmartway.com
    Work through Plotting and Programming in Python and Programming in Python from software-carpentry.com
  3. Tues 2019-05-14: Lab 3: Python Object-Oriented Programming (Corey Schafer: youtube)
    Some advanced Python to work through Python: Lambda, Map, Filter, Reduce Functions fro Joe James, and Object-Oriented Python from Corey Schafer: He has many follow up youtube tutorials too.
  4. Thu 2019-05-16: Lab 4: Python pandas example (datacarpentry.org)
    Work through the tutorial Python for Social Science Data for an introduction to pandas (steps 8-12), matplotlib (step 13), and using SQLite databases (step 14).
  5. Tues 2019-05-21: Lab 5: Python Exploratory Data Analysis example (datacarpentry.org)
    Work through Data Analysis and Visualization in Python for Ecologists. Focus on steps 4-8, ignoring challenges and exercises.
  6. Thu 2019-05-23: Lab 6: Data wrangling example (biggorilla.org)
    This is a long tutorial to work through which will take two lab sessions. See Instructions for Lab 6 and 7
    Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.
    For this lab you should focus on Part 1: Data acquisition; Part 2: Data extraction; and Part 3: Data profiling and cleaning.
  7. Tues 2019-05-28: Lab 7: continuation of Lab 6
    For this lab you should focus on Part 4: Data matching and merging;
  8. Thu 2019-05-30: Lab 8: Midterm Preparation
  9. Tues 2019-06-04: Lab 9: Data wrangling with OpenRefine (datacarpentry.org)
    Work through the tutorial Data Cleaning with OpenRefine for Ecology from datacarpentry.org.
  10. Thu 2019-06-06: Lab 10: Python machine learning with scikit-learn (scikit-learn.org)
    Work through the introductory tutorial on scikit-learn.
    Look at the examples, and explore some of interest to you.
  11. Tues 2019-06-11; Thu 2019-06-13: Lab 11 -- 12: Your project work
  12. Tues 2019-06-18: Lab 13: Project presentations


Lecture Schedule

Tentative - very much subject to change. This is the second time the course is being taught, and the Summer 1 semester goes at an accelerated pace over six-and-a-half weeks. So be prepared to adapt throughout the semester.

Lecture 1 - 07 May 2019: Introduction: Course, Python, Data Analytics
Course Outline; Big Data - Five V's, History; Data Wrangling; Exploratory Data Analysis; Hypothesis-Driven Experimental Design; Modeling; Story Telling;
Structured data; unstructured data;
Database records; spreadsheets; text documents; email messages; text messages, tweets; images; videos; sensor data streams; social media entries, tweets, and connections (likes, tags, retweets)

Slides:
Course Outline
Lecture 1 - introduction
Lecture 1 - data analytics
Lecture 1 - context
OOP in Python: Classes, Methods and Operator Overloading video Aug 17, 2015.

Lecture 2 - 09 May 2019: Numbers, Data, and Experimental Design
terminology - measurement scales (nominal/categorical, ordinal, interval, ration), normalization, accuracy, precision, significant digits.

Slides:
Lecture 2 - numbers and data
Exploratory Data Analysis, University of Virginia, Prof. Patrick Meyer, Published on Aug 13, 2015.
Significant Digits Slides and Calculating significant digits
Box plots
Scientific method (Wikipedia)
Types of Experimental Designs
Hypothesis Testing in Statistics, Rejection Regions In Hypothesis Testing, Hypothesis Testing For Means & Small Samples, Part 1, Hypothesis Testing For Means & Small Samples, Part 2, Hypothesis Testing For Means & Large Samples, Part 1, Hypothesis Testing For Means & Large Samples, Part 2.
Schema for "tidy" data in R

Lecture 3 - 14 May 2019: Data Warehouses, OLAP, and Business Intelligence

Slides:
Data Warehouses


introduction to Data warehouses video (8:52), Andy Wicks, 2013.
Multimensional Analysis, video (24:14) start at 10:30, Michael Lamont, 2014.
Business Intelligence video (26:03), technologyAdvice, 2014.

Reading:
Introduction to Data warehousing, Chapter 1 of Data Warehouse Design: Modern Principles and Methodologies, 2009, by Matteo Golfarelli and Stefano Rizzi.

Lecture 4 - 16 May 2019: Data Formats and Schemas; Python pandas

Slides:
Lecture 4
Lecture 4 pandas

Cheat sheet on pandas, tidy data, dataframes

Pandas DataFrame Notes

Pandas slides, 2016.

pandas at pydata.org
10 Minutes to Pandas, introduction.
Data Structures of Pandas, Series, DataFrame, Panel.
Some tutorials on pandas

Lecture 5 - 21 May 2019: Descriptive Data Analysis; Data Wrangling; OpenRefine
Data Wrangling: discover, structure, cleanse, enrich, validate, publish;

Slides: Lecture 5 - Descriptive Analytics
Lecture 5 - Data Wrangling

An example of data integration and data enrichment using Python classes: Data mining and integration with Python by Isaac Vidas at PyTexas on 09 October 2015.

OpenRefine (previously Google Refine) open source Java interactive tool for Data Wrangling:
Introduction (video 1 of 3),
Introduction (video 2 of 3),
Introduction (video 3 of 3),

Lecture 6 - 23 May 2019: Data Cleaning; Correlation, Causality, Significance

Slides: Lecture 6 - Data Cleaning
Skiena Lecture 7 - Data Cleaning: errors, missing values, outliers, unification/normalization/entity identification; Z-scores.
Skiena Lecture 7 - Data Cleaning video (1:09:50), March 2017.
Skiena Lecture 5 - Correlation video (1:12:34), March 2017.

Lecture 7 - 28 May 2019: Review

Slides: Lecture 7

Lecture 8 - 30 May 2019: Midterm
Midterm Exam 60 minutes from 13:30 tp 14:30.
Questions cover material in Labs 1-6 and Lectures 1-6.
Concentrate on main concepts, processes, steps, techniques, and tools.
Exam may involve true/false questions; multiple choice questions; fill-in the word questions; and short answer questions.

Bring your student ID; writing materials.
Closed book exam.

Lecture 9 - 04 June 2019: Exploratory Data Analysis; Clustering

Slides: Lecture 9 EDA
Skiena Lecture 22 - Clustering video (1:07:54), March 2017.
Introduction to Machine learning video (51:30), MIT.

A Gentle Introduction to Exploratory Data Analysis, Daniel Bourke, January 2019.

A Comprehensive Guide to Data Exploration, Sunil Ray, January 2016.

Automated Feature Engineering in Python, Will Koehrsen, June 2018.

Fundamental Techniques of Feature Engineering for Machine Learning, Emre Rencberoglu, April 2019.

Video Feature Engineering, Ryan Baker, Coursera, Fall 2013.

Principal Component Analysis (PCA) by Victor Lavrenko
See videos PCA 1 to PCA 4.

Lecture 10 - 06 June 2019: Models: Regression, Classification, Prediction, Simulation

Slides: Lecture 10 Machine Learning
Skiena Lecture 17 - Linear Regression
Skiena Lecture 18 - Logistic Regression and Classification
Choosing the right ML algorithm video (1:00:54), first 40 minutes, MS Azure talk at conference, July 2017. See also their cheat sheet.
Skiena lecture 23: Machine Learning video(1:16:59)

Lecture 11 - 11 June 2019: Visualization; Dashboards; Story Telling

Class material:

Making data mean more through storytelling, video, Ben Wellington, TEDxBroadway, April 2015. Lecture 11 Story Telling
Lecture 11 Visualization
Storytelling with Data, video, Cole Nussbaumer Knaflic, Talks at Google, November 2015.

Skiena Lecture 11 - Visualizing Data video (1:15:07), March 2017.

Visualization with matplotlib
Visualization with pandas
Seaborn visualization - examples
Visualization with Seaborn tutorial

Lecture 12 - 13 June 2019: Project Discussion

Lecture 13 - 18 June 2019: Project Presentations

2019-06-21 Final Exam is scheduled for Friday 21 June 2019 from 0900 to 1200 noon in H-521.


Resources


stackoverflow - for questions on technology how-to, eg Python, Latex, Jupyter
kdnuggets - for discussions on topics related to knowledge discovery, so not restricted to data analytics
data carpentry - tutorials on data analytics; see especially the tutorial fo social sciences
software carpentry - tutorials on programming for scientists, including Python


Top 12 Essential Command Line Tools for Data Scientists wget, cat, wc, head, tail, find, cut, uniq, awk, grep, sed, history.
The 10 Statistical Techniques Data Scientists Need to Master linear regression, classification (logistical regression, discrimant analysis), resampling methods (bootstraping, cross validation), subsect selection, (best-subset, forward stepwise, backward stepwise, hybrid) shrinkage (ridge regression, lasso), dimension reduction (principal components, partial least squares), nonlinear models (step function, piecewise function, spline, generalized additive models), tree-based methods ( bagging, boosting, random forests), support vector machines, unsupervised learning (principal component analysis, k-means, hierarchical clustering).

Python

A quick overview to orientate you, A Complete Tutorial to Learn Data Science with Python from Scratch.
The Top 15 Python Libraries for Data Science in 2017
OOP in Python: Classes, Methods and Operator Overloading video Aug 17, 2015.
The Python Tutorial: Classes
Python operator overloading
Intro to NumPy, Bryan Van de Ven, April 2016.
Intro to SciPy, M. Velasco and A. Perera, Feb 2013. Do not read SymPy part.
Intro to MatPlotLib, datacamp, Feb 2013.
Intro to pandas, Slides 70-170, Virginia Tech, Srijith Rajamohan, 2016.

Read Scientific Python Lectures, 2017, chapters 1-5.
pandas tutorial, dataquest, 2016.
Top 8 resources for learning data analysis with pandas, May 2016.

Cheat Sheets

Probability cheat sheet
Python Basics
Python NumPy
Python pandas
Python MatPlotLib
Python Seaborn
Basic prediction algorithms overview
Python sckit-learn ML
When to use algorithms in Python sckit-learn ML

Professor Steven Skiena's Course CSE519 Data Science

This course covers much more than COMP 499 does. The book website has links to video lectures, examples, exercises, and more.
See in particular:
Skiena Lecture 6 -Data Munging video (1:02:20), March 2017.
Skiena Lecture 7 - Data Cleaning video (1:09:50), March 2017.
Skiena Lecture 5 - Correlation video (1:12:34), March 2017.
Skiena Lecture 22 - Clustering video (1:07:54), March 2017.
Skiena Lecture 11 - Visualizing Data video (1:15:07), March 2017.
Skiena Lecture 23: Machine Learning video(1:16:59)

Professors Trevor Hastie and Rob Tibshirani Book and Course on Machine Learning

In-depth introduction to machine learning in 15 hours of expert videos, September 2014.

Stanford's Mining of Massive Datasets

This book, course, and videos Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman looks at Big Data and the Map Reduce paradigm.

In particular, it deals with
link analysis (eg Google pagerank algorithm),
recommendation systems, and
mining social media.

Visualization

Tamara Munzner's book Visualization Analysis and Design, CRC Press, 2014.

Cole Nussbaumer Knaflic, Storytelling with Data, Wiley, 2015.

Material on Specific Application Domains

Supply Chain Management
What is Supply Chain Management? Definition and Introduction , video (12:07), AIMS UK, June, 2016.
Using Analytics to make the most of Your Supply Chain Data, video (34:03), Thorogood, January 2017.
Optional: Deep Learning in Supply Optimization, video (1:06:16), instacart, May 2017.

FinTech
The Tech in FinTech, video (31:28), Elise Breda, February 2017.

Geospatial
Geospatial Data with Open Source Tools in Python | SciPy 2015 Tutorial | Kelsey Jordahl, video (3:03:59), SciPy 2015, Kelsey Jordahl. Covers geopandas, PySAL spatial analysis library, PyProj for geospatial projections and transformations, qGIS system (open source similar to ArcGIS).
See also Open Street Map, open source maps, similar to Google Maps.

Time Series Analysis Tutorial with Python fro DataCamp, to work through.

Financial analysis
Time Series Analysis in Python: An Introduction using Quandl.


Last modified on 07 May 2019 by gregb@cs.concordia.ca