Instructor: Greg Butler (gregb@cse.concordia.ca)
Lab Demonstrater: Stephanie Kamgnia (kamgnia.phanie@gmail.com)
2019-07-09 All Marks
2019-06-12 Project Presentation Schedule
Thursday 2019-06-13 Lecture
1330-1345 Khalid Baraka
Tuesday 2019-06-18 Lab H-847
1015-1030 Emilio Assuncao
1030-1045 Ema Dijmarescu
1045-1100 Simon Huang
1100-1115 Maurice Ngwakum
1115-1130 Starly Solon
1130-1145 Mordechai Zirkind
Tuesday 2019-06-18 Lecture H-403
1315-1330 Steven Zanga
1330-1345 Muherthan Thalayasingam
1345-1400 Sabrina Rieck
1400-1415 Genevieve Plante-Brisebois
1415-1430 Gabriel Noriega
1430-1445 Beeri Nduwimana
1445-1500 Alexandre Masmoudi
1500-1515 Jasmine Leblond-Chartrand
1515-1530 Claudia Feochari
1530-1545 Patrick Bui
Tuesday 2019-06-18 Lab H-847
1615-1630 Corentin Artaud
1630-1645 Khaled Ali
1645-1700 Alessandro Power
2019-06-07
Preliminary reports are okay for everyone who submitted one,
though some are incomplete as expected.
Remember to include your name and student number in the Jupyter notebook.
It makes things easier for me.
I would have liked to see more use of graphs.
Remember when you come to EDA to use scatter plots (with line and correlation coefficinet) for you bi-variate analysis.
2019-06-04 Midterm and Assignment Marks
2019-06-03 Project
Submit your preliminary report by 20:00 Thursday 6 June 2019 to EAS as in Course Outline.
Your Jupyter notebook and pdf should include: the questions that you are seeking to answer; the source of the datasets by explicit url's and retrieval steps in the notebook; a description of each dataset according to standard descriptive analysis; and wrangling of the data to clean it, structure it as tidy data, and perform necessary enrichment/merging/integration.
Each of these steps should be clearly commented in the notebook; the code should be clear; and the results of the code should be clear.
After this preliminary work, you should be ready to continue with the rest of the project which considers exploratory data analysis, model building, validation, and story-telling.
The presentation will take place in labs and lectures for Lab 13 and Lecture 13 on Tuesday 18 June 2019. Presentations will be 10 minutes, followed by 5 minutes question period, for each student.
The final report is due 20:00 on Tuesday 18 June 2019 to EAS as in the Course Outline.
The report should include your full Jupyter notebook as a pdf file, and your presentation as a pdf file, zipped. Make sure your notebook is well commented, and clear.
The project is worth 25% of the course mark. There will be 5% for the preliminary report, 10% for the presentation, and 10% for the final report.
2019-05-28 Lecture room is changed to H-403.
2019-05-27
Lab 7 has issues for those working on Mac OS (mojave): seems to be CPYTHON.
The material works fine on other platforms, including the lab computers, and Mac OS (sierra).
So complete Lab 7 on the lab computers, if you must.
Submit your Jupyter notebook to EAS by 20:00 Tuesday 28 May 2019.
2019-05-21 Final Exam is scheduled for Friday 21 June 2019 from 0900 to 1200 noon in H-521.
2019-05-21 Instructions for Lab 6 and 7
2019-05-19 Assignment 1 is due to show to TA at end of Lab 5; and to submit to EAS by 20:00 Tuesday 21 May 2019. (See course outline)
2019-05-14 Some data set suggestions for your project
2019-05-08
Ericsson Hackathon Saturday, June 1st, 2019.
FYI - For Your Information
Ericsson will be hosting a data science hackathon in which students will apply AI/ML for Smart homes on wireless networks.
Please find below the link to the invitation/Google docs application form:
https://docs.google.com/forms/d/e/1FAIpQLSdcRdmtladxCn_Vr29RYQL4sPNKaowixR_DjZA8ZfhX2APodg/viewform?usp=sf_link
In brief:
Event: Ericsson Data Science Hackathon
Date: Saturday, June 1st
Entries: Teams of 3-4 (3 teams per university will be selected to participate)
Location: Ericsson ENCQOR Site (1000 Rue Saint-Jacques)
Students must form a team and apply as a group
2019-05-07 Lab start on Thursday 9 May 2019.
Tentative - very much subject to change. This is the second time the course is being taught, and the Summer 1 semester goes at an accelerated pace over six-and-a-half weeks. So be prepared to adapt throughout the semester.
Lecture 1 - 07 May 2019: Introduction: Course, Python, Data Analytics
Course Outline; Big Data - Five V's, History;
Data Wrangling; Exploratory Data Analysis; Hypothesis-Driven Experimental Design; Modeling; Story Telling;
Structured data; unstructured data;
Database records; spreadsheets; text documents; email messages; text messages, tweets; images; videos; sensor data streams; social media entries, tweets, and connections (likes, tags, retweets)
Slides:
Course Outline
Lecture 1 - introduction
Lecture 1 - data analytics
Lecture 1 - context
OOP in Python: Classes, Methods and Operator Overloading video Aug 17, 2015.
Lecture 2 - 09 May 2019: Numbers, Data, and Experimental Design
terminology - measurement scales (nominal/categorical, ordinal, interval, ration), normalization,
accuracy, precision, significant digits.
Slides:
Lecture 2 - numbers and data
Exploratory Data Analysis, University of Virginia,
Prof. Patrick Meyer, Published on Aug 13, 2015.
Significant Digits Slides
and
Calculating significant digits
Box plots
Scientific method (Wikipedia)
Types of Experimental Designs
Hypothesis Testing in Statistics,
Rejection Regions In Hypothesis Testing,
Hypothesis Testing For Means & Small Samples, Part 1,
Hypothesis Testing For Means & Small Samples, Part 2,
Hypothesis Testing For Means & Large Samples, Part 1,
Hypothesis Testing For Means & Large Samples, Part 2.
Schema for "tidy" data in R
Lecture 3 - 14 May 2019: Data Warehouses, OLAP, and Business Intelligence
Slides:
Data Warehouses
introduction to Data warehouses video (8:52), Andy Wicks, 2013.
Multimensional Analysis, video (24:14) start at 10:30, Michael Lamont, 2014.
Business Intelligence video (26:03), technologyAdvice, 2014.
Reading:
Introduction to Data warehousing,
Chapter 1 of Data Warehouse Design: Modern Principles and Methodologies, 2009,
by Matteo Golfarelli and Stefano Rizzi.
Lecture 4 - 16 May 2019: Data Formats and Schemas; Python pandas
Slides:
Lecture 4
Lecture 4 pandas
Cheat sheet on pandas, tidy data, dataframes
Pandas slides, 2016.
pandas at pydata.org
10 Minutes to Pandas, introduction.
Data Structures of Pandas, Series, DataFrame, Panel.
Some tutorials on pandas
Lecture 5 - 21 May 2019: Descriptive Data Analysis; Data Wrangling; OpenRefine
Data Wrangling: discover, structure, cleanse, enrich, validate, publish;
Slides:
Lecture 5 - Descriptive Analytics
Lecture 5 - Data Wrangling
An example of data integration and data enrichment using Python classes: Data mining and integration with Python by Isaac Vidas at PyTexas on 09 October 2015.
OpenRefine (previously Google Refine) open source Java interactive tool for Data Wrangling:
Introduction (video 1 of 3),
Introduction (video 2 of 3),
Introduction (video 3 of 3),
Lecture 6 - 23 May 2019: Data Cleaning; Correlation, Causality, Significance
Slides:
Lecture 6 - Data Cleaning
Skiena Lecture 7 - Data Cleaning: errors, missing values, outliers, unification/normalization/entity identification; Z-scores.
Skiena Lecture 7 - Data Cleaning video (1:09:50), March 2017.
Skiena Lecture 5 - Correlation video (1:12:34), March 2017.
Lecture 7 - 28 May 2019: Review
Slides: Lecture 7
Lecture 8 - 30 May 2019: Midterm
Midterm Exam 60 minutes from 13:30 tp 14:30.
Questions cover material in Labs 1-6 and Lectures 1-6.
Concentrate on main concepts, processes, steps, techniques, and tools.
Exam may involve true/false questions; multiple choice questions; fill-in the word questions;
and short answer questions.
Bring your student ID; writing materials.
Closed book exam.
Lecture 9 - 04 June 2019: Exploratory Data Analysis; Clustering
Slides:
Lecture 9 EDA
Skiena Lecture 22 - Clustering video (1:07:54), March 2017.
Introduction to Machine learning video (51:30), MIT.
A Gentle Introduction to Exploratory Data Analysis, Daniel Bourke, January 2019.
A Comprehensive Guide to Data Exploration, Sunil Ray, January 2016.
Automated Feature Engineering in Python, Will Koehrsen, June 2018.
Fundamental Techniques of Feature Engineering for Machine Learning, Emre Rencberoglu, April 2019.
Video Feature Engineering, Ryan Baker, Coursera, Fall 2013.
Principal Component Analysis (PCA) by Victor Lavrenko
See videos PCA 1 to PCA 4.
Lecture 10 - 06 June 2019: Models: Regression, Classification, Prediction, Simulation
Slides:
Lecture 10 Machine Learning
Skiena Lecture 17 - Linear Regression
Skiena Lecture 18 - Logistic Regression and Classification
Choosing the right ML algorithm video (1:00:54), first 40 minutes,
MS Azure talk at conference, July 2017.
See also their cheat sheet.
Skiena lecture 23: Machine Learning video(1:16:59)
Lecture 11 - 11 June 2019: Visualization; Dashboards; Story Telling
Class material:
Making data mean more through storytelling, video, Ben Wellington, TEDxBroadway, April 2015.
Lecture 11 Story Telling
Lecture 11 Visualization
Storytelling with Data, video, Cole Nussbaumer Knaflic, Talks at Google, November 2015.
Skiena Lecture 11 - Visualizing Data video (1:15:07), March 2017.
Visualization with matplotlib
Visualization with pandas
Seaborn visualization - examples
Visualization with Seaborn tutorial
Lecture 12 - 13 June 2019: Project Discussion
Lecture 13 - 18 June 2019: Project Presentations
2019-06-21 Final Exam is scheduled for Friday 21 June 2019 from 0900 to 1200 noon in H-521.
stackoverflow - for questions on technology how-to, eg Python, Latex, Jupyter
kdnuggets - for discussions on topics related to knowledge discovery, so not restricted to data analytics
data carpentry - tutorials on data analytics; see especially the tutorial fo social sciences
software carpentry - tutorials on programming for scientists, including Python
Top 12 Essential Command Line Tools for Data Scientists
wget, cat, wc, head, tail, find, cut, uniq, awk, grep, sed, history.
The 10 Statistical Techniques Data Scientists Need to Master
linear regression,
classification (logistical regression, discrimant analysis),
resampling methods (bootstraping, cross validation),
subsect selection, (best-subset, forward stepwise, backward stepwise, hybrid)
shrinkage (ridge regression, lasso),
dimension reduction (principal components, partial least squares),
nonlinear models (step function, piecewise function, spline, generalized additive models),
tree-based methods ( bagging, boosting, random forests),
support vector machines,
unsupervised learning (principal component analysis, k-means, hierarchical clustering).
A quick overview to orientate you, A Complete Tutorial to Learn Data Science with Python from Scratch.
The Top 15 Python Libraries for Data Science in 2017
OOP in Python: Classes, Methods and Operator Overloading video Aug 17, 2015.
The Python Tutorial: Classes
Python operator overloading
Intro to NumPy, Bryan Van de Ven, April 2016.
Intro to SciPy, M. Velasco and A. Perera, Feb 2013.
Do not read SymPy part.
Intro to MatPlotLib, datacamp, Feb 2013.
Intro to pandas, Slides 70-170, Virginia Tech, Srijith Rajamohan, 2016.
Read Scientific Python Lectures, 2017, chapters 1-5.
pandas tutorial, dataquest, 2016.
Top 8 resources for learning data analysis with pandas, May 2016.
Probability cheat sheet
Python Basics
Python NumPy
Python pandas
Python MatPlotLib
Python Seaborn
Basic prediction algorithms overview
Python sckit-learn ML
When to use algorithms in Python sckit-learn ML
This course covers much more than COMP 499 does.
The book website has links to video lectures, examples, exercises, and more.
See in particular:
Skiena Lecture 6 -Data Munging video (1:02:20), March 2017.
Skiena Lecture 7 - Data Cleaning video (1:09:50), March 2017.
Skiena Lecture 5 - Correlation video (1:12:34), March 2017.
Skiena Lecture 22 - Clustering video (1:07:54), March 2017.
Skiena Lecture 11 - Visualizing Data video (1:15:07), March 2017.
Skiena Lecture 23: Machine Learning video(1:16:59)
In-depth introduction to machine learning in 15 hours of expert videos, September 2014.
This book, course, and videos Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman looks at Big Data and the Map Reduce paradigm.
In particular, it deals with
link analysis (eg Google pagerank algorithm),
recommendation systems, and
mining social media.
Tamara Munzner's book Visualization Analysis and Design, CRC Press, 2014.
Cole Nussbaumer Knaflic, Storytelling with Data, Wiley, 2015.
Supply Chain Management
What is Supply Chain Management? Definition and Introduction , video (12:07), AIMS UK, June, 2016.
Using Analytics to make the most of Your Supply Chain Data, video (34:03), Thorogood, January 2017.
Optional:
Deep Learning in Supply Optimization, video (1:06:16), instacart, May 2017.
FinTech
The Tech in FinTech, video (31:28), Elise Breda, February 2017.
Geospatial
Geospatial Data with Open Source Tools in Python | SciPy 2015 Tutorial | Kelsey Jordahl,
video (3:03:59), SciPy 2015, Kelsey Jordahl.
Covers geopandas, PySAL spatial analysis library, PyProj for geospatial projections and transformations, qGIS system (open source similar to ArcGIS).
See also Open Street Map, open source maps, similar to Google Maps.
Time Series Analysis Tutorial with Python fro DataCamp, to work through.
Financial analysis
Time Series Analysis in Python: An Introduction using Quandl.