COMP 499 Introduction to Data Analytics

Summer 2018 Semester 1: May 2 to June 26, 2018

Lectures: Tuesdays and Thursdays 13:30 to 16:00 in H-403

Labs: Tuesdays and Thursdays 11:20 to 13:20 in H-907

Section AA


Course Outline

Some Datasets


Announcements

2018-07-04 Marks

2018-06-21 Final exam is scheduled for 1400-1700 in H-501 for Thursday 21 June, 2018.

2018-06-19 Exam information
Six questions: T/F, multiple choice, short answer.
You need to know how to write Python 3 code for basic operations of data wrangling, and exploratory data analysis.
Cheat sheets provided.
Only a few questions on R. No need to write R code.
You need to know basics (value scales, statistics, correlation, outliers); steps of data wrangling, and how to perform them; steps of exploratory data analysis, and how to perform them; common ways of building models, and when and how to use the methods.

2018-06-05 Some material on storytelling
Storytelling in Business: Data Storytelling video(6:26), SAS, August 2017.
The Art of Story Telling in Data Science and how to create data stories?, blog, October 2017.

2018-06-02 Project
Aim is to have you think, work with a realistic dataset on interesting questions, and to develop a story about the answers to those questions

What techniques you use depends on you, the data, and the questions. Do what best tells a story. There are no strict requirements to tick off a checklist.

Presentation about 15 minutes

Marking Scheme
Questions /5 (Original?, Interesting?, Challenging?)
Dataset /5 (Volume, Variety, Original)
Wrangling /5 (Acquisition, Cleaning, Enrichment, Entity Resolution, Integration)
Analysis /5 (How well does your analysis/model support your story?)
Presentation/10 (Story /5 Visual etc /5)
so you get points if you have to find, clean, etc your own dataset; and for original ideas beyond what might already be done in kaggle, etc

Submission:
Presentation in pdf form - project 2
Work as Jupyter notebook - project 1
Deadline: midnight 15 June 2018

2018-05-29: Marks for Midterm
Dataset link added at top of this page.

2018-05-14: Note change of time for midterm exam on May 24, 2018.
Submit your Jupyter notebook for assignment 1 by midnight Tuesday 15 May 2018.

2018-05-13: If you need to brush up on probability and statistics: Probability cheat sheet

2018-05-12: Complete Assignment 1 from Lab 2 before the end of Lab 3.

2018-05-07: Resources
stackoverflow - for questions on technology how-to, eg Python,Latex, Jupyter, R
kdnuggets - for discussions on topics related to knowledge discovery, so not restricted to data analytics
data carpentry - tutorials on data analytics; see especially the tutorial fo social sciences
software carpentry - tutorials on programming for scientists, including Python

Professor Steven Skiena's CSE519 (Data Science) course at Stony Brook overs much more than just data analytics. The web page for his book has links to videos, lecture slides, and problem sets.

2018-05-02: First class is Thursday 03 May 2018. First lab is Tuesday 08 May 2018.

Lecture Schedule

Tentative - very much subject to change. This is the first time the course is being taught, and the Summer 1 semester goes at an accelerated pace over six-and-a-half weeks. So be prepared to adapt throughout the semester.

Lecture 1 - 03 May 2018 1330-1600 in H-403: Course Outline; Big Data - Five V's, History; Data Wrangling; Exploratory Data Analysis; Hypothesis-Driven Experimental Design; Modeling; Story Telling;

Slides: Lecture 1

What you should know:
(a) handy unix tools Top 12 Essential Command Line Tools for Data Scientists wget, cat, wc, head, tail, find, cut, uniq, awk, grep, sed, history.
(b) statistical techniques The 10 Statistical Techniques Data Scientists Need to Master linear regression, classification (logistical regression, discrimant analysis), resampling methods (bootstraping, cross validation), subsect selection, (best-subset, forward stepwise, backward stepwise, hybrid) shrinkage (ridge regression, lasso), dimension reduction (principal components, partial least squares), nonlinear models (step function, piecewise function, spline, generalized additive models), tree-based methods ( bagging, boosting, random forests), support vector machines, unsupervised learning (principal component analysis, k-means, hierarchical clustering).

Other reading
A quick overview to orientate you, A Complete Tutorial to Learn Data Science with Python from Scratch.
Useful Python Libraries The Top 15 Python Libraries for Data Science in 2017
A discussion R vs Python for Data Science: The Winner is ...

Lab 1 - 08 May 2018 1120-1320 in H-907: Installation of Tools.
Use the anaconda distribution of Jupyter to install Jupyter, Python, and R onto your laptops, as shown in the video Jupyter Installation from codingthesmartway.com
Learn about basic Python - variables, strings, lists, dictionaries, sets - and use Jupyter, as shown in the video Learn Python for Beginners from codingthesmartway.com

Lecture 2 - 08 May 2018 1330-1600 in H-403: Data Wrangling.
Slides:
Data Wrangling Overview: discover, structure, cleanse, enrich, validate, publish; plus terminology - measurement scales (nominal/categorical, ordinal, interval, ration), normalization, accuracy, precision, significant digits.
Data Cleaning: errors, missing values, outliers, unification/normalization/entity identification; Z-scores.

OpenRefine (previously Google Refine) open source Java interactive tool for Data Wrangling: Introduction (video 1 of 3), Introduction (video 2 of 3), Introduction (video 3 of 3),
Steven Skiena, Stony Brook University, Lecture 6 Data Munging and Lecture 7 Data Cleaning, March 2017.
If you want to learn more on OpenRefine then work through the tutorial OpenRefine for Ecology from Data Carpentry.

Lab 2 - 10 May 2018 1120-1320 in H-907: Data Wrangling.
This is a long tutorial to work through which will take two lab sessions.
Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.
For this lab you should focus on Part 1: Data acquisition; Part 2: Data extraction; and Part 3: Data profiling and cleaning.

Work through the example A Hands on Tutorial for public movie data. You will see examples of cleaning, normalizing, sampling, entity matching (entity recognition), and enrichment of a dataset through data integration.
*** IMDB data is no longer ftp'able. You can download however: see IMDB data.
*** The example is in Python 2 so modify it to Python 3.

If you are not comfortable with that level of Python, then work through steps 1-7 of the tutorial Python for Social Science Data from Data Carpentry.

Lecture 3 - 10 May 2018 1330-1600 in H-403: Data Integration. Enriching your data.
An example using Python classes: Data mining and integration with Python by Isaac Vidas at PyTexas on 09 October 2015.
OOP in Python: Classes, Methods and Operator Overloading video Aug 17, 2015.
Read The Python Tutorial: Classes; and read Python operator overloading

Lab 3 - 15 May 2018 1120-1320 in H-907: More Python.
Continue with the tutorial from Lab 2.
Focus on Part 4: Data matching and merging for Step 1 and Step 2.
For Assignment 1, submit a Jupyter nootebook of your work, preferably covering Part 1, 2, 3, and Part 4 steps 1 and 2.
Return to the rest of the tutorial after you have done machine learning.

If time permits, ork through the tutorial Python for Social Science Data for an introduction to pandas (steps 8-12), matplotlib (step 13), and using SQLite databases (step 14).
If time permits, learn some R in steps 2-4 of the tutorial R for social scientists.

Lecture 4 - 15 May 2018 1330-1600 in H-403: Exploratory data analytics with Python
Exploratory Data Analysis, University of Virginia, Prof. Patrick Meyer, Published on Aug 13, 2015.
Intro to NumPy, Bryan Van de Ven, April 2016.
Intro to SciPy, M. Velasco and A. Perera, Feb 2013. Do not read SymPy part.
Intro to MatPlotLib, datacamp, Feb 2013.
Intro to pandas, Slides 70-170, Virginia Tech, Srijith Rajamohan, 2016.

Read Scientific Python Lectures, 2017, chapters 1-5.
pandas tutorial, dataquest, 2016.
Top 8 resources for learning data analysis with pandas, May 2016.

Cheat sheets for Python
Python Basics
Python NumPy
Python pandas
Python MatPlotLib
Python Seaborn

Lab 4 - 17 May 2018 1120-1320 in H-907: Exploratory data analytics with Python
The aim of this lab is to see a dataset from ecology; that is, a dataset not from social sciences; and to see pandas and ggplot in action.
Focus on steps 4-8, ignoring challenges and exercises.
Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.

Work through Data analysis and visualization with Python using pandas.

Lecture 5 - 17 May 2018 1330-1600 in H-403: Exploratory data analytics with R
Why use R? video, Jonathon Ng, Dec 2017.
Data mining with R video, edureka, Nov 2017.
Data wrangling with R and the tidyverse, 4 videos, RStudio, March 2018.
R Programming for Beginners video, edureka, May 2017. Long.
Map functions in purrr (R tidyverse) video, Ben Stenhaug, August 2017.

Cheat sheets for R
R language
R tidyverse
R ggplot2
R data wrangling with tidyr and dplyr

Lab 5 - 22 May 2018 1120-1320 in H-907: Exploratory data analytics with R
The aim of this lab is to see a dataset from ecology; that is, a dataset not from social sciences; and to see R, tidyverse, and ggplot2 in action.
Focus on steps 4-5, ignoring challenges and exercises.
Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.

Work through Data analysis and visualization with R using Jupyter rather than RStudio preferably.

For assignment 5, submit a zip file containing your Jupyter notebook for Lab 4 and your Jupyter notebook for Lab 5.

Lecture 6 - 22 May 2018 1330-1600 in H-403: Correlation, Clustering, Visualization.
Skiena Lecture 5 - Correlation video (1:12:34), March 2017.
Skiena Lecture 22 - Clustering video (1:07:54), March 2017.
Skiena Lecture 11 - Visualizing Data video (1:15:07), March 2017.

Reading:
Seaborn visualization in Python tutorials.
Tamara Munzner's book Visualization Analysis and Design, CRC Press, 2014.

Lab 6 - 24 May 2018 1120-1320 in H-907: Midterm Preparation

Lecture 7 - 24 May 2018 1330-1600 in H-403: Midterm Preparation and Midterm Examination.

Lab 7 - 29 May 2018 1120-1320 in H-907: Catch Up
Finish off previous labs.

Lecture 8 - 29 May 2018 1330-1600 in H-403: Business Intelligence and Recommender Systems.
introduction to Data warehouses video (8:52), Andy Wicks, 2013.
Introduction to OLAP video (18:07), Standford Engineering, 2012.
Multimensional Analysis, video (24:14) start at 10:30, Michael Lamont, 2014.
Business Intelligence video (26:03), technologyAdvice, 2014.
Frequent Itemsets - Mining of Massive Datasets, video (29:50), Stanford University, April 2016. Includes association rule.

Reading:
Introduction to Data warehousing, Chapter 1 of Data Warehouse Design: Modern Principles and Methodologies, 2009, by Matteo Golfarelli and Stefano Rizzi.

Lab 8 - 31 May 2018 1120-1320 in H-907: Example - Recommender Systems.
Finish off previous labs.
Work through example of Market basket analysis: Code | Market Basket Analysis | Association Rules | R Programming, video (27:07), Data Science Tutorials, March 2017.

Lecture 9 - 31 May 2018 1330-1600 in H-403: Overview of Machine Learning --- Postponed
Introduction to Machine learning video (51:30), MIT.
Choosing the right ML algorithm video (1:00:54), first 40 minutes, MS Azure talk at conference, July 2017.

Other:
Skiena lecture 23: Machine Learning video(1:16:59)
Skiena Lecture 24: Topics in Machine Learning video(1:18:12)

Reading:
statistical techniques The 10 Statistical Techniques Data Scientists Need to Master linear regression, classification (logistical regression, discrimant analysis), resampling methods (bootstraping, cross validation), subsect selection, (best-subset, forward stepwise, backward stepwise, hybrid) shrinkage (ridge regression, lasso), dimension reduction (principal components, partial least squares), nonlinear models (step function, piecewise function, spline, generalized additive models), tree-based methods ( bagging, boosting, random forests), support vector machines, unsupervised learning (principal component analysis, k-means, hierarchical clustering).
In-depth introduction to machine learning in 15 hours of expert videos, September 2014.

Cheat sheets for data mining and machine learning
Basic prediction algorithms overview
Python sckit-learn ML
When to use algorithms in Python sckit-learn ML
R for data mining

Lab 9 - 05 June 2018 1120-1320 in H-907: Introduction to Machine Learning
Work through the introductory tutorial on scikit-learn.
Look at the examples, and explore some of interest to you.

Lecture 10 - 05 June 2018 1330-1600 in H-403: Supply Chain Management, Geospatial Data, Social Networks.

Supply Chain Management
What is Supply Chain Management? Definition and Introduction , video (12:07), AIMS UK, June, 2016.
Using Analytics to make the most of Your Supply Chain Data, video (34:03), Thorogood, January 2017.
Optional: Deep Learning in Supply Optimization, video (1:06:16), instacart, May 2017.

FinTech
The Tech in FinTech, video (31:28), Elise Breda, February 2017.

Geospatial
Geospatial Data with Open Source Tools in Python | SciPy 2015 Tutorial | Kelsey Jordahl, video (3:03:59), SciPy 2015, Kelsey Jordahl. Covers geopandas, PySAL spatial analysis library, PyProj for geospatial projections and transformations, qGIS system (open source similar to ArcGIS).
See also Open Street Map, open source maps, similar to Google Maps.

Time Series Analysis Tutorial with Python fro DataCamp, to work through.

Financial analysis
Time Series Analysis in Python: An Introduction using Quandl.

Lab 10 - 07 June 2018 1120-1320 in H-907: Supply Chain Management, Geospatial Data, Social Networks.
Work on your project.

Lecture 11 - 07 June 2018 1330-1600 in H-403: Supply Chain Management, Geosptial Data, Social Networks.

Lab 11 - 12 June 2018 1120-1320 in H-907: Preparation of presentations.
Work on your project. Work on your presentation.

Lecture 12 - 12 June 2018 1330-1600 in H-403: Preparation of Presentations.
Either come and ask questions about your project, or work on your project and presentation.

Lab 12 - 14 June 2018 1120-1320 in H-907: Presentations.

Presentations: 5 minutes to set up; 15 minutes to present.
1120: Wayne Leung, Air flight delays
1140: Kisife Giles, nutrition and health
1200: Samuel Campbell, Dota2 outcome prediction
1220: Mihal Damaschin, dating
1240: Jason DeLaat, gun violence neighbourhoods
1300: Nomaan Ahmed, automobiles

Lecture 13 - 14 June 2018 1330-1600 in H-403: Project Presentations.

Presentations: 5 minutes to set up; 15 minutes to present.
1330: Melanie Taing, suicide
1350: Vanessa Kurt, NHL players
1410: Laura Gonzalez, hometown players
1430: Hanqing Zhao, newborn mortality
1450: Sarbeng Frimpong, FIFA matches
1510: Lenz Petion, cryptocurrencies
1530: Justin Dumas-Carr, Montreal transport

Exam Preparation Session - Tuesday 19 June 2018 1330-1600 in H-403
We will review midterm examination. You can ask questions on course material.


Last modified on 02 May 2018 by gregb@cs.concordia.ca