Greg Butler: Introduction to Data Analytics

COMP 333 Introduction to Data Analytics

Winter 2023: Labs

Labs

It is very important that you do all the labs. They are compulsory.

You must submit to EAS (the electronic assignment submission system).
You must submit the exercise or assignment in the formal requested. No other format will be accepted (except zip, but only use zip if your internet connection makes it necessary to compress the file).
You must submit to the requested location in EAS, such as "theory_assignment 10", and you must submit to COMP333 (and not some other course).

You must attend the lab section GA, GB for which you are registered! You cannot attend any other lab section.

Lab Exercise Schedule

Lab exercises are for you to learn and practice data analytics, including the Python ecosystem.

The lab sessions are mandatory. You submit the exercises as proof of attendance. The exercises are given full marks for an honest attempt at the lab session submitted in the correct format. They are not marked for correctness.

You should validate your own progress. If you are not sure whether you have done something correctly, then show your work to the demonstrator, during the lab session.

Week 1: No lab.
Week 2: Jupyter installation; Python introduction (software-carpentry.org)
Use the anaconda distribution of Jupyter to install Jupyter and Python onto your laptops, as shown in the video Jupyter Installation from codingthesmartway.com
Learn about basic Python - variables, strings, lists, dictionaries, sets - and use Jupyter, as shown in the video Learn Python for Beginners from codingthesmartway.com
Work through Plotting and Programming in Python and Programming in Python from software-carpentry.com.
Exercise: Submit a ipynb file which shows the definition and use of the "fahr_to_celsius", "visualize", and "detect_problems" functions from Step 8 of the Programming in Python tutorial to EAS as "theory_assignment 2".
Week 3: Python Object-Oriented Programming (Corey Schafer: youtube)
Some advanced Python to work through Python: Lambda, Map, Filter, Reduce Functions from Joe James, and Object-Oriented Python from Corey Schafer. (Schafer has many follow up youtube tutorials too.)
Write and test a simple program to read in a list of Employee data, and calculate their total pay and average pay as a Jupyter notebook. Create some sample data as a csv file to test your code. Use the Employee class from Schafer's OO Python, and use lambda, map, filter, and reduce (as necessary) to calculate the total pay and average pay of a list of Employee's.
Exercise: Submit your ipynb file from the tutorial to EAS as "theory_assignment 3".
Week 4: Python pandas example (datacarpentry.org)
Work through the tutorial Python for Social Science Data for an introduction to pandas (steps 8-12), matplotlib (step 13), and using SQLite databases (step 14).
Exercise: Submit your ipynb file from the tutorial to EAS as "theory_assignment 4".
Week 5: Python Exploratory Data Analysis example (datacarpentry.org)
Work through Data Analysis and Visualization in Python for Ecologists. Focus on steps 4-8, ignoring challenges and exercises.
Exercise: Submit your ipynb file from the tutorial to EAS as "theory_assignment 5".
Week 6: Data wrangling example (biggorilla.org)
This is a long tutorial to work through which will take two lab sessions. See Instructions for Lab 6 and 7
Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.
For this lab you should focus on Part 1: Data acquisition; Part 2: Data extraction; and Part 3: Data profiling and cleaning.
Exercise: Submit your ipynb file from the tutorial to EAS as "theory_assignment 6".
Week 7: continuation of Lab 6
For this lab you should focus on Part 4: Data matching and merging;
Exercise: Submit your ipynb file from the tutorial to EAS as "theory_assignment 7".
Week 8: Data wrangling with OpenRefine (datacarpentry.org)
Work through the tutorial Data Cleaning with OpenRefine for Ecology from datacarpentry.org.
Exercise: Submit the csv file of the clean data exported in Step 6 of the tutorial to EAS as "theory_assignment 8".
Week 9: Linear regression models for ecology data
Use the clean data from Lab 8 to build linear regression models that relate species, seasons of the year, and country to the number of sightings of rodents.
You will need to engineer a new feature for the season, and a feature for the count of each species for each season. Keep in mind that USA is in the northern hemisphere like Canada, while Austalia is in the southern hemisphere where seasons are reversed compared to the northern hemisphere.
You will have to decide how to handle Ecuador which is on the equator, and has no seasons, in fact. You could drop the data from Ecuador, or treat it like USA in the northern hemisphere, or like Australia in the southern hemisphere. Do the models change significantly for these three ways of treating Ecuador?
Yes, the linear regression may not produce good models, but do not worry: this is a learning experience.
Exercise: Submit the ipynb file of your notebook showing your models to EAS as "theory_assignment 9".
Week 10: Python machine learning with scikit-learn (scikit-learn.org)
Work through the introductory tutorial on scikit-learn.
Look at the examples, and explore ML techniques
and explore some concepts in ML
- Confusion matrix, imbalanced data
- Under- and over-fitting
Exercise: Submit a ipynb file for PCA applied to the Iris dataset to EAS as "theory_assignment 10".
Week 11: PCA and decision tree for social science data
First, make sure that you have looked at PCA and decision trees for the iris dataset from last week.
- Principal Component Analysis applied to the Iris dataset
- Decision tree of iris dataset, and looking at tree
Second, follow those examples to create one Jupyter notebook where you (a) apply the PCA calculation, and (b) build a decision tree for the social sciences data of Lab 4.
Exercise: Submit the ipynb file of the Jupyter notebook to EAS as "theory_assignment 11".
Week 12: Story-Telling and Visualization Create a 10-15 minute presentation, with visualization, for the social sciences EDA exercise of Lab 4 following the four-slide method. First re-do the EDA of Lab 4 to explore possible stories and supporting visualizations. Then draft a presentation using the four-slide method. Carefully review your presentation, maybe even doing a trial run of the presentation yourself, and then revise it.
There must be exactly four slides: Situation, Problem, Solution, Next Steps! No title slide; no outline slide; just the four slides required in the SPSN approache.
Exercise: Submit a pdf file (not a ipynb, nor a powerpoint file, nor a zip file) to EAS as "theory_assignment 12".
Week 13: Story telling for ecology data
Build on the use of ecology data in Lab 8 and 9; as well as your experience in Lab Assignment 4 with story-telling as an integral part of the Jupyter notebook.
Create your story using the four-slide method as you did in Lab 12. This time for the ecology data, and this time as part of the Jupyter notebook rather than powerpoint.
Exercise: Submit the ipynb file of youy Jupyter notebook to EAS as "theory_assignment 13".

Lab Assignments

Lab assignments will come here.

There will be a 10% penalty for each day late in submission.

For the deadlines, 10:00 in the 24-hour clock notation is 10am.

A Simple Example READ
Deadline 2023-02-15 at 10:00: Submit your ipynb file to EAS as "programming_assignment 1".
Python for Descriptive Data Analysis READ
Deadline 2023-03-01 at 10:00: Submit your ipynb file to EAS as "programming_assignment 2".
Data Wrangling with Python and pandas READ
Deadline 2023-03-22 at 10:00: Submit your ipynb file to EAS as "programming_assignment 3".
Jupyter markup for Story Telling READ
Deadline 2023-04-14 at 10:00: Submit your zipped directory with your ipynb file and pdf file to EAS as "programming_assignment 4".

Some Datasets

Example 1: Restaurant Tipping Dataset See csv dataset

Example 2: OECD Dataset PISA is the OECD's Programme for International Student Assessment. Every three years it tests 15-year-old students from all over the world in reading, mathematics and science. Look at the Data tab for access to all the data.

Example 3: Titanic Survival Rate

Iris dataset about flowers, that is common example for R system. See csv dataset

Last modified on 10 January 2023 by gregb@cse.concordia.ca