Greg Butler: Introduction to Data Analytics

COMP 499 Introduction to Data Analytics

Summer 2019 Semester 1: May 7 to June 18, 2019

Lab 6 and 7: Movie Data Data Wrangling

Data wrangling example (biggorilla.org)

This is a long tutorial to work through which will take two lab sessions.

It is based on the example A Hands on Tutorial for public movie data: The Kaggle 5000 Movie Dataset (imdb). This is out of date, because it is in Python 2, and because the sources of the data have moved location.

A thorough discussion for the original process involving scrapping and facial recognition can be found at How to tell the greatness of a movie before it is released-in cinema with the code available at https://github.com/sundeepblue/movie_rating_prediction.

The biggorilla file with the Kaggle 5000 Movie Dataset (movie_metadata.csv) is available at BigGorilla datasets under movie_metadata.csv and ./comp499data/movie_metadata.csv

The biggorilla file integrating the IMDB data (imdb_dataset.csv) is available at BigGorilla datasets under imdb_dataset.csv and ./comp499data/imdb_dataset.csv

The files for the original ftp download from Finland (ftp://ftp.funet.fi/pub/)

genres.list.gz
ratings.list.gz

are available as ./comp499data/genres.list.gz and ./comp499data/ratings.list.gz

Lab 6 and Lab 7

Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.

For Lab 6 you should focus on

Part 1: Data acquisition;
Part 2: Data extraction; and
Part 3: Data profiling and cleaning.

and for Lab 7 focus on

Part 4: Data Matching and Merging

Lab 6

Part 1: Data Acquisition

Step 1 The Kaggle IMDB 5000 Movie Dataset is available at ./comp499data/movie_metadata.csv You should name the file './data/kaggle_dataset.csv' on your computer.

Step 2 The IMDB Plain Text Files are available at ./comp499data/genres.list.gz and ./comp499data/ratings.list.gz
You should place the files in './data/' on your computer.

Step 3 The IMDB Prepared Data File is available at ./comp499data/imdb_dataset.csv
You should place the file as './data/imdb_dataset.csv' on your computer.

Part 2: Data Extraction

Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.

Create the dataframes 'genres_data' and ratings_data'.

Part 3: Data Profiling and Cleaning

Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.

Step 1: Loading

Step 2: Profiling statistics

Step 3: Remove duplicates

Step 4: Normalizing text

Step 5: Some samples

Lab 7

Part 4: Data Matching and Merging

Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.

Step 1: Integrating IMDB Files

Step 2: Integrating Kaggle and IMDB

Step 3: Data Matching

Last modified on 21 May 2019 by gregb@cs.concordia.ca