This is a long tutorial to work through which will take two lab sessions.
It is based on the example A Hands on Tutorial for public movie data: The Kaggle 5000 Movie Dataset (imdb). This is out of date, because it is in Python 2, and because the sources of the data have moved location.
A thorough discussion for the original process involving scrapping and facial recognition can be found at How to tell the greatness of a movie before it is released-in cinema with the code available at https://github.com/sundeepblue/movie_rating_prediction.
The biggorilla file with the Kaggle 5000 Movie Dataset (movie_metadata.csv) is available at BigGorilla datasets under movie_metadata.csv and ./comp499data/movie_metadata.csv
The biggorilla file integrating the IMDB data (imdb_dataset.csv) is available at BigGorilla datasets under imdb_dataset.csv and ./comp499data/imdb_dataset.csv
The files for the original ftp download from Finland (ftp://ftp.funet.fi/pub/)
genres.list.gz ratings.list.gzare available as ./comp499data/genres.list.gz and ./comp499data/ratings.list.gz
Create a Jupyter notebook for this tutorial. Keep minimal markdown so you can keep track of the steps done, why you did them, and why you did them the way you did them.
For Lab 6 you should focus on
Part 1: Data acquisition; Part 2: Data extraction; and Part 3: Data profiling and cleaning.and for Lab 7 focus on
Part 4: Data Matching and Merging
Part 1: Data Acquisition
Step 1 The Kaggle IMDB 5000 Movie Dataset is available at ./comp499data/movie_metadata.csv You should name the file './data/kaggle_dataset.csv' on your computer.
Step 2
The IMDB Plain Text Files are available at
./comp499data/genres.list.gz
and
./comp499data/ratings.list.gz
You should place the files in './data/' on your computer.
Step 3
The IMDB Prepared Data File is available at
./comp499data/imdb_dataset.csv
You should place the file as './data/imdb_dataset.csv' on your computer.
Part 2: Data Extraction
Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.
Create the dataframes 'genres_data' and ratings_data'.
Part 3: Data Profiling and Cleaning
Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.
Step 1: Loading
Step 2: Profiling statistics
Step 3: Remove duplicates
Step 4: Normalizing text
Step 5: Some samples
Part 4: Data Matching and Merging
Work through the example A Hands on Tutorial for public movie data: keep in mind that you should use Python 3, whereas the biggorilla example code is Python 2.
Step 1: Integrating IMDB Files
Step 2: Integrating Kaggle and IMDB
Step 3: Data Matching