Course Outline, Winter 2024
Big Data Analytics
SOEN 471 / SOEN 6111

Instructors

Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online or onsite) office hours: Wednesday 4pm - 5pm or by appointment. Office: ER 9.919. A Zoom link will be posted on Moodle.

Teaching Assistants:

Inés Gonzalez Pepe (Projects)
e-mail: inesgp99@gmail.com
Mathieu Dugré (Lab assignments)
e-mail: math.dugre@gmail.com
Sephora Maltais (Lab assignments #2)
email: seph156@gmail.com

Lectures & Labs

Lectures: Wednesday 5:45PM - 8:15PM. H 110.
Labs:
- Wednesday 3:45PM - 5:35PM.
  - Room H903 (assignment lab #1, Mathieu)
  - Room H849 (project clinic, Inés)
- Wednesday 8:30PM - 10:20PM.
  - Room H817 (assignment lab #1, Mathieu)
  - Room H967 (assignment lab #2, Sephora)
  - Room H903 (project clinic, Inés)

Objectives

Big Data analytics has been transforming industry and science in various domains for the past few years, making possible the processing of Terabytes of data on a daily basis. This was enabled by the joint evolution of programming models, data-analysis algorithms and computing infrastructures.

This course introduces the concepts and some of the main algorithms used for Big Data analytics. It presents the principles of the Hadoop ecosystem, Apache Spark, and it details the main algorithms for the analysis of large datasets, related to similarity search, mining of frequent itemsets, graph analysis, clustering, stream mining, recommender systems and advertising.

By the end of this course, students will be able to write and deploy efficient parallel algorithms to analyze Big Data sources for various applications.

Communication

Important information will be communicated through Moodle and/or Slack. Students are expected to consult these channels regularly.

Moodle page
Slack workspace (join channel #bigdata--winter2024)

Students are also encouraged to communicate about course topics between themselves, with their TAs, and with the professor. Frequent communication is key to successful learning! However, to ensure a viable environment, the following rules must be respected, in particular for communications happening on Slack:

Spend a few minutes searching for answers on your own before asking other people.
Keep an eye on the on-going discussions, and try to avoid asking a question that has already been asked.
Use Slack's thread feature to follow-up on a question instead of posting replies to the main channel.
Don't hesitate to chime in a discussion if you think you might help.
Always use the public channel (#bigdata--winter2024) to ask your questions. In particular, never exchange private messages with your TA. If any personal matter has to be discussed, communicate with the professor. If you really need to send a non-public message to your TA, always involve the professor.

These rules are meant to ensure that most questions could be answered while keeping a reasonable load on the instructors.

The instructors are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of students or TAs in any form. Sexual language and imagery is not appropriate for any communication, in particular on Slack. For more information, please consult Concordia's policy on harassment.

Schedule

Date	Lecture	Assignments Lab	Project Clinic Lab	Deliverables
Jan 17	Introduction	None	None	None
Jan 24	Data locality (Hadoop MapReduce and HDFS)	Even team ids: Git, GitHub, Python, pytest.	Uneven team ids: Project Definition	Project teams (4 members) must be registered by Jan 23, 11:55pm
Jan 31	In-memory computing and lazy evaluation (Apache Spark, Dask)	Uneven team ids: Git, GitHub, Python, pytest.	Even team ids: Project Definition	None
Feb 7	Supervised Learning	Even team ids: Spark RDDs and DataFrames, intro to LA1.	Uneven team ids: Data model design	None
Feb 14	Recommender Systems	Uneven team ids: Spark RDDs and DataFrames, intro to LA1.	Even team ids: Data model design	None
Feb 21	Clustering	Even team ids: Introduction to LA2	Uneven team ids: Data preparation	LA1 Due date: Feb 23, 11:55pm Project summary Due during project clinic
Mid-term break
Mar 6	Midterm exam	None	None	None
March 13	Frequent Itemsets	Uneven team ids: Introduction to LA2	Even team ids: Data preparation	LA2 Due date: Mar 15, 11:55pm Project summary Due during project clinic
Mar 20	Data Streams	Even team ids: Introduction to LA3	Uneven team ids: Model implementation	Project data model Due during project clinic
Mar 27	Graph Analysis	Uneven team ids: Introduction to LA3	Even team ids: Model implementation	Project data model Due during project clinic
Apr 3	Similarity Search	All teams: Help with LA3	Uneven team ids: Project presentations	LA3 Due date: Apr 5, 11:55pm
Apr 10	Project presentations
Apr 24, 7-10pm,	Final Exam (Rooms: MBS2.105, MBS2.115, MBS2.285, MBS2.330, MBS2.401 and MBS2.330)

Please note: In the event of extraordinary circumstances beyond the University's control, the content and/or evaluation scheme in this course is subject to change.

Book

MMDS (Required): Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman, beta version of the 3rd edition. Available online.

A significant portion of the slides presented from session 5 will be taken from http://www.mmds.org. This website also has useful videos explaining the slides.

Course Evaluation

Lab assignments (15%): You will be required to develop data analysis programs in Python using Apache Spark or Dask. There will be a total of three assignments. You must work on these assignments individually. The lab assignments are all due on a Friday evening, 11:55pm (see exact dates on the schedule table). A grace period of 48 hours will be automatically granted (assignments will be accepted until Sunday night, 11:55pm), but no further extension will be granted. Assignments must be submitted through GitHub Classroom, you will receive a link for each assignment.

Exams (60%): There will be a mid-term and a final exam. All exames will be closed book and no electronic device will be allowed except ENCS calculators. Exams will be Multiple-Choice Questionnaires. The midterm will be conducted in-class and will count for 20% of the final grade. The final exam will count for 40% of the final grade. There will be no substitution for a missed exam.

Project (25%): This course will walk you through the definition and implementation of a data-science project using Big Data technologies. During the project clinics and lecture, the instructors will guide you through the following milestones:

Project definition:
- Option 1: Own Project. You will define your own project by identifying: (1) a dataset of interest, (2) a set of research questions to be answered with the dataset, using techniques studied in class. If you are doing a Master or PhD thesis, you are encouraged to define a project linked to your research topic.
- Option 2: Industrial Project. You will select a project proposed by a company through the Riipen platform. Industrial projects will be validated by the instructors and published on January 24.
Both project types are expected to follow the same milestones (described below). Milestones for industrial projects might be slightly revised depending on the exact nature of the projects.
Model design: choose a class of models in {supervised learning, recommender system, clustering, frequent itemset}. Outline how the data model could be applied to your dataset to answer your research question. Research algorithms and techniques to implement this class of model.
Data preparation: inspect the dataset, identify missing data, outliers, data types (categorical data in particular), and write Spark or Dask programs to correct for potential issues.
Model implementation: implement the model with Spark, Dask or scikit-learn.
Model evaluation: identify evaluation metrics for your model. Implement and discuss them.

Projects will have the following deliverables. Deadlines are indicated on the schedule. No deadline extension will be granted.

Team registration (1%) Easy point! Please register your team of 4 students on time to allow for a smooth organization of project clinics.
Participation (3%) Attendance and participation to project clinics. Project clinics will follow the following template:
- Project team presents updates to class (5')
- Instructor gives feedback and directions (5'). Audience may ask questions.
- Feedback on project milestones (3').
Don't miss an opportunity to get feedback on your project!
Project summary (3%) The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared. The project summary will be evaluated during the project clinics (see schedule table).
Project data model (10%) The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics (see schedule table).
Final project presentation (8%) The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.

Project deliverables will be evaluated using specific rubrics. All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.
Note: project team grades may be weighted differently for each participant in case the team judges that some team members have not contributed significantly to the project.

Grading Scheme: There is no standard relationship between percentages and letter grades assigned. The grading of the course will be done based on the relative percentages assigned to the assignments, project and the exam. There is no definite rule for translation of number grades to letter grades.

Academic Integrity

Violation of the Academic Code of Conduct in any form will be severely dealt with. This includes copying (even with modifications) of program segments. You must demonstrate independent thought through your submitted work. Click on the following link for more information: http://www.concordia.ca/students/academic-integrity.html.

Course Outline, Winter 2024 Big Data Analytics SOEN 471 / SOEN 6111