Outline

Instructors
Lectures & Labs
Objectives
Communication
Schedule
Books
Evaluation
Academic Integrity

Course Outline, Winter 2024
Big Data Analytics
SOEN 471 / SOEN 6111

Instructors

Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online or onsite) office hours: Wednesday 4pm - 5pm or by appointment. Office: ER 9.919. A Zoom link will be posted on Moodle.

Teaching Assistants:

Lectures & Labs

Objectives

Big Data analytics has been transforming industry and science in various domains for the past few years, making possible the processing of Terabytes of data on a daily basis. This was enabled by the joint evolution of programming models, data-analysis algorithms and computing infrastructures.

This course introduces the concepts and some of the main algorithms used for Big Data analytics. It presents the principles of the Hadoop ecosystem, Apache Spark, and it details the main algorithms for the analysis of large datasets, related to similarity search, mining of frequent itemsets, graph analysis, clustering, stream mining, recommender systems and advertising.

By the end of this course, students will be able to write and deploy efficient parallel algorithms to analyze Big Data sources for various applications.

Communication

Important information will be communicated through Moodle and/or Slack. Students are expected to consult these channels regularly.

Students are also encouraged to communicate about course topics between themselves, with their TAs, and with the professor. Frequent communication is key to successful learning! However, to ensure a viable environment, the following rules must be respected, in particular for communications happening on Slack: These rules are meant to ensure that most questions could be answered while keeping a reasonable load on the instructors.

The instructors are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of students or TAs in any form. Sexual language and imagery is not appropriate for any communication, in particular on Slack. For more information, please consult Concordia's policy on harassment.

Schedule

\
Date Lecture Assignments LabProject Clinic LabDeliverables
Jan 17 Introduction None None None
Jan 24 Data locality (Hadoop MapReduce and HDFS) Even team ids: Git, GitHub, Python, pytest. Uneven team ids: Project Definition Project teams (4 members) must be registered by Jan 23, 11:55pm
Jan 31 In-memory computing and lazy evaluation (Apache Spark, Dask) Uneven team ids: Git, GitHub, Python, pytest. Even team ids: Project Definition None
Feb 7Supervised Learning Even team ids: Spark RDDs and DataFrames, intro to LA1. Uneven team ids: Data model design None
Feb 14 Recommender Systems Uneven team ids: Spark RDDs and DataFrames, intro to LA1. Even team ids: Data model design None
Feb 21 Clustering Even team ids: Introduction to LA2 Uneven team ids: Data preparation LA1
Due date: Feb 23, 11:55pm
Project summary Due during project clinic
Mid-term break
Mar 6 Midterm exam None None None
March 13 Frequent Itemsets Uneven team ids: Introduction to LA2 Even team ids: Data preparation LA2
Due date: Mar 15, 11:55pm
Project summary Due during project clinic
Mar 20 Data Streams Even team ids: Introduction to LA3 Uneven team ids: Model implementation Project data model
Due during project clinic
Mar 27 Graph Analysis Uneven team ids: Introduction to LA3 Even team ids: Model implementationProject data model
Due during project clinic
Apr 3 Similarity Search All teams: Help with LA3 Uneven team ids: Project presentations LA3
Due date: Apr 5, 11:55pm
Apr 10 Project presentations
Apr 24, 7-10pm,
Final Exam
(Rooms: MBS2.105, MBS2.115, MBS2.285, MBS2.330, MBS2.401 and MBS2.330)

Please note: In the event of extraordinary circumstances beyond the University's control, the content and/or evaluation scheme in this course is subject to change.

Book

A significant portion of the slides presented from session 5 will be taken from http://www.mmds.org. This website also has useful videos explaining the slides.

Course Evaluation

Lab assignments (15%): You will be required to develop data analysis programs in Python using Apache Spark or Dask. There will be a total of three assignments. You must work on these assignments individually. The lab assignments are all due on a Friday evening, 11:55pm (see exact dates on the schedule table). A grace period of 48 hours will be automatically granted (assignments will be accepted until Sunday night, 11:55pm), but no further extension will be granted. Assignments must be submitted through GitHub Classroom, you will receive a link for each assignment.

Exams (60%): There will be a mid-term and a final exam. All exames will be closed book and no electronic device will be allowed except ENCS calculators. Exams will be Multiple-Choice Questionnaires. The midterm will be conducted in-class and will count for 20% of the final grade. The final exam will count for 40% of the final grade. There will be no substitution for a missed exam.

Project (25%): This course will walk you through the definition and implementation of a data-science project using Big Data technologies. During the project clinics and lecture, the instructors will guide you through the following milestones:

  1. Project definition:
    • Option 1: Own Project. You will define your own project by identifying: (1) a dataset of interest, (2) a set of research questions to be answered with the dataset, using techniques studied in class. If you are doing a Master or PhD thesis, you are encouraged to define a project linked to your research topic.
    • Option 2: Industrial Project. You will select a project proposed by a company through the Riipen platform. Industrial projects will be validated by the instructors and published on January 24.
    Both project types are expected to follow the same milestones (described below). Milestones for industrial projects might be slightly revised depending on the exact nature of the projects.
  2. Model design: choose a class of models in {supervised learning, recommender system, clustering, frequent itemset}. Outline how the data model could be applied to your dataset to answer your research question. Research algorithms and techniques to implement this class of model.
  3. Data preparation: inspect the dataset, identify missing data, outliers, data types (categorical data in particular), and write Spark or Dask programs to correct for potential issues.
  4. Model implementation: implement the model with Spark, Dask or scikit-learn.
  5. Model evaluation: identify evaluation metrics for your model. Implement and discuss them.
Projects will have the following deliverables. Deadlines are indicated on the schedule. No deadline extension will be granted.
  1. Team registration (1%) Easy point! Please register your team of 4 students on time to allow for a smooth organization of project clinics.
  2. Participation (3%) Attendance and participation to project clinics. Project clinics will follow the following template:
    • Project team presents updates to class (5')
    • Instructor gives feedback and directions (5'). Audience may ask questions.
    • Feedback on project milestones (3').
    Don't miss an opportunity to get feedback on your project!
  3. Project summary (3%) The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared. The project summary will be evaluated during the project clinics (see schedule table).
  4. Project data model (10%) The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics (see schedule table).
  5. Final project presentation (8%) The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.
Project deliverables will be evaluated using specific rubrics. All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.
Note: project team grades may be weighted differently for each participant in case the team judges that some team members have not contributed significantly to the project.

Grading Scheme: There is no standard relationship between percentages and letter grades assigned. The grading of the course will be done based on the relative percentages assigned to the assignments, project and the exam. There is no definite rule for translation of number grades to letter grades.

Academic Integrity

Violation of the Academic Code of Conduct in any form will be severely dealt with. This includes copying (even with modifications) of program segments. You must demonstrate independent thought through your submitted work. Click on the following link for more information: http://www.concordia.ca/students/academic-integrity.html.