Outline

Instructors
Lectures & Labs
Objectives
Communication
Schedule
Books
Evaluation
Academic Integrity

Course Outline, Winter 2022
Big Data Analytics
SOEN 471 / SOEN 6111

Instructors

Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online) office hours: Wednesday 2pm - 3pm or by appointment. A Zoom link will be posted on Moodle.

Teaching Assistants:

Lectures & Labs

Zoom links will be posted on Moodle.

Moodle page

Objectives

Big Data analytics has been transforming industry and science in various domains for the past few years, making possible the processing of Terabytes of data on a daily basis. This was enabled by the joint evolution of programming models, data-analysis algorithms and computing infrastructures.

This course introduces the concepts and some of the main algorithms used for Big Data analytics. It presents the principles of the Hadoop ecosystem, Apache Spark, and it details the main algorithms for the analysis of large datasets, related to similarity search, mining of frequent itemsets, graph analysis, clustering, stream mining, recommender systems and advertising.

By the end of this course, students will be able to write and deploy efficient parallel algorithms to analyze Big Data sources for various applications.

Communication

Important information will be communicated through Moodle and/or Slack. Students are expected to consult these channels regularly.

Students are also encouraged to communicate about course topics between themselves, with their TAs, and with the professor. Frequent communication is key to successful learning! However, to ensure a viable environment, the following rules must be respected, in particular for communications happening on Slack: These rules are meant to ensure that most questions could be answered while keeping a reasonable load on the instructors.

The instructors are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of students or TAs in any form. Sexual language and imagery is not appropriate for any communication, in particular on Slack. For more information, please consult Concordia's policy on harassment.

Schedule

Date Lecture Assignments LabProject Clinic LabDeliverables
Jan 12 Introduction None None None
Jan 19 Data locality (Hadoop MapReduce and HDFS) Even team ids: Git, GitHub, Python, pytest.Uneven team ids: Project DefinitionProject teams (4 members) must be registered by Jan 17, 11:55pm
Jan 26 In-memory computing and lazy evaluation (Apache Spark, Dask)Uneven team ids: Git, GitHub, Python, pytest. Even team ids: Project DefinitionNone
Feb 2 Supervised Learning Even team ids: Spark RDDs and DataFrames, intro to LA1. Uneven team ids: Data model designNone
Feb 9 Recommender Systems Uneven team ids: Spark RDDs and DataFrames, intro to LA1. Even team ids: Data model designNone
Feb 16 Clustering Even team ids: Help with LA1Uneven team ids: Data preparationProject summary Due during project clinic
Feb 23 Frequent Itemsets Uneven team ids: Help with LA1Even team ids: Data preparationProject summary Due during project clinic
LA1
Due date: Feb 25, 11:55pm
Mid-term break
Mar 9 Midterm exam NoneNoneNone
Mar 16 Data Streams Even team ids:Introduction to LA2Uneven team ids: Model implementationNone
Mar 23 Graph Analysis Uneven team ids:Introduction to LA2Even team ids: Model implementationNone
Mar 30 Similarity Search Even team ids:Introduction to LA3Uneven team ids: Model evaluationLA2
Due date: Apr 1, 11:55pm
Project data model
Due during project clinic
Apr 6 Dimensionality ReductionUneven team ids:Introduction to LA3Even team ids: Model evaluationProject data model
Due during project clinic
Apr 13 Project presentations NoneProject presentationsLA3
Due date: Apr 15, 11:55pm
Apr 27, 7pm-10pm
Final Exam
LS105 and LS208 (undergraduate students)
LS208 and LS210 (graduate students)

Please note: In the event of extraordinary circumstances beyond the University's control, the content and/or evaluation scheme in this course is subject to change.

Book

A significant portion of the slides presented from session 4 will be taken from http://www.mmds.org. This website also has useful videos explaining the slides.

Course Evaluation

Lab assignments (25%): You will be required to develop data analysis programs in Python using Apache Spark or Dask. There will be a total of three assignments. You must work on these assignments individually. The lab assignments are all due on a Friday evening, 11:55pm (see exact dates on the schedule table). A grace period of 48 hours will be automatically granted (assignments will be accepted until Sunday night, 11:55pm), but no further extension will be granted. Assignments must be submitted through GitHub Classroom, you will receive a link for each assignment.

Exams (40%): There will be a mid-term and a final exam. Exams will be Moodle quizzes. The midterm will be conducted in-class and will count for 10% of the final grade. The final exam will count for 30% of the final grade. There will be no substitution for a missed exam.

Project (35%): This course will walk you through the definition and implementation of a data-science project using Big Data technologies. During the project clinics and lecture, the instructors will guide you through the following milestones:

  1. Project definition: you will define your own project by identifying: (1) a dataset of interest, (2) a set of research questions to be answered with the dataset, using techniques studied in class. If you are doing a Master or PhD thesis, you are encouraged to define a project linked to your research topic.
  2. Model design: choose a class of models in {supervised learning, recommender system, clustering, frequent itemset}. Outline how the data model could be applied to your dataset to answer your research question. Research algorithms and techniques to implement this class of model.
  3. Data preparation: inspect the dataset, identify missing data, outliers, data types (categorical data in particular), and write Spark or Dask programs to correct for potential issues.
  4. Model implementation: implement the model with Spark, Dask or scikit-learn.
  5. Model evaluation: identify evaluation metrics for your model. Implement and discuss them.
Projects will have the following deliverables. Deadlines are indicated on the schedule. No deadline extension will be granted.
  1. Team registration (5%) Easy points! Please register your team of 4 students on time to allow for a smooth organization of project clinics.
  2. Project summary (10%) The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared. The project summary will be evaluated during the project clinics (see schedule table).
  3. Project data model (10%) The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics (see schedule table).
  4. Project presentation (10%) The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.
Project deliverables will be evaluated using the following criteria: All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.

Grading Scheme: There is no standard relationship between percentages and letter grades assigned. The grading of the course will be done based on the relative percentages assigned to the assignments, project and the exam. There is no definite rule for translation of number grades to letter grades.

Academic Integrity

Violation of the Academic Code of Conduct in any form will be severely dealt with. This includes copying (even with modifications) of program segments. You must demonstrate independent thought through your submitted work. Click on the following link for more information: http://www.concordia.ca/students/academic-integrity.html.