Outline

Instructors
Lectures & Labs
Objectives
Communication
Schedule
Books
Evaluation
Academic Integrity

Course Outline, Winter 2021
Big Data Analytics
SOEN 471 / SOEN 6111

Instructors

Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online) office hours: Tuesday 4pm - 5pm or by appointment. A Zoom link will be posted on Moodle.
Week of Feb 8: Tuessday 5pm-6pm.

Teaching Assistants:

Lectures & Labs

Zoom links will be posted on Moodle.

Moodle page

Objectives

Big Data analytics has been transforming industry and science in various domains for the past few years, making possible the processing of Terabytes of data on a daily basis. This was enabled by the joint evolution of programming models, data-analysis algorithms and computing infrastructures.

This course introduces the concepts and some of the main algorithms used for Big Data analytics. It presents the principles of the Hadoop ecosystem, Apache Spark, and it details the main algorithms for the analysis of large datasets, related to similarity search, mining of frequent itemsets, graph analysis, clustering, stream mining, recommender systems and advertising.

By the end of this course, students will be able to write and deploy efficient parallel algorithms to analyze Big Data sources for various applications.

Communication

Important information will be communicated through Moodle and/or Slack. Students are expected to consult these channels regularly.

Students are also encouraged to communicate about course topics between themselves, with their TAs, and with the professor. Frequent communication is key to successful learning! However, to ensure a viable environment, the following rules must be respected, in particular for communications happening on Slack: These rules are meant to ensure that most questions could be answered while keeping a reasonable load on the instructors.

The instructors are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of students or TAs in any form. Sexual language and imagery is not appropriate for any communication, in particular on Slack. For more information, please consult Concordia's policy on harassment.

Schedule

Date Lecture Lab Readings Assignments
Jan 13 Introduction None None None
Jan 20 Data locality (Hadoop MapReduce and HDFS) Git, GitHub, Travis-CI, Python, pytest. None
Jan 27 In-memory computing and lazy evaluation (Apache Spark, Dask) Spark RDDs and DataFrames, intro to LA1. Project teams (2 members) must be registered by Jan 29, 11:55pm
Registration link
Feb 3 Supervised Learning Dask Project GitHub repositories must be created by Feb 5, 11:55pm.
Registration link.
Feb 10 Recommender Systems scikit-learn LA1: "Spark RDD and DataFrame APIs, Dask"
Due date: Feb 12, 11:55pm
Link: https://classroom.github.com/a/e7sFtQpz
Feb 17 Clustering
  • Introduction to LA2
  • Project clinic (project definition, method, dataset)
Project proposal
Due date: Feb 19, 11:55pm
Feb 24 Frequent Itemsets Introduction to LA3 LA2: "Recommender systems"
Due date: Feb 26, 11:55pm
Link: https://classroom.github.com/a/FtDQw7to
Mid-term break
Mar 10 Midterm exam NoneNoneNone
Mar 17 Similarity Search Introduction to LA4 None
Mar 24 Data Streams LA4LA3: "Clustering and Frequent Itemsets"
Due date: Apr 2, 11:55pm
Link: https://classroom.github.com/a/UYpMRSbI
Mar 31 Graph Analysis Project clinic (debugging, results discussion)None
Apr 7 Project presentations (I)Research presentations None
Apr 14 Project presentations (II) NoneNoneProject report
Due date: Apr 16, 11:55pm
LA4
Due date: Apr 16, 11:55pm
Link: https://classroom.github.com/a/J0XB3ioO
May 5, 7pm-10pm (tentative)
Final Exam

Please note: In the event of extraordinary circumstances beyond the University's control, the content and/or evaluation scheme in this course is subject to change.

Book

A significant portion of the slides presented from session 4 will be taken from http://www.mmds.org. This website also has useful videos explaining the slides.

Course Evaluation

Lab assignments (30%): You will be required to develop data analysis programs in Python using Apache Spark or Dask. There will be a total of four assignments. You must work on these assignments individually. The lab assignments are all due on a Friday evening, 11:55pm (see exact dates on the schedule table). A grace period of 48 hours will be automatically granted (assignments will be accepted until Sunday night, 11:55pm), but no further extension will be granted. Assignments must be submitted through GitHub Classroom, you will receive a link for each assignment.

Exams (40%): There will be a mid-term and a final exam. Exams will be Moodle quizzes. The midterm will be conducted in-class and will count for 10% of the final grade. The final exam will count for 30% of the final grade. There will be no substitution for a missed exam.

Project (30%): The project should fall in one of the following 3 categories:

No project template will be provided: you are supposed to define your own project based on the instructions above. If you are doing a Master or PhD thesis, you are strongly encouraged to define a project based on your own research. Other types of (relevant) projects are welcome and can be discussed with the instructor during office hours or on Slack, or with the TAs during the lab sessions. You have to work on the project in teams of 2. The project has to involve substantial software development, released on GitHub. The project will have the following milestones. Deadlines are indicated on the schedule. No deadline extension will be granted.

  1. The project registration (1%) will simply declare the team of 2 students that will work on the project. No team update will be accepted after the registration.
  2. The project GitHub repository (1%) will be a public or private GitHub repository containing the software, results and report of the project. If the repository is private, you must authorize GitHub user "glatard" to access it. All the subsequent milestones have to be submitted through your GitHub repository. In particular, the project proposal and report will be written in the README.md file in the repository. At this stage, your README file must contain a 100-word abstract of the project.
  3. The project proposal (3%) will be a revision of the README.md file in your project GitHub repository, with the following structure:
    • Abstract (100 words)
    • I. Introduction (300 words): context, objectives, presentation of the problem to solve, related work.
    • II. Materials and Methods (400 words): the dataset(s), technologies and algorithms that will be used.
    Even though the project proposal is only worth 3% of the final grade, you are strongly encouraged to take it seriously as it is your chance to get formal feedback on the project before the final deliverables. Besides, the grade obtained for the project proposal is a good predictor of the grades obtained for the project report and presentation.
  4. The project report (15%) will be a revision of the README.md file in your project GitHub repository, with the following structure:
    • Abstract (100 words)
    • I. Introduction (300 words): context, objectives, presentation of the problem to solve, related work.
    • II. Materials and Methods (400 words): the dataset(s), technologies and algorithms that will be used.
    • III. Results (300 words): a description of the result of the study (dataset analysis, technology comparison or implementation), with quantitative data obtained by the project team (graphs, tables, metrics, etc).
    • IV. Discussion (300 words): a discussion of the relevance of the solution(s), of the limitations and of possible future work.
    Project proposals and reports will be evaluated using the following criteria:
    • Originality (choice of the dataset, technology, problem) (10%)
    • Clarity (writing, organization, formatting) (20%)
    • Relevance to the course topics (10%)
    • Technical quality, including code comprehensiveness and quality (60%)
    All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.
  5. The project presentation (10%) will be a 5 to 10-minute presentation of the project. It will be evaluated using the following criteria:
    • Clarity (slides and speech) (2%)
    • Relevance (2%)
    • Technical quality (6%)
    Expect 1 or 2 questions after your presentation. All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.

Grading Scheme: There is no standard relationship between percentages and letter grades assigned. The grading of the course will be done based on the relative percentages assigned to the assignments, project and the exam. There is no definite rule for translation of number grades to letter grades.

Academic Integrity

Violation of the Academic Code of Conduct in any form will be severely dealt with. This includes copying (even with modifications) of program segments. You must demonstrate independent thought through your submitted work. Click on the following link for more information: http://www.concordia.ca/students/academic-integrity.html.