Instructors
Lectures & Labs
Objectives
Communication
Schedule
Books
Evaluation
Academic Integrity
Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online) office hours: Wednesday 2pm - 3pm or by appointment. A Zoom link will be posted on Moodle.
Teaching Assistants:
Zoom links will be posted on Moodle.
Big Data analytics has been transforming industry and science in various domains for the past few years, making possible the processing of Terabytes of data on a daily basis. This was enabled by the joint evolution of programming models, data-analysis algorithms and computing infrastructures.
This course introduces the concepts and some of the main algorithms used for Big Data analytics. It presents the principles of the Hadoop ecosystem, Apache Spark, and it details the main algorithms for the analysis of large datasets, related to similarity search, mining of frequent itemsets, graph analysis, clustering, stream mining, recommender systems and advertising.
By the end of this course, students will be able to write and deploy efficient parallel algorithms to analyze Big Data sources for various applications.
Important information will be communicated through Moodle and/or Slack. Students are expected to consult these channels regularly.
The instructors are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of students or TAs in any form. Sexual language and imagery is not appropriate for any communication, in particular on Slack. For more information, please consult Concordia's policy on harassment.
Date | Lecture | Assignments Lab | Project Clinic Lab | Deliverables |
---|---|---|---|---|
Jan 12 | Introduction | None | None | None |
Jan 19 | Data locality (Hadoop MapReduce and HDFS) | Even team ids: Git, GitHub, Python, pytest. | Uneven team ids: Project Definition | Project teams (4 members) must be registered by Jan 17, 11:55pm |
Jan 26 | In-memory computing and lazy evaluation (Apache Spark, Dask) | Uneven team ids: Git, GitHub, Python, pytest. | Even team ids: Project Definition | None |
Feb 2 | Supervised Learning | Even team ids: Spark RDDs and DataFrames, intro to LA1. | Uneven team ids: Data model design | None |
Feb 9 | Recommender Systems | Uneven team ids: Spark RDDs and DataFrames, intro to LA1. | Even team ids: Data model design | None |
Feb 16 | Clustering | Even team ids: Help with LA1 | Uneven team ids: Data preparation | Project summary Due during project clinic |
Feb 23 | Frequent Itemsets | Uneven team ids: Help with LA1 | Even team ids: Data preparation | Project summary Due during project clinic LA1 Due date: Feb 25, 11:55pm |
Mar 9 | Midterm exam | None | None | None |
Mar 16 | Data Streams | Even team ids:Introduction to LA2 | Uneven team ids: Model implementation | None |
Mar 23 | Graph Analysis | Uneven team ids:Introduction to LA2 | Even team ids: Model implementation | None |
Mar 30 | Similarity Search | Even team ids:Introduction to LA3 | Uneven team ids: Model evaluation | LA2 Due date: Apr 1, 11:55pm Project data model Due during project clinic |
Apr 6 | Dimensionality Reduction | Uneven team ids:Introduction to LA3 | Even team ids: Model evaluation | Project data model Due during project clinic |
Apr 13 | Project presentations | None | Project presentations | LA3 Due date: Apr 15, 11:55pm |
Apr 27, 7pm-10pm | LS105 and LS208 (undergraduate students) LS208 and LS210 (graduate students) |
Please note: In the event of extraordinary circumstances beyond the University's control, the content and/or evaluation scheme in this course is subject to change.
Lab assignments (25%): You will be required to develop data analysis programs in Python using Apache Spark or Dask. There will be a total of three assignments. You must work on these assignments individually. The lab assignments are all due on a Friday evening, 11:55pm (see exact dates on the schedule table). A grace period of 48 hours will be automatically granted (assignments will be accepted until Sunday night, 11:55pm), but no further extension will be granted. Assignments must be submitted through GitHub Classroom, you will receive a link for each assignment.
Exams (40%): There will be a mid-term and a final exam. Exams will be Moodle quizzes. The midterm will be conducted in-class and will count for 10% of the final grade. The final exam will count for 30% of the final grade. There will be no substitution for a missed exam.
Project (35%): This course will walk you through the definition and implementation of a data-science project using Big Data technologies. During the project clinics and lecture, the instructors will guide you through the following milestones:
Grading Scheme: There is no standard relationship between percentages and letter grades assigned. The grading of the course will be done based on the relative percentages assigned to the assignments, project and the exam. There is no definite rule for translation of number grades to letter grades.