Course Outline, Winter 2024
Big Data Analytics
SOEN 471 / SOEN 6111
Instructors
Coordinator: Dr. Tristan Glatard
e-mail: tristan.glatard@concordia.ca
Regular (online or onsite) office hours: Wednesday 4pm - 5pm or by appointment. Office: ER 9.919. A Zoom link will be posted on Moodle.
Teaching Assistants:
Lectures & Labs
- Lectures: Wednesday 5:45PM - 8:15PM. H 110.
- Labs:
- Wednesday 3:45PM - 5:35PM.
- Room H903 (assignment lab #1, Mathieu)
- Room H849 (project clinic, Inés)
- Wednesday 8:30PM - 10:20PM.
- Room H817 (assignment lab #1, Mathieu)
- Room H967 (assignment lab #2, Sephora)
- Room H903 (project clinic, Inés)
Objectives
Big Data analytics has been transforming industry and
science in various domains for the past few years, making
possible the processing of Terabytes of data on a daily
basis. This was enabled by the joint evolution of
programming models, data-analysis algorithms and computing
infrastructures.
This course introduces the concepts and some of the main
algorithms used for Big Data analytics. It presents the
principles of the Hadoop ecosystem, Apache Spark, and it
details the main algorithms for the analysis of large
datasets, related to similarity search, mining of frequent
itemsets, graph analysis, clustering, stream mining,
recommender systems and advertising.
By the end of this course, students will be able to write
and deploy efficient parallel algorithms to analyze Big Data
sources for various applications.
Communication
Important information will be communicated through Moodle and/or Slack.
Students are expected to consult these channels regularly.
Students are also encouraged to communicate about course topics between
themselves, with their TAs, and with the professor. Frequent
communication is key to successful learning! However, to ensure a
viable environment, the following rules must be respected, in
particular for communications happening on Slack:
- Spend a few minutes searching for answers on your own before asking
other people.
- Keep an eye on the on-going discussions, and try to avoid asking
a question that has already been asked.
- Use Slack's thread feature to follow-up on a question instead of
posting replies to the main channel.
- Don't hesitate to chime in a discussion if you think you might help.
- Always use the public channel (#bigdata--winter2024) to ask your
questions. In particular, never exchange private messages with your
TA. If any personal matter has to be discussed, communicate with the
professor. If you really need to send a non-public message to your
TA, always involve the professor.
These rules are meant to ensure that most questions could be answered while keeping a
reasonable load on the instructors.
The instructors are dedicated to providing a
harassment-free experience for everyone, regardless of gender, gender
identity and expression, sexual orientation, disability, physical
appearance, body size, race, age or religion. We do not tolerate
harassment of students or TAs in any form. Sexual language and imagery
is not appropriate for any communication, in particular on Slack. For
more information, please consult Concordia's
policy on harassment.
Schedule
Date | Lecture | Assignments Lab | Project Clinic Lab | Deliverables |
Jan 17 | Introduction |
None |
None |
None |
Jan 24 | Data locality (Hadoop MapReduce and HDFS) |
Even team ids: Git, GitHub, Python, pytest. |
Uneven team ids: Project Definition |
Project teams (4 members) must be registered by Jan 23, 11:55pm
|
Jan 31 | In-memory computing and lazy evaluation (Apache Spark, Dask) |
Uneven team ids: Git, GitHub, Python, pytest. |
Even team ids: Project Definition |
None |
Feb 7 | Supervised Learning |
Even team ids: Spark RDDs and DataFrames, intro to LA1. |
Uneven team ids: Data model design |
None |
Feb 14 | Recommender Systems |
Uneven team ids: Spark RDDs and DataFrames, intro to LA1. |
Even team ids: Data model design |
None |
Feb 21 | Clustering |
Even team ids: Introduction to LA2 |
Uneven team ids: Data preparation |
LA1 Due date: Feb 23, 11:55pm
Project summary Due during project clinic |
Mid-term break |
Mar 6 | Midterm exam |
None |
None |
None |
March 13 | Frequent Itemsets |
Uneven team ids: Introduction to LA2 |
Even team ids: Data preparation |
LA2 Due date: Mar 15, 11:55pm
Project summary Due during project clinic
|
Mar 20 | Data Streams |
Even team ids: Introduction to LA3 |
Uneven team ids: Model implementation |
Project data model Due during project clinic |
Mar 27 | Graph Analysis |
Uneven team ids: Introduction to LA3 |
Even team ids: Model implementation | \
Project data model Due during project clinic |
Apr 3 | Similarity Search |
All teams: Help with LA3 |
Uneven team ids: Project presentations |
LA3 Due date: Apr 5, 11:55pm |
Apr 10 | Project presentations | |
Apr 24, 7-10pm,
| Final Exam (Rooms: MBS2.105, MBS2.115, MBS2.285, MBS2.330, MBS2.401 and MBS2.330) |
Please note: In the event of extraordinary circumstances beyond the University's control, the content
and/or evaluation scheme in this course is subject to change.
Book
- MMDS (Required):
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman, beta version of the 3rd edition.
Available online.
A significant portion of the slides presented from session 5 will be taken
from http://www.mmds.org. This
website also has useful videos explaining the slides.
Course Evaluation
Lab assignments (15%): You will be required
to develop data analysis programs in Python using Apache
Spark or Dask. There will be a total of three assignments. You must
work on these assignments individually. The
lab assignments are all due on a Friday evening, 11:55pm
(see exact dates on the schedule
table). A grace period of 48 hours will be automatically
granted (assignments will be accepted until Sunday night,
11:55pm), but no further extension will be
granted. Assignments must be submitted through
GitHub Classroom, you will receive a link for each assignment.
Exams (60%): There will be a mid-term and a
final exam. All exames will be closed book and no electronic device will be allowed except ENCS calculators. Exams will be Multiple-Choice Questionnaires. The midterm will be
conducted in-class and will count for 20% of the final grade. The
final exam will count for 40% of the final grade. There will be no
substitution for a missed exam.
Project (25%):
This course will walk you through the definition and implementation
of a data-science project using Big Data technologies. During the project clinics and lecture, the instructors
will guide you through the following milestones:
- Project definition:
- Option 1: Own Project. You will define your own project by
identifying: (1) a dataset of interest, (2) a set of research
questions to be answered with the dataset, using techniques studied in class.
If you are doing a
Master or PhD thesis, you are encouraged to define a project linked
to your research topic.
- Option 2: Industrial Project. You will select a project proposed by a company through the Riipen platform. Industrial projects
will be validated by the instructors and published on January 24.
Both project types are expected to follow the same milestones (described below). Milestones for industrial projects might be slightly revised depending on the exact nature of the projects.
- Model design: choose a class of models in {supervised learning,
recommender system, clustering, frequent itemset}. Outline how the data model
could be applied to your dataset to answer your research question.
Research algorithms and techniques to implement this class of
model.
- Data preparation: inspect the dataset, identify missing data,
outliers, data types (categorical data in particular), and write
Spark or Dask programs to correct for potential issues.
- Model implementation: implement the model with Spark, Dask or
scikit-learn.
- Model evaluation: identify evaluation metrics for your model. Implement and discuss them.
Projects will have the following deliverables. Deadlines are indicated on
the schedule. No deadline extension
will be granted.
- Team registration (1%) Easy point! Please
register your team of 4 students on time to allow
for a smooth organization of project clinics.
- Participation (3%) Attendance and participation to project clinics. Project clinics will follow the following template:
- Project team presents updates to class (5')
- Instructor gives feedback and directions (5'). Audience may ask questions.
- Feedback on project milestones (3').
Don't miss an opportunity to get feedback on your project!
- Project summary (3%) The project summary will
be a 400-word abstract available as a Markdown (.md) document in a
GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its
main characteristics (number and type of features), the research
questions to be addressed in the project, the class of models
to be applied to the dataset, and the algorithms that will be used.
At least two algorithms must be used and compared. The project summary will be evaluated during the project clinics (see schedule table).
- Project data model (10%) The project data
model will be delivered as a Jupyter notebook containing code and
explanations to implement data preparation, model training and
(optional) model evaluation. The project data model will be evaluated during the project clinics (see schedule table).
- Final project presentation (8%) The project presentation will
be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation
and summarizing the other project milestones.
Project deliverables will be evaluated using specific rubrics.
All criteria will be assessed on a 4-level scale: unacceptable, average, good, excellent.
Note: project team grades may be weighted differently for each participant
in case the team judges that some team members have not contributed significantly to the project.
Grading Scheme: There
is no standard relationship between percentages and letter grades
assigned. The grading of the course will be done based on the relative
percentages assigned to the assignments, project and the exam. There is no
definite rule for translation of number grades to letter grades.
Academic Integrity
Violation of the Academic Code of Conduct in any form will be
severely dealt with. This includes copying (even with modifications)
of program segments. You must demonstrate independent thought through
your submitted work. Click on the following link for more
information: http://www.concordia.ca/students/academic-integrity.html.