Large software systems (e.g., Amazon.com and Google's GMail) pose new challenges for software engineers and operators. These systems require near-perfect up-time while supporting millions of concurrent connections and operations. Failures and errors in such systems may bring financial and reputational repercussions.
During the life cycle of such software systems, developers are focused on developing feature rich and bug-free software, while operators are focused on ensuring a failure-free and scalable operation of the software. In current practice, there is a gap between software developers and operators. Software developers are rarely given access to field knowledge (i.e., information about the real-field deployments), while operators are rarely aware of the development knowledge (e.g., internal details about new features). For instance, developers need field knowledge to understand whether their design and implementation perform well in the field, while operators need development knowledge to help them resolve operational problems. If development teams are aware that a particular piece of code is critical based on field executions, then they are more likely to improve the code and assign it to more senior developers. If operators have more in-depth knowledge about the design or the inner-meaning of error messages, they might be able to resolve problems in a timely fashion without needing to wait for the intervention of developers.
DevOps is a software development and operation method that share the concerns about the divide between these two worlds and have proposed the need to bridge these two worlds through better documentation and communication channels. DevOps particularly focuses on communication and collaboration software developers and operators and has been adopted by large software companies such as Google and Facebook. Large companies like Amazon even create new services to facilitate the work for DevOps. It is of great interest to learn how to effectively and efficiently perform DevOps (development and operation) for such systems.
This course explores leading research in the development and operation of large software systems, discusses challenges associated with bridging the development and operation activities of such systems, highlights industrial engineering practice, and outlines future research directions. In particular, the course leverages the mining of data that is generated during the development and operation of large software systems in order to support DevOps. Students will acquire the advance knowledge about the development and operations in the field. Once completed, students should be able to conduct research in topics related to the DevOps and will be able to leverage the learnt techniques in other system and software engineering related research or practice.
Classes are held on every Friday 5:45 PM to 8:15 PM at H 623 SGW.
Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:
- Performance engineering
- Performance counters and measurements
- Log engineering
- Debugging ultra-large-scale systems
- System configuration
- Empirical studies of large software data
Students are expected to have some background in software development and software engineering. Knowledge of ultra-large-scale systems will be beneficial but not expected.
Students will be evaluated using the following breakdown:
1. Paper presentation and discussion (20%):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 20 mins strict and the discussion will last 15-20 mins. Each student should upload the slides to course account before class.
Your presentations should have
- Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative other papers covered in the course (especially the papers covered in that particular week).
- Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work.
2. Weekly critique (10%):
- one slide that lists the main contributions of the paper.
- one slide that places the paper relative to any recent work done by the authors of the paper.
- one slide that links places the paper relative to other papers presented that week.
- as the final slide, a listing of at least three technical points that you liked and three areas that should be improved.
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week. Additional advice for critiquing papers is here.
The one document should have your name at the top. The document name should follow this template: Week#_Paper#_YourName.
3. Assignment (20%):
One assignment done in a group of 3 or 4 students. More details in class.
4. Project (50%=10%+40%):
One original project (10 pages IEEE format) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course.
You need to submit a project proposal (2 pages IEEE format). The proposal should provide a brief motivation of the project, a detailed discussion of the data and systems that will be used in the project, along with a timeline of milestones, and expected outcome. Make sure that you have cited at least 3 papers in your proposal. Additional advice for project proposals will be disused in class.