About the Course
Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the course range from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.
Registration and other generic inquiries:
We will not be managing a waitlist for the course and we have no control over registration.
Registration will be on a first-come-first-serve basis. Usually, the CS Department staff will release more spots in-bulk every so often.
- Optional (highly recommended) background: CS 411 or any relevant database course
- Programming: For projects, you will do some significant application and web programming, with some host languages of your choice (e.g., C, C++, Java, PHP, Python). We will not cover programming-specific issues in this course.
Read the textbook for the required reading before lectures, and study them more carefully after class. Our lectures are intended to provide a roadmap for your reading-- with the limited lecture time, we may not be able to cover everything in the readings.
|Assignments||25%||4 assignments (all have the same weight)|
|Project||50%||Semester-long group project|
Final grading (tentative):
In this course, we will be assigning +/- letter grades.
A (A-, A, A+)
B (B-, B, B+)
C (C-, C, C+)
D (D-, D, D+)
We will give you the best grade of the scale above and a regular Gaussian curve using this rule, with a mean around B+.
This course may contain both graduate and advanced undergraduate students. We will grade all groups of students on different curves.
A. Lectures & Attendance (10%)
Students are responsible for anything that transpires during a class. Class attendance is strongly recommended. If you are unable to attend, I appreciate it if you can let me know in advance. There are also “participation points”, generally corresponding to class discussion questions or activities, and graded on a “check-off” basis.
B. Homework Assignments (25%)
There will be both assignments and projects for the course, generally due on Thursday. I will try to post homework at least a week before it is due.
Homework submission will be through Canvas.
- Assignments are individual work.
- Collaboration is NOT allowed when working on the assignments.
- Discussions are allowed if and only if these discussions regard only high-level concepts and general ideas. Discussion cannot involve answers to the questions on the homework. Checking answers/part of the solutions among peers are not allowed. Sharing answers on any public/private electronic platform, including but not limited to email, messenger, Facebook groups, discord chat, etc., are not allowed.
- If you discussed questions with your classmates, you must include their names and the questions you discussed. Not including students' names will be considered a violation of the course's academic integrity policy. This rule applies to all individual homework assignments, including MPs.
- You should reference (in your code as comments) any code or concepts copied from StackOverflow or any other online resources. However, 80% of the code you turn in must be your own code.
- You are allowed to submit regrade requests within the time frame listed on Campuswire. Typically we allow up to one week after the HW grades are released if not explicitly mentioned.
- Uploading your assignment questions to public platforms (i.e., shared drive, course hero, etc.) is prohibited. Such violations are copyright infringements and possible violations of academic honesty. We will process these strictly.
C. Reading Summaries (15%)
There will be reading assignments, with short summaries due before class.
- time before logging off.
D. Project (50%)
There will be a semester-long project, which involves significant database application programming. The project will be structured with several milestones due in the semester, leading to a demo and write-up near the end of the semester. Details and policies for the project will be documented separately. Please note that projects are group-based assignments, and they still follow academic honesty guidelines. Your group should not exchange and discuss code with other groups. All rules listed in Section I - C - (3)~(5) of this syllabus apply to project assignments.
4-credit project (Option for Graduate Students)
Graduate students MAY take this course for 4 credit units. (Undergraduates take this course for three hours credit.) For the extra unit, you will complete an additional project (a literature review) -- i.e., you will work on both tracks of the projects.
- IN PROGRESS and Subject to change. Lecture notes will be posted on the day of the lecture.
|Week||Date||Topic||Assigned||Due||Required Reading||Optional Reading|
|Week 1||1/18||Course Info, Introduction to Cloud Computing|
|1/20||Introduction to Cloud DM, Challenges||Assignment 1||Above the Clouds: A Berkeley View of Cloud Computing|
|Week 2||1/25||Challenges, App Characteristics||Ch. 1, Three Database Revolutions||D. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng Bull. 2009|
|1/27||Basics: Data Models||Ch. 2, Google, Big Data and Hadoop||R. Cattell: Scalable SQL and NoSQL Data Stores. SIGMOD Rec. 2010|
|Week 3||2/1||Basics Data Models, Basics: Consistency||Project 0||Assignment 1||D. Terry. Replicated Data Consistency Explained Through Baseball|
|2/3||Basics: Consistency, GCP Intro||Assignment 2||Ch. 3, Sharding, Amazon, and the Birth of NoSQL||D. Abadi: Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story|
|Week 4||2/8||Basics: Consistency||Ch. 10, Data Models and Storage|
|2/10||Basics: File Systems||Project 1||Project 0||S. Ghemawat, et al. The Google File System. SOSP 2003|
|Week 5||2/15||Basics: File Systems||Chapter 9: Consistency Models|
|2/17||Basics: File Systems, Basics: Map-Reduce||Case Study. GFS: Evolution on Fast-forward|
|Week 6||2/22||Basics: Map-Reduce||Assignment 2||M. Stonebraker, et al. MapReduce and Parallel DBMSs: Friends or Foes?|
|2/24||Basics: Map-Reduce, Map-Reduce Versus DBMS||J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI|
|Week 7||3/1||Map-Reduce Versus DBMS, Cloud DBs: Key-Value (Amazon Dynamo)||A. Pavlo, et al. A comparison of approaches to large-scale data analysis. SIGMOD 2009.|
|3/3||Cloud DBs: Key-Value (Amazon Dynamo)||Project 2||Project 1|
|Week 8||3/8||Cloud DBs: Key-Value (Amazon Dynamo)||DeCandia, et al. Dynamo: Amazon's highly available key-value store. SOSP 2007.|
|3/10||Cloud DBs: Document (MongoDB)||Midpoint Meetings this week||Chapter 4: Document Databases||Chapter 6, pp 110-115: MongoDB Sharding and Replication, Chapter 11, pp 173-175: MongoDB|
|Week 9||3/22||Cloud DBs: Document (MongoDB); Cloud DBs: Column Family (Bigtable)||Assignment 3|
|3/24||CloudDBs: Column Family (Bigtable); Spark demo||F. Chang, et al.: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst., 26(2), 2008.||Chapter 6, pp 115-119: HBase, Chapter 11, pp 171-173: Hbase|
|Week 10||3/29||CloudDBs: Column Family (Bigtable); Data Processing: Spark||M. Zaharia, et al.: Apache Spark: a Unified Engine for Big Data Processing. CACM October 2016.||M. Zaharia, et a.|
|3/31||Data Procressing: Spark||Project 3||Project 2|
|Week 11||4/5||Graph Model (Neo4J)||Assignment 4||Chapter 5: Graph Databases|
|4/7||Graph Model (Neo4j); Data Processing: Hive; Hive Demo||A. Thusoo, et al.: Hive-A Petabyte Scale Data Warehouse Using Hadoop. ICDE, pp. 996-1005, 2010.||J, Camacho-Rodríguez, et al. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing, SIGMOD 2019|
|Week 12||4/12||Data Processing: Hive||Assignment 3|
|4/14||Data Processing: Pig Latin||C. Olston, et al.: Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008.|
|Week 13||4/19||Data Processing Pig Latin; Data Processing: VoltDB||Chapter 7: The End of Disk?||S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker: OLTP through the looking glass, and what we found there. SIGMOD 2008.|
|4/21||Data Processing: VoltDB||Assignment 4|
|Week 14||4/26||Graph Processing: Pregel and Giraph||G. Malewicz, et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010||A. Ching, et al. One Trillion Edges: Graph Processing at Facebook-Scale. VLDB 2015.|
|4/28||Project Presentations||Project 3|
|Week 15||5/3||Project Presentations|
|5/4||Project Presentations, 1||Project 3 Final Slides|