CS 498: Data Management in the Cloud

About the Course

Cloud computing has recently seen a lot of attention from research and industry for applications that can be parallelized on shared-nothing architectures and have a need for elastic scalability. As a consequence, new data management requirements have emerged with multiple solutions to address them. This course will look at the principles behind data management in the cloud as well as discuss actual cloud data management systems that are currently in use or being developed. The topics covered in the course range from novel data processing paradigms (MapReduce, Scope, DryadLINQ), to commercial cloud data management platforms (Google BigTable, Microsoft Azure, Amazon S3 and Dynamo, Yahoo PNUTS) and open-source NoSQL databases (Cassandra, MongoDB, Neo4J). The world of cloud data management is currently very diverse and heterogeneous. Therefore, our course will also report on efforts to classify, compare and benchmark the various approaches and systems. Students in this course will gain broad knowledge about the current state of the art in cloud data management and, through a course project, practical experience with a specific system.

Live Lecture Zoom Link

Registration and other generic inquiries:

We will not be managing a waitlist for the course and we have no control over registration.

Registration will be on a first-come-first-serve basis. Usually, the CS Department staff will release more spots in-bulk every so often.

Prerequisites:

  • Optional (highly recommended) background: CS 411 or any relevant database course
  • Programming: For projects, you will do some significant application and web programming, with some host languages of your choice (e.g., C, C++, Java, PHP, Python). We will not cover programming-specific issues in this course.

Textbook:

Read the textbook for the required reading before lectures, and study them more carefully after class. Our lectures are intended to provide a roadmap for your reading-- with the limited lecture time, we may not be able to cover everything in the readings.

Grading Summary:

Class component

Percent
Notes
Class Participation10%
Reading Summaries15%
Assignments25%4 assignments (all have the same weight)
Project50%Semester-long group project

 Final grading (tentative):

In this course, we will be assigning +/- letter grades.

Total

Grade

90-100

A (A-, A, A+)

80-89

B (B-, B, B+)

70-79

C (C-, C, C+)

60-69

D (D-, D, D+)

We will give you the best grade of the scale above and a regular Gaussian curve using this rule (Links to an external site.), with a mean around B+.
This course may contain both graduate and advanced undergraduate students. We will grade all groups of students on different curves.


A. Lectures & Attendance (10%)

Students are responsible for anything that transpires during a class. Class attendance is strongly recommended. If you are unable to attend, I appreciate it if you can let me know in advance. There are also “participation points”, generally corresponding to class discussion questions or activities, and graded on a “check-off” basis.

B. Homework Assignments (25%)

There will be both assignments and projects for the course, generally due on Thursday. I will try to post homework at least a week before it is due.

Homework submission will be through Canvas.

  1. Assignments are individual work.
  2. Collaboration is NOT allowed when working on the assignments.
  3. Discussions are allowed if and only if these discussions regard only high-level concepts and general ideas. Discussion cannot involve answers to the questions on the homework. Checking answers/part of the solutions among peers are not allowed. Sharing answers on any public/private electronic platform, including but not limited to email, messenger, Facebook groups, discord chat, etc., are not allowed. 
  4. If you discussed questions with your classmates, you must include their names and the questions you discussed. Not including students' names will be considered a violation of the course's academic integrity policy. This rule applies to all individual homework assignments, including MPs.
  5. You should reference (in your code as comments) any code or concepts copied from StackOverflow or any other online resources. However, 80% of the code you turn in must be your own code. 
  6. You are allowed to submit regrade requests within the time frame listed on Campuswire. Typically we allow up to one week after the HW grades are released if not explicitly mentioned. 
  7. Uploading your assignment questions to public platforms (i.e., shared drive, course hero, etc.) is prohibited. Such violations are copyright infringements and possible violations of academic honesty. We will process these strictly. 

C. Reading Summaries (15%)

There will be reading assignments, with short summaries due before class.

    1. time before logging off.

D. Project (50%)

There will be a semester-long project, which involves significant database application programming. The project will be structured with several milestones due in the semester, leading to a demo and write-up near the end of the semester. Details and policies for the project will be documented separately. Please note that projects are group-based assignments, and they still follow academic honesty guidelines. Your group should not exchange and discuss code with other groups. All rules listed in Section I - C - (3)~(5) of this syllabus apply to project assignments.

4-credit project (Option for Graduate Students)

Graduate students MAY take this course for 4 credit units. (Undergraduates take this course for three hours credit.) For the extra unit, you will complete an additional project (a literature review) -- i.e., you will work on both tracks of the projects.

  • IN PROGRESS and Subject to change.  Lecture notes will be posted on the day of the lecture.  
Schedule
WeekDateTopicAssignedDueRequired ReadingOptional Reading
Week 11/18Course Info, Introduction to Cloud Computing
1/20Introduction to Cloud DM, ChallengesAssignment 1Above the Clouds: A Berkeley View of Cloud Computing (Links to an external site.)
Week 21/25Challenges, App CharacteristicsCh. 1, Three Database RevolutionsD. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng Bull. 2009 (Links to an external site.)
1/27Basics: Data ModelsCh. 2, Google, Big Data and HadoopR. Cattell: Scalable SQL and NoSQL Data Stores. SIGMOD Rec. 2010 (Links to an external site.)
Week 32/1Basics Data Models, Basics: ConsistencyProject 0Assignment 1D. Terry. Replicated Data Consistency Explained Through Baseball (Links to an external site.)
2/3Basics: Consistency, GCP IntroAssignment 2Ch. 3, Sharding, Amazon, and the Birth of NoSQLD. Abadi: Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story (Links to an external site.)
Week 42/8Basics: ConsistencyCh. 10, Data Models and Storage
2/10Basics: File SystemsProject 1Project 0S. Ghemawat, et al. The Google File System. SOSP 2003 (Links to an external site.)
Week 52/15Basics: File SystemsChapter 9: Consistency Models
2/17Basics: File Systems, Basics: Map-ReduceCase Study. GFS: Evolution on Fast-forward (Links to an external site.)
Week 62/22Basics: Map-ReduceAssignment 2M. Stonebraker, et al. MapReduce and Parallel DBMSs: Friends or Foes? (Links to an external site.)
2/24Basics: Map-Reduce, Map-Reduce Versus DBMSJ. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI (Links to an external site.)
Week 73/1Map-Reduce Versus DBMS, Cloud DBs: Key-Value (Amazon Dynamo)A. Pavlo, et al. A comparison of approaches to large-scale data analysis. SIGMOD 2009. (Links to an external site.)
3/3Cloud DBs: Key-Value (Amazon Dynamo)Project 2Project 1
Week 83/8Cloud DBs: Key-Value (Amazon Dynamo)DeCandia, et al. Dynamo: Amazon's highly available key-value store. SOSP 2007. (Links to an external site.)
3/10Cloud DBs: Document (MongoDB)Midpoint Meetings this weekChapter 4: Document DatabasesChapter 6, pp 110-115: MongoDB Sharding and Replication, Chapter 11, pp 173-175: MongoDB
3/12-3/20

Spring Break

Week 93/22Cloud DBs: Document (MongoDB); Cloud DBs: Column Family (Bigtable)Assignment 3
3/24CloudDBs: Column Family (Bigtable); Spark demoF. Chang, et al.: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst., 26(2), 2008. (Links to an external site.)Chapter 6, pp 115-119: HBase, Chapter 11, pp 171-173: Hbase
Week 103/29CloudDBs: Column Family (Bigtable); Data Processing: SparkM. Zaharia, et al.: Apache Spark: a Unified Engine for Big Data Processing. CACM October 2016. (Links to an external site.)M. Zaharia, et a. (Links to an external site.)
3/31Data Procressing: SparkProject 3Project 2
Week 114/5Graph Model (Neo4J)Assignment 4Chapter 5: Graph Databases
4/7Graph Model (Neo4j); Data Processing: Hive; Hive DemoA. Thusoo, et al.: Hive-A Petabyte Scale Data Warehouse Using Hadoop. ICDE, pp. 996-1005, 2010. (Links to an external site.)J, Camacho-Rodríguez, et al. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing, SIGMOD 2019 (Links to an external site.)
Week 124/12Data Processing: HiveAssignment 3
4/14Data Processing: Pig LatinC. Olston, et al.: Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. (Links to an external site.)
Week 134/19Data Processing Pig Latin; Data Processing: VoltDBChapter 7: The End of Disk?S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker: OLTP through the looking glass, and what we found there. SIGMOD 2008. (Links to an external site.)
4/21Data Processing: VoltDBAssignment 4
Week 144/26Graph Processing: Pregel and GiraphG. Malewicz, et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010 (Links to an external site.)A. Ching, et al. One Trillion Edges: Graph Processing at Facebook-Scale. VLDB 2015. (Links to an external site.)
4/28Project PresentationsProject 3
Week 155/3Project Presentations
5/4Project Presentations, 1Project 3 Final Slides