Lecture Notes

List of topics, recommended reading material, and pointers to the PDF version of the slides. Note that for a large fraction of the slides, the complete Latex sources are available: feel free to fork, and contribute with pull requests.

Introduction [Slides]

Topics:

Brief Introduction Big on Data
Industrial Use-cases

Reading list:

The Datacenter as a Computer: An Introduction to the Design of Warehouse-scale Machines, by Luiz André Barroso and Urs Hölzle, Morgan Claypool

Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman, Cambridge University Press

Scalable Algorithm Design [Slides]

Topics:

The MapReduce model
Awfully Basic Introduction to Functional Programming
Design Patterns

Reading list:

Data-intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer, Morgan Claypool

Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman, Cambridge University Press

Upper and Lower Bounds on the Cost of a Map-Reduce Computation, by F. Afrati, et al., PVLDB, 2013

Enumerating subgraph instances using map-reduce, by F. Afrati, et al., ICDE, 2013

Transitive closure and recursive Datalog implemented on clusters, by F. Afrati and J. Ullman, EDBT, 2012

Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms, by J. Lin, Arxiv, 1304.7544

Hadoop Internals [Slides]

Topics:

Hadoop MapReduce Architecture
Jobs, Tasks, Task Attempts
Scheduling
Failures
I/O
Advanced Features
Hadoop Deployments

Reading list:

Mapreduce: Simplified data processing on large clusters, by J. Dean and S. Ghemawat, OSDI, 2004

The google file system, by S. Ghemawat, et al., ACM OSDI, 2003

Hadoop: The Definitive Guide, by T. White, O'Reilly, 2012

Hadoop Operations, by E. Sammer, O'Reilly, 2012

Spark Internals [Slides]

Topics:

Brief introduction about Spark
The anatomy of a Spark Application
Resource allocation: the DAG and the standalone schedulers
Data Shuffling
Caching
Tuning with a running example

Reading list:

Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, O'Reilly [O'Reilly Link]

Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, by M. Zaharia, et al., USENIX NSDI, 2012

Sparrow: distributed, low latency schedulin, by Ousterhout, Kay, et al., ACM SOSP, 2013

Introduction to Spark Internals, by M. Zaharia, Video on YouTube [Link]

A Deeper Understanding of Spark Internals, by Aaron Davidson, Video on YouTube [Link]

Cluster Schedulers [Slides]

Topics:

Introduction
YARN internals
MESOS internals
BORG internals

Reading list:

Hadoop YARN, by A. C. Murthy, et. al., Addison Wesley [Amazon Link]

Mesos: Flexible Resource Sharing for the Cloud, by B. Hindman, et. al., NSDI, 2011

Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, by A. Ghodsi, et. al., EuroSys, 2013

Omega: flexible, scalable schedulers for large compute clusters, by M. Schwarzkopf, et. al., EuroSys, 2013

Large-scale cluster management at Google with Borg, by A. Verma, et. al., EuroSys, 2015

Relational Algebra [Slides]

Topics:

Basic Relational Operators
MapReduce Implementation Snippets of Relational Operators

Reading list:

Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman, Cambridge University Press

SparkSql [Slides]

Topics:

General overview of data structures
SparkSQL internals: a lightweight overview

Reading list:

SparkSQL: Relational Data Processing in Spark, by M. Armbrust, et al., SIGMOD, 2015

Distributed Storage Systems [Slides]

Older, but more complete versions of the slides are available here:

[Part 1]:: this slide deck was originally created by Prof. Marko Vukolic (now at IBM Research Zurich)
[Part 2]:: this slide deck was originally created by Prof. Marko Vukolic (now at IBM Research Zurich)
[HBase]

Topics:

The CAP Theorem
Amazon Dynamo
Cassandra and HBase

Reading list:

Seth Gilbert, Nancy A. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33(2): 51-59 (2002)

DeCandia et al. Dynamo: Amazon's highly available key-value store. SOSP 2007: 205-220 (2007)

Eric A. Brewer: Pushing the CAP: Strategies for Consistency and Availability. IEEE Computer 45(2): 23-29 (2012)

Seth Gilbert, Nancy A. Lynch: Perspectives on the CAP Theorem. IEEE Computer 45(2): 30-36 (2012)

Marko Vukolić: Quorum Systems with Applications to Storage and Consensus. Morgan&Claypool (2012)

Ion Stoica et al: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11(1): 17-32 (2003)

Avinash Lakshman, Prashant Malik: Cassandra: a decentralized structured storage system. Operating Systems Review 44(2): 35-40 (2010)

Apache Cassandra 1.2 Documentation. Datastax. http://www.datastax.com/docs/1.2/index

Eben Hewitt: Cassandra: The definitive Guide. O’Reilly. (2010) http://bit.ly/JHwwR6

Edward Capriolo: Cassandra High Performance Cookbook. Packt Publishing. (2011)

Coordinating distributed systems [Slides]

Topics:

PAXOS
Zookeeper

Reading List:

Patrick Hunt, Mahadev Kumar, Flavio P. Junqueira and Benjamin Reed: Zookeeper: Wait-free coordination for Internet-scale systems. In proc. USENIX ATC (2010)

Zookeeper 3.4 Documentation. http://zookeeper.apache.org/doc/trunk/index.html

Flavio Paiva Junqueira, Benjamin C. Reed, Marco Serafini: Zab: High-performance broadcast for primary-backup systems. DSN 2011: 245-256

Michael Burrows: The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006: 335-350

Atul Adya, John Dunagan, Alec Wolman: Centrifuge: Integrated Lease Management and Partitioning for Cloud Services. NSDI 2010: 1-16

Selected Topics in Cloud Computing [Slides]

Topics:

Cloudonomics 101

Reading List:

J. Weinman. Cloudonomics: The Business Value of Cloud Computing, Wiley, 2012

L.A. Barroso, Jimmy Clidaras and U. Holzle. The Datacenter as a Computer: An Itroduction to the Design of Warehouse-Scale Machines, Morgan&Claypool, 2nd ed. July 2013