Kleppmann, M. (2018). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media.

F‌ind it here at dataintensive.net.


Work in Progress

Progress


I. Foundations of Data Systems

1. Reliable, Scalable, and Maintainable Applications

Building blocks for data-intensive applications

Database Store data
Cache Store result of expensive operations
Search Index Search data by filters or keywords
Stream Processing Send messages to processes asynchronously
Batch Processing Periodically process accumulated data

Concerns

Reliability The system should work correctly with the desired performance even in the face of adversity.
Scalability There should be reasonable ways of dealing with the system’s growth and increased load.
Maintainability Many people should be able to maintain and adapt the system productively.

Reliability

Fault-Tolerant / Resilient: Systems that anticipate faults and can handle them.

Fault One component of the system deviating from its spec.
Failure The whole system stops providing the required service.

Design fault-tolerance mechanisms that prevent faults from causing failures. Generally accomplished by tolerating faults as opposed to preventing faults.

Hardware Faults Recently, most applications started using a larger number of machines, increasing the rate of hardware faults and making hardware redundancy no longer sufficient. There is a move toward systems that can tolerate the loss of entire machines by using software fault-tolerance techniques in preference or in addition to hardware redundancy. These systems have operational advantages and can be patched one node at a time without causing downtime.
Software Errors These bugs are often dormant and result from an assumption about their environment. Examples include bad input errors, runaway processes, or cascading failures. Solutions include carefully thinking about assumptions, testing, allowing faults, and monitoring.
Human Errors Humans are known to be unreliable. Design systems minimize opportunities for error (abstractions, APIs, user interfaces). Decouple the places with a high risk for human error. Test thoroughly at all levels with integration and manual tests. Allow for quick and easy recovery from human errors. Setup detailed monitoring, including performance metrics and error rates (telemetry).

Scalability

Load parameters describe load:

  • Requests per second
  • Read:Write ratio
  • Concurrent active users
  • Cache hit ratio

Investigating performance when the load increases:

  • How is performance affected while keeping the system unchanged?
  • How much do you need to increase resources to retain performance?
Latency Duration that a request is waiting to be handled. Refers to the delay in the system.
Response Time Time between a client sending a request and receiving a response. Sum of round trip latency and service time.

To quantify performance, use percentiles rather than mean. For example, use p50 (median) for typical and p95 or p99 for outliers. High response times are often due to queueing delays. Service level objectives (SLOs) and service level agreements (SLAs) define expected performance and availability, often using percentiles.

Coping with load

Vertical Scaling Move to a more powerful machine.
Horizontal Scaling Distribute the load across multiple smaller machines.
Elastic Scaling Automatically add resources when load increase is detected.

Maintainability

Design principles to minimize maintenance:

  • Operability: Make routine tasks easy for operation teams to focus on high-value activities.
    • Provide visibility into the runtime behavior with monitoring.
    • Provide support for automation and integration with standard tools.
    • Avoid dependency on individual machines.
    • Provide good documentation.
    • Providing good default behavior.
  • Simplicity: Remove as much complexity as possible for new engineers to easily understand the system.
    • When complexity makes maintenance hard, budgets and schedules often get overrun, with a greater risk of introducing bugs. Remove any accidental complexity through abstractions.
  • Evolvability: Make it easy for engineers to make changes (extensibile).
    • System requirements will change. The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and abstractions.

2. Data Models and Query Languages

Data models impact how we write and think about the problem we are solving. Most applications are built by layering data models. Each layer hides the complexity of the layers below by providing a clean data model. These abstractions allow people to work together effectively.

Relational Model Versus Document Model

Driving forces behind NoSQL adoption:

  • Greater scalability, including large datasets or high write throughput.
  • Specialized query operations.
  • More dynamic and expressive data model.
Polyglot Persistence Using multiple data storage technologies for varying data storage needs.
Object–Relational Impedance Mismatch Difficulties with the translation layer mapping object-oriented code with the database model. Object-relational mapping frameworks reduce boilerplate but cannot completely hide the differences between models.

Normalizing data requires many-to-one relationships, which don’t fit nicely into the document model. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak. The application code can emulate a join with multiple queries if the database does not support joins. Relational models support joins and many-to-one and many-to-many relationships. Document models are applicable if data has a document-like structure (tree of one-to-many relationships). Document-oriented databases include MongoDB, RethinkDB, CouchDB, and Espresso.

Normalization Example:

  • Location can be a plain-text string or an ID.
  • The advantage of using an ID is that it never needs to change because it has no meaning to humans. Anything meaningful to humans may need to change.
    • Consistent style, spelling, and localization support - Avoid ambiguity with duplicate names - Ease of updating - Better search.

IMS used a hierarchical model, similar to the JSON model. This worked well for one-to-many relationships but made many-to-many relationships difficult, and it didn’t support joins. Two solutions proposed to solve these limitations were:

CODASYL Network Model A generalization of the hierarchical model, where every record has one parent. In the network model, a record could have multiple parents, allowing the modeling of many-to-one and many-to-many relationships. The links between records in the network were like pointers, where an access path leads to a record. As a result, changing an application’s data model was difficult.
Relational Model Laid all the data in the open. A relation (table) is a collection of tuples (rows). The query optimizer makes the access path, not the developer. The relational model made it much easier to add new features.

Document Models:

  • Reduce Impedance Mismatch
  • Flexible: Schema-on-read is advantageous when the data is heterogenous (i.e., different types of objects, structure determined by external system).
  • Performant: Offers performance benefits due to storage locality. This applies when you need a large part of the document, as the database must load the entire document. It cannot directly refer to a nested item, which is an issue if the document is deeply nested.
Schema-on-Read Schema is implicit and interpreted on read.
Schema-on-Write Schema is explicit, and the database ensures all data written conforms

Query Languages for Data

Imperative Language Uses ordered statements that modify the state.
Declarative Query Language Expresses logic without describing its control flow. Languages like SQL are more concise and easier to work with because they hide the database engine’s implementation details. Declarative languages lend themselves to parallel execution.

On the web, declarative CSS styling is better than manipulating styles imperatively in JavaScript. Similarly, for databases, declarative languages like SQL turned out to be much better than imperative query APIs.

MapReduce A programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. Query logic is expressed with code snippets, called repeatedly by the processing framework. The map and reduce functions are pure functions; they cannot perform additional database queries and must not have any side effects. These restrictions allow the database to run the functions anywhere, in any order and rerun them on failure.

3. Storage and Retrieval

4. Encoding and Evolution

II. Distributed Data

5. Replication

6. Partitioning

7. Transactions

8. The Trouble with Distributed Systems

9. Consistency and Consensus

III. Derived Data

10. Batch Processing

11. Stream Processing

12. The Future of Data Systems

Miscellaneous

Functional Requirements Captures the intended behavior of the system
Nonfunctional Requirements Quality constraints the system must satisfy, such as security, reliability, compliance, scalability, compatibility, and maintainability.