Making Sense of Stream Processing – A Must Read

Making Sense of Stream Processing Stumbled on this book this week, and devoured it in an afternoon.

Written by Martin Kleppmann, a distributed systems researcher and former engineer at LinkedIn (where Kafka was born), this book explores the ideas of stream processing and outlines how they can apply broadly to application architectures. It’s a small book in a report format, synthesized from a series of blog posts (linked on Martin’s website).

I found this book quite profound. It’s one of those quick reads that packs quite a punch. It’s also written extremely well and very easy to digest [read as give it to your manager 🙂]. In that sense, it reminds me of some other great quick reads like Clean Code by Robert C. Martin. Much of what you read feels obvious as you are reading it, but it’s somehow refreshing and exciting, like a new discovery.

Before picking up this book, I had a sense from things I’d read and heard that Kafka could be a powerful general purpose technology – this book really spells that out in detail. I’ve done a fair share of work with application event buses and pipelines, and have always found the decoupled nature of those systems beautiful, flexible, adaptable, and well.. liberating. Kafka or a technology like it, seems like the ideal way to realize this more broadly and at a larger scale. This book explores this approach, and argues it can be a superior alternative to other contemporary architectures.

In the first part of the book, some other common architectural patterns are covered which attempt to address scaling reads & writes, but present new problems of their own. This is followed by several sections covering some other interesting observations & architecture patterns from other areas of computing which ultimately build a compelling case for an end-to-end event-based streaming application architecture.

The dual-writes architecture

One of the architectural patterns explored in the book is that of the dual-writes architecture. In this architecture, in an effort to optimize both reads and writes, data is written in one format, and then replicated in one or more additional formats suited to various read-optimized representations based on particular access patterns.

Most developers are familiar with this pattern. For example, a common use case would be building a full-text index (such as with solr or elasticsearch) from a back-end OLTP database. But this concept extends to in-memory caches, data warehouses, and so on. The book goes on to explain some fundamental problems with this approach, notably the challenges with keeping things in sync or worse – the system falling into a permanently inconsistent state.

Materialized Views – A better cache

Another interesting argument made in the book is around the idea that materialized views (in the abstract sense, not necessarily database) make better caches than a read-through cache. This is actually quite apparent when you think about it; a materialized view is automatically updated when the underlying data changes, whereas a read-through or similar cache can have stale data for long periods of time and potentially complex invalidation logic.

The Unix Philosophy

One of the underpinnings of the Unix philosophy is that of pipes and filters. Filters are small, flexible, and most importantly, arbitrarily compose-able. Pipes facilitate a key underlying abstraction: the file. Because the file can represent almost anything (simple files, devices, network sockets, etc.). This makes it incredibly powerful. When we apply a similar philosophy in a distributed setting, we can find some good analogs: your applications are the filters and Kafka forms the pipes between them.

Change Data Capture (CDC)

The idea behind change capture is simple – capture the changes to a mutable system as an event stream or log. This can be incredibly valuable. It allows you to scale out usage of an existing database without putting additional load on it. You can take more risks and experiment with new architectures utilizing that data all the while not touching the existing database.

Interestingly, a number of databases expose their transaction logs in this way, notably Postgres. Bottled Water, also by Martin Kleppmann, is a project that translates Postgres’s logical decoding feature directly into Kafka streams via a standard/flexible encoding format.

Conclusion

I think more and more people are starting to use technologies like Kafka beyond just big data analytics use cases. It can underlie your entire application architecture, providing a backbone to smaller tighter-focused microservices or to build materialized views similar to the Unix philosophy of pipes and filters. As Kafka plugs into more data-stores and event-producing sources at their origin, it can serve as the conduit for all your applications. Incidentally, this architecture also lends itself really well to functional programming.

I highly recommend this book. As a bonus, it’s free on the confluent.io site until May 6th, 2016. Get your free copy while you can!

Relevant blog posts

  • Martin Kleppmann’s website
  • The Log: What every software engineer should know about real-time data’s unifying abstraction
  • Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform
  • Leave a Reply

    Your email address will not be published. Required fields are marked *