HackerNews Readings: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

All Time Past 6 Months

Clean Code: A Handbook of Agile Software Craftsmanship

Robert C. Martin

4.7 on Amazon

43 HN comments

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Martin Kleppmann

4.8 on Amazon

34 HN comments

The Martian

Andy Weir, Wil Wheaton, et al.

4.7 on Amazon

27 HN comments

The Pragmatic Programmer: 20th Anniversary Edition, 2nd Edition: Your Journey to Mastery

David Thomas, Andrew Hunt, et al.

4.8 on Amazon

27 HN comments

Snow Crash

Neal Stephenson, Jonathan Davis, et al.

4.3 on Amazon

24 HN comments

The Mom Test: How to Talk to Customers & Learn If Your Business Is a Good Idea When Everyone Is Lying to You

Rob Fitzpatrick and Robfitz Ltd

4.7 on Amazon

22 HN comments

Dune

Frank Herbert, Scott Brick, et al.

4.7 on Amazon

20 HN comments

Seveneves: A Novel

Neal Stephenson, Mary Robinette Kowal, et al.

4.1 on Amazon

20 HN comments

Why We Sleep: Unlocking the Power of Sleep and Dreams

Matthew Walker, Steve West, et al.

4.7 on Amazon

19 HN comments

Project Hail Mary

Andy Weir, Ray Porter, et al.

4.7 on Amazon

18 HN comments

Never Split the Difference: Negotiating as if Your Life Depended on It

Chris Voss, Michael Kramer, et al.

4.8 on Amazon

18 HN comments

Brave New World

Aldous Huxley

4.6 on Amazon

16 HN comments

Thinking, Fast and Slow

Daniel Kahneman, Patrick Egan, et al.

4.6 on Amazon

16 HN comments

The Design of Everyday Things: Revised and Expanded Edition

Don Norman

4.6 on Amazon

15 HN comments

A Pattern Language: Towns, Buildings, Construction (Center for Environmental Structure Series)

Christopher Alexander , Sara Ishikawa , et al.

4.7 on Amazon

15 HN comments

Prev Page 1/58 Next

Sorted by relevance

barbecue_sauceonApr 15, 2021

If you want to understand how databases work under the hood, from the basics up to near-state-of-the-art, its filled with tons of great information. I recommend reading Kleppmann's Designing Data Intensive Applications and then watching Pavlo.

bwh2onApr 30, 2021

Designing Data Intensive Applications was surprisingly detailed in terms of data storage.

I also enjoyed Release It! by Michael Nygard to learn about making distributed systems more resilient.

eatonphilonJune 30, 2021

Here are my four: Designing Data Intensive Applications, the Google SRE book, High Performance Browser Networking, and Effective Python.

https://notes.eatonphil.com/books-developers-should-read.htm...

jkapturonApr 14, 2021

In case someone isn't aware, Designing Data-Intensive Applications is a very good introduction to distributed databases, even if it doesn't specialize in them.

jstx1onJuly 26, 2021

As a side note, I feel like I need to revisit Designing Data-Intensive Applications. I got almost nothing out of it the first time I read it but people on HN keep recommending as this absolute game-changing gem.

eatonphilonJune 27, 2021

As luck would have it, my blog on Github pages is down. So here's the post describing the four in markdown.

Tldr; Designing Data Intensive Applications, Effective Python, The Google SRE book, and High Performance Browser Networking.

https://github.com/eatonphil/notes.eatonphil.com/blob/master...

LarryEtonJune 26, 2021

Wonderful. I am such a Taleb fan boy but have been putting off Designing Data Intensive Applications. I am starting on it this afternoon from this post.

I just started on Daniel Kahneman's Noise. It will be disappointing if it isn't one of these type of books.

shadeslayer_onJune 30, 2021

Designing Data-Intensive Applications has taught me more than 90% of what I know about scalability. I can't recommend that book enough.

dharmaturtleonJune 28, 2021

Hah - you're describing "event sourcing" and it's the technique I'm using: https://flpvsk.com/blog/2019-07-20-offline-first-apps-event-...

Unfortunately event sourcing means distributed systems... and I'm learning this on the fly on nights & weekends. Martin Kleppmann's "Designing Data Intensive Applications" has put the fear of god in me.

ZealotuxonJuly 2, 2021

I'm currently learning back-end coming from a front-end career, and I started reading "Designing Data-Intensive Applications" by Martin Kleppmann, seems to be a must-have for anyone who wants to get serious in this field.

pitchedonJuly 11, 2021

Designing Data-Intensive Applications by Martin Kleppmann
https://www.goodreads.com/book/show/23463279
It’s usually number 1 on these lists but definitely deserves it!

rmetzleronJuly 15, 2021

I read the DDIA book in my own time and bought it myself. It probably would be possible to get these kinds of books bought by the company and read them on company time if there is some important enough thing to learn. But I never made use of it. The way I read books you can tell I read them because they have marks everywhere.
We do have initiatives to learn from each other, but the days are already filled with too much unplanned work and meetings.

sidewayonAug 3, 2021

Thanks for your detailed answer, really appreciate it.

Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:

1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?

2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?

And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.

avinasshonMay 27, 2021

I love reading about articles on Databases, especially about their internal workings. Some of the blogs I follow:

1. Oren Eini - Creator and CTO of Raven DB - https://ayende.com/blog

2. Tyler Neely - Creator of Sled DB - https://medium.com/@tylerneely

3. Philip O'Toole - Creator of rqlite - https://www.philipotoole.com/

4. Martin Kleppmann - Author of Designing Data-Intensive Applications - https://martin.kleppmann.com/archive.html

5. Glauber Costa - worked on glommio, scylla DB - https://glaubercosta-11125.medium.com/

Do recommend me if you know more!

thundergolferonJune 30, 2021

When I first started learning software I liked to 'collect' these kind of lists as educational busywork. Now that I've been learning software engineering for over 6 years I think they're super unhelpfully overwhelming like you say. You want _one_ short list that you actually use.

For me that is teachyourselfcs.com. It recommends only two books if you don't have "multiple years" to self-study part-time. They are: Computer Systems: A Programmer's Perspective and Designing Data-Intensive Applications. If you do have multiple years it recommends ~9 books. The OP list has almost 100 books just on software architecture.

It takes so long to read one good textbook that I'd bet 90% of software engineers haven't read more than three or four cover-to-cover. I was rare in my computing theory class for actually using the textbook and doing the exercises and I only got 2/3 through. Given my current progress rate through 'Computer Systems: A Programmer's Perspective' it will take me at least 150 hours to complete.

JtsummersonJune 25, 2021

Eric Evans' Domain-Driven Design. I've heard enough about DDD over the years that I figured I'd just go to the source. Liking it so far, I have some good takeaways but we'll see how effectively I'm able to use the ideas over the next couple years.

Martin Kleppmann's Designing Data-Intensive Applications. Based on the frequent praise it receives here, haven't gotten far yet. I have some project ideas (for personal and professional projects) that could benefit from reading through it.

Martin Fowler's 2018 update to Refactoring. I read the original one a long time ago. In context, we have a work lunch & learn series and I'm interested in doing some presentations on the topic of refactoring (why, how, and when in particular) so it seemed appropriate to refresh my memory on some specific terminology from the book as well as to see if it's an appropriate book to recommend to colleagues. My recollection of the first edition is that I'd recommend it to colleagues, but it's been so long I'd rather read it once more before actually recommending it.

I reread Robert C. Martin's Clean Code based on some recent discussion here where it was rather strongly dismissed by a fair number of people. I didn't recall it being bad, my reread confirmed it is not, in fact, bad. Java-heavy, which is now an unpopular style of OOP, but otherwise a very good book. I'd still recommend it to junior colleagues paired with some caveats about avoiding seeing the world in black & white. There is no singular Way of Programming, but learn various ways and find what works for you and your team.

There are some more, but it's almost 5am and I haven't been able to sleep so I don't recall everything that's in the book stack or ebook queue. These are the ones I'm most interested in at present.

tybitonJuly 15, 2021

I read the DDIA book in my own time and have watched some of the internal Amazon tech talks mentioned above during work time with other engineers.

I think either way it’s important to realise that unless you’re superhuman you’re not really solidifying the knowledge, you’re making yourself aware of the concepts. If a use case comes up that requires it you have a better chance of recognising that and then going back and getting a deeper understanding of how to apply it.

I only absorbed a small fraction of DDIA but I still think reading it was invaluable.

faizshahonAug 15, 2021

You seem to think that this is just a service that reads and writes data like any CRUD app.

The hard problems stem from how the system deals with failures and how the system propagates writes across the replicas while meeting latency and consistency SLAs. On top of that the system needs to be built in a way that it can be maintained by many developers each working on a small piece of the system without knowing the ins and outs of the system as a whole. In addition, when the system fails debugging and mitigation needs to be able to be parallelized across many developers so that availability SLAs can be maintained. You can read about this in “Designing Data-Intensive Applications” by Martin Kleppman where he discusses the complexity involved in building distributed systems.

billtionJuly 19, 2021

I'm currently listening to "Designing Data-Intensive Applications" [1] on my commute and it really does work well as an Audio Book (I can attest to the positive reviews). Highly recommended if you are dealing with any requirements in the space (scale, replication, consistency, SQL vs NoSQL, etc.)

[1] https://www.audible.com/pd/Designing-Data-Intensive-Applicat...

mapmeonApr 30, 2021

The easiest path is to read Designing Data Intensive Applications by Martin Kleppman. Builds from the ground up and by the end (or after a second read) you will have a deep understanding. Book hits on all of of the tech you mentioned except Kube. For that read one of the original google papers on cluster management systems eg. Omega

ovidiup13onJune 25, 2021

My planned summer reading list:

- High Performance Browser Networking by Ilya Grigorik

- Refactoring: Improving the Design of Existing Code by Martin Fowler

- Designing Data-Intensive Applications by Martin Kleppmann

How did you find the latter? I'm a FE developer, so quite keen to get my hands on in data.

wenconJune 25, 2021

* Fooled By Randomness (NN Taleb): Taleb is a complicated personality, but this book gave me a heuristic for thinking about long-tails and uncertain events that I could never have derived myself from a probability textbook.

* Designing Data Intensive Applications (M Kleppmann): Provided a first-principles approach for thinking about the design of modern large-scale data infrastructure. It's not just about assembling different technologies -- there are principles behind how data moves and transforms that transcend current technology, and DDIA is an articulation of those principles. After reading this, I began to notice general patterns in data infrastructure, which helped me quickly grasp how new technologies worked. (most are variations on the same principles)

* Introduction to Statistical Learning (James et al) and Applied Predictive Modeling (Kuhn et al). These two books gave me a grand sweep of predictive modeling methods pre-deep learning, methods which continue to be useful and applicable to a wider variety of problem contexts than AI/Deep Learning. (neural networks aren't appropriate for huge classes of problems)

* High Output Management (A Grove): oft-recommended book by former Intel CEO Andy Grove on how middle management in large corporations actually works, from promotions to meetings (as a unit of work). This was my guide to interpreting my experiences when I joined a large corporation and boy was it accurate. It gave me a language and a framework for thinking about what was happening around me. I heard this was 1 of 2 books Tobi Luetke read to understand management when he went from being a technical person to CEO of Shopify. (the other book being Cialdini's Influence). Hard Things about Hard Things (B Horowitz) is a different take that is also worth a read to understand the hidden--but intentional--managerial design of a modern tech company. These some of the very few books written by practitioners--rather than management gurus--that I've found to track pretty closely with my own real life experiences.

nikhilsimhaonJune 30, 2021

[prioritization] Effective engineer - lau

[systems] Designing data intensive applications - kleppman

[programming] SICP - sussman & abelson

Last one is an old scheme book. No other book (that I read) can even hold a candle to this one, in terms of actually developing my thought process around abstraction & composition of ideas in code. Things that library authors often need to deal with.

For example in react - what are the right concepts to that are powerful enough to represent a dynamic website & how should they compose together.

chana_masalaonJuly 19, 2021

I agree that the Pragmatic Programmer is well done in it's audio form, and I also agree that Grokking Algorithms is terrible.

I am currently listening Designing Data Intensive Applications and it's phenomenally done - the author clearly worked with the narrator to adapt the content to audio format, and the narrator seems to have experience or familiarity with the subject because he pronounces the technical jargon very naturally.

I hope to find other software related audiobooks as good as DDAI is.

evanrichonAug 3, 2021

Like others have said, it is just one tool in the tool box.

We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.

DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.

I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.

macintuxonJune 21, 2021

A couple of books that are often recommended for understanding how to make robust software:

- Release It! (https://pragprog.com/titles/mnee2/release-it-second-edition/)

- Designing Data-Intensive Applications (https://dataintensive.net/)

I would suggest finding an open source project of interest and taking a deep dive into its code and documentation to understand how it works and why it was built that way.

Which reminds me, this should help with that: The Architecture of Open Source Applications (http://www.aosabook.org/en/index.html)

rashidujangonMar 31, 2021

Similar to this, upon reading the often recommended book Designing Data Intensive Application, I realize that the concept of a system and the optimization of one is simply a combination of read/writes, storage, I/O, bottlenecks. It's the layers upon layers of work that's been done on it that makes it work the way it does now.

abledononJuly 15, 2021

Question for people in 'Mature enough' Orgs who actually use this advanced tech/concept....

Do you watch videos like these on your employers time? And then do you have like 5-6 other colleagues also watch the video? or perhaps book a meeting room (virtually to all watch the talk, make popcorn, etc..)

and then once its done, do you guys have a day to come back around and discuss its concepts, like a reading group (what if stuff is burning, PRs, SRE tickets, milestones? is time allocated for this stuff? is it lowest of priority to keep abreast of this knowledge) . I started reading his DDIA book and I can't fathom solidifying all those concepts without discussing them thoroughly with other engineers over at least a month or so.

Basically, curious how orgs 'ingest' these large ideas into their software eng knowledge practices, are there initiatives? etc... or are you just banking on some engineer to study this stuff on their own time at home.

cloverichonAug 6, 2021

> You can't design it first and then go and build it.

You can't design it in full in one go, but you can design it and then incrementally update said design. Sadly many (companies) do not. But you can define the problem(s), the scope, the scale, and then design a solution appropriately to meet those needs (for a defined period of time). That's what distinguishes software engineering from hacking. They both have their place. Many companies claim to do the former but are mostly doing the latter. Software is still early in its life and as various kinds of system designs stabilize, so will the formalizations around what it means to be a software developer. Reading a book like Designing Data Intensive Application's you can't help but see those formalized topics budding.

ipnononApr 30, 2021

The best way to learn is have skin in the game. Doing something yourself will force you to do what actually works. So it seems your current professional employment is excellent in that regard.

Formal study seems to work best after real experience. I read Martin Kleppmann's Designing Data-Intensive Applications based on its inclusion in teachyourselfcs.com.[0] I did not find it useful because I had nothing to apply it to once I finished. However I don't think this will apply to you as it seems you already have some problems in mind to consider.

[0] https://teachyourselfcs.com/#distributed-systems

ingvulonMay 1, 2021

While I recommend reading DDIA, I think buzzcut_diet may end up disappointed. Reasons:

- it takes a while to read DDIA. Probably around 6 months of focused reading. Perhaps more

- one can learn a really good chunk of theoretical stuff... but probably not applicable to day to day work

- zero practical experience will be gained regarding Kubernetes, Spark, Kafka, EMR, Redis

So, I would recommend a more practical approach:

- start already reading the documentation of K8s, Kafka, Spark, etc. Choose one and go for it. I would recommend Kafka since its documentation is well written

- while reading documentation of the tooling above, one will inevitable stumble upon theoretical stuff that will not be explained in detail: that's exactly when you pick up DDIA (or similar books) and try to find the topic in the index and read it.

nindalfonJune 21, 2021

There are two books that taught me how systems work.

- One system in isolation - Operating Systems: Three Easy Pieces. Covers persistence, virtualisation and concurrency. This book is available for free at https://pages.cs.wisc.edu/~remzi/OSTEP/

- Multiple systems, and how data flows through them - Designing Data Intensive Applications. Covers the low level details of how databases persist data to disk and how multiple nodes coordinate with each other. If you’ve heard of the “CAP theorem”, this is the source to learn it from. Worth every penny.

More on why these two books are worth reading at https://teachyourselfcs.com

MulticomponJuly 7, 2021

I'd love to know the 'design patterns' of works like this Knights of San Francisco game. Did the author use a workflow engine, a rules engine, functional event sourcing, a nested pyramid of if-then-else doom?

I have a sense that this space is somewhat unexplored. Text-based game world simulation is a relatively underdocumented (to my eyes) form of the 'game UI overtop a database manipulated with game logic rules' type of games, of which Simulation games are at the complex end of.

There are things like Twine and Inky that offer variables and conditionals to prewritten bodies of text, but doing composable texts worlds that change their state based on the accumulated choices of players over the course of their time seems to be a complex feature to build and extend, whether in Twine or another tool. Dialog simulation systems that remember what options you've done and give you additional options or changes over the course of the game are sold as products online. Heck, someone recently patented a 'grudge' system that a popular game (League of Legends?) used.

Or maybe I've just been looking too closely at it. I've been working slowly for about the past 2 years on an automation system for a tabletop RPG (non D-20 system) to speed up battle generation & resolution, trying to incorporate all of the various rules that say 'in X scenario, if Y conditions are met, gather this information from the user, then apply its Z effect like so, but also let the DM / user change any of the above or ignore the entire thing before you do so', so while I've ordered Designing Data Intensive Applications in hopes of gaining more insights, this problem certainly seems like a big thing to chew on from my self-taught programmer's POV right now.

machinehermiteronJuly 19, 2021

I second Designing Data-Intensive Applications.

Deep Learning with Python by François Chollet I think works as an audiobook as well.

I am a big non-fiction audio book fan and so much depends on the voice actor. I bad read can ruin the best content while Robertson Dean made Alan Greenspan's The Age of Turbulence into an enthralling adventure story.