If you want to understand how databases work under the hood, from the basics up to near-state-of-the-art, its filled with tons of great information. I recommend reading Kleppmann's Designing Data Intensive Applications and then watching Pavlo.
Huge +1 to Martin Kleppmann's "Designing Data-Intensive Applications." It's the best single-volume introduction I've seen to the fundamentals of databases, queues, and the complications inherent in making those systems distributed.
The book "designing data-intensive applications" is really really good, and covers all the concepts (although not per sé the tools) you need to understand.
I know you said you are good at system design, but I'd still recommend you read
Designing Data Intensive Applications book and read through
https://github.com/donnemartin/system-design-primer.
If you do those and cracking the coding interview you should be good to go.
Read "Designing Data-intensive Applications"-- it's a great combo of theory + real-application (including some of the technologies you listed).
I loved Designing Data-Intensive Applications. It gives you the reasons why NoSQL databases exist and the problems they solve. Moreover it gives you reasons to select one over another. It's really excellent and one of my top two CS books
Martin deserves every penny of his hard-work on
Designing Data Intensive Applications. Experienced engineers can easily use the book as a reference and you can give the book to smart junior engineers to give them a great foundation.
Honestly, the best technical book I’ve ever owned.
Designing Data Intensive Applications was surprisingly detailed in terms of data storage.
I also enjoyed Release It! by Michael Nygard to learn about making distributed systems more resilient.
If you want read more about this topic, I liked "Designing Data-intensive Applications" by Martin Kleppmann.
If you liked this page, you might also like the excellent book "Designing Data-Intensive Applications" that among others surveys many characteristics of large-scale systems and presents some. Note that it's not a book for preparing you on system design questions, but it can definitely help.
FWIW, the article mentions the book "Designing Data-Intensive Applications" by Martin Kleppmann. I wanted to throw out my own endorsement for the book, it's been instrumental in helping me design my own fairly intensive data pipeline.
Designing Data-Intensive Applications [1] is a good book all around for creating application and management of the data that they provide including NoSQL.
[1] See http://dataintensive.net
I confirm DDIA is awesome book, I would recommend to anyone working with database and distributed systems.
Schema on read, as opposed to schema on write, as it says in the excellent "Designing Data Intensive Applications" book ... nightmarish to deal with.
This seems to be Martin Kleppman's work - worth noting that his book, "Designing data intensive applications" is also very good :)
Designing Data-Intensive Applications by Martin Kleppman. He does a great job of distilling storage systems to concepts and discussing conceptual trade offs instead of focusing on particular storage products. Plenty of footnotes to relevant research papers too.
Strongly agree with the high rating for
Designing Data Intensive Applications. It's a fantastic book that covers most of the important principles in architecture.
I wish this website had more filters. I'd like to filter out books with fewer than 10 reviews. As it is right now, it's a bit noisy.
Just to add on to the fantastic article: Designing Data‑Intensive Applications is a great book which tries to bridge the gap between distributed systems theory and practice.
DDIA is an amazing book, but it's way, way too deep for any kind of interview. If you read DDIA cover to cover, you know what I mean.
DDIA is to system design, as a computer science textbook is to the algo interview.
If you want a good high level overview I recommend Designing Data Intensive Applications by Kleppmann. You will walk away with a good understanding of the tradoffs of each paradigm.
I'm currently starting up a study group for Martin Kleppmann's "Designing Data-Intensive Applications," which seems to be a popular book with a lot of practical knowledge that'd be great for discussions.
This is a wide-ranging interview (in written form) with Martin Kleppmann, author of the highly acclaimed book "Designing Data-Intensive Applications". Good stuff!
I think Designing Data Intensive Applications covers more than this, but this still looks interesting, perhaps a potentially nice (and free) read before diving into the other one. Doesn't seem whitepaper-y at all.
I enjoyed the book "
Designing Data-Intensive Applications" [1]. It is a survey of technologies for storing and processing data.
As an engineer new to system design, I found the whole book to be gold. It gave me the vocabulary to continue learning more on my own.
[1]: https://dataintensive.net
Read "
Designing data intensive applications" (
http://dataintensive.net/), which is an excellent introduction to various techniques for solving data problems. It won't specifically tell you what to do, but will quickly acclimate you to available approaches and how to think about their trade offs.
In case someone isn't aware, Designing Data-Intensive Applications is a very good introduction to distributed databases, even if it doesn't specialize in them.
I just wanted to point out that there is a good explanation of SSTables B-trees and other database structures in a whoke chapter of the great book Designing Data-Intensive Applications, by Martin Klepmann.
Kleppmann's Designing Data-Intensive Applications was one of the best technical books I've ever read. Perfect mix of theory and practice for someone who understands the web stack but doesn't understand how distributed systems are built.
Pretty necessary, I cant imagine working at a FAANG without at least having read and understood "Designing Data-Intensive Applications" cover to cover.
Designing Data Intensive Applications [0] is a pretty complete and thorough starting point.
Not the author, just a happy reader.
[0] https://dataintensive.net/
As a side note, I feel like I need to revisit Designing Data-Intensive Applications. I got almost nothing out of it the first time I read it but people on HN keep recommending as this absolute game-changing gem.
+1 for Designing Data Intensive Applications, it's a phenomenal book that's helpful if you've never scaled before.
I already have Designing Data-Intensive Applications (2017), do you think I would get much more out of that book?
There is a great chapter in "Designing Data Intensive Applications" about this very subject
Can someone compare this to Martin Kleppman's awesome book "Designing Data Intensive Applications"? I'm wondering if this book is like the old IBM, etc. whitepapers which quietly tried to sell technologies from the writer's company?
Martin Kleppmann's "
Designing Data-Intensive Applications" discusses this
https://dataintensive.net/Besides being a good read overall, the book discusses topics like this one in detail and with a healthy attitude (people tend to have strong opinions on this)
I can heavily recommend the book "Designing Data Intensive Applications" by Martin Kleppmann for a great overview and deep dive into the trade-offs in both of those approaches (and others)
Wonderful. I am such a Taleb fan boy but have been putting off
Designing Data Intensive Applications. I am starting on it this afternoon from this post.
I just started on Daniel Kahneman's Noise. It will be disappointing if it isn't one of these type of books.
Books: Operating Systems/Database/Networking/Computer Security/Computer Architecture textbooks, Software Engineering textbooks (Clean Code, Design Patterns, Designing Data Intensive Applications, Domain Driven Design as a short list off the top of my head)
Designing Data Intensive Applications (DDIA) is the best book I’ve come across when it comes to serious system design.
Designing Data-Intensive Applications by Martin Kleppmann. If you ever plan on working on systems with high availability, high throughput or high data volume requirements, this is the book to read. Great balance between practical and theoretical, best book on distributed systems IMO.
Designing Data-Intensive Applications has taught me more than 90% of what I know about scalability. I can't recommend that book enough.
I highly recommended reading “
Designing data intensive applications “ by Martin Kleppmann to get a thorough overview with lots of references. Reading this book is a timesaver compared to finding all these information across blog posts.
https://www.amazon.de/dp/1449373321/
I totally agree! I can't wait for Martin Kleppmann's "
Designing Data-Intensive Applications" to be complete. I've read the chapters available through the early release and highly recommend the book based on what I've seen so far!
http://dataintensive.net/
If you're interested in this, I recommend Martin Kleppmann's book "
Designing Data-Intensive Applications" which is a longer form discussion of these topics.
http://dataintensive.net
Martin Kleppmann's "Designing Data-Intensive Applications" is one of the best books in computing I've read in a very long time (and thus the best book of 2017).
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann. Absolute best book on system design.
"Designing Data Intensive Applications" is an absolute goldmine for things like message queues but also going beyond understanding the full implications in database selection and other common, distributed-oriented engineering decisions modern software engineers may come across.
For those looking to understand how to choose the right database for the job, I'd recommend first reading "
Designing Data-Intensive Applications" (
https://dataintensive.net)
Hah - you're describing "event sourcing" and it's the technique I'm using:
https://flpvsk.com/blog/2019-07-20-offline-first-apps-event-...Unfortunately event sourcing means distributed systems... and I'm learning this on the fly on nights & weekends. Martin Kleppmann's "Designing Data Intensive Applications" has put the fear of god in me.
Blindsight by Peter Watts - really interesting high concept sci-fi about the nature of life and consciousness.
Designing Data Intensive Applications by Martin Kleppmann. This book really made stream processing and Kafka click for me.
Starting to feel like a parrot since I’ve recommended this book a few times already on HN but
Designing Data Intensive Applications [0] is a great starting resource. I’m not affiliated in any way just a happy reader.
[0] https://dataintensive.net/
This is a great list.
Missing from the Architecture & System Design list is Martin Kleppmann's Designing Data Intensive Application, IMO the best modern book on systems / scalability.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems is a really good book.
At the Architecture level see the very good book "
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann.
For actual code, you do not have one book but have to glean the knowledge from a whole bunch of them.
The theorem only applies to perfectly asynchronous systems, which assume that algorithms are deterministic and have no clocks. We must introduce imperfect heuristics like timeouts in order to conclude nodes have crashed by deduction and be able to reach consensus in concrete distributed systems.
This paper is referenced in chapter 9 (Consistency and Consensus) of "Designing Data-Intensive Applications" by Martin Kleppmann.
Can anyone recommend me a well written book on distributed systems? I've finished reading Designing Data-Intensive Applications last year and enjoyed it a lot.
Designing Data-Intensive Applications by Martin Kleppman. It's the other half of software engineering as far as I'm concerned.
Incidentally, Martin's book ("Designing Data-Intensive Applications") is excellent and highly recommended reading. If you find yourself saying things like "this database is ACID compliant", "we have an SQL database with transactions, so we're fine" or "let's just add replication to Postgres and we'll be fine", you need to read this book.
I found the book "
Designing Data-Intensive Applications" by Martin Kleppmann to be a good, practical primer on a lot of these topics.
If you want to go deeper on any of the subjects he discusses, his references for every chapter are solid and provide a deeper understanding.
Thanks! I hope so too. My current learning material is “
Designing Data-Intensive Applications” which seems to involve some data structures.
On algorithms, I think being able to use the work of others is already very empowering. Perhaps one day I’ll get into it deep.
Designing Data-Intensive Applications is probably one of the absolute best technical books I've read. I haven't done a second read through, but I can definitely see that being helpful.
Designing Data Intensive Applications is not exactly what you're looking for, but it touches on some API topics and is a genuinely great technical read for application programmers.
The topical overview certainly sounds interesting, but sounds extremely similar to
Designing Data Intensive Applications which also covers modern DB internals.
What’s the sell here?
I'm currently learning back-end coming from a front-end career, and I started reading "Designing Data-Intensive Applications" by Martin Kleppmann, seems to be a must-have for anyone who wants to get serious in this field.
"
Designing Data-Intensive Applications" is excellent, but it took me a long time to get through (it really helped that we read it in the book club at work). I am very glad I read it though - there is so much good stuff in it. I wrote a summary of it (mostly so I would learn the contents better):
https://henrikwarne.com/2019/07/27/book-review-designing-dat...The books on my desk are a combination of reference books and books that are good conversation-starters (I've read them already and don't need them as reference, but they're good for lending out to people, especially junior devs).
Reference:
- Effective Java (good for learning the mindset of developing backward-compatible APIs in any language)
- Enterprise Integration Patterns (I work on an enterprise APIs team)
- Designing Data-Intensive Applications
- Camel in Action
Good for lending out:
- The Phoenix Project
- Making Work Visible
- Effective DevOps
- The Pragmatic Programmer
- REST in Practice
I've been programming C++ for years. Trying to keep up with modern C++ took a lot of time, which I feel is better spent on other stuff, like reading "Designing Data Intensive Applications" or something similar than yet another Scott Meyers book.
I am trying to learn more about distributed systems architecture. Currently almost finished reading "
Designing Data Intensive Application". Is there any recommendation from HN what book should I read next? I am thinking
- Building Microservices
- Desining Distributed systems
Any thoughts?
I read the DDIA book in my own time and bought it myself. It probably would be possible to get these kinds of books bought by the company and read them on company time if there is some important enough thing to learn. But I never made use of it. The way I read books you can tell I read them because they have marks everywhere.
We do have initiatives to learn from each other, but the days are already filled with too much unplanned work and meetings.
As a precursor to this (excellent, in depth) post I also recommend Martin Kleppmann's
Designing Data-Intensive Applications, which is (to date) the definitive 101 book on the topic.
He also did these awesome Tolkien-esque maps of the database engine ecosystem: https://martin.kleppmann.com/2017/03/15/map-distributed-data...
Anyway, I inject this sort of stuff directly into my veins, so thanks very much for the post!
I usually read good technical books twice. Recently did that for: Designing Data-Intensive Applications, DynamoDB Book.
In terms of software design, I found that Designing Data Intensive Applications to be very informative, I am going to give it another read next year. It cites a lot of references at the end of every chapter for the reader to explore the various topics further.
The article mentions Martin Kleppmann. Go buy a copy of his "
Designing Data-Intensive Applications". It may well have been titled "Practical Distributed Systems for the Everyday Developer". It is an absolutely fantastic book with a perfect ratio of theory/practice.
Extra points for buying a dead tree copy and reading it without a thousand alerts and internet temptations vying for your attention :)
I highly recommend Martin Kleppmann's
Designing Data Intensive Applications(
http://dataintensive.net).
It will not only help you understand what's "SQL" and "NoSQL" data stores, it also covers the differences between each of them, what problems they are designed to solve, how they try to solve it, and if it'll help with your problems as well.
Designing Data‑Intensive Applications: The Big Ideas Behind Reliable, Scalable and Maintainable systems - Martin Kelppmann
https://dataintensive.net/I have recommended this to everyone.
I dont understand what he means when he said
"(a good example book for this currently is Designing Data Intensive Applications)."
I got this book recently and was planning to read it soon, does he thinks the tools and techniques mentioned in the book are a waste of time, or the opposite?
I'd recommend the following:
Clean Code: A Handbook of Agile Software Craftsmanship [0] is a great book on writing and reading code.
Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design [1] is, no surprise, a book on organizing and architecting software.
Designing Data-Intensive Applications [2] may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.
The Architecture of Open Source Applications [3] is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.
Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.
[0]: https://www.amazon.com/Clean-Code-Handbook-Software-Craftsma...
[1]: https://www.amazon.com/Clean-Architecture-Craftsmans-Softwar...
[2]: https://www.amazon.com/Designing-Data-Intensive-Applications...
[3]: http://aosabook.org/en/index.html
kleppmann’s book:
designing data-intensive applications.
It’s very well written, but maybe doesn’t have as much in the way of exercises.
Another good resource is
Designing Data-Intensive Applications [1]. Chapter 2 does a really good job explaining how different categories of databases relate to different data models, including examples of querying graph-like data models using `WITH RECURSIVE` compared to a query language for graph databases.
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
To add to suggestions of others, the SQL Antipatterns was also an insigtful read about basic pitfalls of db design.
I have also found first chapters from Designing Data Intensive Applications helpful to get context around sql/nosql dbs.
Haven’t really read Fowler’s stuff, but I have read Martin Kleppmann’s Designing Data-Intensive Applications and that was helpful. Haven’t seen it mentioned here (though I haven’t looked thoroughly through the comments). Just thought I’d mention it here.
Computer Systems: A Programmer's Perspective: around 2012 / 2013, I went through this book because I took a coursera course based on it. In fact, many universities base their systems courses around this book. It is really well written, has a great choice of topics, and phenomenal exercises [0] for practice (some are legitimately fun).
Operating Systems: Three Easy Pieces: In 2013, I found this book because I was frustrated with the textbook assigned for my operating systems class (Silberchatz). OSTEP has incredibly clear and concise descriptions without skimping on necessary details. It's wonderfully written. I was so jazzed up about this book that I ended up sending a lot of edits / improvements, and the authors gave me a very kind shoutout in the acknowledgements section.
Computer Networking: A Top-Down Approach: In 2013, this was the assigned textbook for my computer networking class. I already owned Tanenbaum & Wetherall which is good, but preferred this book. It is a more approachable treatment of networking (without sacrificing any crucial topics), so better for a first course.
I've heard glowing reviews of The Algorithm Design Manual, Designing Data-Intensive Applications, and Structure and Interpretation of Computer Programs over the years, but I haven't personally gone through them. For the TeachYourselfCS categories that I know the textbook landscape, I find their selections spot-on and pretty refreshing.
[0] https://csapp.cs.cmu.edu/3e/labs.html
+1 for Designing Data-Intensive Applications. I worked my way into the hadoop ecosystem a few years ago and read dozens of books before I started seeing the big picture of databases, data processing and distributed systems. This book gives the same grounding in a single read, it’s an instant classic.
I think it's same reason why
Designing Data Intensive Applications is in the top most popular books on safaribooks for quite some time now.
Many people are grinding for job interviews and many companies now copy FAANG and have a "systems design" round, Paxos/Raft is one of the key topics there, thus it's discovered by more and more people.
The book “
Designing Data-Intensive Applications” by Martin Kleppman is a fantastic read with such a concise train of thought. It builds up from basics, adds another thing, and another thing.
I kept asking myself, what would happen if I were to extend on the feature currently presented in the chapter I was reading, only to find out my answers in the next chapter.
Brilliant book
Thanks for your detailed answer, really appreciate it.
Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:
1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?
2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?
And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.
It's not released yet, but I've been reading the early release version of
Designing Data-Intensive Applications by Martin Kleppmann (
http://shop.oreilly.com/product/0636920032175.do). I've found it pretty useful and well-written thus far. He does a good job of explaining concepts and then tying them to real-world implementations and examples. It's a good balance of theory and practical knowledge.
Designing Data-Intensive Applications is probably the best O'Reilly (if not overall technology) book of the past decade.
Surprised SICP isn't in there. Also surprised to see the Code book is ranked so high, I personally didn't get much out of it though it's probably a great introductory book to people who are new to the field.
On a related note, my favorite book this year was "Designing Data-Intensive Applications" by Martin Kleppmann. It's a great overview of modern database systems with a good balance between theory and practice.
The issue isn't "learning frameworks", it's don't waste your time learning how to
use frameworks
only. Learning how different frameworks are
implemented, their goals, etc, is pretty valuable.
I would say the same thing about databases. Don't just learn how to use PostgreSQL, or Kafka, or Dynamo. Try to understand how they're implemented.
I've found that sometimes, you do need help. Getting some nice user documentation will get your feet wet. But I've found that the best books tend to be more general topics, like "Designing Data Intensive Applications" for databases. (Note: I haven't found anything like that for frameworks - would be a great topic though.) These tend to cover not only "patterns" but give you a nice survey of the theory - so you can dive further into details yourself.
I've enjoyed having an O'Reilly Safari subscription for random access to books. In particular, the Pragmatic Programmer and
Designing Data-Intensive Applications.
I've also had good experiences with SCPD courses from Stanford, if your budget would cover those (they are at the other end of the price spectrum).
I like reading PDFs of books on my phone, especially on my commute. It works surprisingly well, better than I thought it would.
My work bought a copy of Designing Data Intensive Applications for the team, I've started reading it but lugging around 1kg of book every day gets old really quick, I wish they would have offered a PDF download coupon or something inside.
I couldn't agree more about curiosity mode (I'm going to use that phrase liberally). Despite reading papers for many years, I rarely go in cold. I browse blog posts and twitter, ask myself a series of questions, then try to find/read papers to answer them. Of course this only leads to more questions, and so the journey continues.
I also agree with the recommendations for "Designing Data-Intensive Applications" and "Database Internals". Though, having read the latter for a book club at $employer, I felt it served better as a sort of "index for the space" for people who already had some DB experience, rather a true introduction.
I second this. I prefer reading on pdf, so I bought the ebook[1]. The book explains indexing perfectly, and it shouldn't take you more than a day to finish. I can't recall a book having a better benefit/time ratio. I wish the author would release more books in this vein, but he hasn't. So right now I'm looking at
Designing Data-Intensive Applications to learn more about different kinds of databases [2].
[1] http://sql-performance-explained.com/
[2] http://dataintensive.net/
I agree with that 100%. I am a voracious reader, I love software engineering and productivity and fiction and Syfy and nonfiction and... You get the idea.
I have a previous comment on this site about reading. That is actually what I look for, intellectual curiosity and a desire to continue learning and growing.
I'll just quote myself:
"""
What I look for in a developer: READS BOOKS. ( Audio books count )
That's the only thing. I'm sorry, if you are not reading and studying to keep up, you are getting left behind. There are so many brilliant people writing amazing books on a huge array of subjects. If I could get every one of my developers to read ONE book on software design[0] a year, I would die happy and the entire industry would be 10 years ahead.
They don't even have to be technical books. I just want to see intellectual curiosity and a commitment to self improvement.
- 0: In the vein of Clean Architecture, The Pragmatic Programmer, The Mythical Man Month, Designing Data-Intensive Applications, The Google SRE book, etc
"""
Designing Data-Intensive Applications might be the best (non-niche) CS book of the decade, and he definitely did create far more value than he captured. So I'm glad Mr Kleppmann at least kind of "broke even" compared to the alternative of working for a FAANG (his salary assumptions seem however rather low). I guess the main take away for mere mortals who consider writing a book and making money with it (as opposed to boosting their profile) is to under no circumstances publish with OReilly. Getting <10% vs 80%+ of revenue is just about viable for the 0.01%.
Thank you for your reply. It was very helpful. I will include your suggestions into my learning path.
I had difficulty implementing data structures in C, not in python. Python I was able to think in terms of classes and attributes. But I was finding it difficult to do the same in C since there is no concept of classes. I am still trying to learn pointers properly to have an understanding how to implement data structures and algorithms effectively.
I came across the book you have recommended and it is a very nice book. I would recommend that along with Designing Data Intensive Applications.
Thank you.
"
Designing Data-Intensive Applications" is shaping up to be an excellent treatement of modern databases and their underpinnings. It's at an excellent level of abstraction, deep enough to convey database internals while high level enough (so far at least) to be able to cover a wide variety of database systems. It also has its feet firmly planted in database history, and is NoSQL-koolaid free. Highly recommended.
http://shop.oreilly.com/product/0636920032175.do
Yeah, I just asked a similar question on HN and didn't get many responses, but one overwhelming book rec was "
Designing Data Intensive Applications"
Basically a high level guide through modern architectures, frameworks, and database designs. So far, my takeaway has been learning what tool would be useful for certain types of data engineering, not the details of how to write code with it.
Edit - link: https://news.ycombinator.com/item?id=20417801
Another commenter mentioned it as well, but "
Designing Data-Intensive Applications" by Martin Kleppmann
https://dataintensive.net/ is a _fantastic_ overview of the field and, I think, more approachable and enjoyable to read than Kimball's book. But Kimball is a classic, especially for how to do warehouse design.
I'll also make a plug for the Meltano[0] project that my colleagues are working on. The idea is to have a simple tool for extraction, loading, transformation, and analysis from common business operations sources (Salesforce, Zendesk, Netsuite, etc.). It's all open source and we're tackling many of the problems you're interested in. Definitely poke around the codebase and feel free to ping me or make an issue / ask questions.
[0] https://gitlab.com/meltano/meltano/
If you want to dig into the differences more, I highly recommend the book Designing Data Intensive Applications. Analytics systems have different requirements from normal transaction processing systems, so it naturally follows that DuckDB could specialize to fill those requirements better than SQLite. Many people bring up row stores (SQLite) vs column stores, but there are many other interesting differences and optimizations to be made, so I can see how SQLite may be leaving some niche unfilled.
I thoroughly enjoyed "
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" from Martin Kleppmann.
It's a great book that goes into pretty much all of the commonly used strategies to scaling data-intensive applications. It's not incredibly deep on any of them but it will allow you to get a great overview of the entire space. For each component, there's usually references to places where you can read and study more about them.
Designing Data Intensive Applications is a good read, especially if you’re interested in the “programming in the large” aspect of data engineering [1]. It does have a slightly theoretical taste to it, but I think you’ll find that helpful since some of the problems you listed don’t really have a good solution at the moment (versioning a model, for example).
[1] http://dataintensive.net
I found "
Designing Data-Intensive Application" book to be really helpful introduction to learning more interesting details about distributed systems. The book provides a gentle introduction to build intuition around these systems and contains a plethora of links to go further down the rabbit hole.
https://dataintensive.net/
Lots of people here are shitting on mongodb(maybe rightly so). But I think the biggest problem is that the developers making the decision on what kind of DB to use just do not understand these tools.
I always recommend reading Designing Data Intensive Applications as soon as you have an inkling that you will be asked to make such decisions in the near future.
I'm currently reading "Designing Data-Intensive Applications" by Martin Kleppman and it has an interesting chapter on database storage engines, including PostgreSQL (e.g. B-trees chapter). I would definitely recommend if you'd like a high level overview of how things work under the hood (it's a good mix of theory and practice). It also helps understanding why and when optimisations work.
Unreal Engine and C++. I've long worked with Unity, but as part of a new job I'm tasked with developing our Unreal plugin. Previously I only touched C++ on occasion, so I had a lot to learn — and have a lot to learn yet — of best practices, new features available in C++11, dealing with exceptions (Unreal disables them by default), and so forth. Likewise for Unreal. Like C++ itself, it's wonderfully powerful but sometimes painfully complex.
I also continued to deepen my understanding of databases and distributed systems. My favourite read this year was Designing Data-Intensive Applications which made me more familiar with the pros and cons of the various datastores and provided a better sense of the tradeoffs that each makes. It also gave me an appreciation for the guarantees that the battle-tested relational databases provide. One of my goals for 2019 is to improve my SQL knowledge — thus far any extra effort to understand it better has payed dividends.
Experience as another commenter suggested. There are probably a few good formal sources as well, but really, a lot of this knowledge is tactile and remains uncodified.
I worked more on the infrastructure / backend side, and I've found the book Designing Data Intensive Applications really useful. Amazing mix of practice and theory, super applicable to people working on distributed systems. Not sure if there is any equivalent for frontend / product engineering.
The fundamentals don’t change fast and are what everything else is built on, so I would start there.
For example, if you don’t have a traditional CS degree, https://teachyourselfcs.com/ is a curated and effective set of books.
If your trying to understand complex systems, I would read Designing Data Intensive Applications, which is perhaps the best and most useful technical book I have ever read, and covers the most important parts of distributed systems. A lot of what’s in the book are fundamental distributed systems, from the 70-80s?/newer things from early 2000s built by BigTechCo
Depends how much time you have...
Fifteen minutes: “How to Choose a Database” by Ben Anderson (https://www.ibm.com/cloud/blog/how-to-choose-a-database-on-i...)
Three hours: Jepsen analyses of distributed systems safety. Kyle tests software ranging across the database spectrum.
One week: Designing Data-Intensive Applications by Martin Kleppman.
Disclaimer: I work with Ben and think he takes a really nice tact on this subject, while it may be orthogonal to your immediate question regarding trade-offs.
I'm also hoping to prune my reading list of redundancy.
Right now I've got:
- Design Patterns by the Gang of Four
- The DevOps Handbook by Gene Kim
- The Phoenix Project by Gene Kim
- Designing Data-intensive Applications - Martin Kleppmann
- Peopleware - Tom DeMarco
- Code Complete - Steve McConnell
- The Mythical Man Month - Frederick P Brooks Jr
- Growing Object-Oriented Software - Steve Freeman
- Domain Driven Design - Eric Evans
- The Clean Coder: A code of conduct - Robert C martin
- The Pragmatic Programmer - Andrew Hunt
- Building Evolutionary Architectures - Neal Ford
- The Design of Everyday Things - Don Norman
- Don't Make me think - Steve Krug
The article notes that "
Designing Data-Intensive Applications" is perhaps not a typical data science book, but it is still very useful. I agree. It is a fantastic book - one of the best technical books I have read. I wrote why, and a summary of it here:
https://henrikwarne.com/2019/07/27/book-review-designing-dat...
If you are interested in distributed systems , I found the book "
Designing data intensive application by Martin Kleppmann" to be a good starting point. Its not about only about distributed systems but also covers quite a bit of ground on overall data systems.
https://www.amazon.com/Designing-Data-Intensive-Applications...One of the best books I've read about software architecture is
Designing Data-Intensive Applications:
https://dataintensive.net/
Almost no fluff, very concrete explanations of various algorithms and system properties, how various real world systems embody them, and how to put those systems together to get effective real world solutions.
"
Designing Data-Intensive Applications" is the best reference I've found explaining the many pitfalls of data storage and transmission. It's especially helpful if your code has ownership of any data (i.e. if you create or modify it), and an order of magnitude more useful if your organization has multiple processes touching a given piece of data.
https://dataintensive.net/
There is some overlap, but they complement each other. This one (Database Internals) has much more of a technical deep dive on storage engines, especially B-tree implementation details.
If I was mentoring someone learning this stuff, I'd advise reading Designing Data Intensive Applications first, which is certainly the best for giving the big picture, and follow up with this one for more detail on certain topics.
Given the previous dearth of books on this important subject, I think it's wonderful that we have two.
When I first started learning software I liked to 'collect' these kind of lists as educational busywork. Now that I've been learning software engineering for over 6 years I think they're super unhelpfully overwhelming like you say. You want _one_ short list that you actually use.
For me that is teachyourselfcs.com. It recommends only two books if you don't have "multiple years" to self-study part-time. They are: Computer Systems: A Programmer's Perspective and Designing Data-Intensive Applications. If you do have multiple years it recommends ~9 books. The OP list has almost 100 books just on software architecture.
It takes so long to read one good textbook that I'd bet 90% of software engineers haven't read more than three or four cover-to-cover. I was rare in my computing theory class for actually using the textbook and doing the exercises and I only got 2/3 through. Given my current progress rate through 'Computer Systems: A Programmer's Perspective' it will take me at least 150 hours to complete.
I'd highly recommend reading [
Designing Data-Intensive Applications](
https://www.amazon.com/Designing-Data-Intensive-Applications...). The book gives you a great overview of designing data systems - foundational knowledge you'll need in any DE role.
The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.
I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.
Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.
Source: I work as a data engineer at what some would call a big company :)
I found
Designing Data-Intensive Applications[0] by Martin Kleppman to be the most eye-opening system design book that I've read. He really describes well how awful things get once you have to coordinate more than one physical machine - the number of things that can go wrong is staggering. I would say this book is as scary as Java Concurrency in Practice was - and that book was scary enough to get our company to change languages.
I think one of the best ways to learn software architecture is to have a clear view of what the challenges are, and the Kleppman book does a really good job of providing that clear view.
[0]https://dataintensive.net/
Location: Boston, MA
Remote: Yes
Willing to relocate: Yes (Highly interested in relocating to Silicon Valley, or San Fransisco, or other major tech hubs/cities, such as NYC, also interested in staying in the Boston area)
Technologies: Common Lisp, Python, Linux, git (some knowledge of rust, and C)
Github: github.com/Duderichy
LinkedIn: https://www.linkedin.com/in/rbibeault
Resume: see LinkedIn, and message me there, or email me for a copy.
Email: RichardMBibeault@gmail.com
I passed the triplebyte interview.
Physics major (Bachelors of Science) turned software developer. One year as a backend developer at a common lisp shop. Looking for a linux based company. (macOS as workstation computer/laptops is great too!). Avid learner, I try to read and learn as much as possible, I've recently gone through Designing Data Intensive Applications, and Designing Distributed Systems.
Would be glad to work at a company that uses a functional language, such as Haskell, especially if they don't expect new employees to come in already knowing the language. Also highly interested in companies using Rust, python, or go.
Ambitious: only been at the company a year and spent a significant amount of time this summer directing an intern, overhauled the build system the company uses internally (set up jenkins over previous system).
Eager to learn as much as I can.
Location: Boston, MA
Remote: Yes
Willing to relocate: Yes (Highly interested in relocating to Silicon Valley, or San Fransisco, or other major tech hubs/cities, such as NYC, also interested in staying in the Boston area)
Technologies: Common Lisp, Python, Linux, git (some knowledge of rust, and C)
Github: github.com/Duderichy
LinkedIn: https://www.linkedin.com/in/rbibeault
Resume: see LinkedIn, and message me there, or email me for a copy.
Email: RichardMBibeault@gmail.com
I passed the triplebyte interview.
Physics major (Bachelors of Science) turned software developer. One year as a backend developer at a common lisp shop. Looking for a linux based company. (macOS as workstation computer/laptops is great too!). Avid learner, I try to read and learn as much as possible, I've recently gone through Designing Data Intensive Applications, and Designing Distributed Systems.
Would be glad to work at a company that uses a functional language, such as Haskell, especially if they don't expect new employees to come in already knowing the language. Also highly interested in companies using Rust, python, or go.
Ambitious: only been at the company a year and spent a significant amount of time this summer directing an intern, overhauled the build system the company uses internally (set up jenkins over previous system).
Eager to learn as much as I can.
> Where is your finished data engineering book? I would like to read it.
So I need to have written a book to be able to download a PDF and see 85/100 pages are blank? I work as a data engineer and can tell you 50% of these chapter topics are not directly related to data engineering.
There are no chapters in this book even close to 10% finished. If you want a book recommendation I'm seconding the suggestion in this thread of Designing Data-Intensive Applications. I have a copy 3 feet from me at the moment.
> This is a work-in-progress kindly made freely available. Is it really fair to criticize the author for not having finished it yet?
Please look through the PDF. This isn't just not done. This is not ready to share with anyone publicly. There is no useful information in this. There are probably under 20 paragraphs of original text.
> Is it really fair to criticize the author for not having finished it yet?
No, but I'm criticizing the fact that it's posted[0]. Not that they're working on something.
I don't see the author here in this thread so my warning is to other readers. Just move on unless you're a book publisher looking for an author to pick up.
The only real criticism anyone could offer about this would be about the chapter structure, because that's all that exists. I would recommend they drop all the chapters that are a CS101 equivalent. There's no need to explain git or the OSI model or grep.
[0] edit, I want to clarify I mean just posted and dumped. If the author were here for questions or feedback I would feel differently. But with just this link as-is, there is no point in sharing.
After seeing your comment here yesterday, I started reading your book Initiative.
I am in the middle of the first exercise and have some questions.
Many of the examples in your book show people connecting these separate ideas that are reasonably understandable and applicable to the general population--the girl who recognized that many people have a fear of needles and sought to design a medical to device to help, or the student who liked going to festivals and thought about aligning attendee interests with the festivals' interests and waive attendance fees for attendees by having them volunteer at charities. It sounds like students in your class came up with relatable ideas by looking at problems in their lives that they noticed.
Right now, I'm merely a year into my career as a software engineer (having switched careers last year) and I am very interested in learning about good software engineering practices. I like seeing great CI/CD pipelines and being able to deliver very quickly. I like the sound of good DevOps practices (currently reading slowly through Accelerate by Forsgren) and I so far have really enjoyed reading books on scalability and reliability (Designing Data-Intensive Applications by Kleppmann is frequently recommended and I got a lot out of the book). I'm vaguely interested in MLOps.
I'm pretty happy being more of a cog in a machine right now so that I can see how an established company runs from the inside. I don't know that I'm immediately interested in a project that is more generalizable, the way your book examples are. But it does seem like an entrepreneurial mindset is still core to career progression since in the end a job is also about solving people's problems (where people may be inside or outside of the company). Thus I want to figure out how to use Initiative to iteratively improve my career.
I am wondering if people found success applying your Method Initiative concepts to a narrower scope in a specific technical field, and whether you could share some of those stories.
https://martin.kleppmann.com/2020/09/29/is-book-writing-wort...Martin Kleppmann, the author behind Designing Data Intensive Applications, wrote about his experience as well, and it shows an interesting contrast with Resig's experience with digital publishing. As you can see in Martin's graph, ebook sales starting Sept 2014 were a _major_ part of his royalties due to it being available as "early release", and integration with the O'Reilly platform increased his exposure, and therefore royalties.
Its hard to guage accurately, but it seems O'Reilly + ebook sales contributed to about 2/3rds of his overall royalty returns, which is a pretty darn good result!
Of course, Kleppmann and Resig are writing about very different eras in terms of publishing, but I can't help but wonder if Resig would have a different experience if he was able to publish an equally relevant work in 2015 vs 2008.
Eric Evans'
Domain-Driven Design. I've heard enough about DDD over the years that I figured I'd just go to the source. Liking it so far, I have some good takeaways but we'll see how effectively I'm able to use the ideas over the next couple years.
Martin Kleppmann's Designing Data-Intensive Applications. Based on the frequent praise it receives here, haven't gotten far yet. I have some project ideas (for personal and professional projects) that could benefit from reading through it.
Martin Fowler's 2018 update to Refactoring. I read the original one a long time ago. In context, we have a work lunch & learn series and I'm interested in doing some presentations on the topic of refactoring (why, how, and when in particular) so it seemed appropriate to refresh my memory on some specific terminology from the book as well as to see if it's an appropriate book to recommend to colleagues. My recollection of the first edition is that I'd recommend it to colleagues, but it's been so long I'd rather read it once more before actually recommending it.
I reread Robert C. Martin's Clean Code based on some recent discussion here where it was rather strongly dismissed by a fair number of people. I didn't recall it being bad, my reread confirmed it is not, in fact, bad. Java-heavy, which is now an unpopular style of OOP, but otherwise a very good book. I'd still recommend it to junior colleagues paired with some caveats about avoiding seeing the world in black & white. There is no singular Way of Programming, but learn various ways and find what works for you and your team.
There are some more, but it's almost 5am and I haven't been able to sleep so I don't recall everything that's in the book stack or ebook queue. These are the ones I'm most interested in at present.
Sorry for going offtopic
"How do you make sure that a celebrity's tweet reaches all of her followers in less than 3 seconds?"
Looks like the interviewer has atleast read the first chapter of
"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems"
If you read and understand the above book, I am reasonably sure that you can crack most of the system design interviews.
I've been thoroughly enjoying "Designing Data-Intensive Applications" by Martin Kleppmann. It primarily deals with the current state of storing data (databases, etc) starting with storing data on one machine and expanding to distributed architectures...but most importantly it goes over the trade-offs between the various approaches. It is at a high level because of the amount of ground it covers, but it contains a ton of references to dig in deeper if you want to know more about a specific topic.
That this book and the very similar "Grokking the System Design Interview" (I went through both) get accolades just shows the poor resources we have.
What we need is more "Designing Data Intensive Applications", adapted to interviews.
Just as a couple quick comments, the "web crawler" scenario suggest a breadth-first search, which is OK (as in compared to depth-first search) but not good enough; web links in general is not a DAG and you can get into a loop. As another comment, in none of these two resources there's a single estimate that I can remember about how many servers you need as per requests/bandwidth etc, only calculations are about data amount. They also assume collaborative interviewer, which has never happened in my experience. I think none of these two resources by themselves would get you a L5 or even do well as L4 at FAANG (please somebody correct me), they are very basic (maybe I'm "too advanced" heh).
As a side-comment, don't limit yourself to recently released programming books! I used to do that when reading about web stuff, since most older texts were outdated. But there's many topics for which older books are far better sources of information! The 80s and 90s had tons of improvements and innovations which are still highly relevant. Don't fall into the trap of thinking that something is better just because it's new.
As for 2017 CS books, I'd second Designing Data‑Intensive Applications.
If we count updates, the latest revision of The Swift Programming Language is solid. My forays into Swift have been enjoyable.
This last one is kinda cheating since it's continuously updated, but I'd highly suggest browsing through the HTML Living Standard [0] and reading any parts that grab your attention.
EDIT: Looking through whatwg's news, I found out there's a developer edition [1] of the spec which strips the stuff that's only relevant to browser developers.
[0] https://html.spec.whatwg.org/multipage/
[1] https://html.spec.whatwg.org/dev/
I haven't done much in Dask and never heard of Ray, so might not be applicable directly.
But a general book about distributed systems that I highly recommend and one you'll see often referred to on HN is Designing Data-Intensive Applications. Seriously great book and one I often go back to for fundamentals about distributed systems.
I can definitely recommend e-readers for regular books (it's great when travelling) but for reference / textbooks I'm not sure I could.
I have a Kindle Paperwhite from (I think) 2014 and it still holds up quite well. I recently read the Kleppmann book (Designing Data-Intensive Applications) on it and it was fine, but I would have liked the ability to scribble notes and put in physical bookmarks.
From a technical perspective, everything worked quite well but I'm not sure I would want to read the UEFI spec on it.
I read the
DDIA book in my own time and have watched some of the internal Amazon tech talks mentioned above during work time with other engineers.
I think either way it’s important to realise that unless you’re superhuman you’re not really solidifying the knowledge, you’re making yourself aware of the concepts. If a use case comes up that requires it you have a better chance of recognising that and then going back and getting a deeper understanding of how to apply it.
I only absorbed a small fraction of DDIA but I still think reading it was invaluable.
To anyone cutting their distributed-systems teeth on Kleppmann's excellent
Designing Data-Intensive Applications: each chapter ends with an
essential references section (which include multiple citations to Abadi and Pavlo!).
The book chapters do an solid job laying the ground-work for those papers. The depth is in those references. Read them if you can!
My recommendations are not exactly what you ask but what I suspect you'll enjoy if you're asking this.
(1) "Understanding Computation" by Tom Stuart.
Not "fundamental" as a deep textbook, but very approachable for programmers intro into a big chunk of CS, explaining deep ideas about languages using rigorous working clean code (in Ruby, no prior knowledge needed).
I especially loved the first few chapters about what it means to define a programming languange and various kinds of formal semantics.
(2) Designing Data-Intensive Applications, Martin Kleppmann. This gives you a phenomenally good survey of concepts and practice of distributed systems. This is more software engineering than pure CS, but in my view you can't approach the field of distributed systems without blending both anyway.
(3) POODR — Practical Object-Oriented Design, in Ruby, by Sandi Metz. This is 100% software engineering, where there is no single definition of "foundational", but many people who read this swear by it. It's remarkably thin but lucid distillation of ideas that were "in the air" but Sandi nailed them down. An important thesis is that good code is not an aesthetic judgement of how it _now_ looks, but objective question how easy it will be to _change in the future_. Not Ruby-specific at all, but it teaches the original Smalltalk "message-passing" view of OOP, that for people that only learnt statically-typed Java, C++ etc view of OOP is a fundamental idea they're missing on.
Finally, not a book, but "the morning paper" https://blog.acolyer.org/ is excellent "return on your time" if you want to sample academic papers, both classic foundational ones, as well as cutting edge.
You seem to think that this is just a service that reads and writes data like any CRUD app.
The hard problems stem from how the system deals with failures and how the system propagates writes across the replicas while meeting latency and consistency SLAs. On top of that the system needs to be built in a way that it can be maintained by many developers each working on a small piece of the system without knowing the ins and outs of the system as a whole. In addition, when the system fails debugging and mitigation needs to be able to be parallelized across many developers so that availability SLAs can be maintained. You can read about this in “Designing Data-Intensive Applications” by Martin Kleppman where he discusses the complexity involved in building distributed systems.
It's become a fairly widely known concept in data engineering circles, expounded upon in Martin Kleppman's
Designing Data Intensive Applications book. (buy this book if you want to get up to speed on modern ideas around distributed systems and data architecture)
This became popular as people were trying to figure how to use Kafka as a persisted log store that could be "replayed" into various other databases. This meant that you could potentially stream all the deltas (well, more accurately the operations to create the delta, e.g insert, update, delete) in your data -- through a mechanism called Change-Data-Capture (CDC) [1] -- into a single platform (Kafka) and consistently replicate that data into SQL databases, NoSQL databases, object stores, etc. Because these are deltas, this lets you reconstruct your data at any point in history on any kind of back end database or storage (it’s database agnostic).
Event sourcing to my understanding is a term used among DDD practitioners and Martin Fowler disciples but with a different nuance. This article explains what it is:
http://cqrs.wikidot.com/doc:event-sourcing
[1] Debezium is an open-source CDC tool for common open-source databases. Side note: A valid (but potentially expensive) way of implementing CDC is by defining database triggers in your SQL database.
I'm currently listening to "
Designing Data-Intensive Applications" [1] on my commute and it really does work well as an Audio Book (I can attest to the positive reviews). Highly recommended if you are dealing with any requirements in the space (scale, replication, consistency, SQL vs NoSQL, etc.)
[1] https://www.audible.com/pd/Designing-Data-Intensive-Applicat...
> - Google, FB, Amazon basically wrote the book on distributed systems (Read
Designing Data Intensive Applications), both from a research perspective and a very well architected, open source solution
Most of the science behind these things is actually older. They industrialized it, removed the kinks, built upon actual experience, all of which is extremely precious, but I don't think it's as groundbreaking as people believe.
> - Google, FB have profoundly impacted front end development with cutting edge Javascript runtimes and open source front end frameworks
If you're talking about JITs, that's gradual improvements on prior work on JITs (started during the 60s, ignored by industry until Sun picked it during the 90s... for an academic project). Again, very useful, but not necessarily groundbreaking.
> - Amazon, Google, Microsoft have basically invented/popularized a way to do computing(Cloud), server management that has enabled tiny tech companies to become giants by outsourcing IT infrastructure
Again, industrialization on prior academic work (e.g. virtualization, distributed component-based architectures, etc.)
> - Apple/Google have created devices, OSs, and software that is nearly impossible to live without these days, additionally creating platforms for millions of developers to make a living on(App Store)
> - Amazon has set the bar pretty high for automation in operations and made 2 day shipping a thing we expect from everyone
Mmmmh... I was talking of "scientific progress", you seem to be talking of something different :)
If you recall, my point was that it's very hard to measure "scientific progress" by looking at industry, because industrialization typically happens decades after the actual discoveries/inventions. I think your point is that "industrial progress" may be good, which I'm not debating :)
Yes, chapter 2 in Martin Kleppmann's Designing Data-Intensive Applications.
Check out "
Designing Data-Intensive Applications" by Martin Kleppmann -
http://shop.oreilly.com/product/0636920032175.do. It's still work-in-progress, but covers a big chunk of distributed systems material, is up to date and has good reviews. You can read the 10 out of 12 chapters via Safari Books Online.
The downside is that I pre-ordered the book in November, expecting it in April and it now shows November of this year as the release date on Amazon. I'd be surprised to get it this year at all. Haven't found other books of similar scope and recency though, so I guess I'll wait some more.
Good list, but is it still being actively updated? Not having Kleppmann’s seminal
Designing Data-Intensive Applications (2017) on it would indicate no.
Alex Petrov’s Database Internals: A Deep Dive Into How Distributed Data Systems Work (2019) is another essential recent reference that should be here. Not as broad as Kleppmann but dives a lot deeper into certain topics.
You may try this book. It is one of my favourite.
[Designing Data-Intensive Applications
by Martin Kleppmann]
Tim O'Reilly once said "Obscurity is a far greater threat to authors and creative artists than piracy". I discovered and purchased almost every Rosenfeld Media book from OReilly.
After O'Reilly moved to DRM-free books, their 2009 sales went up by 104% http://toc.oreilly.com/2010/01/2009-oreilly-ebook-revenue-up...
In other interviews, he seemed confident that DRM wasn't worth it
https://www.forbes.com/forbes/2011/0411/focus-tim-oreilly-me...
Perhaps some part of the equation has changed since then. I'm looking forward a deeper analysis of the business reasons for this.
I'm also interested to hear what more authors think - I wonder how many agree with Martin Kleppmann (Designing Data Intensive Applications) https://twitter.com/martinkl/status/880336943980085248
This independence day weekend there were a lot of sales, so I purchased:
* "Programming Clojure, Third Edition" from pragprog (30% off sale)
* The entire collection of "Enthusiast's Guide to ..." from rockynook (each for $10)
* "The Quick Python Book 3e", "Serverless Architectures on AWS", "Event Streams in Action", "Get Programming with Haskell" from Manning (50% off)
These sales are the only way I can afford the volume I read. Some of that money would have gone to OReilly authors, but they deleted my full cart with $100 worth of stuff before I could purchase!
EDIT: OReilly catalog seemed large & redundant with publishers (packt) offering the same materials on their sites. Some like Wiley / MKP only offered very few items from their catalogs. Others like Rosenfeld / rockynook / no starch now provide DRM free options directly from their sites. I'm hoping at least OReilly reconsiders selling their Animal books again.
The easiest path is to read Designing Data Intensive Applications by Martin Kleppman. Builds from the ground up and by the end (or after a second read) you will have a deep understanding. Book hits on all of of the tech you mentioned except Kube. For that read one of the original google papers on cluster management systems eg. Omega
Location: Washington, DC | Fairfax , VA
Remote: Yes (Have experience working remotely)
Willing to relocate: Yes
Technologies: Python, JavaScript, Linux, AWS, MySQL, PHP, Pandas, Selenium, Ansible, etc.
Résumé/CV: Available upon request.
Email: dctechj at gmail
I enjoy working with other people, and I'm good at developing practical solutions to problems. I am capable of quickly learning new tech on my own time, or absorbing knowledge by working with others. I've both worked remotely and as a member of a team. My work experience is focused in full stack web development and running IT infrastructure. I am comfortable outside of this range and have worked on systems ranging from USB duplication automation, warehouse inventory systems, and 'complex' proprietary databases.
I am open to entry-level roles, but I could be a good fit for roles where my experience applies. I took a break to complete my degree a few years back, and have a programming resume gap that can be discussed.
Current personal projects:
Built a server out of off-lease enterprise gear and using it as my own virtualmachine server. Working on automating the deployment of any programs or services I host locally.
Developing a real-time general purpose notification system. Reading through "Designing Data-Intensive Applications."
Designing Data Intensive Applications by Martin Kleppman
My planned summer reading list:
- High Performance Browser Networking by Ilya Grigorik
- Refactoring: Improving the Design of Existing Code by Martin Fowler
- Designing Data-Intensive Applications by Martin Kleppmann
How did you find the latter? I'm a FE developer, so quite keen to get my hands on in data.
I'll mimic what others are saying by saying that
Designing Data-Intensive Applications is a superb book which you absolutely should read.
As for my second suggestion, I'll tell you one of the ways in which I go about researching certain kinds of programming topics. I pay for a Safari Books Online subscription [0], which lets me browse a massive amount of technical books without restrictions. Once I figure out the appropriate keywords, I'll perform a search and open all the relevant books in separate tabs. Then I filter the list down by looking through the index, or reading through a couple pages, to see if it actually covers what I'm looking for. By the time I've prepared this reduced list I usually have an idea of which books seem most interesting, and those are usually the ones I start with. Then it's just a matter of working my way through the list until satisfied. It has been my experience that most technical books are not worth reading cover-to-cover, so I just read through the few relevant chapters and move on. As with all things, there's definitely exceptions; I'd actually consider Designing Data-Intensive Applications one such example.
If you get a card from your local library you might also be able to get access to Safari Books Online for free, as well as tons of other resources. Although with my library card I only get access to a limited subset of their books, instead of the whole collection like with the paid subscription.
Another option, if you can't afford to spend that much money, is to just pirate a bunch of books or look em up on Google Books [1] in order to identify the ones which interest you the most, and then buy the ones that look useful, or try borrowing em from your local library (most likely through interlibrary loans). The market for technical books isn't very big and great authors are rare, so I think it's incredibly important that they be adequately compensated for their hard work, though. If you really can't afford to buy the books initially, be sure to at least keep track of the list so you can make the purchase after you've gotten your new job.
[0] https://www.safaribooksonline.com/
[1] https://books.google.com/
Agreed. There's a serious problem of information density in content these days. Most publisher-driven content is almost intentionally sparse. It takes tremendous expertise to weld information with context in a way that doesn't feel hollow and commercial.
Two books I've come across recently that do well to circumvent that problem: Sapiens by Yuval Harari and Designing Data-Intensive Applications by Martin Kleppmann. In both cases, you get a sense of the author distilling a lifetime's worth of knowledge and expertise into a form that seems hopelessly condensed, and in fact provokes you into further study. That's the mark of excellent exposition in my opinion.
IMO a better approach would be to read
Designing Data Intensive Applications (mentioned a million times on HN) which is more like a high-level map of the field. The references in DDIA are also a goldmine of information.
You don't have to read DDIA front to back. Just picking a topic (for instance "Distributed Transactions") is enough to get you started building an intuition about these issues.
Assuming you have some experience building simple single-node systems:
- Read Designing Data Intensive Applications. As others have said, it's a gem of a book, very readable, and it covers a lot of ground. It should answer both of your questions. Take the time to read it, take notes, and you should be well set. If you need to dive deeper into specific topics, each chapter links to several resources.
- Read some classic papers (Dynamo, Spanner, GFS). Some of these are readable while some are not-so-readable, but it'll be useful to get a sense of what problems they solve and where they fit in. You may not understand all of the terminology but that's fine.
That should give you a strong foundation that you can build upon. Beyond that, just build some systems, experiment with the ideas that you're learning. You cannot replace that experience with any amount of reading, so build something, make mistakes, struggle with implementation, and you'll reinforce what you've learned.
Backend is vast, and this helps you build a general sense of the topic. When you find a topic that you're really interested in (say stream processing, storage systems, or anything else), you can dive into that specific topic with some extra resources.
> I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?
A lot of that depends on the way you're building a system, the amount of data you're going to store, query patterns, etc. In most cases, there are tradeoffs that you'll have to understand and account for.
For example, a lot of column oriented databases are better suited for analytics workloads. One of the reasons is for that is their storage format (as the name says, columns rather than rows). Some of the systems you mentioned are built for search; some are built from the ground up to allow easier horizontal scaling, etc.
DDIA is a fantastic 2017 book but a few things went from must-do to avoid or at least think about it before doing it. we ve seen big changes into client side frameworks such as angular/react, nosql is less and less seen as something to look for first, scalability got somehow more complex and more achievable with cloud new features, client server best practices evolved a bit too. things like message queues or kafka are not as sexy
is there a good ressource to add on the top of DDIA that will reflect some recent changes in system design?
Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.
What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.
Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.
So far I have read:
- Left of Bang.
- The Obstacle is the way.
- The Daily Stoic.
- High-Output Management.
- The Effective Engineer.
- Managing Humans.
- Introducing Go.
Currently going through "Designing Data Intensive Applications" and some other data-related free ebooks from O'Reilly.
Up next on my list for the rest of the year:
- Hadoop: The Definitive Guide.
- The Manager's Path.
- Anti-Fragile.
- A Guide to the Good Life.
- The Denial of Death.
- Man's Search for Meaning.
EDIT: list formatting.
1. Get the book '
Designing Data-Intensive Applications', read the causes of scaling problems, and some techniques to deal with them.
2. Take something you've built, and imagine that it faces a specific scaling problem. For example, let's say the most complex thing you've built is a to-do list app. Maybe you can imagine some things that could make it grind to a halt when running as a monolith on a single server. e.g. what if a single user has 10,000 or 100,000 TODOs? Fake that situation (just insert many many fake TODOs into the database). Now see if you've created a scaling problem. If not, try a larger number, or a different dimension (e.g. what if there are many many simultaneous requests?) until you create a problem.
3. Look in that book again and pick a couple of ways you could solve the problem (database sharding across multiple servers? caching? pagination in your API calls?) and implement one of them.
(If you have difficult at step 2, you could try running your app on a virtual machine with very little RAM and disk, but imagine that's the largest machine type available.)
I have had similar thoughts about big tech companies, but lately I have started to realize the progress they have brought.
To name a few:
- Google, FB, Amazon basically wrote the book on distributed systems (Read Designing Data Intensive Applications), both from a research perspective and a very well architected, open source solution
- Google, FB have profoundly impacted front end development with cutting edge Javascript runtimes and open source front end frameworks
- Amazon, Google, Microsoft have basically invented/popularized a way to do computing(Cloud), server management that has enabled tiny tech companies to become giants by outsourcing IT infrastructure
- Apple/Google have created devices, OSs, and software that is nearly impossible to live without these days, additionally creating platforms for millions of developers to make a living on(App Store)
- Amazon has set the bar pretty high for automation in operations and made 2 day shipping a thing we expect from everyone
There are many other things I can’t think of right now, but long story short most of the companies you listed do have crappy parts of their business, but have also made incredible platforms that 3rd parties can leverage to make a ton of money.
I caveat all of this by saying that there are some practices that I don’t agree with at all of those firms, but by and large they have gotten so big because they are platforms.
* Fooled By Randomness (NN Taleb): Taleb is a complicated personality, but this book gave me a heuristic for thinking about long-tails and uncertain events that I could never have derived myself from a probability textbook.
* Designing Data Intensive Applications (M Kleppmann): Provided a first-principles approach for thinking about the design of modern large-scale data infrastructure. It's not just about assembling different technologies -- there are principles behind how data moves and transforms that transcend current technology, and DDIA is an articulation of those principles. After reading this, I began to notice general patterns in data infrastructure, which helped me quickly grasp how new technologies worked. (most are variations on the same principles)
* Introduction to Statistical Learning (James et al) and Applied Predictive Modeling (Kuhn et al). These two books gave me a grand sweep of predictive modeling methods pre-deep learning, methods which continue to be useful and applicable to a wider variety of problem contexts than AI/Deep Learning. (neural networks aren't appropriate for huge classes of problems)
* High Output Management (A Grove): oft-recommended book by former Intel CEO Andy Grove on how middle management in large corporations actually works, from promotions to meetings (as a unit of work). This was my guide to interpreting my experiences when I joined a large corporation and boy was it accurate. It gave me a language and a framework for thinking about what was happening around me. I heard this was 1 of 2 books Tobi Luetke read to understand management when he went from being a technical person to CEO of Shopify. (the other book being Cialdini's Influence). Hard Things about Hard Things (B Horowitz) is a different take that is also worth a read to understand the hidden--but intentional--managerial design of a modern tech company. These some of the very few books written by practitioners--rather than management gurus--that I've found to track pretty closely with my own real life experiences.
The Redlock algorithm suggested for use with Redis has been the subject of some of criticism:
https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...
Kleppman is the author of Designing Data-Intensive Applications and generally someone whose opinion I trust in such things.
I love Redis and use it extensively at $dayjob, but I stick to Consul for managing distributed locks. Consul was built from the ground up to handle such things; Redis handles it as a bit more of an afterthought / consequence of other features.
[prioritization] Effective engineer - lau
[systems] Designing data intensive applications - kleppman
[programming] SICP - sussman & abelson
Last one is an old scheme book. No other book (that I read) can even hold a candle to this one, in terms of actually developing my thought process around abstraction & composition of ideas in code. Things that library authors often need to deal with.
For example in react - what are the right concepts to that are powerful enough to represent a dynamic website & how should they compose together.
Designing Data-Intensive Applications by Martin Kleppmann.
Here are some books that have stood out for me. They cover some of the technologies that I either have to work with, or am personally interested in.
Effective Java by Joshua Bloch
Practical, actionable guidelines. The first edition was the best, the second was diluted somewhat by having to cover generics, in the third he admits that he doesn't really use Java much anymore... Despite that, it's well-written and still a good book.
The Linux Programming Interface by Michael Kerrisk
Covers some of the history of the Linux/Unix API, describes it in detail, has plenty of examples, compares different APIs that do similar things so you can make an informed choice (e.g. System V vs. POSIX message queues).
If any book in this list stands out for me, it's probably this one. It might be partly due to the surprise factor of how enjoyable and well-written a 1000+ page, near-reference book is.
Programming in Haskell by Graham Hutton
An intro to the language and how to approach problem solving from a functional P.O.V. Not as comprehensive as some other intros to Haskell, but Hutton is a good writer and educator, making it a good read.
Designing Data-Intensive Applications by Martin Kleppmann
Provides an overview of a number of topics related to databases, distributed systems, consensus, etc. Lots of references (many of them online) if you like that in a book. Enjoyable to read.
Parallel and Concurrent Programming in Haskell by Simon Marlow
Probably a must-read if you're into Haskell; probably too esoteric if you're not... Well written.
Type-Driven Development with Idris by Edwin Brady
Describes a programming language similar to Haskell, but strict by default and with dependent types designed-in from the start. Also describes techniques for leveraging the type system to construct functions (the type-driven part of the title). Well written.
Hacker's Delight by Henry S. Warren
Low-level bit twiddling. 'Nuff said.
Designing Data‑Intensive Applications by Martin Kleppmann
I agree that the Pragmatic Programmer is well done in it's audio form, and I also agree that Grokking Algorithms is terrible.
I am currently listening Designing Data Intensive Applications and it's phenomenally done - the author clearly worked with the narrator to adapt the content to audio format, and the narrator seems to have experience or familiarity with the subject because he pronounces the technical jargon very naturally.
I hope to find other software related audiobooks as good as DDAI is.
Martin Kleppmann's book
Designing Data-Intensive Applications has two chapters where it talks about consistency (one about replication and another one about consensus and consistency). That books reference sections for each character are goldmines.
The work is never "completely done" because most of the known or reliable solutions involve choosing tradeoffs with scalability, speed or data locality, so you can always go bespoke to optimize for the current business needs, and when they change you may need to change protocols or algorithms again.
As a self-taught developer, I used to think that some of the theoretical elements were overhyped. I can build iOS apps that work, and I did just that for the last 2-3 years. However, many of the programs that I wrote have not been as easy to maintain as I would like and some difficult to fix bugs have popped up overtime, both of which are due to a lack of deeper understanding of CS fundamentals. Last year I started interviewing and was ridiculed at one company in particular for a lack of CS knowledge. Afterwords I started exploring a lot of the CS concepts listed in this link and I have since found numerous ways to improve my code quality and have a better understanding of how CS best practices came to be. I also used to think that algorithms and data structures were relatively useless for an iOS developer, and I was able to do the job without them, thus proving my point. However, after gaining a better understanding, it quickly becomes clear that things like view hierarchies are simply trees and understanding ways to traverse these hierarchies can lead to much cleaner code. With the open sourcing of Swift, I also became more interested in understanding the language, but a lot of the language design decisions didn't make sense to me until I gained a better understanding of CS fundamentals. I have found the programming languages course on Coursera [1] to be particularly useful, and have also greatly enjoyed the book
Designing Data Intensive Applications [2]. There's also a great video from this year's WWDC that really inspires algorithm study and use in everyday applications [3].
[1] https://www.coursera.org/learn/programming-languages
[2] https://www.amazon.com/Designing-Data-Intensive-Applications...
[3] https://developer.apple.com/videos/play/wwdc2018/223/
I don't have a list because I usually look around on here for my next book to read. I see a lot of similar titles after reading through this thread. Some I've read, some I've started and got bored with and a lot I've never heard of.
I just finished reading Siddartha, which is a really short book, but I'd like to read more that are similar to this, any suggestions?
I see Designing Data-Intensive Applications quite a bit in this thread, might have a go at that one too.
Currently I'm reading https://en.wikipedia.org/wiki/The_History_of_the_Standard_Oi... which is fascinating!
Like others have said, it is just one tool in the tool box.
We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.
DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.
I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.
Architecture is also really useful in another regard: making the right (or at least "good enough") decisions on things that will be really painful to change later.
Think for example the choice of programing language or the DB technology used.
Architecture must strike the correct balance between vague guidelines and over-specification upfront. It should provide a framework which ensure homogeneity between components without restricting the teams/developers excessively.
On this subject, I found this presentation (by Stefan Tilkov): https://www.youtube.com/watch?v=PzEox3szeRc
Also, I'm currently reading "Designing Data-Intensive Applications", which is quite interesting and full of insight on the architecture trade-offs for data management and querying.
Probably you interpreted the word "design" with its narrower meaning of visual(ly-oriented) design, rather than, say
architecting data-intensive applications.
As with another poster, Edward Tufte's books came to mind - though it's about visual presentation of information, not user interface/experience design.
I've also felt that there's an unmet demand for books that provide a thorough overview of UI/UX design patterns, especially the way this book (Designing Data-Intensive Applications) does for its domain.
Similar to this, upon reading the often recommended book Designing Data Intensive Application, I realize that the concept of a system and the optimization of one is simply a combination of read/writes, storage, I/O, bottlenecks. It's the layers upon layers of work that's been done on it that makes it work the way it does now.
Question for people in 'Mature enough' Orgs who actually use this advanced tech/concept....
Do you watch videos like these on your employers time? And then do you have like 5-6 other colleagues also watch the video? or perhaps book a meeting room (virtually to all watch the talk, make popcorn, etc..)
and then once its done, do you guys have a day to come back around and discuss its concepts, like a reading group (what if stuff is burning, PRs, SRE tickets, milestones? is time allocated for this stuff? is it lowest of priority to keep abreast of this knowledge) . I started reading his DDIA book and I can't fathom solidifying all those concepts without discussing them thoroughly with other engineers over at least a month or so.
Basically, curious how orgs 'ingest' these large ideas into their software eng knowledge practices, are there initiatives? etc... or are you just banking on some engineer to study this stuff on their own time at home.
If you want a broad overview of how RDBMS work under the hood, "
Designing Data Intensive Applications", would be a good conceptual start.
This could be followed by database internals books like "Expert Oracle Database Architecture" and "Pro SQL Server Internals", irrespective of whether you use these databases or not, you will learn a ton of stuff about database internals and system design.
If you are interested in database schema design etc, then look up "The Database Model Resource Book" Volumes 1,2,3.
None of the above are quick reads but they are solid foundations for becoming a well rounded DBMS professional.
Have had Safari Books Online for about a year now, got it 50% off last August, love it. There have been several books I started reading in SBO and then ordered physical copies on Amazon. Never even occurred to me to order the books from Safari directly. The books I order are about broad topics and themes (Designing Data-Intensive Applications, Enterprise IoT, etc.). Rarely would I order a book about a specific Language or Framework, they have a limited shelf-life, but I'll absolutely reference the online versions to grab the info I need (more valuable format, can have open on monitor, search, etc.). Their process for alerting authors of this change may not have been great, but it seems like a smart business decision (assuming my use case if fairly typical).
Mainly it's just annoying if you've been working for 4+ years and haven't touched that stuff in a while. It's an entirely separate skillset separate from your role as a developer. The fact that there have now been several studies showing it doesn't correlate with job performance makes its emphasis in interviews questionable.
There's also some fear on the "poseur" factor that people worry about. If candidates can simply rote-learn their way into senior roles I wouldn't really feel comfortable having them as team members.
With that said, I do like think the advice of reading Programming Pearls and Designing Data-Intensive Applications -- two books useful for the job at hand, and just your general knowledge of how systems work and how to approach problems. Leetcode, by contrast, is empty calories.
I write about topics I'm learning about here:
https://timilearning.com/It's largely focused on distributed systems and databases for now, but that's subject to change.
I have some deep dive posts like this: https://timilearning.com/posts/data-storage-on-disk/part-two... - where I write dig deeper into a particular topic, in this case: how databases work.
I also have posts like this one: https://timilearning.com/posts/ddia/part-two/chapter-9-2/, where I just share the notes I took while reading a book or watching a video. I've posted my notes from the first 9 chapters of 'Designing Data-Intensive Applications' by Martin Kleppmann there.
My goal is mainly to think more clearly about the things I learn by writing about them, and then share that knowledge with whoever finds the topics interesting.
It depends on what you mean by Software Architecture. I normally see 3 interpretations of it.
For some people, S/W arch is writing readable, maintainable code. Things like Design patterns, FP, TDD, microservices etc. There is a lot of literature on this out there.
For others, it means having the ability to design the next Kafka/Spark/React. You can get basic theory for this by reading books on Domain Modelling, Distributed computing and Algorithms. So books like The Algorithm design manual, Designing Data intensive Applications, The Parallel and Concurrent Programming in Haskell, Functional and reactive domain modelling etc. The http://aosabook.org has good case studies to read as well. However, to actually build these systems require facing the problem in the 1st place and being unable to use existing systems to solve it. Or doing phd in them. It happens rarely.
Finally, the last one is my day job. Which is to convert ramblings and fantasies of leadership into a production systems, minimizing the number of curse words people use when working on it. I haven't really found any good guides to do this though. Things which help me are:
- Always thinking what could go wrong. And if it does, who should be notified if the system can't recover. A lot of times when I don't have the answer, I ask around. Things like slack channels, mailing lists, or even having coffee with people in industry who have tackled stuff like this.
- Communication skills. This doesn't mean small talk, but being able to have conversations and meetings which help define requirements and ensure everyone is on the same page. Also making sure there are hard numbers. ie. instead of "fast","responsive" etc, get latency, throughput, uptime numbers.
- Understanding business/technical capabilities and limitations. Things like business impact(LTR etc), capabilities of current infrastructure, skill levels of various people/contractors involved etc
3 or 4 Discworld books, as in every year. Starting with Soul Music this time, in publication order.
Designing Data Intensive Applications.
Some books on leadership from the recent HN discussion, not decided which yet.
Death's End (book 3 of The Three Body Problem). The first two were really good.
The Algorithm Design Manual. Domain Driven Design.
Some chess books. Some general science and history. The yearly random self help book.
If I manage all that plus whatever I'll decide I want in the actual year, it will be a good year for reading, but maybe I need to have some more focus. We'll see.
The premise of a system design interview is ridiculously broad. You could spend half an hour talking about how to scale a system or design at a very high level; or it could be an excuse to get you to mock-up an API or to talk about some useful algorithm. You can and should expect to write code, but then again maybe you won't have to. It's a lucky dip question.
There are books which are tangentially useful, eg Designing Data Intensive Applications or Site Reliability Engineering. Even if you're not going for SRE, it's good to understand the problems that are involved with high availability.
Having a good overview of something like Code Complete is useful, if only because it has generic advice for designing large programs.
For case studies I don't think books are any good. Watch conference talks and read the company dev blogs.
Thanks for the advice and the links! Also, I came across a book called Designing Data Intensive Applications. It was very highly rated on Amazon and on HN as well. Do you think it would help me? It explained a lot of concepts but after reading for a while, I found myself not following it. So, I decided to tackle each topic from scratch.
> You can't design it first and then go and build it.
You can't design it in full in one go, but you can design it and then incrementally update said design. Sadly many (companies) do not. But you can define the problem(s), the scope, the scale, and then design a solution appropriately to meet those needs (for a defined period of time). That's what distinguishes software engineering from hacking. They both have their place. Many companies claim to do the former but are mostly doing the latter. Software is still early in its life and as various kinds of system designs stabilize, so will the formalizations around what it means to be a software developer. Reading a book like Designing Data Intensive Application's you can't help but see those formalized topics budding.
The best way to learn is have skin in the game. Doing something yourself will force you to do what actually works. So it seems your current professional employment is excellent in that regard.
Formal study seems to work best after real experience. I read Martin Kleppmann's Designing Data-Intensive Applications based on its inclusion in teachyourselfcs.com.[0] I did not find it useful because I had nothing to apply it to once I finished. However I don't think this will apply to you as it seems you already have some problems in mind to consider.
[0] https://teachyourselfcs.com/#distributed-systems
To get my attention in a prepared lecture:
- Tell stories that touch on the facts in passing. It doesn't matter whether they're stories about the founders in a field, about your personal experiences, about a fictional startup building a database, etc... As long as it's wrapped in a story I will probably find it interesting.
- Start with why. Before explaining solutions, explain the problem those solutions solve.
- Gradual build-up of a system, instead of a serial description of its parts. The chapter in DDIA about LSM databases is one of the most engaging technical chapters I've ever read because it starts with a 2 line shell script and evolves it until it is Google's Bigtable.
While I recommend reading
DDIA, I think buzzcut_diet may end up disappointed. Reasons:
- it takes a while to read DDIA. Probably around 6 months of focused reading. Perhaps more
- one can learn a really good chunk of theoretical stuff... but probably not applicable to day to day work
- zero practical experience will be gained regarding Kubernetes, Spark, Kafka, EMR, Redis
So, I would recommend a more practical approach:
- start already reading the documentation of K8s, Kafka, Spark, etc. Choose one and go for it. I would recommend Kafka since its documentation is well written
- while reading documentation of the tooling above, one will inevitable stumble upon theoretical stuff that will not be explained in detail: that's exactly when you pick up DDIA (or similar books) and try to find the topic in the index and read it.
This replayability is discussed in
Designing Data Intensive Applications (DDIA), a book by Martin Kleppmann. Essentially you can use the Change Data Capture (CDC) information in your primary Postgres database, and pipe it through Kafka and replay it on any other data store.[0] This is also the basis of traditional database replication technology, where the change logs are replayed on other databases.
Is this architecture common? Well, I suspect it is overkill for most smaller organizations due to increased infrastructure complexity. I wouldn't do this just for the sake of doing it -- you may find yourself saddled with an increased maintenance workload just keeping the infrastructure running.
But if you truly have this use case, this is a well-known method for syncing data across various types of datastores (ie. so-called polyglot persistence).
[0] https://www.confluent.io/blog/bottled-water-real-time-integr...
Hi HN,
Graph theory had fascinated me as a student. On my first job, I'd briefly worked with Neo4j, a graph database, as part of a proof-of-concept project. At my current gig, I've had the opportunity to delve deep into the world of graph tech, especially databases, over the last one year.
Graph-like data models have been around since forever but their mainstream promise is relatively new. A resource which helped me understand the historical as well as fundamental aspects when starting out was the amazing book, "Designing Data Intensive Applications" by Martin Kleppmann. There also exist various resources academic and industry resources around graph tech. But piecing them together to get a holistic picture to evaluate potential use-cases has been an arduous process, to say the least.
Hence, I wrote this introductory piece to help anyone interested get started. I'd given a talk on the same topic at PyCon Italy (https://www.youtube.com/watch?v=t0Ra8G8gD-w). I plan to write more on related topics.
Main updates as of May 2020:
Computer Architecture: added Computer Systems: A Programmer's Perspective as first recommendation over nand2tetris.
Compilers: Crafting Interpreters added as first recommendation over dragon book.
Distributed Systems: added Designing Data-Intensive Applications as first recommendation over Distributed Systems.
Online availability of some video lectures has changed as well.
I've been having this experience recently. I used to be a voracious reader of programming books. But for a long time starting, gosh, maybe over 5 years ago, I just couldn't get excited to read them anymore. Just recently I thoroughly enjoyed reading Designing Data Intensive Applications, and I have just started Streaming Systems, which I'm also enjoying a lot. I think this idea of fluff at different levels of abstraction explains it: I used to enjoy reading books about programming languages and frameworks, but at some point it just felt like fluff and I could no longer get through it. But there are books about techniques for solving particular classes of problem (in the case of my recent reading: data processing) which don't feel like fluff to me. But maybe they will eventually, and maybe you're already further on this continuum and would find these books fluffy as well. I'll definitely be thinking about this fluff at different levels of abstraction model and checking my self-education against it as I go now!
For a modern look at systems design i highly recommend "
Designing Data-Intensive Applications (DDIA)" By Martin Kleppmann. Not really about structuring code, but I think stepping back and realizing your code is part of a larger system is very illuminating and influence how you write code.
http://dataintensive.net/
I would say exactly the opposite. I regret of buying a book from Amazon [0] dedicated to Kindle-use, because it is DRM protected and I am forced to use "Amazon Kindle" application, otherwise I cannot open it. I am usually okay with DRMs but I miss a fact I haven't bought it elsewhere with less annoying protection.
[0]: https://www.amazon.com/Designing-Data-Intensive-Applications...
Psst, "Designing Data Intensive Applications" was very good read. Do you know similar books that focus on distributed systems?
You can simplify the coding preparation by just getting a basic LeetCode subscription, filtering down to problems asked at Facebook in the last six months, and then sorting by frequency (how many times each question has been asked recently, according to user reports). CLRS, etc., are nice books, but if you're seriously considering a senior role at Facebook presumably you already know the basics and just need to memorize all the tricks for LeetCode problems.
For system design, though, it actually is worth reading through "Designing Data-Intensive Applications."
There are two books that taught me how systems work.
- One system in isolation - Operating Systems: Three Easy Pieces. Covers persistence, virtualisation and concurrency. This book is available for free at https://pages.cs.wisc.edu/~remzi/OSTEP/
- Multiple systems, and how data flows through them - Designing Data Intensive Applications. Covers the low level details of how databases persist data to disk and how multiple nodes coordinate with each other. If you’ve heard of the “CAP theorem”, this is the source to learn it from. Worth every penny.
More on why these two books are worth reading at https://teachyourselfcs.com
I would agree with you if we were talking about leetcode style data structure/algorithm questions, but system design is almost always relevant. I found '
Designing Data-Intensive Applications' to be a useful book beyond interviewing.
Even if you are not working on a product that requires supporting millions or billions of reads / writes, knowing what is overkill and what isn't, for your project's use case is still useful
I just finished the distributed systems section of teachyourselfcs and I now have a solid understanding of a lot of concepts I previously only shallowly understood.
When I started learning, the recommended textbook was Distributed systems: Principles and Paradigms (the new recommendation is Designing Data-Intensive Applications) and the recommended course was MIT 6.824.
6.824 is the first online course that I have completed fully and it was well worth it. I read, took notes, and summarized all required course readings (around 20 papers) and completed the labs, which involved creating a raft-based key-value store. The labs were especially useful because they forced me to really understand the details of the topics I had learned (e.g., MapReduce, Raft, Raft’s log compaction) in order for my code to pass all the tests.
I’m very happy with the results of following the course and I now plan to put the same amount of effort into other topics.
Maybe you can use one of the data interchange protocols that has a story for backward/forward compatibility? Something like Apache Avro or Protocol Buffers should allow you to work with different versions of your data at the same time.
See: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-a...
His book "Designing Data-Intensive Applications" has a section on this.
I'd love to know the 'design patterns' of works like this Knights of San Francisco game. Did the author use a workflow engine, a rules engine, functional event sourcing, a nested pyramid of if-then-else doom?
I have a sense that this space is somewhat unexplored. Text-based game world simulation is a relatively underdocumented (to my eyes) form of the 'game UI overtop a database manipulated with game logic rules' type of games, of which Simulation games are at the complex end of.
There are things like Twine and Inky that offer variables and conditionals to prewritten bodies of text, but doing composable texts worlds that change their state based on the accumulated choices of players over the course of their time seems to be a complex feature to build and extend, whether in Twine or another tool. Dialog simulation systems that remember what options you've done and give you additional options or changes over the course of the game are sold as products online. Heck, someone recently patented a 'grudge' system that a popular game (League of Legends?) used.
Or maybe I've just been looking too closely at it. I've been working slowly for about the past 2 years on an automation system for a tabletop RPG (non D-20 system) to speed up battle generation & resolution, trying to incorporate all of the various rules that say 'in X scenario, if Y conditions are met, gather this information from the user, then apply its Z effect like so, but also let the DM / user change any of the above or ignore the entire thing before you do so', so while I've ordered Designing Data Intensive Applications in hopes of gaining more insights, this problem certainly seems like a big thing to chew on from my self-taught programmer's POV right now.
I second
Designing Data-Intensive Applications.
Deep Learning with Python by François Chollet I think works as an audiobook as well.
I am a big non-fiction audio book fan and so much depends on the voice actor. I bad read can ruin the best content while Robertson Dean made Alan Greenspan's The Age of Turbulence into an enthralling adventure story.
The standard setup nowadays is something like this:
http://bit.ly/2MFAAt9.
You can use different technologies based on your use case, but you probably need all the pieces outlined above. As someone else has mentioned, if you're looking for trade-offs between different technologies, I'd recommend "Designing Data-Intensive Applications" by Kleppmann.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systemshttps://www.amazon.com/Designing-Data-Intensive-Applications...
I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.
I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.
Here are a few suggestions by me, tried to pick items that will hopefully stand the test of time, one per major publisher:
- Nathan Marz' (of Backtype/Twitter/Storm fame) Big Data (Manning): Don't let the name Big Data make you feel this is only for people with big data needs, in fact Nathan Marz tries to rethink how we store data. Immutable, append-only master data sets and views derived from them.
http://manning.com/marz/
- Martin Kleppmann's Designing Data-Intensive Applications (O'Reilly): http://dataintensive.net/ in beta. "This book will help you navigate the diverse and fast-changing landscape of technologies for storing and processing data. We compare a broad variety of tools and approaches." Martin has been researching for the past year for the book and aim to have a timeless book on the subject.
- Colin Jones' Mastering Clojure Macros: Write Cleaner, Faster, Smarter Code (Pragmatic Bookshelf): This short, but dense book makes macros understandable. Recommended if you want to learn what makes Lisp so powerful. https://pragprog.com/book/cjclojure/mastering-clojure-macros
I've found few issues on which people are as diametrically opposed as the ebook/physical book debate.
I did a reading group at work a couple of years ago (for Designing Data Intensive Applications as it turns out!), and half the group looked at me like I had 3 heads when I offered to buy them hard copies and the other half was insulted if I didn't offer them hard copies
I've been trying to learn about all of this stuff over the last couple of weeks, and agree that it isn't obvious.
Designing Data-Intensive Applications by Kleppmann has been helpful. It doesn't cover every framework, but I think it helps explain where a lot of the pieces fit together and when you might want to use some of them.
I've also found it useful to find podcasts that explain specific projects, such as Apache Kafka, and listen to them when I'm running.
Dear HN reader - if you're not quite ready to buy the book, take a listen to this episode of Software Engineering Daily (
https://softwareengineeringdaily.com/2017/05/02/data-intensi...). It will give you a sense of what Martin Kleppmann is all about and how he thinks about problems. I ordered my copy of "
Designing Data-Intensive Applications" after listening to this episode.
On my list to read in 2020:
The Visual Display of Quantitative Information
The Rust Programming Language
Progressive Web Apps
Permaculture: Principles and Pathways beyond Sustainability
Farming the Woods
Growing Gourmet and Medicinal Mushrooms
Affinity Designer Workbook
The Age of Surveillance Capitalism
Walden
A Guide for Desert and Dryland Restoration
Ernest Hemingway On Writing
The Two Hands of God (Alan Watts)
The Anarchist's Design Book and/or With the Grain: A Craftsman's Guide to Wood
Dune
Some other fiction reading I'll decide on after I finish Dune
Books/Authors I've read that I would recommend:
Tao of Physics, Web of Life, Systems View of Life, etc. (Fritjof Capra)
^Capra's work has heavily influenced my worldview and ability to think in systems
Designing Data Intensive Applications (just finishing this week)
Permaculture One & Two, Gaia's Garden, Edible Forest Gardens I&II
Black Swan, Antifragile, etc. (Nassim Taleb)
Cloud Hidden: Whereabouts Unknown (Alan Watts, written late in life)
Ishmael, Story of B, etc. (Daniel Quinn)
You are Not a Gadget, Who Owns the Future, etc. (Jaron Lanier)
Goethe's Italian Journey
Vonnegut, Hemingway, Steinbeck
The Wheel of Time
Twitter has changed this approach few times I guess, earlier it used to be simply insert tweet into a collection of tweets, and then when you load use timeline, look up the people they follow and find/merge those tweets. But it's going to create a lots of load on systems. Another approach is to maintain a cache of user's timeline(mailbox of tweets), when user posts a tweet, lookup all the people who follow that user, and insert the tweet into each or their timeline cache. results have be pre-computed, so less load. Both approaches fails when you have folks with lots of followers, so may be they use a hybrid of these approaches. this is Discussed in detail in "Designing Data-Intensive applications" book.
A lot of what I read in 2020 will involve just finishing titles I started in 2019 (or before). So my "To read in 2020" list already has a lot of stuff on it.
But to name ones that I very specifically want to read/finish sooner than later... hmm... there are a number of books that fall more into the realms of history / anthropology / etc., that I have been meaning to read. Books like Guns, Germs, and Steel, and Sapiens - things of that nature. One of those that I'm already on, but probably won't finish before Jan 1, is Human Universals by Donald Brown.
I also want to get through some books on writing/reading mathematical proofs. Mathematical Reasoning: Writing and Proof by Ted Sundstrom, or The Book of Proof by Richard Hammack.
Another one I hope to get through is Designing Data-Intensive Applications.
The book
Designing Data-Intensive Applications talks about “schema on write” vs “schema on read”. In order to interpret your data, you must apply a schema, so your choice is whether you do that explicitly when the data is written, or implicitly when it’s read.
Or as Yoda would say, Schema read or schema write, there is no “no schema”.
If you can't understand what Zookeeper is, I'd recommend reading Martin Kleppmann's book
Designing Data-Intensive Applications (
https://dataintensive.net/).
You don't need a CS degree to work in this field (I don't have one either!) but there are fundamental concepts you need to understand in order to make informed decisions when designing distributed systems.
barbecue_sauceonApr 15, 2021
akshayshahonMar 30, 2020
KototamaonOct 13, 2020
- https://github.com/donnemartin/system-design-primer
- http://aosabook.org/en/index.html
AvalaxyonJan 14, 2021
playing_coloursonDec 15, 2019
Designing Data-Intensive Applications https://dataintensive.net/
Streaming Systems http://shop.oreilly.com/product/0636920073994.do
and this one.
Ozzie_osmanonOct 4, 2019
If you do those and cracking the coding interview you should be good to go.
syndacksonOct 13, 2020
jb3689onMay 13, 2019
alextheparrotonSep 29, 2020
Honestly, the best technical book I’ve ever owned.
bwh2onApr 30, 2021
I also enjoyed Release It! by Michael Nygard to learn about making distributed systems more resilient.
turing_completeonOct 31, 2020
bandwitchonJuly 13, 2018
jdcarteronSep 6, 2017
zitterbewegungonOct 8, 2017
[1] See http://dataintensive.net
skydeonNov 29, 2018
rasmionJuly 29, 2018
https://dataintensive.net
weavieonMar 29, 2018
michel-slmonJune 8, 2016
nmcaonFeb 6, 2018
eatonphilonJune 30, 2021
https://notes.eatonphil.com/books-developers-should-read.htm...
dwateronFeb 6, 2019
nsmonJan 16, 2017
nindalfonJan 8, 2019
I wish this website had more filters. I'd like to filter out books with fewer than 10 reviews. As it is right now, it's a bit noisy.
vshanonJune 16, 2018
jorblumeseaonJan 5, 2021
DDIA is to system design, as a computer science textbook is to the algo interview.
elamjeonSep 16, 2019
atsushinonJuly 20, 2020
lioetersonJuly 11, 2019
makmanalponJan 29, 2018
adamfeldmanonFeb 11, 2019
As an engineer new to system design, I found the whole book to be gold. It gave me the vocabulary to continue learning more on my own.
[1]: https://dataintensive.net
asoloveonFeb 25, 2016
jkapturonApr 14, 2021
merittonJuly 29, 2019
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
jxubonFeb 24, 2018
commonturtleonNov 5, 2020
voz_onJune 21, 2020
basetensucksonMay 16, 2018
Not the author, just a happy reader.
[0] https://dataintensive.net/
jstx1onJuly 26, 2021
potta_coffeeonFeb 26, 2021
nwsmonAug 12, 2020
honkycatonSep 24, 2019
tiloleboonFeb 19, 2020
polskibusonJan 29, 2018
eatonphilonJune 27, 2021
Tldr; Designing Data Intensive Applications, Effective Python, The Google SRE book, and High Performance Browser Networking.
https://github.com/eatonphil/notes.eatonphil.com/blob/master...
roperzhonFeb 17, 2021
Besides being a good read overall, the book discusses topics like this one in detail and with a healthy attitude (people tend to have strong opinions on this)
socksyonApr 7, 2020
LarryEtonJune 26, 2021
I just started on Daniel Kahneman's Noise. It will be disappointing if it isn't one of these type of books.
asdffdsaonMay 13, 2019
posharmaonJan 5, 2021
romanhnonJan 10, 2019
shadeslayer_onJune 30, 2021
sciurusonAug 2, 2017
http://dataintensive.net/
https://www.safaribooksonline.com/library/view/designing-dat...
weitzjonJan 6, 2019
https://www.amazon.de/dp/1449373321/
crazed_climberonApr 13, 2015
http://dataintensive.net/
matthewrudyonFeb 6, 2017
http://dataintensive.net
correlationonDec 22, 2017
johntranonMar 12, 2018
lbrindzeonMay 13, 2019
billdybasonDec 15, 2018
dharmaturtleonJune 28, 2021
Unfortunately event sourcing means distributed systems... and I'm learning this on the fly on nights & weekends. Martin Kleppmann's "Designing Data Intensive Applications" has put the fear of god in me.
mindvirusonDec 6, 2017
Designing Data Intensive Applications by Martin Kleppmann. This book really made stream processing and Kafka click for me.
basetensucksonJune 12, 2018
[0] https://dataintensive.net/
Ozzie_osmanonFeb 23, 2020
Missing from the Architecture & System Design list is Martin Kleppmann's Designing Data Intensive Application, IMO the best modern book on systems / scalability.
manigandhamonOct 8, 2017
Then read this book for in-depth details - Designing Data-Intensive Applications
: https://dataintensive.net/
phxqlonDec 17, 2019
rramadassonAug 19, 2019
For actual code, you do not have one book but have to glean the knowledge from a whole bunch of them.
ipnononJune 24, 2020
This paper is referenced in chapter 9 (Consistency and Consensus) of "Designing Data-Intensive Applications" by Martin Kleppmann.
reinhardt1053onJuly 30, 2019
pepper_sauceonNov 12, 2019
jwronJuly 11, 2020
sna1lonJuly 12, 2018
If you want to go deeper on any of the subjects he discusses, his references for every chapter are solid and provide a deeper understanding.
whytakaonJuly 23, 2019
On algorithms, I think being able to use the work of others is already very empowering. Perhaps one day I’ll get into it deep.
apazzolinionJune 9, 2020
polymathemagicsonNov 10, 2019
elamjeonDec 15, 2019
What’s the sell here?
throwawayplsonMay 16, 2018
puszczykonDec 16, 2019
Also hope to get some good recommendations here :)
[1]: https://www.goodreads.com/book/show/30659.Meditations?ac=1&f...
[2]: https://www.goodreads.com/book/show/242472.The_Black_Swan?ac...
[3]: https://www.goodreads.com/book/show/23463279-designing-data-...
sambroneronFeb 22, 2019
An overview of databases (what and why, but also a lot of how) plus distributed concepts and modern architectures.
[0] https://www.amazon.com/Designing-Data-Intensive-Applications...
ZealotuxonJuly 2, 2021
henrik_wonDec 16, 2019
https://henrikwarne.com/2019/07/27/book-review-designing-dat...
organsnyderonApr 4, 2018
Reference:
Good for lending out:
zeroc8onOct 17, 2020
christiansakaionOct 8, 2019
- Building Microservices
- Desining Distributed systems
Any thoughts?
pitchedonJuly 11, 2021
https://www.goodreads.com/book/show/23463279
It’s usually number 1 on these lists but definitely deserves it!
rmetzleronJuly 15, 2021
We do have initiatives to learn from each other, but the days are already filled with too much unplanned work and meetings.
dustingetzonJuly 14, 2018
kthejoker2onApr 27, 2019
He also did these awesome Tolkien-esque maps of the database engine ecosystem: https://martin.kleppmann.com/2017/03/15/map-distributed-data...
Anyway, I inject this sort of stuff directly into my veins, so thanks very much for the post!
avremelonJune 9, 2020
aalhouronDec 30, 2017
daviddaviddavidonSep 29, 2020
Extra points for buying a dead tree copy and reading it without a thousand alerts and internet temptations vying for your attention :)
hdraonOct 8, 2017
It will not only help you understand what's "SQL" and "NoSQL" data stores, it also covers the differences between each of them, what problems they are designed to solve, how they try to solve it, and if it'll help with your problems as well.
DeceiveEitheronDec 8, 2017
https://dataintensive.net/
I have recommended this to everyone.
systemsonJune 1, 2020
"(a good example book for this currently is Designing Data Intensive Applications)."
I got this book recently and was planning to read it soon, does he thinks the tools and techniques mentioned in the book are a waste of time, or the opposite?
otrasonNov 4, 2018
Clean Code: A Handbook of Agile Software Craftsmanship [0] is a great book on writing and reading code.
Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design [1] is, no surprise, a book on organizing and architecting software.
Designing Data-Intensive Applications [2] may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.
The Architecture of Open Source Applications [3] is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.
Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.
[0]: https://www.amazon.com/Clean-Code-Handbook-Software-Craftsma...
[1]: https://www.amazon.com/Clean-Architecture-Craftsmans-Softwar...
[2]: https://www.amazon.com/Designing-Data-Intensive-Applications...
[3]: http://aosabook.org/en/index.html
iso1337onMar 17, 2019
It’s very well written, but maybe doesn’t have as much in the way of exercises.
davidcuddebackonMay 22, 2018
[1] https://www.amazon.com/Designing-Data-Intensive-Applications...
jugjugonDec 19, 2020
I have also found first chapters from Designing Data Intensive Applications helpful to get context around sql/nosql dbs.
phyrexonNov 28, 2019
morty_sonFeb 10, 2021
doctorsheronJune 21, 2020
Operating Systems: Three Easy Pieces: In 2013, I found this book because I was frustrated with the textbook assigned for my operating systems class (Silberchatz). OSTEP has incredibly clear and concise descriptions without skimping on necessary details. It's wonderfully written. I was so jazzed up about this book that I ended up sending a lot of edits / improvements, and the authors gave me a very kind shoutout in the acknowledgements section.
Computer Networking: A Top-Down Approach: In 2013, this was the assigned textbook for my computer networking class. I already owned Tanenbaum & Wetherall which is good, but preferred this book. It is a more approachable treatment of networking (without sacrificing any crucial topics), so better for a first course.
I've heard glowing reviews of The Algorithm Design Manual, Designing Data-Intensive Applications, and Structure and Interpretation of Computer Programs over the years, but I haven't personally gone through them. For the TeachYourselfCS categories that I know the textbook landscape, I find their selections spot-on and pretty refreshing.
[0] https://csapp.cs.cmu.edu/3e/labs.html
JoerionDec 8, 2017
karolistonNov 4, 2020
Many people are grinding for job interviews and many companies now copy FAANG and have a "systems design" round, Paxos/Raft is one of the key topics there, thus it's discovered by more and more people.
weitzjonJuly 23, 2020
I kept asking myself, what would happen if I were to extend on the feature currently presented in the chapter I was reading, only to find out my answers in the next chapter.
Brilliant book
sidewayonAug 3, 2021
Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:
1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?
2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?
And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.
henrik_wonNov 24, 2019
Also, for a fantastic source on database transactions and isolation levels, check out "Designing Data-Intensive Applications" (chapter 7) by Martin Kleppmann. Really a great book!
I have written about both these here:
https://henrikwarne.com/2011/12/18/introduction-to-databases...
https://henrikwarne.com/2019/07/27/book-review-designing-dat...
MediumDonJan 3, 2017
barbecue_sauceonMay 13, 2019
deepakkarkionNov 19, 2020
This recorded series is from Kleppmann's Concurrent and Distributed Systems course which he teaches at University of Cambridge.
In case the name seems familiar, Kleppmann is the author of perhaps HN's favourite book "Designing Data-Intensive Applications" https://www.amazon.com/dp/1449373321
olalondeonNov 26, 2016
On a related note, my favorite book this year was "Designing Data-Intensive Applications" by Martin Kleppmann. It's a great overview of modern database systems with a good balance between theory and practice.
mr_tristanonDec 18, 2018
I would say the same thing about databases. Don't just learn how to use PostgreSQL, or Kafka, or Dynamo. Try to understand how they're implemented.
I've found that sometimes, you do need help. Getting some nice user documentation will get your feet wet. But I've found that the best books tend to be more general topics, like "Designing Data Intensive Applications" for databases. (Note: I haven't found anything like that for frameworks - would be a great topic though.) These tend to cover not only "patterns" but give you a nice survey of the theory - so you can dive further into details yourself.
paulgbonNov 18, 2020
I've also had good experiences with SCPD courses from Stanford, if your budget would cover those (they are at the other end of the price spectrum).
djhworldonJuly 2, 2017
My work bought a copy of Designing Data Intensive Applications for the team, I've started reading it but lugging around 1kg of book every day gets old really quick, I wish they would have offered a PDF download coupon or something inside.
hugofirthonFeb 1, 2021
I also agree with the recommendations for "Designing Data-Intensive Applications" and "Database Internals". Though, having read the latter for a book club at $employer, I felt it served better as a sort of "index for the space" for people who already had some DB experience, rather a true introduction.
nindalfonJan 17, 2017
[1] http://sql-performance-explained.com/
[2] http://dataintensive.net/
honkycatonNov 3, 2019
I have a previous comment on this site about reading. That is actually what I look for, intellectual curiosity and a desire to continue learning and growing.
I'll just quote myself:
"""
What I look for in a developer: READS BOOKS. ( Audio books count )
That's the only thing. I'm sorry, if you are not reading and studying to keep up, you are getting left behind. There are so many brilliant people writing amazing books on a huge array of subjects. If I could get every one of my developers to read ONE book on software design[0] a year, I would die happy and the entire industry would be 10 years ahead.
They don't even have to be technical books. I just want to see intellectual curiosity and a commitment to self improvement.
- 0: In the vein of Clean Architecture, The Pragmatic Programmer, The Mythical Man Month, Designing Data-Intensive Applications, The Google SRE book, etc
"""
patreconSep 29, 2020
8589934591onJan 1, 2020
I had difficulty implementing data structures in C, not in python. Python I was able to think in terms of classes and attributes. But I was finding it difficult to do the same in C since there is no concept of classes. I am still trying to learn pointers properly to have an understanding how to implement data structures and algorithms effectively.
I came across the book you have recommended and it is a very nice book. I would recommend that along with Designing Data Intensive Applications.
Thank you.
avinasshonMay 27, 2021
1. Oren Eini - Creator and CTO of Raven DB - https://ayende.com/blog
2. Tyler Neely - Creator of Sled DB - https://medium.com/@tylerneely
3. Philip O'Toole - Creator of rqlite - https://www.philipotoole.com/
4. Martin Kleppmann - Author of Designing Data-Intensive Applications - https://martin.kleppmann.com/archive.html
5. Glauber Costa - worked on glommio, scylla DB - https://glaubercosta-11125.medium.com/
Do recommend me if you know more!
evil-oliveonApr 24, 2019
Some additional resources I'd recommend if you're interested:
Designing Data-Intensive Applications is a fantastic place to start if you're interested in the intersection of databases & distributed systems:
https://dataintensive.net/
The Architecture of Open-Source Applications book has a fewer chapters on databases:
https://aosabook.org/en/bdb.html
https://aosabook.org/en/hdfs.html
https://aosabook.org/en/nosql.html
There's some fantastic documentation on Postgres and its internals:
http://www.interdb.jp/pg/index.html
https://momjian.us/main/presentations/internals.html
https://www.postgresql.org/docs/current/internals.html
gfodoronNov 9, 2014
http://shop.oreilly.com/product/0636920032175.do
elamjeonJuly 17, 2019
Basically a high level guide through modern architectures, frameworks, and database designs. So far, my takeaway has been learning what tool would be useful for certain types of data engineering, not the details of how to write code with it.
Edit - link: https://news.ycombinator.com/item?id=20417801
veritas3241onAug 17, 2018
I'll also make a plug for the Meltano[0] project that my colleagues are working on. The idea is to have a simple tool for extraction, loading, transformation, and analysis from common business operations sources (Salesforce, Zendesk, Netsuite, etc.). It's all open source and we're tackling many of the problems you're interested in. Definitely poke around the codebase and feel free to ping me or make an issue / ask questions.
[0] https://gitlab.com/meltano/meltano/
jameskrausonMay 25, 2020
munchoronFeb 1, 2021
It's a great book that goes into pretty much all of the commonly used strategies to scaling data-intensive applications. It's not incredibly deep on any of them but it will allow you to get a great overview of the entire space. For each component, there's usually references to places where you can read and study more about them.
sahil-kangonMay 9, 2018
[1] http://dataintensive.net
wippleronJune 22, 2019
https://dataintensive.net/
muramiraonFeb 21, 2021
I always recommend reading Designing Data Intensive Applications as soon as you have an inkling that you will be asked to make such decisions in the near future.
rahimnathwanionAug 24, 2018
I haven't read 'Designing Data-Intensive Applications' yet, so not sure how much overlap there is or which one is better. But, according to Amazon.com, they're 'frequently bought together'.
olalondeonAug 24, 2016
mmineronDec 24, 2018
I also continued to deepen my understanding of databases and distributed systems. My favourite read this year was Designing Data-Intensive Applications which made me more familiar with the pros and cons of the various datastores and provided a better sense of the tradeoffs that each makes. It also gave me an appreciation for the guarantees that the battle-tested relational databases provide. One of my goals for 2019 is to improve my SQL knowledge — thus far any extra effort to understand it better has payed dividends.
commonturtleonAug 26, 2020
I worked more on the infrastructure / backend side, and I've found the book Designing Data Intensive Applications really useful. Amazing mix of practice and theory, super applicable to people working on distributed systems. Not sure if there is any equivalent for frontend / product engineering.
mapmeonDec 30, 2020
For example, if you don’t have a traditional CS degree, https://teachyourselfcs.com/ is a curated and effective set of books.
If your trying to understand complex systems, I would read Designing Data Intensive Applications, which is perhaps the best and most useful technical book I have ever read, and covers the most important parts of distributed systems. A lot of what’s in the book are fundamental distributed systems, from the 70-80s?/newer things from early 2000s built by BigTechCo
xadoconApr 4, 2020
Curated lists:
Jeff Atwood more comprehensive list
https://blog.codinghorror.com/recommended-reading-for-develo...
Steve Yegge
https://sites.google.com/site/steveyegge2/ten-great-books
Dan Luu
http://danluu.com/programming-books/
Marty Jacobs
https://zeroequalsfalse.com/posts/programming-books-you-wish...
BGO Software
https://www.bgosoftware.com/blog/8-most-influential-books-on...
Aggregated lists:
https://www.reddit.com/r/learnprogramming/wiki/books
Designing Data-Intensive Applications and it's related books:
https://anvaka.github.io/greview/ddia/1/
GraphguyonJuly 31, 2019
Fifteen minutes: “How to Choose a Database” by Ben Anderson (https://www.ibm.com/cloud/blog/how-to-choose-a-database-on-i...)
Three hours: Jepsen analyses of distributed systems safety. Kyle tests software ranging across the database spectrum.
One week: Designing Data-Intensive Applications by Martin Kleppman.
Disclaimer: I work with Ben and think he takes a really nice tact on this subject, while it may be orthogonal to your immediate question regarding trade-offs.
JarwainonDec 19, 2018
Right now I've got:
- Design Patterns by the Gang of Four
- The DevOps Handbook by Gene Kim
- The Phoenix Project by Gene Kim
- Designing Data-intensive Applications - Martin Kleppmann
- Peopleware - Tom DeMarco
- Code Complete - Steve McConnell
- The Mythical Man Month - Frederick P Brooks Jr
- Growing Object-Oriented Software - Steve Freeman
- Domain Driven Design - Eric Evans
- The Clean Coder: A code of conduct - Robert C martin
- The Pragmatic Programmer - Andrew Hunt
- Building Evolutionary Architectures - Neal Ford
- The Design of Everyday Things - Don Norman
- Don't Make me think - Steve Krug
henrik_wonJuly 24, 2020
https://henrikwarne.com/2019/07/27/book-review-designing-dat...
x-curiouscase-xonJan 14, 2020
https://www.amazon.com/Designing-Data-Intensive-Applications...
jimbokunonMay 12, 2020
https://dataintensive.net/
Almost no fluff, very concrete explanations of various algorithms and system properties, how various real world systems embody them, and how to put those systems together to get effective real world solutions.
therealplatoonNov 18, 2020
https://dataintensive.net/
libraryofbabelonDec 15, 2019
If I was mentoring someone learning this stuff, I'd advise reading Designing Data Intensive Applications first, which is certainly the best for giving the big picture, and follow up with this one for more detail on certain topics.
Given the previous dearth of books on this important subject, I think it's wonderful that we have two.
thundergolferonJune 30, 2021
For me that is teachyourselfcs.com. It recommends only two books if you don't have "multiple years" to self-study part-time. They are: Computer Systems: A Programmer's Perspective and Designing Data-Intensive Applications. If you do have multiple years it recommends ~9 books. The OP list has almost 100 books just on software architecture.
It takes so long to read one good textbook that I'd bet 90% of software engineers haven't read more than three or four cover-to-cover. I was rare in my computing theory class for actually using the textbook and doing the exercises and I only got 2/3 through. Given my current progress rate through 'Computer Systems: A Programmer's Perspective' it will take me at least 150 hours to complete.
nw__dataengonJuly 12, 2019
The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.
I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.
Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.
Source: I work as a data engineer at what some would call a big company :)
robtoonJan 31, 2020
I think one of the best ways to learn software architecture is to have a clear view of what the challenges are, and the Kleppman book does a really good job of providing that clear view.
[0]https://dataintensive.net/
rmbibeaultonJan 2, 2020
Remote: Yes
Willing to relocate: Yes (Highly interested in relocating to Silicon Valley, or San Fransisco, or other major tech hubs/cities, such as NYC, also interested in staying in the Boston area)
Technologies: Common Lisp, Python, Linux, git (some knowledge of rust, and C)
Github: github.com/Duderichy
LinkedIn: https://www.linkedin.com/in/rbibeault
Resume: see LinkedIn, and message me there, or email me for a copy.
Email: RichardMBibeault@gmail.com
I passed the triplebyte interview.
Physics major (Bachelors of Science) turned software developer. One year as a backend developer at a common lisp shop. Looking for a linux based company. (macOS as workstation computer/laptops is great too!). Avid learner, I try to read and learn as much as possible, I've recently gone through Designing Data Intensive Applications, and Designing Distributed Systems.
Would be glad to work at a company that uses a functional language, such as Haskell, especially if they don't expect new employees to come in already knowing the language. Also highly interested in companies using Rust, python, or go.
Ambitious: only been at the company a year and spent a significant amount of time this summer directing an intern, overhauled the build system the company uses internally (set up jenkins over previous system).
Eager to learn as much as I can.
rmbibeaultonDec 3, 2019
Remote: Yes
Willing to relocate: Yes (Highly interested in relocating to Silicon Valley, or San Fransisco, or other major tech hubs/cities, such as NYC, also interested in staying in the Boston area)
Technologies: Common Lisp, Python, Linux, git (some knowledge of rust, and C)
Github: github.com/Duderichy
LinkedIn: https://www.linkedin.com/in/rbibeault
Resume: see LinkedIn, and message me there, or email me for a copy.
Email: RichardMBibeault@gmail.com
I passed the triplebyte interview.
Physics major (Bachelors of Science) turned software developer. One year as a backend developer at a common lisp shop. Looking for a linux based company. (macOS as workstation computer/laptops is great too!). Avid learner, I try to read and learn as much as possible, I've recently gone through Designing Data Intensive Applications, and Designing Distributed Systems.
Would be glad to work at a company that uses a functional language, such as Haskell, especially if they don't expect new employees to come in already knowing the language. Also highly interested in companies using Rust, python, or go.
Ambitious: only been at the company a year and spent a significant amount of time this summer directing an intern, overhauled the build system the company uses internally (set up jenkins over previous system).
Eager to learn as much as I can.
squeaky-cleanonJuly 29, 2019
So I need to have written a book to be able to download a PDF and see 85/100 pages are blank? I work as a data engineer and can tell you 50% of these chapter topics are not directly related to data engineering.
There are no chapters in this book even close to 10% finished. If you want a book recommendation I'm seconding the suggestion in this thread of Designing Data-Intensive Applications. I have a copy 3 feet from me at the moment.
> This is a work-in-progress kindly made freely available. Is it really fair to criticize the author for not having finished it yet?
Please look through the PDF. This isn't just not done. This is not ready to share with anyone publicly. There is no useful information in this. There are probably under 20 paragraphs of original text.
> Is it really fair to criticize the author for not having finished it yet?
No, but I'm criticizing the fact that it's posted[0]. Not that they're working on something.
I don't see the author here in this thread so my warning is to other readers. Just move on unless you're a book publisher looking for an author to pick up.
The only real criticism anyone could offer about this would be about the chapter structure, because that's all that exists. I would recommend they drop all the chapters that are a CS101 equivalent. There's no need to explain git or the OSI model or grep.
[0] edit, I want to clarify I mean just posted and dumped. If the author were here for questions or feedback I would feel differently. But with just this link as-is, there is no point in sharing.
linkelonFeb 12, 2020
I am in the middle of the first exercise and have some questions.
Many of the examples in your book show people connecting these separate ideas that are reasonably understandable and applicable to the general population--the girl who recognized that many people have a fear of needles and sought to design a medical to device to help, or the student who liked going to festivals and thought about aligning attendee interests with the festivals' interests and waive attendance fees for attendees by having them volunteer at charities. It sounds like students in your class came up with relatable ideas by looking at problems in their lives that they noticed.
Right now, I'm merely a year into my career as a software engineer (having switched careers last year) and I am very interested in learning about good software engineering practices. I like seeing great CI/CD pipelines and being able to deliver very quickly. I like the sound of good DevOps practices (currently reading slowly through Accelerate by Forsgren) and I so far have really enjoyed reading books on scalability and reliability (Designing Data-Intensive Applications by Kleppmann is frequently recommended and I got a lot out of the book). I'm vaguely interested in MLOps.
I'm pretty happy being more of a cog in a machine right now so that I can see how an established company runs from the inside. I don't know that I'm immediately interested in a project that is more generalizable, the way your book examples are. But it does seem like an entrepreneurial mindset is still core to career progression since in the end a job is also about solving people's problems (where people may be inside or outside of the company). Thus I want to figure out how to use Initiative to iteratively improve my career.
I am wondering if people found success applying your Method Initiative concepts to a narrower scope in a specific technical field, and whether you could share some of those stories.
dd82onFeb 25, 2021
Martin Kleppmann, the author behind Designing Data Intensive Applications, wrote about his experience as well, and it shows an interesting contrast with Resig's experience with digital publishing. As you can see in Martin's graph, ebook sales starting Sept 2014 were a _major_ part of his royalties due to it being available as "early release", and integration with the O'Reilly platform increased his exposure, and therefore royalties.
Its hard to guage accurately, but it seems O'Reilly + ebook sales contributed to about 2/3rds of his overall royalty returns, which is a pretty darn good result!
Of course, Kleppmann and Resig are writing about very different eras in terms of publishing, but I can't help but wonder if Resig would have a different experience if he was able to publish an equally relevant work in 2015 vs 2008.
JtsummersonJune 25, 2021
Martin Kleppmann's Designing Data-Intensive Applications. Based on the frequent praise it receives here, haven't gotten far yet. I have some project ideas (for personal and professional projects) that could benefit from reading through it.
Martin Fowler's 2018 update to Refactoring. I read the original one a long time ago. In context, we have a work lunch & learn series and I'm interested in doing some presentations on the topic of refactoring (why, how, and when in particular) so it seemed appropriate to refresh my memory on some specific terminology from the book as well as to see if it's an appropriate book to recommend to colleagues. My recollection of the first edition is that I'd recommend it to colleagues, but it's been so long I'd rather read it once more before actually recommending it.
I reread Robert C. Martin's Clean Code based on some recent discussion here where it was rather strongly dismissed by a fair number of people. I didn't recall it being bad, my reread confirmed it is not, in fact, bad. Java-heavy, which is now an unpopular style of OOP, but otherwise a very good book. I'd still recommend it to junior colleagues paired with some caveats about avoiding seeing the world in black & white. There is no singular Way of Programming, but learn various ways and find what works for you and your team.
There are some more, but it's almost 5am and I haven't been able to sleep so I don't recall everything that's in the book stack or ebook queue. These are the ones I'm most interested in at present.
ing33konSep 13, 2020
"How do you make sure that a celebrity's tweet reaches all of her followers in less than 3 seconds?"
Looks like the interviewer has atleast read the first chapter of
"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems"
If you read and understand the above book, I am reasonably sure that you can crack most of the system design interviews.
fatjonnyonJune 5, 2017
lazyantonJan 5, 2021
What we need is more "Designing Data Intensive Applications", adapted to interviews.
Just as a couple quick comments, the "web crawler" scenario suggest a breadth-first search, which is OK (as in compared to depth-first search) but not good enough; web links in general is not a DAG and you can get into a loop. As another comment, in none of these two resources there's a single estimate that I can remember about how many servers you need as per requests/bandwidth etc, only calculations are about data amount. They also assume collaborative interviewer, which has never happened in my experience. I think none of these two resources by themselves would get you a L5 or even do well as L4 at FAANG (please somebody correct me), they are very basic (maybe I'm "too advanced" heh).
TheAceOfHeartsonDec 8, 2017
As for 2017 CS books, I'd second Designing Data‑Intensive Applications.
If we count updates, the latest revision of The Swift Programming Language is solid. My forays into Swift have been enjoyable.
This last one is kinda cheating since it's continuously updated, but I'd highly suggest browsing through the HTML Living Standard [0] and reading any parts that grab your attention.
EDIT: Looking through whatwg's news, I found out there's a developer edition [1] of the spec which strips the stuff that's only relevant to browser developers.
[0] https://html.spec.whatwg.org/multipage/
[1] https://html.spec.whatwg.org/dev/
maxioaticonApr 5, 2020
But a general book about distributed systems that I highly recommend and one you'll see often referred to on HN is Designing Data-Intensive Applications. Seriously great book and one I often go back to for fundamentals about distributed systems.
tcbascheonNov 27, 2019
I have a Kindle Paperwhite from (I think) 2014 and it still holds up quite well. I recently read the Kleppmann book (Designing Data-Intensive Applications) on it and it was fine, but I would have liked the ability to scribble notes and put in physical bookmarks.
From a technical perspective, everything worked quite well but I'm not sure I would want to read the UEFI spec on it.
tybitonJuly 15, 2021
I think either way it’s important to realise that unless you’re superhuman you’re not really solidifying the knowledge, you’re making yourself aware of the concepts. If a use case comes up that requires it you have a better chance of recognising that and then going back and getting a deeper understanding of how to apply it.
I only absorbed a small fraction of DDIA but I still think reading it was invaluable.
kronskionSep 8, 2020
The book chapters do an solid job laying the ground-work for those papers. The depth is in those references. Read them if you can!
cbenonOct 27, 2019
(1) "Understanding Computation" by Tom Stuart.
Not "fundamental" as a deep textbook, but very approachable for programmers intro into a big chunk of CS, explaining deep ideas about languages using rigorous working clean code (in Ruby, no prior knowledge needed).
I especially loved the first few chapters about what it means to define a programming languange and various kinds of formal semantics.
(2) Designing Data-Intensive Applications, Martin Kleppmann. This gives you a phenomenally good survey of concepts and practice of distributed systems. This is more software engineering than pure CS, but in my view you can't approach the field of distributed systems without blending both anyway.
(3) POODR — Practical Object-Oriented Design, in Ruby, by Sandi Metz. This is 100% software engineering, where there is no single definition of "foundational", but many people who read this swear by it. It's remarkably thin but lucid distillation of ideas that were "in the air" but Sandi nailed them down. An important thesis is that good code is not an aesthetic judgement of how it _now_ looks, but objective question how easy it will be to _change in the future_. Not Ruby-specific at all, but it teaches the original Smalltalk "message-passing" view of OOP, that for people that only learnt statically-typed Java, C++ etc view of OOP is a fundamental idea they're missing on.
Finally, not a book, but "the morning paper" https://blog.acolyer.org/ is excellent "return on your time" if you want to sample academic papers, both classic foundational ones, as well as cutting edge.
faizshahonAug 15, 2021
The hard problems stem from how the system deals with failures and how the system propagates writes across the replicas while meeting latency and consistency SLAs. On top of that the system needs to be built in a way that it can be maintained by many developers each working on a small piece of the system without knowing the ins and outs of the system as a whole. In addition, when the system fails debugging and mitigation needs to be able to be parallelized across many developers so that availability SLAs can be maintained. You can read about this in “Designing Data-Intensive Applications” by Martin Kleppman where he discusses the complexity involved in building distributed systems.
westurneronAug 2, 2020
https://dataintensive.net/
https://g.co/kgs/xJ73FS
wenconFeb 4, 2019
This became popular as people were trying to figure how to use Kafka as a persisted log store that could be "replayed" into various other databases. This meant that you could potentially stream all the deltas (well, more accurately the operations to create the delta, e.g insert, update, delete) in your data -- through a mechanism called Change-Data-Capture (CDC) [1] -- into a single platform (Kafka) and consistently replicate that data into SQL databases, NoSQL databases, object stores, etc. Because these are deltas, this lets you reconstruct your data at any point in history on any kind of back end database or storage (it’s database agnostic).
Event sourcing to my understanding is a term used among DDD practitioners and Martin Fowler disciples but with a different nuance. This article explains what it is:
http://cqrs.wikidot.com/doc:event-sourcing
[1] Debezium is an open-source CDC tool for common open-source databases. Side note: A valid (but potentially expensive) way of implementing CDC is by defining database triggers in your SQL database.
billtionJuly 19, 2021
[1] https://www.audible.com/pd/Designing-Data-Intensive-Applicat...
YoriconJuly 24, 2019
Most of the science behind these things is actually older. They industrialized it, removed the kinks, built upon actual experience, all of which is extremely precious, but I don't think it's as groundbreaking as people believe.
> - Google, FB have profoundly impacted front end development with cutting edge Javascript runtimes and open source front end frameworks
If you're talking about JITs, that's gradual improvements on prior work on JITs (started during the 60s, ignored by industry until Sun picked it during the 90s... for an academic project). Again, very useful, but not necessarily groundbreaking.
> - Amazon, Google, Microsoft have basically invented/popularized a way to do computing(Cloud), server management that has enabled tiny tech companies to become giants by outsourcing IT infrastructure
Again, industrialization on prior academic work (e.g. virtualization, distributed component-based architectures, etc.)
> - Apple/Google have created devices, OSs, and software that is nearly impossible to live without these days, additionally creating platforms for millions of developers to make a living on(App Store)
> - Amazon has set the bar pretty high for automation in operations and made 2 day shipping a thing we expect from everyone
Mmmmh... I was talking of "scientific progress", you seem to be talking of something different :)
If you recall, my point was that it's very hard to measure "scientific progress" by looking at industry, because industrialization typically happens decades after the actual discoveries/inventions. I think your point is that "industrial progress" may be good, which I'm not debating :)
jpamataonMay 10, 2018
The Architecture of Open Source Applications[2] series is a good one for leaning how to build production applications and you can read it online. The chapter on Scalable Web Architecture[3] is a must-read.
[0] https://www.amazon.com/Designing-Data-Intensive-Applications...
[1] https://news.ycombinator.com/item?id=15428526
[2] http://aosabook.org/en/index.html
[3] http://aosabook.org/en/distsys.html
jxubonAug 3, 2018
nonesuchluckonOct 21, 2020
romanhnonJune 16, 2016
The downside is that I pre-ordered the book in November, expecting it in April and it now shows November of this year as the release date on Amazon. I'd be surprised to get it this year at all. Haven't found other books of similar scope and recency though, so I guess I'll wait some more.
libraryofbabelonDec 8, 2020
Alex Petrov’s Database Internals: A Deep Dive Into How Distributed Data Systems Work (2019) is another essential recent reference that should be here. Not as broad as Kleppmann but dives a lot deeper into certain topics.
speedytnwonDec 6, 2018
[Designing Data-Intensive Applications
by Martin Kleppmann]
nindalfonJuly 2, 2017
cloakedarbiteronOct 1, 2018
[0] https://www.amazon.com/gp/product/1449373321/
clumsysmurfonJuly 2, 2017
After O'Reilly moved to DRM-free books, their 2009 sales went up by 104% http://toc.oreilly.com/2010/01/2009-oreilly-ebook-revenue-up...
In other interviews, he seemed confident that DRM wasn't worth it
https://www.forbes.com/forbes/2011/0411/focus-tim-oreilly-me...
Perhaps some part of the equation has changed since then. I'm looking forward a deeper analysis of the business reasons for this.
I'm also interested to hear what more authors think - I wonder how many agree with Martin Kleppmann (Designing Data Intensive Applications) https://twitter.com/martinkl/status/880336943980085248
This independence day weekend there were a lot of sales, so I purchased:
* "Programming Clojure, Third Edition" from pragprog (30% off sale)
* The entire collection of "Enthusiast's Guide to ..." from rockynook (each for $10)
* "The Quick Python Book 3e", "Serverless Architectures on AWS", "Event Streams in Action", "Get Programming with Haskell" from Manning (50% off)
These sales are the only way I can afford the volume I read. Some of that money would have gone to OReilly authors, but they deleted my full cart with $100 worth of stuff before I could purchase!
EDIT: OReilly catalog seemed large & redundant with publishers (packt) offering the same materials on their sites. Some like Wiley / MKP only offered very few items from their catalogs. Others like Rosenfeld / rockynook / no starch now provide DRM free options directly from their sites. I'm hoping at least OReilly reconsiders selling their Animal books again.
mapmeonApr 30, 2021
alikemalocalanonAug 13, 2018
https://dataintensive.net/
techjonApr 2, 2019
Remote: Yes (Have experience working remotely)
Willing to relocate: Yes
Technologies: Python, JavaScript, Linux, AWS, MySQL, PHP, Pandas, Selenium, Ansible, etc.
Résumé/CV: Available upon request.
Email: dctechj at gmail
I enjoy working with other people, and I'm good at developing practical solutions to problems. I am capable of quickly learning new tech on my own time, or absorbing knowledge by working with others. I've both worked remotely and as a member of a team. My work experience is focused in full stack web development and running IT infrastructure. I am comfortable outside of this range and have worked on systems ranging from USB duplication automation, warehouse inventory systems, and 'complex' proprietary databases.
I am open to entry-level roles, but I could be a good fit for roles where my experience applies. I took a break to complete my degree a few years back, and have a programming resume gap that can be discussed.
Current personal projects:
Built a server out of off-lease enterprise gear and using it as my own virtualmachine server. Working on automating the deployment of any programs or services I host locally.
Developing a real-time general purpose notification system. Reading through "Designing Data-Intensive Applications."
cosmolevonJune 22, 2015
Designing Data-Intensive Applications
The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
By Martin Kleppmann
http://shop.oreilly.com/product/0636920032175.do
http://dataintensive.net/
The author has great sense of humor.
tracer4201onJan 11, 2020
adamnemecekonSep 22, 2015
ovidiup13onJune 25, 2021
- High Performance Browser Networking by Ilya Grigorik
- Refactoring: Improving the Design of Existing Code by Martin Fowler
- Designing Data-Intensive Applications by Martin Kleppmann
How did you find the latter? I'm a FE developer, so quite keen to get my hands on in data.
sciurusonMay 13, 2016
http://shop.oreilly.com/product/0636920032175.do?sortby=publ...
TheAceOfHeartsonAug 24, 2018
As for my second suggestion, I'll tell you one of the ways in which I go about researching certain kinds of programming topics. I pay for a Safari Books Online subscription [0], which lets me browse a massive amount of technical books without restrictions. Once I figure out the appropriate keywords, I'll perform a search and open all the relevant books in separate tabs. Then I filter the list down by looking through the index, or reading through a couple pages, to see if it actually covers what I'm looking for. By the time I've prepared this reduced list I usually have an idea of which books seem most interesting, and those are usually the ones I start with. Then it's just a matter of working my way through the list until satisfied. It has been my experience that most technical books are not worth reading cover-to-cover, so I just read through the few relevant chapters and move on. As with all things, there's definitely exceptions; I'd actually consider Designing Data-Intensive Applications one such example.
If you get a card from your local library you might also be able to get access to Safari Books Online for free, as well as tons of other resources. Although with my library card I only get access to a limited subset of their books, instead of the whole collection like with the paid subscription.
Another option, if you can't afford to spend that much money, is to just pirate a bunch of books or look em up on Google Books [1] in order to identify the ones which interest you the most, and then buy the ones that look useful, or try borrowing em from your local library (most likely through interlibrary loans). The market for technical books isn't very big and great authors are rare, so I think it's incredibly important that they be adequately compensated for their hard work, though. If you really can't afford to buy the books initially, be sure to at least keep track of the list so you can make the purchase after you've gotten your new job.
[0] https://www.safaribooksonline.com/
[1] https://books.google.com/
sp527onAug 25, 2017
Two books I've come across recently that do well to circumvent that problem: Sapiens by Yuval Harari and Designing Data-Intensive Applications by Martin Kleppmann. In both cases, you get a sense of the author distilling a lifetime's worth of knowledge and expertise into a form that seems hopelessly condensed, and in fact provokes you into further study. That's the mark of excellent exposition in my opinion.
stuxnet79onDec 8, 2020
You don't have to read DDIA front to back. Just picking a topic (for instance "Distributed Transactions") is enough to get you started building an intuition about these issues.
_____sonOct 13, 2020
- Read Designing Data Intensive Applications. As others have said, it's a gem of a book, very readable, and it covers a lot of ground. It should answer both of your questions. Take the time to read it, take notes, and you should be well set. If you need to dive deeper into specific topics, each chapter links to several resources.
- Read some classic papers (Dynamo, Spanner, GFS). Some of these are readable while some are not-so-readable, but it'll be useful to get a sense of what problems they solve and where they fit in. You may not understand all of the terminology but that's fine.
That should give you a strong foundation that you can build upon. Beyond that, just build some systems, experiment with the ideas that you're learning. You cannot replace that experience with any amount of reading, so build something, make mistakes, struggle with implementation, and you'll reinforce what you've learned.
Backend is vast, and this helps you build a general sense of the topic. When you find a topic that you're really interested in (say stream processing, storage systems, or anything else), you can dive into that specific topic with some extra resources.
> I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?
A lot of that depends on the way you're building a system, the amount of data you're going to store, query patterns, etc. In most cases, there are tradeoffs that you'll have to understand and account for.
For example, a lot of column oriented databases are better suited for analytics workloads. One of the reasons is for that is their storage format (as the name says, columns rather than rows). Some of the systems you mentioned are built for search; some are built from the ground up to allow easier horizontal scaling, etc.
finikytouonJan 5, 2021
is there a good ressource to add on the top of DDIA that will reflect some recent changes in system design?
sseppolaonJan 11, 2021
What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.
Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.
aalhouronJuly 10, 2017
- Left of Bang.
- The Obstacle is the way.
- The Daily Stoic.
- High-Output Management.
- The Effective Engineer.
- Managing Humans.
- Introducing Go.
Currently going through "Designing Data Intensive Applications" and some other data-related free ebooks from O'Reilly.
Up next on my list for the rest of the year:
- Hadoop: The Definitive Guide.
- The Manager's Path.
- Anti-Fragile.
- A Guide to the Good Life.
- The Denial of Death.
- Man's Search for Meaning.
EDIT: list formatting.
rahimnathwanionMar 11, 2019
2. Take something you've built, and imagine that it faces a specific scaling problem. For example, let's say the most complex thing you've built is a to-do list app. Maybe you can imagine some things that could make it grind to a halt when running as a monolith on a single server. e.g. what if a single user has 10,000 or 100,000 TODOs? Fake that situation (just insert many many fake TODOs into the database). Now see if you've created a scaling problem. If not, try a larger number, or a different dimension (e.g. what if there are many many simultaneous requests?) until you create a problem.
3. Look in that book again and pick a couple of ways you could solve the problem (database sharding across multiple servers? caching? pagination in your API calls?) and implement one of them.
(If you have difficult at step 2, you could try running your app on a virtual machine with very little RAM and disk, but imagine that's the largest machine type available.)
elamjeonJuly 23, 2019
To name a few:
- Google, FB, Amazon basically wrote the book on distributed systems (Read Designing Data Intensive Applications), both from a research perspective and a very well architected, open source solution
- Google, FB have profoundly impacted front end development with cutting edge Javascript runtimes and open source front end frameworks
- Amazon, Google, Microsoft have basically invented/popularized a way to do computing(Cloud), server management that has enabled tiny tech companies to become giants by outsourcing IT infrastructure
- Apple/Google have created devices, OSs, and software that is nearly impossible to live without these days, additionally creating platforms for millions of developers to make a living on(App Store)
- Amazon has set the bar pretty high for automation in operations and made 2 day shipping a thing we expect from everyone
There are many other things I can’t think of right now, but long story short most of the companies you listed do have crappy parts of their business, but have also made incredible platforms that 3rd parties can leverage to make a ton of money.
I caveat all of this by saying that there are some practices that I don’t agree with at all of those firms, but by and large they have gotten so big because they are platforms.
wenconJune 25, 2021
* Designing Data Intensive Applications (M Kleppmann): Provided a first-principles approach for thinking about the design of modern large-scale data infrastructure. It's not just about assembling different technologies -- there are principles behind how data moves and transforms that transcend current technology, and DDIA is an articulation of those principles. After reading this, I began to notice general patterns in data infrastructure, which helped me quickly grasp how new technologies worked. (most are variations on the same principles)
* Introduction to Statistical Learning (James et al) and Applied Predictive Modeling (Kuhn et al). These two books gave me a grand sweep of predictive modeling methods pre-deep learning, methods which continue to be useful and applicable to a wider variety of problem contexts than AI/Deep Learning. (neural networks aren't appropriate for huge classes of problems)
* High Output Management (A Grove): oft-recommended book by former Intel CEO Andy Grove on how middle management in large corporations actually works, from promotions to meetings (as a unit of work). This was my guide to interpreting my experiences when I joined a large corporation and boy was it accurate. It gave me a language and a framework for thinking about what was happening around me. I heard this was 1 of 2 books Tobi Luetke read to understand management when he went from being a technical person to CEO of Shopify. (the other book being Cialdini's Influence). Hard Things about Hard Things (B Horowitz) is a different take that is also worth a read to understand the hidden--but intentional--managerial design of a modern tech company. These some of the very few books written by practitioners--rather than management gurus--that I've found to track pretty closely with my own real life experiences.
evil-oliveonSep 5, 2019
https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...
Kleppman is the author of Designing Data-Intensive Applications and generally someone whose opinion I trust in such things.
I love Redis and use it extensively at $dayjob, but I stick to Consul for managing distributed locks. Consul was built from the ground up to handle such things; Redis handles it as a bit more of an afterthought / consequence of other features.
nikhilsimhaonJune 30, 2021
[systems] Designing data intensive applications - kleppman
[programming] SICP - sussman & abelson
Last one is an old scheme book. No other book (that I read) can even hold a candle to this one, in terms of actually developing my thought process around abstraction & composition of ideas in code. Things that library authors often need to deal with.
For example in react - what are the right concepts to that are powerful enough to represent a dynamic website & how should they compose together.
ttaonApr 4, 2018
rotiferonJune 8, 2020
Effective Java by Joshua Bloch
Practical, actionable guidelines. The first edition was the best, the second was diluted somewhat by having to cover generics, in the third he admits that he doesn't really use Java much anymore... Despite that, it's well-written and still a good book.
The Linux Programming Interface by Michael Kerrisk
Covers some of the history of the Linux/Unix API, describes it in detail, has plenty of examples, compares different APIs that do similar things so you can make an informed choice (e.g. System V vs. POSIX message queues).
If any book in this list stands out for me, it's probably this one. It might be partly due to the surprise factor of how enjoyable and well-written a 1000+ page, near-reference book is.
Programming in Haskell by Graham Hutton
An intro to the language and how to approach problem solving from a functional P.O.V. Not as comprehensive as some other intros to Haskell, but Hutton is a good writer and educator, making it a good read.
Designing Data-Intensive Applications by Martin Kleppmann
Provides an overview of a number of topics related to databases, distributed systems, consensus, etc. Lots of references (many of them online) if you like that in a book. Enjoyable to read.
Parallel and Concurrent Programming in Haskell by Simon Marlow
Probably a must-read if you're into Haskell; probably too esoteric if you're not... Well written.
Type-Driven Development with Idris by Edwin Brady
Describes a programming language similar to Haskell, but strict by default and with dependent types designed-in from the start. Also describes techniques for leveraging the type system to construct functions (the type-driven part of the title). Well written.
Hacker's Delight by Henry S. Warren
Low-level bit twiddling. 'Nuff said.
wavesandwindonDec 8, 2017
chana_masalaonJuly 19, 2021
I am currently listening Designing Data Intensive Applications and it's phenomenally done - the author clearly worked with the narrator to adapt the content to audio format, and the narrator seems to have experience or familiarity with the subject because he pronounces the technical jargon very naturally.
I hope to find other software related audiobooks as good as DDAI is.
pvarangotonNov 14, 2018
The work is never "completely done" because most of the known or reliable solutions involve choosing tradeoffs with scalability, speed or data locality, so you can always go bespoke to optimize for the current business needs, and when they change you may need to change protocols or algorithms again.
chw9eonJuly 22, 2018
[1] https://www.coursera.org/learn/programming-languages
[2] https://www.amazon.com/Designing-Data-Intensive-Applications...
[3] https://developer.apple.com/videos/play/wwdc2018/223/
trenningonDec 16, 2019
I just finished reading Siddartha, which is a really short book, but I'd like to read more that are similar to this, any suggestions?
I see Designing Data-Intensive Applications quite a bit in this thread, might have a go at that one too.
Currently I'm reading https://en.wikipedia.org/wiki/The_History_of_the_Standard_Oi... which is fascinating!
evanrichonAug 3, 2021
We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.
DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.
I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.
macintuxonJune 21, 2021
- Release It! (https://pragprog.com/titles/mnee2/release-it-second-edition/)
- Designing Data-Intensive Applications (https://dataintensive.net/)
I would suggest finding an open source project of interest and taking a deep dive into its code and documentation to understand how it works and why it was built that way.
Which reminds me, this should help with that: The Architecture of Open Source Applications (http://www.aosabook.org/en/index.html)
kakwa_onMay 12, 2020
Think for example the choice of programing language or the DB technology used.
Architecture must strike the correct balance between vague guidelines and over-specification upfront. It should provide a framework which ensure homogeneity between components without restricting the teams/developers excessively.
On this subject, I found this presentation (by Stefan Tilkov): https://www.youtube.com/watch?v=PzEox3szeRc
Also, I'm currently reading "Designing Data-Intensive Applications", which is quite interesting and full of insight on the architecture trade-offs for data management and querying.
lioetersonOct 9, 2017
As with another poster, Edward Tufte's books came to mind - though it's about visual presentation of information, not user interface/experience design.
I've also felt that there's an unmet demand for books that provide a thorough overview of UI/UX design patterns, especially the way this book (Designing Data-Intensive Applications) does for its domain.
rashidujangonMar 31, 2021
abledononJuly 15, 2021
Do you watch videos like these on your employers time? And then do you have like 5-6 other colleagues also watch the video? or perhaps book a meeting room (virtually to all watch the talk, make popcorn, etc..)
and then once its done, do you guys have a day to come back around and discuss its concepts, like a reading group (what if stuff is burning, PRs, SRE tickets, milestones? is time allocated for this stuff? is it lowest of priority to keep abreast of this knowledge) . I started reading his DDIA book and I can't fathom solidifying all those concepts without discussing them thoroughly with other engineers over at least a month or so.
Basically, curious how orgs 'ingest' these large ideas into their software eng knowledge practices, are there initiatives? etc... or are you just banking on some engineer to study this stuff on their own time at home.
roundthecorneronSep 2, 2019
This could be followed by database internals books like "Expert Oracle Database Architecture" and "Pro SQL Server Internals", irrespective of whether you use these databases or not, you will learn a ton of stuff about database internals and system design.
If you are interested in database schema design etc, then look up "The Database Model Resource Book" Volumes 1,2,3.
None of the above are quick reads but they are solid foundations for becoming a well rounded DBMS professional.
jefe_onJune 29, 2017
spamizbadonJan 6, 2021
There's also some fear on the "poseur" factor that people worry about. If candidates can simply rote-learn their way into senior roles I wouldn't really feel comfortable having them as team members.
With that said, I do like think the advice of reading Programming Pearls and Designing Data-Intensive Applications -- two books useful for the job at hand, and just your general knowledge of how systems work and how to approach problems. Leetcode, by contrast, is empty calories.
timtimtimionApr 7, 2020
It's largely focused on distributed systems and databases for now, but that's subject to change.
I have some deep dive posts like this: https://timilearning.com/posts/data-storage-on-disk/part-two... - where I write dig deeper into a particular topic, in this case: how databases work.
I also have posts like this one: https://timilearning.com/posts/ddia/part-two/chapter-9-2/, where I just share the notes I took while reading a book or watching a video. I've posted my notes from the first 9 chapters of 'Designing Data-Intensive Applications' by Martin Kleppmann there.
My goal is mainly to think more clearly about the things I learn by writing about them, and then share that knowledge with whoever finds the topics interesting.
codingbearonJan 31, 2020
For some people, S/W arch is writing readable, maintainable code. Things like Design patterns, FP, TDD, microservices etc. There is a lot of literature on this out there.
For others, it means having the ability to design the next Kafka/Spark/React. You can get basic theory for this by reading books on Domain Modelling, Distributed computing and Algorithms. So books like The Algorithm design manual, Designing Data intensive Applications, The Parallel and Concurrent Programming in Haskell, Functional and reactive domain modelling etc. The http://aosabook.org has good case studies to read as well. However, to actually build these systems require facing the problem in the 1st place and being unable to use existing systems to solve it. Or doing phd in them. It happens rarely.
Finally, the last one is my day job. Which is to convert ramblings and fantasies of leadership into a production systems, minimizing the number of curse words people use when working on it. I haven't really found any good guides to do this though. Things which help me are:
- Always thinking what could go wrong. And if it does, who should be notified if the system can't recover. A lot of times when I don't have the answer, I ask around. Things like slack channels, mailing lists, or even having coffee with people in industry who have tackled stuff like this.
- Communication skills. This doesn't mean small talk, but being able to have conversations and meetings which help define requirements and ensure everyone is on the same page. Also making sure there are hard numbers. ie. instead of "fast","responsive" etc, get latency, throughput, uptime numbers.
- Understanding business/technical capabilities and limitations. Things like business impact(LTR etc), capabilities of current infrastructure, skill levels of various people/contractors involved etc
ScarblaconDec 16, 2019
Designing Data Intensive Applications.
Some books on leadership from the recent HN discussion, not decided which yet.
Death's End (book 3 of The Three Body Problem). The first two were really good.
The Algorithm Design Manual. Domain Driven Design.
Some chess books. Some general science and history. The yearly random self help book.
If I manage all that plus whatever I'll decide I want in the actual year, it will be a good year for reading, but maybe I need to have some more focus. We'll see.
joshvmonAug 24, 2018
There are books which are tangentially useful, eg Designing Data Intensive Applications or Site Reliability Engineering. Even if you're not going for SRE, it's good to understand the problems that are involved with high availability.
Having a good overview of something like Code Complete is useful, if only because it has generic advice for designing large programs.
For case studies I don't think books are any good. Watch conference talks and read the company dev blogs.
rochakonFeb 19, 2020
cloverichonAug 6, 2021
You can't design it in full in one go, but you can design it and then incrementally update said design. Sadly many (companies) do not. But you can define the problem(s), the scope, the scale, and then design a solution appropriately to meet those needs (for a defined period of time). That's what distinguishes software engineering from hacking. They both have their place. Many companies claim to do the former but are mostly doing the latter. Software is still early in its life and as various kinds of system designs stabilize, so will the formalizations around what it means to be a software developer. Reading a book like Designing Data Intensive Application's you can't help but see those formalized topics budding.
ipnononApr 30, 2021
Formal study seems to work best after real experience. I read Martin Kleppmann's Designing Data-Intensive Applications based on its inclusion in teachyourselfcs.com.[0] I did not find it useful because I had nothing to apply it to once I finished. However I don't think this will apply to you as it seems you already have some problems in mind to consider.
[0] https://teachyourselfcs.com/#distributed-systems
JoerionJuly 1, 2018
- Tell stories that touch on the facts in passing. It doesn't matter whether they're stories about the founders in a field, about your personal experiences, about a fictional startup building a database, etc... As long as it's wrapped in a story I will probably find it interesting.
- Start with why. Before explaining solutions, explain the problem those solutions solve.
- Gradual build-up of a system, instead of a serial description of its parts. The chapter in DDIA about LSM databases is one of the most engaging technical chapters I've ever read because it starts with a 2 line shell script and evolves it until it is Google's Bigtable.
ingvulonMay 1, 2021
- it takes a while to read DDIA. Probably around 6 months of focused reading. Perhaps more
- one can learn a really good chunk of theoretical stuff... but probably not applicable to day to day work
- zero practical experience will be gained regarding Kubernetes, Spark, Kafka, EMR, Redis
So, I would recommend a more practical approach:
- start already reading the documentation of K8s, Kafka, Spark, etc. Choose one and go for it. I would recommend Kafka since its documentation is well written
- while reading documentation of the tooling above, one will inevitable stumble upon theoretical stuff that will not be explained in detail: that's exactly when you pick up DDIA (or similar books) and try to find the topic in the index and read it.
wenconNov 1, 2017
Is this architecture common? Well, I suspect it is overkill for most smaller organizations due to increased infrastructure complexity. I wouldn't do this just for the sake of doing it -- you may find yourself saddled with an increased maintenance workload just keeping the infrastructure running.
But if you truly have this use case, this is a well-known method for syncing data across various types of datastores (ie. so-called polyglot persistence).
[0] https://www.confluent.io/blog/bottled-water-real-time-integr...
akashtndnonAug 8, 2019
Graph theory had fascinated me as a student. On my first job, I'd briefly worked with Neo4j, a graph database, as part of a proof-of-concept project. At my current gig, I've had the opportunity to delve deep into the world of graph tech, especially databases, over the last one year.
Graph-like data models have been around since forever but their mainstream promise is relatively new. A resource which helped me understand the historical as well as fundamental aspects when starting out was the amazing book, "Designing Data Intensive Applications" by Martin Kleppmann. There also exist various resources academic and industry resources around graph tech. But piecing them together to get a holistic picture to evaluate potential use-cases has been an arduous process, to say the least.
Hence, I wrote this introductory piece to help anyone interested get started. I'd given a talk on the same topic at PyCon Italy (https://www.youtube.com/watch?v=t0Ra8G8gD-w). I plan to write more on related topics.
jgwil2onJune 21, 2020
Computer Architecture: added Computer Systems: A Programmer's Perspective as first recommendation over nand2tetris.
Compilers: Crafting Interpreters added as first recommendation over dragon book.
Distributed Systems: added Designing Data-Intensive Applications as first recommendation over Distributed Systems.
Online availability of some video lectures has changed as well.
sanderjdonMar 24, 2020
blain_the_trainonJuly 25, 2017
http://dataintensive.net/
pqbonAug 25, 2020
[0]: https://www.amazon.com/Designing-Data-Intensive-Applications...
Psst, "Designing Data Intensive Applications" was very good read. Do you know similar books that focus on distributed systems?
nilknonJan 6, 2021
For system design, though, it actually is worth reading through "Designing Data-Intensive Applications."
nindalfonJune 21, 2021
- One system in isolation - Operating Systems: Three Easy Pieces. Covers persistence, virtualisation and concurrency. This book is available for free at https://pages.cs.wisc.edu/~remzi/OSTEP/
- Multiple systems, and how data flows through them - Designing Data Intensive Applications. Covers the low level details of how databases persist data to disk and how multiple nodes coordinate with each other. If you’ve heard of the “CAP theorem”, this is the source to learn it from. Worth every penny.
More on why these two books are worth reading at https://teachyourselfcs.com
flak48onSep 13, 2020
Even if you are not working on a product that requires supporting millions or billions of reads / writes, knowing what is overkill and what isn't, for your project's use case is still useful
eddyerburghonJune 21, 2020
When I started learning, the recommended textbook was Distributed systems: Principles and Paradigms (the new recommendation is Designing Data-Intensive Applications) and the recommended course was MIT 6.824.
6.824 is the first online course that I have completed fully and it was well worth it. I read, took notes, and summarized all required course readings (around 20 papers) and completed the labs, which involved creating a raft-based key-value store. The labs were especially useful because they forced me to really understand the details of the topics I had learned (e.g., MapReduce, Raft, Raft’s log compaction) in order for my code to pass all the tests.
I’m very happy with the results of following the course and I now plan to put the same amount of effort into other topics.
maksutonJuly 4, 2018
See: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-a...
His book "Designing Data-Intensive Applications" has a section on this.
MulticomponJuly 7, 2021
I have a sense that this space is somewhat unexplored. Text-based game world simulation is a relatively underdocumented (to my eyes) form of the 'game UI overtop a database manipulated with game logic rules' type of games, of which Simulation games are at the complex end of.
There are things like Twine and Inky that offer variables and conditionals to prewritten bodies of text, but doing composable texts worlds that change their state based on the accumulated choices of players over the course of their time seems to be a complex feature to build and extend, whether in Twine or another tool. Dialog simulation systems that remember what options you've done and give you additional options or changes over the course of the game are sold as products online. Heck, someone recently patented a 'grudge' system that a popular game (League of Legends?) used.
Or maybe I've just been looking too closely at it. I've been working slowly for about the past 2 years on an automation system for a tabletop RPG (non D-20 system) to speed up battle generation & resolution, trying to incorporate all of the various rules that say 'in X scenario, if Y conditions are met, gather this information from the user, then apply its Z effect like so, but also let the DM / user change any of the above or ignore the entire thing before you do so', so while I've ordered Designing Data Intensive Applications in hopes of gaining more insights, this problem certainly seems like a big thing to chew on from my self-taught programmer's POV right now.
machinehermiteronJuly 19, 2021
Deep Learning with Python by François Chollet I think works as an audiobook as well.
I am a big non-fiction audio book fan and so much depends on the voice actor. I bad read can ruin the best content while Robertson Dean made Alan Greenspan's The Age of Turbulence into an enthralling adventure story.
karankeonJune 12, 2019
You can use different technologies based on your use case, but you probably need all the pieces outlined above. As someone else has mentioned, if you're looking for trade-offs between different technologies, I'd recommend "Designing Data-Intensive Applications" by Kleppmann.
tracer4201onFeb 10, 2019
https://www.amazon.com/Designing-Data-Intensive-Applications...
I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.
I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.
ludwigvanonNov 28, 2014
- Nathan Marz' (of Backtype/Twitter/Storm fame) Big Data (Manning): Don't let the name Big Data make you feel this is only for people with big data needs, in fact Nathan Marz tries to rethink how we store data. Immutable, append-only master data sets and views derived from them.
http://manning.com/marz/
- Martin Kleppmann's Designing Data-Intensive Applications (O'Reilly): http://dataintensive.net/ in beta. "This book will help you navigate the diverse and fast-changing landscape of technologies for storing and processing data. We compare a broad variety of tools and approaches." Martin has been researching for the past year for the book and aim to have a timeless book on the subject.
- Colin Jones' Mastering Clojure Macros: Write Cleaner, Faster, Smarter Code (Pragmatic Bookshelf): This short, but dense book makes macros understandable. Recommended if you want to learn what makes Lisp so powerful. https://pragprog.com/book/cjclojure/mastering-clojure-macros
zaphod12onSep 29, 2020
I did a reading group at work a couple of years ago (for Designing Data Intensive Applications as it turns out!), and half the group looked at me like I had 3 heads when I offered to buy them hard copies and the other half was insulted if I didn't offer them hard copies
TheCowboyonDec 8, 2017
Designing Data-Intensive Applications by Kleppmann has been helpful. It doesn't cover every framework, but I think it helps explain where a lot of the pieces fit together and when you might want to use some of them.
I've also found it useful to find podcasts that explain specific projects, such as Apache Kafka, and listen to them when I'm running.
teejonSep 6, 2017
sentientforestonDec 18, 2019
Books/Authors I've read that I would recommend:
varunsainionSep 13, 2020
mindcrimeonDec 16, 2019
But to name ones that I very specifically want to read/finish sooner than later... hmm... there are a number of books that fall more into the realms of history / anthropology / etc., that I have been meaning to read. Books like Guns, Germs, and Steel, and Sapiens - things of that nature. One of those that I'm already on, but probably won't finish before Jan 1, is Human Universals by Donald Brown.
I also want to get through some books on writing/reading mathematical proofs. Mathematical Reasoning: Writing and Proof by Ted Sundstrom, or The Book of Proof by Richard Hammack.
Another one I hope to get through is Designing Data-Intensive Applications.
physiclesonJan 17, 2021
Or as Yoda would say, Schema read or schema write, there is no “no schema”.
sciurusonJuly 10, 2020
You don't need a CS degree to work in this field (I don't have one either!) but there are fundamental concepts you need to understand in order to make informed decisions when designing distributed systems.