Chapter 1: Reliable, Scalable, and Maintainable Applications

Chapter 1 focused on three key topics in distributed systems: reliability, scalability, and maintainability. The chapter defines each of these important topics, provides some best practices for achieving each, and gives real-world examples of how they can be applied to distributed systems. While reading this chapter, I found myself relating a lot of what I read back to my experience over the last five years building and maintaining the cloud and distributed trading system at Belvedere.

Reliability is defined as the ability for a system to work correctly even in the face of adversity. Correctly means performing the correct function at the desired level of performance. Reliability is a core tenet of distributed systems and something I am intimately familiar with. There are entire careers dedicated to ensuring the reliability of different software systems and it requires some very interesting engineering. One of the main culprits of diminishing the reliability of a system are “faults”. Faults are just things that can go wrong. When desiging a reliable system, it is imperative to anticipate the faults and architect your system to gracefully handle them. This is called fault-tolerance. Notable, the book mentions that a fault is not a failure. A failure is when the system as a whole stops providing the required service to the user. Whereas a fault, can result in the service still being available but maybe degraded or even providing incomplete (or incorrect) data.

During my time at Belvedere, I spent two years as a pseudo Site Reliability Engineer on our hedging and autofitting products. Our hedging services monitored internal order flow and risk reports and sent orders to an exchange automatically in order to hedge the positions we built up throughout the day. This system was designed well before I joined, but there was a key feature in the design of our hedging system that caused a lot of issues (and also prevented some). In order for our traders to run the options execution services that would place making/taking options orders, the hedging services had to be up and running. The options trading services were designed to require a TCP connection to the hedging services in order to run and if that connection was broken, the options services would shut off. This was an intentional choice because if we can’t hedge, then we shouldn’t be trading options. However, this brought a lot of challenges from my perspective on the hedging side. The exchanges we sent our hedge orders to have exchange side safety settings. When the hedging services were designed, they were not designed to be fault-tolerant of these settings. Meaning when the exchange indicated we hit one of their safety settings, our hedging services would shut off. This ultimately led to hedging services shutting off in the middle of the trading day and caused our downstream options execution services to also shutoff. This led to downtime for our traders, time out of market, opportunity cost, and required manual intervention on both the hedging and options side. This led to some interesting conversations between engineering and trading. We built fault-tolerance around some of the exchange side safety settings that were causing issues, but we also determined that shutting everything off on others was the right move. I bring this up because it was a real-world example of a production issue related to reliability and fault-tolerance. However, it also shows that each scenario is unique and fault-tolerance may not be the best choice for every possible issue.

Scalability is how a system is able to handle an increase in load, whether that load is in data volume, traffic volume, or complexity. Scalability is often associated with either increasing instances of a service to handle more requests (horizontal scaling), or increasing the amount of compute resources available to a service (vertical scaling). I most often associate scalability with Kubernetes. Kubernetes has become the de facto technology used for scalable workloads and it automates so much of what is required for scaling infrastructure for you. Designing scalable systems requires you to handle a few prerequisites before you can begin:

  • You must be able to describe the current load on the system via metrics. Without that data you cannot realistically plan for scalability.
  • If the system grows in a particular way, what are our options for coping with the growth?
  • How can we add computing resources to handle the additional load?

Once you have the data to measure the performance of your system and you have answered the next two questions, you can actually design how your system will scale. I thought the book mentioned something really interesting in regards to thinking about scalability. It mentions how decitions around scaling are based on assumptions of which operations will be common and which will be rare. If these assumptions are wrong, then the engineering work used to implement the scaling becomes wasted effort. This stood out to me because of how it realates to early stage startups. Since I am joining an early stage startup next week, I think this is a point I would like to take with me and keep in the back of my mind. The book mentions how it is usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load right away. This makes sense to me. The primary goal of a startup is to make money first and quickly. If I can design a system that will work for our initial projected load in a week versus design a scalable solution that scales to 100x our initial load in 3 months, then it is probably a better use of my time to choose the former. Coming from Belvedere, I think I already have a bit of that mentality. There was a big focus on getting systems deployed that worked quickly so that we could capitalize on opportunities in the market now. However, I also had to deal with a lot of legacy systems that followed this approach and were never re-architected to meet a larger demand/load and this caused a lot of headaches. I can see the need and argument for both.

Lastly, maintainability is about making life better for the engineering and ops teams who need to work with the system. I have deep experience on the operations side of software systems and so designing something that is easily maintainable is something I really care about. Thinking about the supportability of a system while in the design phase is important and I’ve taken many lessons first-hand of what unsupportable software is like. The book mentions Netflix’s “chaos monkey” and this is something I learned from my last manager at Belvedere. I was tasked with chaos testing all the systems I worked on while working with them and I am a big fan of it. I learned so much and caught many issues (or faults) while still in the testing phase from this. Because of chaos testing I was able to roll out more reliable and maintainable systems to production the first time around. I was not aware chaos testing was associated with Netflix, but that’s cool to see that in this book.

I will end this blog with a section I really enjoyed from the chapter on operations teams. Operations teams are vital to keeping a software system running smoothing. A good operations team typically is responsible for the following, and more:

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state
  • Tracking down the cause of problems, such as system failures or degraded performance
  • Keeping software and platforms up to date , including security patches
  • Keeping tabes on how different systems affect each other, so that a problematic change can be avoided before it causes damage
  • Anticipating future problems and solving them before they occur (e.g. capacity planning)
  • Establishing good practices and tools for deployment, configuration management, and more
  • Performing complex maintenance tasks, such as moving an application from one platform to another
  • Maintaining the security of the system as configuration changes are made
  • Defining processes that make operations predictable and help keep the production environment stable
  • Preserving the organization’s knowledge about the system

Operations teams do a lot to keep software systems afloat and they learn a lot about how they perform in production! Please always design with your ops teams in mind and work with them. They often have a very unique perspective and might be able to provide insight during the design phase of your system.