Let's start with a topic everyone knows, loves, and always talks about in a respectable friendly manner: blockchain.
A blockchain is a distributed system with decentralized hardware where each node is connected to each other to follow a single protocol. Each node in the blockchain network has the exact same database as guaranteed by the consensus protocol the nodes follow.
That is, until there's a bug in the code. [Bitcoin has a storied history of software failures invalidating its data](https://blog.bitmex.com/bitcoins-consensus-forks/). For all the talk of blockchain being a decentralized system, there's little talk<sup>1</sup> about how each network's software is, with minor exceptions, centralized and a single point of failure.
I see this blind spot doing systems analysis with my peers constantly. When I was in college we talked about single points of failure in a distributed systems class. My professor drew a mesh network on the board and asked if there was a single point of failure. I answered "the TCP implementation they're communicating with" thinking he was asking a trick question. The "correct" answer was that there were no single points of failure, though the professor said I brought up an interesting point he hadn't considered.
Recently I had a systems design interview where I was designing services to solve a mock problem. I was asked a question: "What happens if service A fails?" I started designing a second service to handle this failure, but was stopped and asked to fix the issue in terms of service A. After a little back and forth I eventually asked: "do you want a solution for if the *hardware* in service A fails?" and the interviewer responded "yes." They wanted to see if I knew how to create redundant cloud architecture, but in doing so assumed services primarily fail through hardware.
I think this blind spot comes up because hardware points of failure are less abstract and easier to deal with than software points of failure. They usually involve some form of managing redundant components, and it's easy to visualize two copies of the same computer, switch, network card, or whatever else. But how would you make redundant software? Hire three teams to write three different sets of code that solve the same problem?
The most common way to address software failures is to eliminate as many as possible before anything is deployed to production. In other words, software failures are addressed with QA. However, QA is fallible, and I think there needs to be more thought being put into what happens when software fails. So I am going to list some of the ways I have seen and handled software failures in my development experience:
- **Recording system state**. This is usually in the form of logging but can also take other forms, such as timestamping database records. This is very useful for investigating a software failure but does nothing to address the failure at runtime.
- **Code restarting itself when it fails**. This tends to handle hardware failures but I have also seen it used to fix software issues. That being said, if you can fix a software failure with a restart your code is nondeterministic and that underlying issue needs to be fixed.
- **Defensive coding**. Explicitly handling errors, setting memory to zero before using it, and so forth. Many programming languages have or are introducing features like dynamic memory management and static analysis that force defensive coding, so people are doing it more often without realizing it. Although defensive coding can handle failures at the application level, it is less helpful at the systems level. If two services comprised of Typescript programs compile successfully they will be very defensive in isolation, but there's not much checking if they're correctly communicating with each other.
- **Stateful batch operations**. In my line of work I am often gathering and processing large datasets where batch operations can take 12 to 24 hours to run. The scariest thing that can happen during these runs is a software failure in the 23rd hour of a job that has to start from the beginning. So I place I high value in large batch operations being stateful. More specifically, I expect all my operations to be saved in a way that they can be immediately resumed after a software failure.
This list is tepid in my opinion. Anyone that's ever had to dig through logs to troubleshoot a system knows it's not fun and often relies on intuition and guesswork. I don't think I'll ever work for another place that fixes software issues with restarts because it's a horrible solution. Defensive coding doesn't solve systemic failures. Stateful code is complicated code, and it's hard to write and manage. These aren't good solutions. But they're a place to start, and I hope to have a follow up post with better solutions as my career and the software field as a whole grows and develops.
<sub>1 Little talk isn't zero talk. See https://medium.com/@VitalikButerin/the-meaning-of-decentralization-a0c92b76a274</sub>
<font size=1>Published 2024-05-24</font>