Software architecture is a critical part of software engineering and one that is easy to get wrong. Over-engineering and complexity can quickly get in the mix if not paid attention to. When it comes to software design, boredom and simplicity are some of the strongest tools in our shed to solve problems.
It is good practice in the industry to write design documents. We consider them helpful to understand the decisions and the context in which they were made. With that in mind, I thought it would be interesting for me to share with you some of the design thoughts I put in one of my latests products.
I recently launched a new app named WatchPager. WatchPager is an Apple Watch app made for PagerDuty users. PagerDuty is a product to manage software incidents and notify engineers in order to fix production issues. WatchPager’s purpose is to enable users to perform all their on-call tasks right from their wrist, as well as enhancing the reliability of delivered notifications. The app targets PagerDuty and Apple Watch power users, as well as people practising outdoor activities during their on-call shifts.
Third Party Interactions
Identifying the Design ambitions
Unfortunately, software does not last forever. Unlike some wine, we can say it does not age very well. As a consequence, we should always keep in mind that software is written to solve a problem within a given context and time frame. When I wrote WatchPager, I designed it with the intention to serve at a scale of hundreds of users. This statement allowed me to clarify some known unknowns and to make compromises:
- I expect WatchPager to serve in the range of 100 queries per second at peak time. Because of the purpose of the application, and if we consider the monopolistic nature of modern cloud computing, it is acceptable to say incidents are likely to occur in batches.
- The relatively small scale at which the service operates means we are not constrained with certain technologies.
- With a relatively small amount of users, it is acceptable to have more verbose observability: more logs and more metrics will help better understand the application without much increased cost.
Optimising for a metric
I believe users who install the app would not tolerate not to receive notifications on their watch, regardless of the cause. Therefore, optimising for reliability is a logical decision. With that in mind, here are the principles I followed.
An application with less components is more reliable
I believe that keeping an application simple is a great way to improve its reliability. The simplest form of architecture I can think of is a web server serving an API with JSON over HTTP, storing data in a SQL database. The app needs to have some sort of durable queue to provide at-least-once delivery guarantees for push notifications. This queue can be delegated to an external application, such as RabbitMQ, or be implemented in a SQL database. I chose the later as it allowed me to have a single component in my backend application.
An application with less components is more observable
If an application is small, the points of failures are easier to identify. For such a simple application, we know the failure can come from hardware, our software or an external software we interact with.
An application with less components is simpler to move around
Users would not tolerate if our application went down at the same time as theirs but in the case of a cloud provider incident, this event is likely to happen. Keeping our application design simple will help in case we wish to switch datacenter in order to recover our production system. An application with a single SQL database can be moved around and rebooted with limited complexity.
A simpler application is better understood
Finally, we know a simple application is easier to debug and to operate. As engineers, this makes our life simpler as well.
No matter what we do, things can always go wrong
Even though we work to achieve a high availability, there are no software applications that target 100% availability, with the notable exception of CloudFlare load balancing. In practice, this 100% availability is not reachable though.
Let’s zoom into the WatchPager backend component.
What I came up with:
- A Web application written in Go, serving JSON over HTTP and a worker to send notifications.
- A single SQL database, used to store user data and the durable queue for the notification worker.
Other decisions that can be challenged
- The language choice is a matter of preference. Go is a simple and efficient language and it felt suitable for the task.
- PostgreSQL is a modern SQL database. It supports the
SELECT FOR UPDATE SKIP LOCKEDfeature which is very helpful to implement the durable queue.
Design is a living process, and this document may be amended in the futur. Do you have comments or suggestions ? Feel free to leave messages.