Update on Service Resilience

If you've been using our services over the last couple of weeks, you're certain to have noticed a noticeable dropoff in our reliability. In particular, while there's been some small periods of general unavailability for other reasons such as network instability (which we're currently working with our network provider on), the worrying outages have related to repeated partial outages of upload functionality that were not detected by our healthcheck stack, which primarily looks at availability of the service as a whole.

There's some background to all of this. I've been working hard to make changes to some of our underlying tech that sits upstream of our main cluster, both for resilience and performance but also to decrease our cost base. We've been using BunnyCDN (affiliate link) to do this.

And to be clear, it's not Bunny's service that is causing the errors. It's that the changes being made require configuration in our cluster to support, and there have been some challenges in setting this up. This is both due to accumulated tech debt, previous misconfigurations, and in some cases just a lack of skill in configuring certain components, particularly Kubernetes Container Network Interfaces.

But it's not all bad news. The use of keepalive connection pooling, reconfiguring flows to reduce overhead on uploads, and the tweaking of resource allocation across our services has lead to a significant increase in upload throughput and response times, both of which are really important to our continued growth and the success of our platforms.

The main issue that has been plaguing us over the last week or so has been intermittent failures of inter-service connectivity between Mastodon, Peertube and our object storage. This has recurred a number of times despite different steps being made to attempt to resolve it. For Peertube – this primarily impacts uploads. For Mastodon, it impacts both uploads and the caching of remote images, such as avatars and attachments to posts.

The root cause of this appears to have been some funky caching going on of some of the virtual IPs we use for load balancing purposes in our cluster. We made some changes to our load balancing algorithm to improve performance. These changes failed, and were rolled back. For the large part the rollback was successful, however some iptables or similar changes appear to have persisted somewhere, causing the intermittent failure.

With this is now (hopefully) more fully resolved, our attention turns to how best to prevent this from happening in the future. I'm currently working on a suite of tools to periodically test the functionality of uploads on our services – not just that landing pages load correctly, but that uploads are successfully hitting our backend and loading to object storage.

This will take some time to fully develop, test and deploy. In the meantime I'm going to be uplifting our log monitoring to more quickly detect and alert to backend failures.

This isn't the post I was hoping to make – I was hoping to talk about the success we've had in using Bunny to reduce our cost base while increasing service performance. That post will come in the future, but I wanted to take the time to talk through our struggles over the last few weeks to keep you all informed on our progress against these challenges.