Narrator is experiencing some request failures

Incident Report for Narrator

Postmortem

Over the past 2 days Narrator has experienced a higher than average error rate. We have now officially fixed the issue and I apologize for the inconvenience to everyone.

‌

The Change

As Narrator continues to grow we aimed to improve our servers processing and security by the following changes:

Leveraging Kubernetes to better scale load and distribute processing
Multi-region deploys with multiple availability zones to allow traffic to allow faster, more security and highly available access to Narrator (i.e. if an AWS region goes down we are up)
Better Query experience: Faster queries, more secure and consistent caching

‌

The Deploy

We deployed our changes in steps over the last couple of months.

Initially we deployed only to our internal accounts for 1 month to ensure stability
We deployed to some customers for 2 weeks
Last Friday we deployed it to everyone.

Over the weekend the some small edge case bugs were found and resolved but no major issues occurred.

‌

The Timeline

On Tuesday, we began noticing way more of our requests timing out. This was very surprising since we had lots of auto-scaling measures.

As we began investigating the issues, we realized:

Our Pods (computers) were not able to handle the requests
We quickly scaled up the pods by 5x to ensure we can handle more requests
The problem appeared solved
We noticed that the issue came back and began debugging again
The problem was intermitted and would affect some customers and not others.
On Monday night, we updated our workers and noticed the problem was solved.
1. This was later discovered to not be true, it was actually less load that made the problem appear fixed
We kept monitoring the system to ensure the problem is premaritally solved.
At 8am on Tuesday, we saw another spike of timeouts
We investigated and realized that our auto-scaling was not working due to a bug in our Health Check that helps decide where to route traffic.
At 1pm on Tuesday, we implemented a fix to bring our systems back
On Tuesday, we continued to monitor all our endpoints

‌

The Issue

Our Health-Check process checks our servers to ensure they are up and running. This helps us distribute the load. A bug was causing the health request to fail (we were blocking IP based requests to stop malicious behavior) thus the servers were being flagged as unhealthy, this resulted all our traffic to be routed into less and less servers. This resulted in timeout and the “Failed to Fetch” error.

‌

The solution

Fixed the Health Check requests by flagging the traffic to enable the Health Check code to succeed.

‌

The learning

We take this situation very seriously and we are updating our deployment strategy so this does not happen in the future.

Our new deploy process will now be:

Internal deploy for a couple of weeks
Deploy to a couple of customers for a couple of weeks
(new) Coordinate and test with high usage customers
Deploy to everyone

With the addition of a deploy to some high usage customers we can ensure that auto-scaling and load is handled.

‌

I hope you can see that we take these issues incredibly seriously and are very sorry for your experience. If you have any concerns or recommendations, please feel free to email ahmed@narrator.ai.

Thanks

Posted Sep 20, 2023 - 18:04 EDT

Resolved

The issue has been resolved and post mortem will be written and shared

Posted Sep 20, 2023 - 16:06 EDT

Update

We are continuing to monitor for any further issues.

Posted Sep 20, 2023 - 13:53 EDT

Update

We have noticed the issue to not be solved by our initial solution, we are currently testing another solution now.

Posted Sep 20, 2023 - 12:52 EDT

Monitoring

A fix has been implemented and we will continue to monitor for the next couple of days

Posted Sep 19, 2023 - 23:10 EDT

Identified

We have identified the issue and are working to scale up our systems to ensure reliability

Posted Sep 19, 2023 - 20:43 EDT

Investigating

Narrator servers are experiencing intermittent failures, we are diving into it now.

Posted Sep 19, 2023 - 19:56 EDT

This incident affected: Portal.