Narrator is experiencing some request failures
Incident Report for Narrator
Postmortem

Over the past 2 days Narrator has experienced a higher than average error rate. We have now officially fixed the issue and I apologize for the inconvenience to everyone.

The Change

As Narrator continues to grow we aimed to improve our servers processing and security by the following changes:

  1. Leveraging Kubernetes to better scale load and distribute processing
  2. Multi-region deploys with multiple availability zones to allow traffic to allow faster, more security and highly available access to Narrator (i.e. if an AWS region goes down we are up)
  3. Better Query experience: Faster queries, more secure and consistent caching

The Deploy

We deployed our changes in steps over the last couple of months.

  1. Initially we deployed only to our internal accounts for 1 month to ensure stability
  2. We deployed to some customers for 2 weeks
  3. Last Friday we deployed it to everyone.

Over the weekend the some small edge case bugs were found and resolved but no major issues occurred.

The Timeline

On Tuesday, we began noticing way more of our requests timing out. This was very surprising since we had lots of auto-scaling measures.

As we began investigating the issues, we realized:

  1. Our Pods (computers) were not able to handle the requests
  2. We quickly scaled up the pods by 5x to ensure we can handle more requests
  3. The problem appeared solved
  4. We noticed that the issue came back and began debugging again
  5. The problem was intermitted and would affect some customers and not others.
  6. On Monday night, we updated our workers and noticed the problem was solved.

    1. This was later discovered to not be true, it was actually less load that made the problem appear fixed
  7. We kept monitoring the system to ensure the problem is premaritally solved.

  8. At 8am on Tuesday, we saw another spike of timeouts

  9. We investigated and realized that our auto-scaling was not working due to a bug in our Health Check that helps decide where to route traffic.

  10. At 1pm on Tuesday, we implemented a fix to bring our systems back

  11. On Tuesday, we continued to monitor all our endpoints

The Issue

Our Health-Check process checks our servers to ensure they are up and running. This helps us distribute the load. A bug was causing the health request to fail (we were blocking IP based requests to stop malicious behavior) thus the servers were being flagged as unhealthy, this resulted all our traffic to be routed into less and less servers. This resulted in timeout and the “Failed to Fetch” error.

The solution

Fixed the Health Check requests by flagging the traffic to enable the Health Check code to succeed.

The learning

We take this situation very seriously and we are updating our deployment strategy so this does not happen in the future.

Our new deploy process will now be:

  1. Internal deploy for a couple of weeks
  2. Deploy to a couple of customers for a couple of weeks
  3. (new) Coordinate and test with high usage customers
  4. Deploy to everyone

With the addition of a deploy to some high usage customers we can ensure that auto-scaling and load is handled.

I hope you can see that we take these issues incredibly seriously and are very sorry for your experience. If you have any concerns or recommendations, please feel free to email ahmed@narrator.ai.

Thanks

Posted Sep 20, 2023 - 18:04 EDT

Resolved
The issue has been resolved and post mortem will be written and shared
Posted Sep 20, 2023 - 16:06 EDT
Update
We are continuing to monitor for any further issues.
Posted Sep 20, 2023 - 13:53 EDT
Update
We have noticed the issue to not be solved by our initial solution, we are currently testing another solution now.
Posted Sep 20, 2023 - 12:52 EDT
Monitoring
A fix has been implemented and we will continue to monitor for the next couple of days
Posted Sep 19, 2023 - 23:10 EDT
Identified
We have identified the issue and are working to scale up our systems to ensure reliability
Posted Sep 19, 2023 - 20:43 EDT
Investigating
Narrator servers are experiencing intermittent failures, we are diving into it now.
Posted Sep 19, 2023 - 19:56 EDT
This incident affected: Portal.