Querying not working for certain customers.

Incident Report for Vectara Inc

Postmortem

Query, Chat, and Document API Outage

On March 6, 2024 between 2:30 AM US Eastern Standard Time (EST) and 11:00 AM EST many Vectara customers were unable to query their corpora. All query requests for some of our Growth tier users and a very small number of our Scale customers experienced a complete outage on any type of query requests. This failure includes chat API requests and list documents API calls, including those made from the administrative Console.

What Happened?

We’ve been working hard to improve the resilience of our internal query engine system. This has included updates on how the query engine receives and stores new index requests. Unfortunately, the new code didn’t account for the rare series of events:

The index request gets picked up by the storage engine.
The optimizer service deletes the index request while the storage engine has only partially read the index request.

This coupled with another memory configuration change on a few servers led to the query engine not being able to automatically recover from the failed state.While we did have alerts that were firing immediately to alert us of an issue, it took the team an extended time to discover the cause of the incident. When the criticality of the failure was correctly identified, the server was quickly modified to ignore deleted index requests.

What do we plan to do in the future?

We know that users are relying on Vectara to build robust, production-grade GenAI applications, and thus we view this incident as completely unacceptable. We plan to overhaul internal incident response processes to prevent similar situations in the future. Additionally, we plan on continuing to engineer for resilience and make sure the system operates gracefully under failure conditions. Our architecture is generally designed to handle internal failures gracefully, and we are doing a review of the current architecture for deficiencies in light of this incident to uncover any other shortcomings. We were already in the process of overhauling the query engine for resiliency, and we plan to expedite that process making sure outages like this one cannot happen in the future.

‌

Thank you for your understanding and we apologize for the impact.

Posted Mar 08, 2024 - 20:37 UTC

Resolved

Full Query API functionality should be restored. A post mortem report on the incident will be shared in the future.

Posted Mar 06, 2024 - 16:32 UTC

Monitoring

The Query API should now work again for all customers. A new internal feature intended to increase system capacity had a bug in a corner case, causing the Query API to go down for some customers. We have implemented a fix for that bug and will review how processes can be improved going forward.

Posted Mar 06, 2024 - 15:46 UTC

Identified

For certain customers, all Query API calls are completely down. We have identified the problem and are implementing a fix to bring back up the query engine for those customers.

Posted Mar 06, 2024 - 15:15 UTC

Update

We are continuing to investigate this issue.

Posted Mar 06, 2024 - 15:03 UTC

Investigating

A subset of our querying engine is down for certain customers.

Posted Mar 06, 2024 - 14:57 UTC

This incident affected: API (Query).