On March 6, 2024 between 2:30 AM US Eastern Standard Time (EST) and 11:00 AM EST many Vectara customers were unable to query their corpora. All query requests for some of our Growth tier users and a very small number of our Scale customers experienced a complete outage on any type of query requests. This failure includes chat API requests and list documents API calls, including those made from the administrative Console.
We’ve been working hard to improve the resilience of our internal query engine system. This has included updates on how the query engine receives and stores new index requests. Unfortunately, the new code didn’t account for the rare series of events:
This coupled with another memory configuration change on a few servers led to the query engine not being able to automatically recover from the failed state.While we did have alerts that were firing immediately to alert us of an issue, it took the team an extended time to discover the cause of the incident. When the criticality of the failure was correctly identified, the server was quickly modified to ignore deleted index requests.
We know that users are relying on Vectara to build robust, production-grade GenAI applications, and thus we view this incident as completely unacceptable. We plan to overhaul internal incident response processes to prevent similar situations in the future. Additionally, we plan on continuing to engineer for resilience and make sure the system operates gracefully under failure conditions. Our architecture is generally designed to handle internal failures gracefully, and we are doing a review of the current architecture for deficiencies in light of this incident to uncover any other shortcomings. We were already in the process of overhauling the query engine for resiliency, and we plan to expedite that process making sure outages like this one cannot happen in the future.
Thank you for your understanding and we apologize for the impact.