On of the indexing servers OOMed, due to which indexing requests going to that server failed. The OOM resulted in the server to be in a limbo state, which the liveness prober did not detect as FAILURE and hence, did not restart the server in time. This resulted in a long partial outage.
Start: June 25 2024: 7:40 am UTC End: June 25 2024: 9:28 am UTC
While the issue has been fixed, and a long term solution for improving the liveness prober is in progress.