High streaming latency and device inactive

Incident Report for CVaaS

Postmortem

A network upgrade of the cluster caused some backend to be unreachable. The network degradation caused processing delay of the streamed data, resulting in customer’s devices to be shown as inactive and the various API to not be responding in time. Once the misbehaving server were removed from the cluster, the system was able to recover.

This incident is not related to yesterday performance degradation, but the visible effect for customers is the same.

Posted Jul 31, 2024 - 15:43 UTC

Resolved

This incident has been resolved.

Posted Jul 31, 2024 - 15:33 UTC

Update

The streaming latency is back to normal. Keeping the incident open for a few more minutes, to validate that we have fully recovered.

Posted Jul 31, 2024 - 15:20 UTC

Update

We are still monitoring the situation. The cluster has mostly recovered, but higher latency is still observed for a small fraction of the ingested data.

Posted Jul 31, 2024 - 15:03 UTC

Update

We are continuing to monitor the situation. The backlog of streamed data has been processed. Higher streaming latency are still being observed at the moment.

Posted Jul 31, 2024 - 14:36 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 31, 2024 - 14:30 UTC

Identified

The issue has been identified, we are applying mitigation. At the moment, data is being queued but not processed.

Posted Jul 31, 2024 - 14:18 UTC

Investigating

We are currently investigating this issue.

Posted Jul 31, 2024 - 14:07 UTC

This incident affected: us-central1-a (Core Platform).