Before the incident report team was in session at around 22:00 CEST, the systems had started to experience an ever-increasing latency, leading to users potentially experiencing delays when trying to control their charging stations. When the incident report team was set into session, the latency had reached 5 minutes or more for most, if not all.
In the hours that followed, the team worked through numerous hypotheses and made several attempts to reduce the load, primarily by limiting retries and restarting applications. Despite these efforts, the root cause of the issue has not yet been fully identified, and we are still actively investigating it. Various code changes were prepared for deployment, aimed at improving several calls believed to function like multicast processes, such as duplicate occurrences of balancing etc.
Unfortunately, these attempts were not deemed successful in reducing or eliminating the issue. However, the problem resolved itself at 00:00 CEST, as a natural dip in traffic pulled the overall load under an unknown threshold, stabilizing the system. At around 01:00 CEST, the response team deemed the system to have recovered, and further improvement attempts were paused. The situation was monitored for the next half hour, ending the response team’s efforts at 01:30 CEST.
In the following days, the team will prioritize improving observability, redesigning traffic flows, and addressing how our systems and surrounding consumer services handle traffic. While the situation has stabilized, we are committed to identifying the root cause and ensuring long-term stability.