OCPP disconnections/restarts

Incident Report for Zaptec

Postmortem

Summary

An intermittent and hard-to-diagnose issue impacted the OCPP Bridge, causing delays and failures in charging operations. The incident spanned multiple days with various troubleshooting efforts including scaling, infrastructure and code changes. A temporary resolution was achieved on March 20, but problems resurfaced the next day, leading to the decision to migrate to a new Azure Service Bus instance. The incident was ultimately traced back to high CPU usage and server saturation on Microsoft’s physical infrastructure hosting the original Service Bus. Something not visible in our metrics and ultimately out of hands.

Impact

During the incident, 3rd party integrator and charging stations connected to our ocpp bridge experienced problems with sending and receiving observations, resulting in a delay in start/stop charging, and in some cases being unable to start charging.

Root Cause Analysis

On 18th of March around 11PM we detected a few brief disconnect in our OCPP Bridge but quickly stabilize. As the night continue we see and increase in the frequency and duration of the drops.
In the following morning we establish a dedicated team to focus on troubleshooting and fixing the issue.
The system behavior we observe turns out to be very difficult to troubleshoot since there is no clear cause as to where the problem originate or what can be the cause. Throughout the day we implement a number of possible fixes to try and pinpoint the root problem, and focus heavily on getting better insight by adding more metrics and logs, as well as changing log levels to be more verbose.
Due to the intermittent behavior of the problem it is difficult to know what hotfixes might have a positive or negative effect, and by the end of the day we have only been able to stabilize the backend for shorter duration but are no closer to finding the root cause. Nothing in our code, firmware or infrastructure suggest that we should be having problems.
On the 20th around 2:30PM we see a “blip” on the primary Service Bus where the service is not reporting any metrics back to use, giving us reason to believe that Microsoft is making updates or changes to the backend of our specific Service Bus. After the “blip” all systems are back to normal without us having made any new changes to our code or infrastructure. To try and confirm that our systems are back and operational; we cycle some key consumers of the Service Bus by doing a full restart.
We go into active monitoring until the next day when we start to see identical problems at around 9:50PM. At this point we decide to migrate our applications to use a new, but identical Service Bus as we are now as sure as we can be that the problem is on Microsoft’s end and not Zaptec.
What made this particularly difficult to troubleshoot was the fact that our observability in the form of metrics and logs indicated no resource saturation. Neither in our code or infrastructure.
After having escalated the support request to Microsoft we received confirmation that they did in fact have problems with the underlying servers that was hosting our infrastructure, but this was not communicated to us during the active incident.

Action Items & Follow-Up

Even though the root cause turned out to be on Microsoft sides, we have identified areas that we can improve to try and prevent these types of problems in the future.
Azure Service Bus is a PaaS, meaning Microsoft handles the infrastructure, scaling, and maintenance in the operating, network and physical server stack. We now know that we can’t always trust the metrics being sent to us. The only way we can try to eliminate these types of problems in a troubleshooting scenario is to deploy new PaaS infrastructure and migrate our services.
This is now being added earlier in our troubleshooting plan, and we will make this process quicker and easier by prioritizing the deployment in our Backup and Disaster Recovery plan.
This is being followed up by the Platform team.

In addition to finding improvements to our current topology, we are exploring options to simplify our infrastructure and rely less on Service Bus or similar message brokers. This is an ongoing initiative involving Platform, developers and architect.

Posted Apr 30, 2025 - 08:02 CEST

Resolved

This incident has been resolved, postmortem will be posted when ready.
Posted Mar 25, 2025 - 12:16 CET

Update

We are still monitoring the situation.
Posted Mar 25, 2025 - 09:39 CET

Update

There is still full functionality. You may experience some minor delays, but all operations should work as expected. We are still monitoring the situation.
Posted Mar 24, 2025 - 15:26 CET

Monitoring

Full functionality have been enabled to OCPP Bridge, and systems are running fine. You may experience some minor delays, but all operations should work as expected. We are still monitoring the situation.

Thank you for your patience.
Posted Mar 24, 2025 - 09:30 CET

Update

After the rollback at 10:00, we do see improvements, but we are still operating with limited functionality. With this, all charging stations that are connected, will work as expected. But, if there are done any changes on installation level - updates from installations, circuits, and chargers, OCPP messages will not be received. This includes changes such as change to authentication mode, and the addition or removal of installations, circuits, or chargers from OCPP.
We are still working on a solution to this incident.
Posted Mar 21, 2025 - 11:48 CET

Investigating

After enabling full functionality at 09:15, we do see some issues. We will roll back to limited functionality at 10:00 and keep investigating the issue.
Posted Mar 21, 2025 - 09:57 CET

Update

As our systems have been running as expected since last update, we are going to enable full functionality to OCPP at 09.15.
Posted Mar 21, 2025 - 09:08 CET

Monitoring

We are seeing improvements to OCPP bridge, and our systems are stable again after our latest deployment. However, OCPP have been deployed with limited functionality. With this, all charging stations that are connected, will work as expected. But, if there are done any changes on installation level - updates from installations, circuits, and chargers, OCPP messages will not be received. This includes changes such as change to authentication mode, and the addition or removal of installations, circuits, or chargers from OCPP.
We are actively monitoring the situation, working on a solution, and expect full functionality to be restored by tomorrow morning, with further improvements ongoing.
Posted Mar 20, 2025 - 17:10 CET

Update

We still don't know the root cause, and will continue the investigation through the evening. Thank you for your patience.
Posted Mar 20, 2025 - 15:32 CET

Update

Our development team is fully focused on identifying the root cause and working towards a resolution. While we are still investigating and do not yet have a definitive explanation, please rest assured that we have our best people on the case.
We will provide an update as soon as we have more clarity on the situation. Thank you for your patience and understanding.
Posted Mar 20, 2025 - 12:14 CET

Update

It is still unclear what's causing the issues. We are continuing to investigate.
Posted Mar 20, 2025 - 10:20 CET

Update

At this stage, we are still investigating the root cause of the issue and don’t have a definitive explanation yet. We will provide an update as soon as we have more clarity.
Posted Mar 20, 2025 - 09:06 CET

Update

We will continue to investigate this issue today.
Posted Mar 20, 2025 - 07:39 CET

Update

We will keep investigating this issue over the night.
Posted Mar 19, 2025 - 23:45 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 21:03 CET

Update

We can see improvements on our systems as of now. But we are still investigating the issue.
Posted Mar 19, 2025 - 20:07 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 19:44 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 18:54 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 17:37 CET

Update

More logging have been activated to track down the issue. We will continue to investigate.
Posted Mar 19, 2025 - 16:54 CET

Update

We are still on top of this issue, and we will continue to investigate. As of now we have no ETA on when we expect the problem to be fixed, but we will update frequently.
Posted Mar 19, 2025 - 16:00 CET

Update

We are still investigating this issue.
Posted Mar 19, 2025 - 15:02 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 13:27 CET

Update

At 13:00 we will do a reboot of the OCPP Proxy. You will see chargers disconnect from the OCPP backend, but should recover shortly after.
Posted Mar 19, 2025 - 12:54 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 12:42 CET

Update

We are still investigating this issue.
Posted Mar 19, 2025 - 11:14 CET

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 10:07 CET

Investigating

We are seeing an increase of BootNotifications being sent from our charging stations without any clear reason. We are currently investigating this issue.
Posted Mar 19, 2025 - 09:03 CET
This incident affected: Zaptec Cloud Services (OCPP).