An intermittent and hard-to-diagnose issue impacted the OCPP Bridge, causing delays and failures in charging operations. The incident spanned multiple days with various troubleshooting efforts including scaling, infrastructure and code changes. A temporary resolution was achieved on March 20, but problems resurfaced the next day, leading to the decision to migrate to a new Azure Service Bus instance. The incident was ultimately traced back to high CPU usage and server saturation on Microsoft’s physical infrastructure hosting the original Service Bus. Something not visible in our metrics and ultimately out of hands.
During the incident, 3rd party integrator and charging stations connected to our ocpp bridge experienced problems with sending and receiving observations, resulting in a delay in start/stop charging, and in some cases being unable to start charging.
On 18th of March around 11PM we detected a few brief disconnect in our OCPP Bridge but quickly stabilize. As the night continue we see and increase in the frequency and duration of the drops.
In the following morning we establish a dedicated team to focus on troubleshooting and fixing the issue.
The system behavior we observe turns out to be very difficult to troubleshoot since there is no clear cause as to where the problem originate or what can be the cause. Throughout the day we implement a number of possible fixes to try and pinpoint the root problem, and focus heavily on getting better insight by adding more metrics and logs, as well as changing log levels to be more verbose.
Due to the intermittent behavior of the problem it is difficult to know what hotfixes might have a positive or negative effect, and by the end of the day we have only been able to stabilize the backend for shorter duration but are no closer to finding the root cause. Nothing in our code, firmware or infrastructure suggest that we should be having problems.
On the 20th around 2:30PM we see a “blip” on the primary Service Bus where the service is not reporting any metrics back to use, giving us reason to believe that Microsoft is making updates or changes to the backend of our specific Service Bus. After the “blip” all systems are back to normal without us having made any new changes to our code or infrastructure. To try and confirm that our systems are back and operational; we cycle some key consumers of the Service Bus by doing a full restart.
We go into active monitoring until the next day when we start to see identical problems at around 9:50PM. At this point we decide to migrate our applications to use a new, but identical Service Bus as we are now as sure as we can be that the problem is on Microsoft’s end and not Zaptec.
What made this particularly difficult to troubleshoot was the fact that our observability in the form of metrics and logs indicated no resource saturation. Neither in our code or infrastructure.
After having escalated the support request to Microsoft we received confirmation that they did in fact have problems with the underlying servers that was hosting our infrastructure, but this was not communicated to us during the active incident.
Even though the root cause turned out to be on Microsoft sides, we have identified areas that we can improve to try and prevent these types of problems in the future.
Azure Service Bus is a PaaS, meaning Microsoft handles the infrastructure, scaling, and maintenance in the operating, network and physical server stack. We now know that we can’t always trust the metrics being sent to us. The only way we can try to eliminate these types of problems in a troubleshooting scenario is to deploy new PaaS infrastructure and migrate our services.
This is now being added earlier in our troubleshooting plan, and we will make this process quicker and easier by prioritizing the deployment in our Backup and Disaster Recovery plan.
This is being followed up by the Platform team.
In addition to finding improvements to our current topology, we are exploring options to simplify our infrastructure and rely less on Service Bus or similar message brokers. This is an ongoing initiative involving Platform, developers and architect.