OCPP disconnections
Incident Report for Zaptec
Postmortem

At 14:06 CEST a deployment was made to our API that included work done on two features. This work included an inadvertent change of behaviour to several endpoints used to control configuration of installations and devices:

  • POST /charger
  • PUT /charger/{id}
  • POST /installation
  • PUT /installation/{id}
  • POST /installation/{id}/update

With this change, when requesting one of these endpoints, and if the JSON body was missing any of the parameters regarding OCPP Bridge configuration (central system URL parameter, default tag ID, or password), the system would then delete those missing parameters from the device/installation configuration. This differs from the previous behaviour, where a missing parameter would leave the configured value intact.Since many customers interact with these endpoints to change the behaviour of their installations and devices, this caused their OCPP Bridge configuration to be wiped if it was previously set. Unfortunately, given that those endpoints are updated very frequently in certain scenarios (like setting available current on an installation), this has led to several thousand devices losing their OCPP connection.The issue was quickly uncovered by a partner starting to have severe issues, since they call one of the endpoints very frequently for every installation that they manage — which caused OCPP settings to be wiped. Once aware, Global Support quickly brought the issue to the attention of developers.Once the behaviour was noticed, it was quickly correlated to the recent deployment, and a quick analysis revealed the root cause. At 15:54 CEST, the API was reverted to the previous state, and the endpoint behaviour returned to normal. However, the data for OCPP Bridge configuration had already been wiped for many installations at this point. An incident response team was set up at 16:54 CEST and it immediately started working on strategies to recover the corrupted data. A point-in-time restore of the database was initiated to obtain a copy of the data that could be used to fix affected installations. The team also managed to identify that all the corruption had been tracked in our Change Log for every installation. This redirected efforts to locating all the damaged data within the Change Log both on Chargers and Installations, allowing us to identify affected installations, as well as individual charging stations. The change log was retrieved, and the team’s effort focused on creating a script to restore the data. The behaviour of the script was verified in stages, and eventually applied to all installations, restoring the data. Data for the individual charging stations was corrected manually by Global Support.Improvements identified

  • We need unit tests for all API endpoints that accept partial data, ensuring that existing values are kept.
  • Apply more tooling for using the Change Log, especially for data recovery.
  • Our database’s point-in-time restore was slower than expected.

Timeline (all times CEST)

  • 14:06: Deployment completed
  • Prior to 15:50: Issue discovered and researched
  • 15:54: Rollback of deployment to prevent further damage
  • 16:39: Triggered restore of the database to a separate copy retrieve the list of the installations before the corruption
  • 16:53 Incident response team formed
  • 18:11: Retrieving Change Log for all installations since 14:00
  • 20:47: Manually restoring charger-level settings
  • 21:10: Restore limited installations as verification
  • 21:18: Restore the remaining installations
  • 21:57: Issue is deemed resolved
Posted Apr 25, 2024 - 15:47 CEST

Resolved
Restore is complete and the OCPP configuration data should be restored to all installations.
Posted Apr 23, 2024 - 22:04 CEST
Update
We are still working on restoring the OCPP connections.
Posted Apr 23, 2024 - 20:35 CEST
Update
We are still continuing to work at restoring the OCPP connections.

If you have the ability of re-setting the OCPP URL, this will also re-establish the OCPP connection.
How to do this, is well described in the first steps of this guide: https://help.zaptec.com/hc/en-001/articles/12771090871825-Connecting-your-Zaptec-Installation-to-an-OCPP-Server
Posted Apr 23, 2024 - 19:35 CEST
Update
We are continuing to work at restoring the OCPP connections.
Our aim is to restore this as soon as possible, but if you have the ability of re-setting the OCPP URL, this will re-establish the OCPP connection.
How to do this, is well described in the first steps of this guide: https://help.zaptec.com/hc/en-001/articles/12771090871825-Connecting-your-Zaptec-Installation-to-an-OCPP-Server
Posted Apr 23, 2024 - 18:22 CEST
Identified
There was a data corruption incident today that led some customers’ installations to lose the URL specifying connectivity to their OCPP central system. This led the OCPP connections to be dropped. We are currently working on restoring this data.
Posted Apr 23, 2024 - 17:50 CEST
This incident affected: Zaptec Cloud Services (OCPP).