Incident history
Public postmortems for every critical or high-severity OID4Pay incident. Lower-severity events are recorded internally and surfaced here only when they affect external integrators.
2026
No customer-facing production incidents since the public beta launch on 2026-05-14.
During the from-bare provisioning of the production fleet on 2026-05-15, an internal incident affected the primary database before any external traffic was served. The replication TLS certificate files were owned by root and were not readable by the non-root database container, so the primary Postgres crash-looped while the authorization server's static endpoints still answered 200. We root-caused the ownership mismatch, fixed it at source in the database provisioning, and restored the database with all data intact. Takeaway carried forward: verify real database health rather than a static 200 from the edge.
Pre-launch hardening
Bringing the fleet up surfaced a set of deployment fixes that we resolved before launch. These were operator-side and did not affect any external integrator:
- Atomic
iptables-restorewith a deferred-rollback validator (the incremental UFW ruleset did not hold under our kernel and base-image combination). - Lazy test-fixture imports (per-module fixtures that loaded slow test data on import stalled the test collection phase).
127.0.0.1binds on every service that does not need to listen on0.0.0.0.- WireGuard
wg0startup ordering: the tunnel must come up before nginx starts, so the dependency is now explicit in provisioning. - Terraform
templatefiledeferred evaluation: the eager form leaked encrypted values intoterraform.tfstate.
Postmortem template
Every postmortem entry covers:
- What happened (timeline, impact window).
- Detection (which monitor / alarm fired, who acked).
- Root cause (the blameless contributing factors).
- Resolution (what was changed in production to clear the alarm).
- Action items (what we are changing so it does not recur).
Disclosure cadence
Critical or high incidents: postmortem within 5 working days of resolution. Medium incidents: aggregated quarterly. Low incidents: aggregated annually. Security-sensitive incidents may be redacted until affected merchants confirm patch deployment.