Root cause analysis (RCA) when something goes wrong
Recently at my job, I had to do many times with incidents and as part of this process, we had to read and produce RCA (Root Cause Analysis) docs. We had to read some of them coming from our providers to understand what caused the issues on their side and we had to write some of them to let our customers know what happened.
Also, this doc is not only to tell what happened but also to tell a bit more.
An example of a doc could be as follows:
## Incident summary
A brief summary of the incidents including which system was impacted and what was the impact on the final users.
## Root Cause
Detailed description of what caused the incidents. It should be detailed enough that people who were not involved should be able to understand it. Add any supporting material like link to docs, diagrams..
## Incident Timeline
Timeline including at least: impact started at, detected at, first communication sent at, fixed at.
## Mitigations taken
Actions taken during the incidents that stopped the impact. (i.e. restarted pods, rotated certificates, shipped a hotfix...)
## Preventive actions
Actions planned as follow-ups to prevent this incident from happening again. It's nice to add for each of them: a brief description, expected delivery date, status (it might be already started, finished, or to plan), and an owner.