Coordinating human activities across organizations and disciplines is fundamental to DevOps. This requires having documented procedures to handle any situation, tools that enable participants to collaborate effectively, and a shared understanding of what information needs to be captured and communicated to work — especially when the actors are likely to be separated by space and time.
A DevOps procedure is initiated for a reason. These situations include scheduled maintenance, a response to an alert (detection of a condition that deserves attention), or a response to a request for support. In each case, there should be a ‘ticket’ (a record in a tool) that notifies a responsible DevOps engineer to work the issue. Ideally, the relevant procedure that applies to a ticket should be obvious — ideally, referenced explicitly.
When responding to an alert or a support request (usually a complaint about a service malfunction), usually it begins by confirming the reported condition. This requires gathering information about the context and collecting diagnostics to aid in troubleshooting. Ideally, the ticket clearly identified the problem; otherwise, interactions would be needed to gain such clarity. Humans are routinely terrible about assuming the recipient of a request has all the necessary context to understand what is being asked of them and why. To mitigate this inefficiency, tooling and procedural documentation are usually provided to guide how tickets should be written, so that many questions and answers are not needed afterward to satisfy the request.
The engineer who works a ticket should capture the service configuration, relevant logs to determine the failure mode, and any other data for the context associated with the problem for analysis toward an operational fix or for submitting a bug, if applicable.
A designated channel should be used for engineers to collaborate. Each engineer must record in real time the actions taken to troubleshoot, analyze, and correct the problem. This aids in coordination between multiple individuals in different roles across organizations to work together. Good communication enables everyone involved to stay informed and avoid taking actions that interfere with each other. Moreover, an accurate record can be reviewed later as a post-mortem. In the course of the
Severe incidents, such as those that cause a service outage, demand a root cause analysis toward preventative actions, such as process improvements, procedure documentation, operational tooling, or developing a permanent fix (for a software bug). This depends on the ticket capturing the necessary information to trace how the problem originated, such as the transaction processing in progress, events or metrics indicating resource utilization out of bounds (e.g., out of memory, insufficient cpu, I/O bandwidth limited, storage exhaustion) or performance impairment (e.g., lock contention, queue length, latency), or anything else that appears out of the ordinary.
One of the biggest impediments to verifying that a problem is resolved is that DevOps normally does not have access to the service features being exercises by end users. When a functional problem is reported by users, it may not be possible for DevOps to confirm that the problem is fixed from the user’s perspective. Communication with users may need to be mediated by customer support staff. The information on the ticket would need to facilitate this interaction.
Accurate record-keeping also enables later audits in case there are subsequent problems. The record of actions taken can be used to analyze whether these actions (such as configuration changes or software patches) are contributing to other problems. Troubleshooting procedures should include data mining of past incidents (have we seen this problem before? how did we fix it previously?) and auditing what procedures may have impacted the service under investigation (what could have broken this?).
The above guidance can be summarized as follows.
- Say what you do
- Do what you say
- Write it down