Tag Archives: DevOps

DevOps Transparency and Coordination

Coordinating human activities across organizations and disciplines is fundamental to DevOps. This requires having documented procedures to handle any situation, tools that enable participants to collaborate effectively, and a shared understanding of what information needs to be captured and communicated to work — especially when the actors are likely to be separated by space and time.

A DevOps procedure is initiated for a reason. These situations include scheduled maintenance, a response to an alert (detection of a condition that deserves attention), or a response to a request for support. In each case, there should be a ‘ticket’ (a record in a tool) that notifies a responsible DevOps engineer to work the issue. Ideally, the relevant procedure that applies to a ticket should be obvious — ideally, referenced explicitly.

When responding to an alert or a support request (usually a complaint about a service malfunction), usually it begins by confirming the reported condition. This requires gathering information about the context and collecting diagnostics to aid in troubleshooting. Ideally, the ticket clearly identified the problem; otherwise, interactions would be needed to gain such clarity. Humans are routinely terrible about assuming the recipient of a request has all the necessary context to understand what is being asked of them and why. To mitigate this inefficiency, tooling and procedural documentation are usually provided to guide how tickets should be written, so that many questions and answers are not needed afterward to satisfy the request.

The engineer who works a ticket should capture the service configuration, relevant logs to determine the failure mode, and any other data for the context associated with the problem for analysis toward an operational fix or for submitting a bug, if applicable.

A designated channel should be used for engineers to collaborate. Each engineer must record in real time the actions taken to troubleshoot, analyze, and correct the problem. This aids in coordination between multiple individuals in different roles across organizations to work together. Good communication enables everyone involved to stay informed and avoid taking actions that interfere with each other. Moreover, an accurate record can be reviewed later as a post-mortem. In the course of the

Severe incidents, such as those that cause a service outage, demand a root cause analysis toward preventative actions, such as process improvements, procedure documentation, operational tooling, or developing a permanent fix (for a software bug). This depends on the ticket capturing the necessary information to trace how the problem originated, such as the transaction processing in progress, events or metrics indicating resource utilization out of bounds (e.g., out of memory, insufficient cpu, I/O bandwidth limited, storage exhaustion) or performance impairment (e.g., lock contention, queue length, latency), or anything else that appears out of the ordinary.

One of the biggest impediments to verifying that a problem is resolved is that DevOps normally does not have access to the service features being exercises by end users. When a functional problem is reported by users, it may not be possible for DevOps to confirm that the problem is fixed from the user’s perspective. Communication with users may need to be mediated by customer support staff. The information on the ticket would need to facilitate this interaction.

Accurate record-keeping also enables later audits in case there are subsequent problems. The record of actions taken can be used to analyze whether these actions (such as configuration changes or software patches) are contributing to other problems. Troubleshooting procedures should include data mining of past incidents (have we seen this problem before? how did we fix it previously?) and auditing what procedures may have impacted the service under investigation (what could have broken this?).

The above guidance can be summarized as follows.

  • Say what you do
  • Do what you say
  • Write it down

Pain Feedback Loops for improvement

Feedback loops are very important to regulate behavior within an enterprise. This applies to both rewarding positive behavior, and encouraging more of it, as well as correcting negative behavior to get less of it. Continuous improvement is about feedback loops.

Focusing on negative feedback, we should recognize a phenomenon called ‘pain’. In this context, it refers mostly to pains in the ass, which are discomforts, inconveniences, and frustrations which burden people’s life, draining their time and energy in unproductive ways. In DevOps, when high severity operational problems arise, such a service outage in the middle of the night, pain manifests in a pager alert that wakes up an engineer to troubleshoot the incident and resolve the problem.

When fires need to be fought, fire-fighters experience this pain in proportion to the number of fires and their severity. Development teams tend to avoid work with the goal being to deliver features with faster time to market. They inevitably cut corners in areas that make operations more efficient, because they tend not to be placed in the position of experiencing the pain, when it comes. Disconnecting development priorities from operational responsibilities is a recipe for the infliction of pain on those who do not deserve it, and the result is an excess of unexpected pain that should have been foreseen and mitigated. The integration of development with operations into DevOps is intended to establish this connection. This connection must not be undermined by paying mere lip-service to operations without putting real skin in the game, so that development staff experience pain for operational failures as much as operations staff.

DevOps Mentality

DevOps – development and operations

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications