Tag Archives: cloud

DevOps Mentality

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications

cloud services for the enterprise

The Innovator’s Dilemma describes how the choice to sustain an incumbent technology may need to be weighed against pursuing disruptive new technologies. Nascent technologies tend to solve a desirable subset of a problem with greater efficiency. They change the game by making what used to be a costly high-end technology available as a commodity that is affordable to the masses. It turns out that high-end customers can often live without the rich capabilities of the costly solution, and they would rather save on cost. Meanwhile, with the success that the low-end solution is gaining in the market, it can invest in maturing its product to encroach into the high-end market. Eventually, the incumbent product’s market is entirely taken over by the rapidly growing upstart, who was able to establish a foothold in a larger installed base.

That is the situation we find ourselves in today with enterprise applications. Large companies rely on expensive software licenses for Customer Relationship Management, Enterprise Resource Management, and Human Capital Management applications deployed on-premise. Small and medium sized businesses may not be able to afford the same kinds of feature rich software, because not only is the software license and annual maintenance cost expensive, but commercial off the shelf software for enterprises are typically platforms that require months of after-market solution development, customization, and system integration to tailor the software to the business policies and processes specific to the enterprise. The evolution to cloud services aims to disrupt this situation.

Let us explore the ways that cloud services aim to be disruptive.

As described above, traditional enterprise software are platforms. An incumbent product that wants to evolve to cloud without disrupting its code base will merely be operating in a sustaining mode, not achieving significant gains in efficiency. Being more PaaS-like, the prohibitive cost and onerous effort of after-market solution development remains a huge barrier to entry for customers. To become SaaS-like, a cloud service must be useful by default, immediately of value to the end users of its enterprise tenants.

Cloud services are disruptive by providing improved user experiences. Of course, this means a friendlier Web user interface that is optimized for users to perform their work more easily and intuitively. User interfaces need to be responsive to device screen size, orientation, locale, and input method. Cloud services also provide advantages for enterprise collaboration by enabling the work force to be mobile. Workers need to become more decoupled in space and time, so they can be more geographically dispersed and global in reach. Cloud services should assist in transforming how employees work together, not just replacing the same old ways of doing our jobs using a Web browser instead of a desktop application. Mobile applications may even enable new ways of interacting that are not recognizable today.

Cloud services are disruptive economically. Subscription pricing replaces perpetual software licensing and annual maintenance costs along with the capital costs of hardware infrastructure, IT staffing to operate an on-premise deployment, and on-going infrastructure maintenance, upgrades, and scaling. Subscription pricing in and of itself is not transformational. It is only superficially different by virtue of amortizing the traditional cost of on-premise deployment over many recurring payments. The main benefit is in eliminating the financial risk associated with huge up-front capital expenditures in case the project fails. Migrating a traditional on-premise application into the cloud is not really financially disruptive unless it can significantly alter the costs involved. In fact, by taking on the capital cost of infrastructure and the operational cost of the deployment, the software vendor has now cannibalized its on premise application business and replaced it with a lower margin business with high upfront costs and risk—this is a terrible formula for profitability and a healthy business.

Multi-tenancy provides this disruptive benefit. Multi-tenancy enables a cloud service to support users from multiple tenants. This provides significant cost advantages over single-tenant deployments in terms of resource utilization, simplified operations, and economies of scale. Higher deployment density translates directly into higher profit, but by itself multi-tenancy provides no visible benefit to users. The disruption comes when the vendor realizes that at scale multi-tenancy enables a new tenant to be provisioned at near zero cost. This opens up the possibility of offering an entry level service to new tenants at a low price point, because the cost to the vendor is zero. Zero cost entry-level pricing is transformational by virtue of making a cloud service available to small enterprises who would never have been able to afford such capabilities in the past. This enables innovation to be done by individual or small scale entrepreneurs (start-ups), who have the most radical, risky, and unconventional, paradigm-shifting ideas.

Elastic scaling provides another disruptive benefit. It enables a cloud service to perform as required as a tenant grows from seeding a proof-of-concept demonstrator to large scale (so-called Web scale) production. The expertise, techniques, and resources needed to scale a deployment are difficult and costly to acquire. When a vendor can provide this pain-free, an enormous burden is lifted from the tenant’s shoulders.

Cloud services evolve with the times through DevOps and continuous delivery. Traditional on-premise applications tend to be upgraded rarely due to the risk and high development cost of customization, which tends to suffer from compatibility breakage. Enterprise applications are often not upgraded for years. “If it ain’t broke, don’t fix it.” Even though the software vendor may be investing heavily in feature enhancements, functional and performance improvements, and other innovations, users don’t see the benefits in a timely manner, because the enterprise cannot afford the pain of upgrading. When the vendor operates the software as a SaaS offering, upgrades can be deployed frequently and painlessly for all tenants. Users enjoy the benefit of software improvements immediately, as the cloud service stays up-to-date with the current competitive business environment.

Combining the abilities to provision a tenant to be useful immediately by default, to start at near zero cost, to scale with growth, and to evolve with the times, cloud services provide tools that can enable business agility. A business needs to be able to turn on a dime, changing what they sell and how they operate in order to stay ahead of their competitors. Cloud services are innovative and disruptive in these ways in order to enable their enterprise tenants to be innovative and disruptive.

vertical integration

Applications have been pursuing operational efficiency through vertical integration for years. This is generally understood to mean assembling infrastructure (machine and operating system) with platform components (database, middleware) and application components into an engineered system that is pre-integrated and optimized to work together.

Now, the evolution to cloud services is following the same pattern. IaaS is integrated into PaaS. IaaS and PaaS are integrated with application components to deliver SaaS. However, just as we see in on-premise enterprise information systems, applications do not operate in silos. They are integrated by business processes, and they must collaborate to enforce business policies across business functions and organizations.

Marketing is deeply interwoven with sales. Product configuration, pricing, and quotation are tied to order capture and fulfillment. Fulfillment involves inventory, shipping, provisioning, billing, and financial accounting. Customer service is linked with various service assurance components, billing care, and also quote and order capture. All components need views of accounts, assets (products and services subscribed to), agreements, contracts, and warranties. Service usage and demand all feed analytics to drive marketing campaigns that generate more sales. What a tangled web.

What is clear from this picture is that vertical integration does not end with infrastructure, platform, and a software application. Applications contribute components that combine with business processes and business policies to construct higher level applications. This may continue for many layers of integration according to the self-similar paradigm.

The evolution to cloud should recognize the need for integration of SaaS components with business processes and business policies. However, it does not appear as though cloud services have anticipated the need for vertical integration to continue in layers. To construct assemblies, the platform should provide a means of defining such assemblies, so that they can be replicated by packaging and deploying them on infrastructure at various scales. The platform should provide a consistent programming and configuration model for extending and customizing applications in ways that are natural to being reapplied layer by layer.

Vertical integration is not an elegantly solved problem for on-premise applications. On-premise application integration is notoriously complex due to heterogeneity and vastly inconsistent interfaces and programming models. One component’s notion of customer is another’s notion of party. Two components with notions of customer do not agree on its schema and semantics. A product to one component is an offer to another. System integration projects routinely cost five to ten times the software license cost of the application components, because of the difficulty of overcoming impedance mismatches, gaps in functional capabilities, duct tape, and bubblegum.

Examining today’s cloud platforms and the applications built upon them, it is looking like we have not learned much from our past mistakes. We are faced with the same costly and clunky integration nightmare with no breakthrough in sight.