DevOps Mentality

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications

Corporations Acting as Agents of Foreign Governments

4A thought experiment

An American citizen uses a healthcare web site to manage his lab results, medical images, medications, and communications with his doctors, all of whom are in the US, as is the web site’s hosting and corporate headquarters. The web site also serves citizens in other countries in the same manner. One day, the German government obtains a warrant from a German court to gain access to an American’s private health information.

Would a libertarian say that the web site is a private company and they may do as they please to violate the American’s privacy rights by handing over information to agents who have no jurisdiction? I think we can all agree that the answer must be no.

1A thought experiment

Canada, Germany, New Zealand, and every other country enact hate speech legislation that is incompatible with the American Constitutional protections of free speech. American social media web sites must enforce these “community standards” to the satisfaction of all these jurisdictions in order to do business with users in those countries. It would be costly to treat each user specifically in accordance to the laws and regulations in their jurisdiction, so the company prefers to implement a uniform set of community standards that becomes a superset of all laws and regulations in every jurisdiction. That means every country’s government mandated censorship now applies to American citizens to infringe upon their free speech rights.

Would a libertarian say that the web site is a private company and they may do as they please to violate the American’s free speech rights?

Would a libertarian agree that a private company may act as an agent on behalf of a foreign government to implement foreign laws and regulations on American citizens, such as China’s social credit system?

Social Media and the First Amendment

As social companies like Facebook, Twitter, and Google (YouTube) increasingly restrict what users can publish according their policies and “community standards”, we must be careful not to summarily dismiss such matters as private companies being free to operate their business as they please, because censorship (violating the First Amendment) is only applicable to state actors.

We must recognize three factors. (1) State actors are threatening to impose regulations or initiate anti-trust enforcement unless these companies self-regulate. Thus, the state is violating the First Amendment coercively through the back door. Or companies are working in cahoots with the government to implement the government’s desired regulations, knowing this is the only way such rules would not be struck down on Constitutional grounds. (2) Large corporations lobby for regulations to establish their own business practices as the status quo and raise barriers to entry for smaller competitors and start-ups. (3) Foreign governments want to enforce their laws and regulations, including against so-called “hate speech”, and corporations will enforce these rules against American citizens. All three are problematic from a First Amendment perspective.

Economics of Human Valuation

(My evolving thoughts that extend from https://www.jetpen.com/blog/2010/06/20/currency-of-goodwill/)

The currency of life is life-energy. When we give things of value to someone to improve their well-being, whether it is material or intangible, we transfer life-energy from ourselves to the recipient. The esteem that we hold for others is counted in life-energy credits and debts registered in our personal accounting system. For strangers who we hold at arm’s length there is a direct conversion of life-energy into monetary units when we conduct transactions. Even then, the quality of such transactions is accounted for with non-monetary life-energy to account for goodwill that is earned or extinguished.

Economic Freedom

Collectivism literally (in the most literal sense) destroys liberty. The word “capitalism” dehumanizes liberty by placing the focus on capital, material wealth separated from its owner and any act of earning it. The term “free market” dehumanizes liberty by inventing an aggregate concept separated from actors, who make individual buying and selling decisions to improve quality of life. Nobody sympathizes with the concept of a market. Even if we were to shift the focus to “free people” that would still be an injustice, because people (the collective) cannot be free. The concept of people is still impersonal. The proper term that humanizes the concept to enable it to elicit sympathy for an economic system based on liberty is “free persons”. That is, the economic system based on liberty is not capitalism (inhuman), the free market (inhuman), or even free people (impersonal). That economic system is properly called free persons.

Scaling operations across tenants in the cloud

Dear Santa,

Currently, when using the tenant-per-namespace deployment model, operational management procedures are difficult to scale to many tenants, because typical actions like patching, upgrading, stopping, starting, etc. must be initiated as pipeline jobs, once per tenant, and watched for successful execution per job. This is labor intensive, error-prone (having to re-input the same input parameters per pipeline job), and tedious to manage. Therefore, it is not scalable in its current form.

To enable this model to scale, tooling is required to enable a single specification of intent to serve as input into an automated workflow that performs the required action across every applicable namespace (tenant). The intended action may be as simple as `kubectl patch` or it may be a very complex job (upgrade all resources). The workflow would coordinate the parallel execution of these actions against their respective namespaces (indentified either by label or a list of names), possibly throttling for limited concurrency to avoid resource contention, and reporting output for status monitoring and troubleshooting. This would reduce the operational cost and complexity of deploying patches and upgrades from O(n) to approximately O(1) for n tenants.

Abstractions and Concretes

When approaching complex topics that are difficult to understand, I find that it is essential to include two levels of explanation. The abstract summarizes the concepts and principles at play. The concretes give the details that demonstrate how the concepts and principles present themselves in real-world examples. Arguments that go nowhere tend to get stuck in one without the other. Abstracts without concretes are hand-wavy, and no one can understand what the others are talking about, because they cannot agree on definitions. Concretes without abstracts get lost in the details, and no one can see the forest for the trees. You can often forego offering precise definitions, when you can give examples, which by themselves provide a definition by induction.

Net Neutrality

Whenever government policies are implemented in the name of consumer protection, we can be sure that it is not consumers being protected, but rather crony industry incumbents. It is presented as a false alternative between government regulation or absence of regulation, when the strongest form of regulation with the greatest degree of consumer protection is the free market, where consumers decide how their dollars are spent. Good products from well-behaving businesses are rewarded. Bad products and ill-behaving businesses are punished, often to extinction. Moreover, when consumers are under-serviced, entrepreneurs enter the market to compete against under-performing incumbents by offering innovative new products and business practices to meet the demand for superior goods and services, often disrupting the status quo. Meanwhile, government regulations necessarily entrench the status quo. “Best practices” can only be best until innovations overtake them, at which time they become obsolete. Government regulations often continue to burden an industry with obsolete practices that prevent innovations from flourishing. Thus, incumbents are protected from agile upstarts.

Net Neutrality is promoted ostensibly to protect consumers from Internet Service Providers (ISPs) throttling traffic to disadvantage competitive “over the top” (OTT) content providers (e.g., Netflix) while favoring the ISP’s own content services (e.g., television in the case of a cable ISP). Another hypothetical straw man is for ISPs to charge customers to enable access to various information services. I would argue that no ISP would pursue such goals, because of the backlash and consequent mass-exodus of customers to the embrace of the ISP’s competition. ISPs would also want to avoid anti-trust concerns. Paranoia about ISP misbehavior disregards the lack of a business case. Net Neutrality was enacted in response to no ISPs actually implementing any anti-competitive traffic management on any significant scale.

Consumers want to preserve a “free and open Internet”—rightly so. ISPs have the practical capability to throttle traffic by origin (content provider), traffic type (e.g., video), or consumption (e.g., data limits for heavy users). They have no practical (cost-effective) mechanism to understand the meaning of the content to selectively filter it. ISPs have only blunt instruments to wield.

Unlike ISPs, content providers (e.g., Netflix, Google, Facebook, Twitter, Cloudflare, GoDaddy) are responsible for “information” services, which fall outside the scope of Net Neutrality for “transmission” by carriers. While ISPs have not attempted to damage a free and open Internet, we have already seen content providers behave very badly toward free speech, since they have the ability to understand the meaning of their content.

If a “free and open Internet” is what is desired, censorship, bans, de-platforming, and de-monetization by companies, who are the strongest advocates of Net Neutrality, are certainly antithetical to that aim. What is their real motive?

Content providers enjoy having their traffic delivered to customers worldwide. They only pay for the bandwidth to the networks they are directly connected to. They are not charged for their traffic transiting other networks, while routed to their end users. Content providers obviously like this arrangement, and they want to preserve this status quo (protecting their crony interests).

Without Net Neutrality, although ISPs may not have a business case for charging customers (end users) for differentiated services, they would have a strong business case for providing differentiated services (various levels of higher reliability, low latency, low jitter, and guaranteed bandwidth) to content providers. Improvements in high quality delivery (called “paid prioritization”) would be beneficial to innovative applications that may not be viable today. For example, remote surgery. With paid prioritization, this would motivate content providers to buy connectivity into an ISP’s network to provide higher quality service to their customers, who receive their Internet access from that ISP. Or to otherwise share revenue with the ISP for such favorable treatment of their traffic. The environment becomes much more competitive between content providers, while more revenue would be shared with the ISPs. ISPs would then be motivated to invest more heavily to improve their networks to capture more of this revenue opportunity. Consumers benefit from higher quality services, better networks, and increased competition (differentiation based on quality) among content providers.

Personal Assistants

Continuing the series on Revolutionizing the Enterprise, where we left off at Sparking the Revolution, I would like to further emphasize immediate opportunities for productive improvements, which do not need to venture into much-hyped speculative technologies like blockchain and artificial intelligence.

In the previous article, I identified communication and negotiation as skills where software agents can contribute superior capabilities to improve human productivity by offloading tedium and toil. Basic elements of this problem can be solved without applying advanced technology like AI. Machine learning can provide additional value by discerning a person’s preferences and priorities. For example, this person is always preferring to reschedule dentist appointments but never reschedules family events to accommodate work. Automating the learning of rules enables the prioritization of activities to be automated, further offloading cognitive load.

In my own work, I wish I had a personal assistant, who could shadow my every move. I want it to record my activities so I can replay them later. I want these activities to be in the most concise and compact form, not only as audio and video. For example, as I execute commands in a bash shell, I want to record the command line arguments, the inputs, and the outputs, so this textual information can be copied to technical documentation. As I point and click through a graphical user interface, I want these events to be described as instructions (e.g., input “John Doe” in the field labeled “Name” and click on the “Submit” button).

With a history of my work in this form, this information will be useful for a number of purposes.

  • Someone who pioneers a procedure will eventually need to document it for knowledge transfer. Operating procedures teach others how to accomplish the same tasks by observing how it was done.
  • Pair programming is often inconvenient due to team members being located remote from each other and separated by time zones. An activity log can enable two remote workers to collaborate more effectively.
  • Context switching between tasks is expensive in terms of organizing one’s thoughts. Remembering what a person was doing, so that they can resume later would save time and improve effectiveness.

The above would be a good starting point for a personal assistant without applying any form of AI or analytics. Then, imagine what might be possible as future enhancements. Procedures can be optimized. Bad habits can be replaced by better ones. Techniques used by more effective workers can be taught to others. Highly repeatable tasks can be automated to remove that burden from humans.

I truly believe the places to begin innovating to revolutionize the enterprise are the mundane and ordinary, which machines have the patience, discipline, and endurance to perform better than humans. More ambitious technological capabilities are good value-adds, but we should start with the basics to establish personal assistants in the enterprise as participants in ordinary work, not as esoteric tools in obscure niches.

[Image credit – Robotics and the Personal Assistant of the Future]

planning is useless

Whenever an organization is faced with challenges that require many people to move in a different direction, change their behavior, adjust their attitudes, or alter their thinking, the first thing that management wants to put in place is leadership. They always believe that with the proper top-down inspiration, instruction, and oversight, it will drive the desired results. They believe this model scales hierarchically.

I don’t believe it’s true of problems for which the organization does not have experience and expertise. The more technical and schedule risk that a project incurs because of greater unknowns, the less helpful project planning is. The ability to plan relies on a degree of analysis and design. Without relevant experience to help speculate on how to implement something, planning must happen in ignorance. The plans are meaningless, because actual implementation experience will likely invalidate those plans and designs. Unfortunately, the natural reaction is to spend more time and effort getting those plans right, as the plan goes off track with execution. The more right you try to make it, the worse that situation becomes, as the organization invests more in a futile activity, and less in activities that actually achieve the result. A “learning organization” is what is needed, not one that assumes it knows (or more importantly “can know”) what it’s doing without having done it yet.

The idea of “spontaneous order” is appealing, but that requires all participants to behave rationally with the right signals, so they can work things out among themselves. In large engineering organizations, this does not seem to work, because the communications channels are too narrow, the number of participants too great, and the volume and complexity of knowledge that must be exchanged is too vast. Individuals become too overwhelmed and cannot keep up. Management structures are inevitably put in place to introduce controls and gatekeepers. Whereas chaos is too noisy and incoherent, the imposition of order destroys knowledge pathways from forming spontaneously.

I’m left wondering if there are methods that facilitate spontaneous order. Autonomy, mastery, and purpose are great motivators in the abstract, but they don’t easily translate into concrete methods and tools. I noticed that Facebook has started implementing a system like khanacademy.org for helping edit location information, where it awards points, badges, and levels. Such systems really do provide users with a motivation to achieve the measured outcome. I’m wondering if gamification is a superior way to achieve outcomes.

Insights into innovation