All posts by Ben Eng

I am a software architect in the Communications Applications Global Business Unit (CAGBU) within Oracle. I currently work on cloud services for Business Support System (BSS) and Operations Support System (OSS) applications for communications service providers. My previous responsibilities include architecture and product management for the the RODOD and RSDOD solutions to provide an integrated suite of BSS and OSS applications for communications providers. I founded the Oracle Communications Service & Subscriber Management (S&SM) application and the Oracle Communications Unified Inventory Management (UIM) application. I pioneered the adoption of Object Relational Mapping (ORM) based persistence techniques within OSS applications. I introduced the XML Schema based entity relationship modeling language, which is compiled into the persistent object modeling service. I established the notion of a valid time temporal object model and database schema for life cycle management of entities and the ability to travel through time by querying with a temporal frame of reference or a time window of interest. I established the patterns for resource consumption for capacity management. I championed the development of Web based user interfaces and Web Services for SOA based integration of OSS applications. I was responsible for single handedly developing the entire prototype that formed the foundation of the current generation of the OSS inventory application. I have been engaged in solution architecture with service providers to adopt and deploy Oracle's OSS applications across the globe. I am responsible for requirements analysis and architectural design for the Order-to-Activate Process Integration Pack (PIP) proposed to integrate the OSS application suite for the communications industry using the Application Integration Architecture (AIA). Any opinions expressed on this site are my own, and do not necessarily reflect the views of Oracle.

DevOps Transparency and Coordination

Coordinating human activities across organizations and disciplines is fundamental to DevOps. This requires having documented procedures to handle any situation, tools that enable participants to collaborate effectively, and a shared understanding of what information needs to be captured and communicated to work — especially when the actors are likely to be separated by space and time.

A DevOps procedure is initiated for a reason. These situations include scheduled maintenance, a response to an alert (detection of a condition that deserves attention), or a response to a request for support. In each case, there should be a ‘ticket’ (a record in a tool) that notifies a responsible DevOps engineer to work the issue. Ideally, the relevant procedure that applies to a ticket should be obvious — ideally, referenced explicitly.

When responding to an alert or a support request (usually a complaint about a service malfunction), usually it begins by confirming the reported condition. This requires gathering information about the context and collecting diagnostics to aid in troubleshooting. Ideally, the ticket clearly identified the problem; otherwise, interactions would be needed to gain such clarity. Humans are routinely terrible about assuming the recipient of a request has all the necessary context to understand what is being asked of them and why. To mitigate this inefficiency, tooling and procedural documentation are usually provided to guide how tickets should be written, so that many questions and answers are not needed afterward to satisfy the request.

The engineer who works a ticket should capture the service configuration, relevant logs to determine the failure mode, and any other data for the context associated with the problem for analysis toward an operational fix or for submitting a bug, if applicable.

A designated channel should be used for engineers to collaborate. Each engineer must record in real time the actions taken to troubleshoot, analyze, and correct the problem. This aids in coordination between multiple individuals in different roles across organizations to work together. Good communication enables everyone involved to stay informed and avoid taking actions that interfere with each other. Moreover, an accurate record can be reviewed later as a post-mortem. In the course of the

Severe incidents, such as those that cause a service outage, demand a root cause analysis toward preventative actions, such as process improvements, procedure documentation, operational tooling, or developing a permanent fix (for a software bug). This depends on the ticket capturing the necessary information to trace how the problem originated, such as the transaction processing in progress, events or metrics indicating resource utilization out of bounds (e.g., out of memory, insufficient cpu, I/O bandwidth limited, storage exhaustion) or performance impairment (e.g., lock contention, queue length, latency), or anything else that appears out of the ordinary.

One of the biggest impediments to verifying that a problem is resolved is that DevOps normally does not have access to the service features being exercises by end users. When a functional problem is reported by users, it may not be possible for DevOps to confirm that the problem is fixed from the user’s perspective. Communication with users may need to be mediated by customer support staff. The information on the ticket would need to facilitate this interaction.

Accurate record-keeping also enables later audits in case there are subsequent problems. The record of actions taken can be used to analyze whether these actions (such as configuration changes or software patches) are contributing to other problems. Troubleshooting procedures should include data mining of past incidents (have we seen this problem before? how did we fix it previously?) and auditing what procedures may have impacted the service under investigation (what could have broken this?).

The above guidance can be summarized as follows.

  • Say what you do
  • Do what you say
  • Write it down

Future Distributed Applications

Big Tech censorship and cancel culture are becoming intolerable. Politicization of business is destroying the fabric of society. We need to design distributed applications and decentralized services to counter these nefarious forces.

Corporate oligarchs are implementing partisan agendas to shape public discourse by applying so-called “community standards” for social media content moderation. They de-platform personalities who express opinions that run counter to approved narratives. They silence dissent. Free speech and freedom of association are under threat, as private companies are coerced by state regulatory action, looming threats of state intervention, and mob rule through heckler’s veto, bullying, harassment, doxxing, and cancel culture.

Concentration of power and control in a few dominant platforms, such as Google, YouTube, Facebook, Twitter, Wikipedia, and their peers has harmed consumer choice. Anti-competitive behavior, such as collusion among platform and infrastructure services to deny service to competitive upstarts and undesirable non-conformists, has suppressed alternatives like Parler, Gab, and BitChute.

The current generation of dominant platforms does not allow editorial control to be retained by content creators. The platform is viewed as the ultimate authority, and users are limited in their ability to assert control to form self-moderated communities and to set their own community standards. Control is asserted by the central platform authorities.

Control needs to be decoupled from centralized platform authorities and put back in the hands of content creators (authors, podcasters, video makers) and end users (content consumers and social participants). Editorial control over legal content does not belong with Big Tech. What constitutes legal content is dependent on the user’s jurisdiction, not Big Tech’s harmonization of globalist attitudes. To Americans, hate speech is protected speech, and it needs to be freely expressible. Similarly, users in other jurisdictions should be governed according to their own standards.

We need to develop apps with peer-to-peer protocols and end-to-end encryption to cut out the middlemen which will exterminate today’s generation of social media companies. Better yet, application logic itself should be deployable on user-controlled compute with user-controlled encrypted storage on any choice of infrastructure providers (providing a real impetus for the adoption of Edge Computing), so that centralized technology monopolies cannot dominate as they do today. This approach needs to be applied to decentralize all apps, including video, audio podcasts, music, messaging, news, and other content distribution.

I believe the next frontier for the Internet will be the development of a generalized approach on top of HTTP or as an adjunct to HTTP (like bittorrent) to enable distributed apps that put app logic and data storage at end-points controlled by users. This would eliminate control by middlemen over what content can be created and shared.

Applications must be distributed in a topology where a node is dedicated to each user, so that the user maintains control over the processing and data storage associated with their own content. Applications must be portable across cloud infrastructures available from multiple providers. A user should be able to deploy an application node on any choice of infrastructure provider. This would enable users to be immune from being de-platformed.

With an application whose logic and data are distributed in topology and administrative control, the content should be digitally signed so that it can be authenticated (verified to be produced by the user who owns it). This is necessary, so that a user’s application node can be moved to an alternative infrastructure (compute and storage) without other application nodes needing to establish any form of trust. Consumers (the audience with whom the owner shares content) and processors (other computational services that may operate on the content) of the information would be able to verify that the information is authentic, not forged or tampered with. The relationship between users and among application nodes, as well as processors, is based on zero trust.

Processing of information often involves mirroring and syndication. Mirroring with locality for low latency access gives certain types of transaction processing, such as search indexing, the performance characteristics they need. Authorizing a search engine to index one’s content does not automatically grant users of the search engine access to the content. Perhaps only an excerpt is presented by the search engine along with the owner’s address, where the user may request access. A standard protocol is needed to enable this negotiation to be efficient and automated, if the content owner chooses to forego human review and approval.

We need to change how social applications control the relationship between content producers and content consumers. First, for original source content, the root of a new discussion thread, the owner must control how broadly it is published. Second, consumers of content must control what sources of information they consume and how it is presented. Equally important, consumers of an article become producers of reviews and comments, when interacting in a social network. The same principle must apply universally to the follow-on interactions, so that the article’s author should not be able to block haters from commenting, but the author is not obligated to read them. Similarly, readers are not obligated to see hateful commenters, who they want to exclude from their network. The intent should be to enable each person to control their own content and experience, ceding no control to others.

Social applications need self-managed communities with member administered access control and content moderation. Community membership tends to be fluid with subgroups merging and splitting regularly. Each member’s access and content should follow their own memberships rather than being administered by others in those communities. The intent is to mitigate a blacklisted individual being cancelled by mobs. If a cancelled individual can form their own community and move their allies there with ease, cancel culture becomes powerless as a tool of suppression with global reach. Its reach is limited to communities that quarantine themselves.

This notion of social network or community is decentralized. A social application may support a registry of members, which would serve as a superset of potential relationships for content distribution. This would enable a new member to join a social network and request access to their content. Presumably, most members would enable automatic authorization of new members to see their content, if the new member has not been blocked previously. That is, enable a community to default to public square with open participation. However, honor freedom of association, so that no one is forced to interact with those with whom there is no desire to associate, and no one can be banned from forming their own mutually agreed relationships.

Software innovations address this urgent need to counter the censors, the cancellers, the de-platformers, the prohibitionists, the silencers of dissent, and the government oppressors. A good understanding of the requirements which I’ve touched upon above still needs elaboration, as I have only scratched the surface. We need an architecture to enable the unstoppable Open Internet that we failed to preserve from the early days. We need to develop a platform that realizes this vision to restore a healthy social fabric for our online communities.

indirect aggression – wielding state power

An earlier article titled corporations acting as agents of foreign governments describes one form of indirect aggression. No one would dispute that when a person hires a hitman to murder someone, the hiring party is guilty of aggression against the victim, even though the violence was indirect. We must recognize the same causal connection when government wields the machinery of state violence indirectly through private entities; and similarly when private entities wield government power indirectly against their competitors and the public at large.

In current events, we see David Boaz of the Cato Institute, a libertarian think-tank, lament Florida Governor Ron DeSantis implementing legislation to counter Critical Race Theory, election irregularities, stringent COVID restrictions, and vaccine passports. We also see state action being taken to punish protests that block road traffic and social media censorship and de-platforming, which trammel upon free speech.

Libertarians face a moral dilemma of epic proportions. The left weaponizes government power to aggress. The right weaponizes government power to aggress in their own ways and to counter the left. Libertarians want non-aggression, but they cannot reconcile their views with reality of a world filled with aggression.

Right and left engage in indirect aggression through crony relationships between private entities and governments. When private entities influence government to wield legislative and regulatory force against their competitors or the public at large, the corruption results in a phenomenon known as regulatory capture. Increasingly, we also see the government infiltrate private entities to coerce them into implementing policy that the US Constitution forbids government from implementing directly, such as censorship (content moderation). Politicians threaten to regulate or otherwise interfere in the market. Governments use subsidies, spending, loans, tax policy, tariffs, export controls, licenses, permits, and authorizations as tools of coercive power. Sometimes the mere threat or even a wink and a nod are enough to influence private entities to acquiesce to a politician’s desires, as any mobster is aware. Government agencies are also looking to outsource their coercive functions to private entities, who are not subject to Constitutional constraints and legislative oversight.

Libertarians are faced with the dilemma of either to be passive (“private entities can do what they want”) in ignorance of indirect aggression, or to retaliate. When countering indirect aggression, if one’s economic power (cancel culture) and political power (voting, lobbying, and bribing for legislation and regulation) are too weak, passivity is surrendering to coercion (resigning to be victims). Alternatively, libertarians may recognize that in the face of indirect aggression having already been initiated against them, the same weapons can be wielded as a defensive measure in retaliation and this would be morally defensible. It is the principle: return fire when fired upon.

system integration – a tangled web

There is something terribly wrong with software development in the enterprise application space. No one is able to release working software without coordinating system integration across all product development teams to align the version of every product in the universe, because end-to-end workflows can’t be made to work as products are released on independent life cycles.

Compatibility

I believe we are missing architectural design principles. We talk about forward and backward compatibility of APIs, but I’m not sure the industry deeply understands what that entails. The problem goes beyond teams within an organization, because the software industry doesn’t even understand what compatibility entails.

Horizontal applications lack vertical behavior

The issue lies in how the base application (e.g., product catalog, store front, sales automation, care, order fulfillment, customer and subscription management, charging, billing, revenue management) is horizontal (generic) and hollow, expecting after-market extensibility to provide the vertical behavior that is specialized for the industry and the enterprise’s business model. The intent of the application vendor is to provide a general purpose platform that can be tailored after-market to the peculiarities of any enterprise.

The application will implement an API defined by industry standards (say, tmforum.org for the communications industry) that reflects this general purpose hollowness. The application doesn’t have any real substance until it is customized to model the business. For example, a product catalog would not come populated with 5G mobile product specifications that are branded and priced according to a 5G service provider’s business model).

When extending entities with data that have hidden meaning, implied behavior, constraints, and statefulness (life cycle, workflow), these contribute to the API in ways that were not defined by the original specification. Each new element introduces some degree of incompatibility. Industry standards can never specify in a precise and rigorous manner things they did not foresee.

Stateful behavior

Stateful behavior is especially troublesome to specify in a manner that ensures compatibility. This includes conversational state and persistent state. Conversational state is where linked information is implicitly kept across multiple requests involved in the same session. A cursor for iterating through a collection of query results is an example of conversational state. Persistent state is durable across transactions, having memory that spans the life of a transaction, a session, a process, and even the life of a compute instance. When methods can only act against objects in certain states, but not others, this constraint must be honored for compatibility across collaborating components.

Objects and attributes are allowed to take on certain values at various points in their life cycles. Transactional behavior and workflow (the steps performed by business processes) are conditional upon the state of these objects. For example, when equipment is installed, it may be in various states of readiness for production use, but when not installed the equipment’s operational characteristics and configuration are irrelevant. Every component with access to that object must understand these semantics and enforce them consistently, otherwise there is no compatibility. Unfortunately, even these very simple conditional constraints and the ones in the previous paragraph are beyond the capability of today’s prevailing interface specification languages and entity modeling frameworks.

Conditional immutability

Immutability is often conditional on the life cycle state of an entity. For example, an order can be edited during information capture, but its captured intent cannot be edited after the order is firm and in the process of being fulfilled. Again, this constraint cannot be specified in a manner that ensures compatibility across collaborating components.

Failure modes

Methods have failure modes, usually specified as failure responses, error codes, or exceptions. Some kinds of failures are recoverable using techniques like retrying, while others are non-recoverable. This too is usually not expressible for compatibility.

Performance expectations

Methods have performance expectations in terms of latency, concurrency, and transaction volume. Methods have resource consumption expectations in terms of memory, cpu, storage, network, and I/O. Methods that involve data sets have expectations about how much data can be passed with corresponding performance and scalability characteristics. This too is usually not expressible for compatibility.

Persistent versus volatile

Objects and their attributes are often persistent on durable storage. Subsets of attributes may be persistent, while others are volatile or derived (computed based on the value of other attributes, such as a rolled-up status or a count of a collection). This too is usually not expressible for compatibility.

Concurrency, availability, and partition tolerance

Methods must trade off concurrency, availability, and partition tolerance. The expectation of what trade offs should be chosen is usually not expressible for compatibility.

Authentication and authorization

Methods expect the caller to be authenticated and they are expected to enforce access control to verify that the caller is authorized. Moreover, the method is expected to enforce data permissions and data privacy. This too is usually not expressible for compatibility.

Conclusion

The list of requirements and constraints that contribute to compatibility goes on. The above is a sampling to give the reader a sense of the problem, not to be comprehensive. The intent is to show how formal specifications are grossly insufficient to ensure a high degree of compatibility across heterogeneous suppliers and independently developed implementations.

Because API compatibility is so unreliable based on specifications and contract testing, the promise of a microservice architecture (within an application) or a service-oriented architecture (for integrating applications across the enterprise) cannot be achieved naively. System integration continues to be plagued by a waterfall model of requiring a complete line-up of application versions to be tested end-to-end, before we have any confidence that they work together. The benefits of agile development and independent life cycles are not achievable, because the pre-requisite compatibility guarantees cannot be met. System integration of enterprise applications remains in the stone age because of this crippling deficiency.

Wealth versus Quality of Life

Conflating “wealth” with “quality of life”—in criticism of wealth inequality—is a fatal error. It is important to recognize that wealth in the form of capital accumulation (savings that are re-invested into factors of production toward increasing capacity for supplying goods and services into the future) speaks to supply-side capacity. The abundance created by this productive capacity is what provides for quality of life. On the demand side, quality of life comes from consumers with incomes that have purchasing power to acquire those goods and services. The greater the abundance of supply, the greater the purchasing power that consumers can wield (as expenses on the income statement or outflows on the cash flow statement) WITHOUT wealth (assets and equity on the balance sheet) playing any role for consumers. The role of wealth is to associate ownership for management responsibility over factors of production to create and maintain supply. The role of income is to have purchasing power to enable quality of life for consumers. Savings (retained earnings that are re-invested) is how consumers cross over to participate in wealth toward the management of supply.

Pain Feedback Loops for improvement

Feedback loops are very important to regulate behavior within an enterprise. This applies to both rewarding positive behavior, and encouraging more of it, as well as correcting negative behavior to get less of it. Continuous improvement is about feedback loops.

Focusing on negative feedback, we should recognize a phenomenon called ‘pain’. In this context, it refers mostly to pains in the ass, which are discomforts, inconveniences, and frustrations which burden people’s life, draining their time and energy in unproductive ways. In DevOps, when high severity operational problems arise, such a service outage in the middle of the night, pain manifests in a pager alert that wakes up an engineer to troubleshoot the incident and resolve the problem.

When fires need to be fought, fire-fighters experience this pain in proportion to the number of fires and their severity. Development teams tend to avoid work with the goal being to deliver features with faster time to market. They inevitably cut corners in areas that make operations more efficient, because they tend not to be placed in the position of experiencing the pain, when it comes. Disconnecting development priorities from operational responsibilities is a recipe for the infliction of pain on those who do not deserve it, and the result is an excess of unexpected pain that should have been foreseen and mitigated. The integration of development with operations into DevOps is intended to establish this connection. This connection must not be undermined by paying mere lip-service to operations without putting real skin in the game, so that development staff experience pain for operational failures as much as operations staff.

Social Media Bias

Jack Dorsey and Vijaya Gadde of Twitter appeared with Tim Pool on The Joe Rogan Experience to explore the topic of social media bias.

https://www.jrepodcast.com/episode/joe-rogan-experience-1258-jack-dorsey-vijaya-gadde-tim-pool/

Tim Pool did a reasonable job of enumerating several areas of concern.

  1. Applying a single global standard (“community standards”) to American citizens imposes “hate speech” regulations that are antithetical to American principles of free speech protected by the US Constitution. [Similar to concerns I raise:  Corporations Acting as Agents of Foreign Governments and Social Media and the First Amendment]
  2. Twitter claims to hold no politically biased agenda by intention, while trying to implement narrow goals around maximizing inclusion of users to conduct speech by protecting their physical safety (i.e., by disallowing targeted harassment and doxxing), but they have adopted ideologically biased policies that are selectively enforced in a manner that predominantly punishes conservatives.
  3. The near-monopoly status of the social media giants within their own niches combined with their unilateral decision-making that appear to most people to be politically biased in one direction will lead politicians, who are ignorant the actual issues and incompetent to formulate good solutions, will take a sledgehammer approach to regulate, and consequently make the situation much worse.
  4. The coordinated efforts among corporations to punish individuals across social media, hosting, Internet infrastructure, and payment processing systems demonstrates a terrifying abuse of power that is terrifying for how they can implement a social credit system that can unperson people in an extra-judicial manner without due process of law or avenues of redemption.

Black hole jets – how do they work?

In my hubris, I sometimes write emails to established scientists with my stupid ideas. I have a knack for formulating what I believe to be good questions. Here is what I sent to Netta Engelhardt at MIT this morning on the topic of black hole jets.

I appreciate the videos you’ve done on YouTube on black holes.

Some rhetorical questions come to mind. I don’t expect an answer. My intention is to ask them, in case they help stimulate some curiosity toward maybe forming a useful idea.

  1. How much of the mass that is falling into a black hole adds to the mass of the black hole, versus being ejected, say through its jets?
  2. Can we think of the jets as carrying information away from the black hole, given that the BH is accelerating the outbound particles substantially, thereby transferring energy from it?
  3. Wouldn’t (2) then be consistent with a model whereby all information about the BH is thought to be encoded on its boundary, for accreted matter to be seen as sticking to the boundary as it falls in, and over time that same information migrating to the poles of the BH and ejected through its jets?

It just seems to me, as a layman, that black hole jets are such a prominent feature, but I haven’t seen much talk about what mechanisms generate these jets, and what are their relationships to the flow of energy and information into and out of the black hole.

DevOps Mentality

DevOps – development and operations

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications