Category Archives: software

Microservices Life Cycle

There is friction between a microservices architecture and life cycle management goals for application releases. One significant motivation for microservices is independent life cycle management, so that capabilities with well-defined boundaries can be developed and operated by self-contained, self-directed teams. This allows for more efficient workflows, so that a fast-moving code base is not held back by other slower-moving code bases.

Typically, an application (a collection of services that form an integrated whole and are offered together as a product to users) is rolled out with major and minor releases on some cadence. Major releases include large feature enhancements and some degree of compatibility breakage, so these may happen on an annual or semi-annual basis. Minor releases or patches may happen quarterly, monthly, or even more frequently. With microservices, the expectation is that each service may release on its own schedule without coordination with all others even within the scope of an integrated application. A rapid release cadence is conducive to responsiveness for bug fixes and security fixes, which protect against exposing vulnerabilities to exploits.

One advantage of applications on the cloud is that a single release of software can be rolled out to all users in short order. This removes the substantial burden on developers to maintain multiple code branches, as they had to do in the past for on-premises deployments. Unfortunately, the burden is not entirely lifted, because as software under development graduates toward production use, various pre-release versions must be made available for pre-production staging, testing, and quality assurance.

Development is already complex, needing feature development toward a future release to proceed in parallel with being able to implement bug fixes for the release that is already in production (assuming all users are on only the latest). These parallel streams of development will be in various phases of pre-production testing toward being released to production, and in various phases of integration testing with a longer-term future release schedule. Varying levels of severity for bugs mean that the urgency for fixes varies. For example, emergency fixes need to be released as a patch to production immediately, if they are needed for security vulnerabilities that are exploitable. Whereas, fixes for functional defects may wait for the next release on the regular cadence. Cherry-picking and merging fixes across code branches is tedium that every developer dreads. Independent life cycle management of source code organized according to microservices is seen as helping to decouple coordination across development teams, which are organized according to microservice boundaries.

Independent life cycle management of services relies on both backward compatibility and forward compatibility. Integration between services needs to be tolerant of mismatched versions to be resilient to independent release timing, including both upgrades, rollbacks due to failed upgrades, and rerunning an upgrade after a prior failure. Backward compatibility enables a new version of a service to interoperate with an older client. Forward compatibility enables the current version of a service—soon to be upgraded—to interoperate with a newer client, especially during the span of time (brief or lengthy) in which one may be upgraded before the other. In my article about system integration, I explained the numerous problems that make compatibility difficult to achieve. Verification of API compatibility through contract testing is the best practice, but test coverage is seldom perfect. Moreover, no contract language specifies everything that impacts compatibility. Mocking will never be representative of non-functional qualities. This is one of many reasons why confidence in verification cannot be achieved without a fully integrated system. This is how the desire for independent life cycles for microservices is thwarted. The struggle is more real than most people realize. As software professionals, we enter into every new project with fresh optimism that this time we will do things properly to achieve utopia (well, at least independent life cycle would be a small victory), and at each and every turn we are confronted by this one insurmountable obstacle.

Application features involve workflows that span two or more collaborating microservices. For example, a design-time component provides the product modeling for a runtime component for selling and ordering those products. Selling and ordering cannot function without the product model, so the collaboration between those services must integrate properly for features to work. Most features rely on collaborations involving several services. Often, the work resulting from one service is needed to drive the processing in other services, as was the case in the selling and ordering example above. This pattern is repeated broadly in most applications. Once all collaborations are accounted for across the supported use cases, the integrations across services would naturally cover every service. The desire for an independent life cycle for each service that composes the application faces the interoperability challenges across this entire scope. There goes our independence.

Given the need to certify a snapshot of all services that compose an application to work properly together, we need a mechanism to correlate the versioning of source code to versions of binaries (container images) for deployment. Source code can be tagged with a release. This includes Helm charts, Kubernetes YAML files, Ansible playbooks, and whatever other artifacts support the control plane and operations pipelines for the application. A snapshot must be taken of the Helm chart version and their corresponding container image versions, so that the complete deployment can be reproduced.

This identifies an application release as a set of releases of services deployed together. This information aids in troubleshooting, bug reporting, and reproducing a build of those container images and artifacts from source code, each at the same version as what was released for deployment. This is software release management 101, nothing out of the ordinary. What is noteworthy is our inability to extricate ourselves from the monolithic approach to life cycle management despite adopting a modern microservices architecture.

Worse still, if our application is integrated into a suite of applications, as enterprise applications tend to be, the system integration nightmare broadens the scope to the entire suite. The desire for an independent life cycle even for each application that composes the suite faces interoperability challenges across this entire scope. What a debacle this has turned out to be. The system integration nightmare is the challenge that modern software engineering continues to fail at solving across the industry.

DevOps Transparency and Coordination

Coordinating human activities across organizations and disciplines is fundamental to DevOps. This requires having documented procedures to handle any situation, tools that enable participants to collaborate effectively, and a shared understanding of what information needs to be captured and communicated to work — especially when the actors are likely to be separated by space and time.

A DevOps procedure is initiated for a reason. These situations include scheduled maintenance, a response to an alert (detection of a condition that deserves attention), or a response to a request for support. In each case, there should be a ‘ticket’ (a record in a tool) that notifies a responsible DevOps engineer to work the issue. Ideally, the relevant procedure that applies to a ticket should be obvious — ideally, referenced explicitly.

When responding to an alert or a support request (usually a complaint about a service malfunction), usually it begins by confirming the reported condition. This requires gathering information about the context and collecting diagnostics to aid in troubleshooting. Ideally, the ticket clearly identified the problem; otherwise, interactions would be needed to gain such clarity. Humans are routinely terrible about assuming the recipient of a request has all the necessary context to understand what is being asked of them and why. To mitigate this inefficiency, tooling and procedural documentation are usually provided to guide how tickets should be written, so that many questions and answers are not needed afterward to satisfy the request.

The engineer who works a ticket should capture the service configuration, relevant logs to determine the failure mode, and any other data for the context associated with the problem for analysis toward an operational fix or for submitting a bug, if applicable.

A designated channel should be used for engineers to collaborate. Each engineer must record in real time the actions taken to troubleshoot, analyze, and correct the problem. This aids in coordination between multiple individuals in different roles across organizations to work together. Good communication enables everyone involved to stay informed and avoid taking actions that interfere with each other. Moreover, an accurate record can be reviewed later as a post-mortem. In the course of the

Severe incidents, such as those that cause a service outage, demand a root cause analysis toward preventative actions, such as process improvements, procedure documentation, operational tooling, or developing a permanent fix (for a software bug). This depends on the ticket capturing the necessary information to trace how the problem originated, such as the transaction processing in progress, events or metrics indicating resource utilization out of bounds (e.g., out of memory, insufficient cpu, I/O bandwidth limited, storage exhaustion) or performance impairment (e.g., lock contention, queue length, latency), or anything else that appears out of the ordinary.

One of the biggest impediments to verifying that a problem is resolved is that DevOps normally does not have access to the service features being exercises by end users. When a functional problem is reported by users, it may not be possible for DevOps to confirm that the problem is fixed from the user’s perspective. Communication with users may need to be mediated by customer support staff. The information on the ticket would need to facilitate this interaction.

Accurate record-keeping also enables later audits in case there are subsequent problems. The record of actions taken can be used to analyze whether these actions (such as configuration changes or software patches) are contributing to other problems. Troubleshooting procedures should include data mining of past incidents (have we seen this problem before? how did we fix it previously?) and auditing what procedures may have impacted the service under investigation (what could have broken this?).

The above guidance can be summarized as follows.

  • Say what you do
  • Do what you say
  • Write it down

Future Distributed Applications

Big Tech censorship and cancel culture are becoming intolerable. Politicization of business is destroying the fabric of society. Corporate oligarchs are implementing partisan agendas to shape public discourse by applying so-called “community standards” for social media content moderation. They de-platform personalities who express opinions that run counter to approved narratives. They silence dissent. Free speech and freedom of association are under threat, as private companies are coerced by state regulatory action, looming threats of state intervention, and mob rule through heckler’s veto, bullying, harassment, doxxing, and cancel culture. Concentration of power and control in a few dominant platforms, such as Google, YouTube, Facebook, Twitter, Wikipedia, and their peers has harmed consumer choice. Anti-competitive behavior, such as collusion among platform and infrastructure services to deny service to competitive upstarts and undesirable non-conformists, has suppressed alternatives like Parler, Gab, and BitChute.

The current generation of dominant platforms does not allow editorial control to be retained by content creators. The platform is viewed as the ultimate authority, and users are limited in their ability to assert control to form self-moderated communities and to set their own community standards. Control is asserted by the central platform authorities.

Control needs to be decoupled from centralized platform authorities and put back in the hands of content creators (authors, podcasters, video makers) and end users (content consumers and social participants). Editorial control over legal content does not belong with Big Tech. What constitutes legal content is dependent on the user’s jurisdiction, not Big Tech’s harmonization of globalist attitudes. To Americans, hate speech is protected speech, and it needs to be freely expressible. Similarly, users in other jurisdictions should be governed according to their own standards.

We need to develop apps with peer-to-peer protocols and end-to-end encryption to cut out the middlemen which will exterminate today’s generation of social media companies. Better yet, application logic itself should be deployable on user-controlled compute with user-controlled encrypted storage on any choice of infrastructure providers (providing a real impetus for the adoption of Edge Computing), so that centralized technology monopolies cannot dominate as they do today. This approach needs to be applied to decentralize all apps, including video, audio podcasts, music, messaging, news, and other content distribution.

I believe the next frontier for the Internet will be the development of a generalized approach on top of HTTP or as an adjunct to HTTP (like bittorrent) to enable distributed apps that put app logic and data storage at end-points controlled by users. This would eliminate control by middlemen over what content can be created and shared.

Applications must be distributed in a topology where a node is dedicated to each user, so that the user maintains control over the processing and data storage associated with their own content. Applications must be portable across cloud infrastructures available from multiple providers. A user should be able to deploy an application node on any choice of infrastructure provider. This would enable users to be immune from being de-platformed.

With an application whose logic and data are distributed in topology and administrative control, the content should be digitally signed so that it can be authenticated (verified to be produced by the user who owns it). This is necessary, so that a user’s application node can be moved to an alternative infrastructure (compute and storage) without other application nodes needing to establish any form of trust. Consumers (the audience with whom the owner shares content) and processors (other computational services that may operate on the content) of the information would be able to verify that the information is authentic, not forged or tampered with. The relationship between users and among application nodes, as well as processors, is based on zero trust.

Processing of information often involves mirroring and syndication. Mirroring with locality for low latency access gives certain types of transaction processing, such as search indexing, the performance characteristics they need. Authorizing a search engine to index one’s content does not automatically grant users of the search engine access to the content. Perhaps only an excerpt is presented by the search engine along with the owner’s address, where the user may request access. A standard protocol is needed to enable this negotiation to be efficient and automated, if the content owner chooses to forego human review and approval.

We need to change how social applications control the relationship between content producers and content consumers. First, for original source content, the root of a new discussion thread, the owner must control how broadly it is published. Second, consumers of content must control what sources of information they consume and how it is presented. Equally important, consumers of an article become producers of reviews and comments, when interacting in a social network. The same principle must apply universally to the follow-on interactions, so that the article’s author should not be able to block haters from commenting, but the author is not obligated to read them. Similarly, readers are not obligated to see hateful commenters, who they want to exclude from their network. The intent should be to enable each person to control their own content and experience, ceding no control to others.

Social applications need self-managed communities with member administered access control and content moderation. Community membership tends to be fluid with subgroups merging and splitting regularly. Each member’s access and content should follow their own memberships rather than being administered by others in those communities. The intent is to mitigate a blacklisted individual being cancelled by mobs. If a cancelled individual can form their own community and move their allies there with ease, cancel culture becomes powerless as a tool of suppression with global reach. Its reach is limited to communities that quarantine themselves.

This notion of social network or community is decentralized. A social application may support a registry of members, which would serve as a superset of potential relationships for content distribution. This would enable a new member to join a social network and request access to their content. Presumably, most members would enable automatic authorization of new members to see their content, if the new member has not been blocked previously. That is, enable a community to default to public square with open participation. However, honor freedom of association, so that no one is forced to interact with those with whom there is no desire to associate, and no one can be banned from forming their own mutually agreed relationships.

We need software innovations to address this urgent need to counter the censors, the cancellers, the de-platformers, the prohibitionists, the silencers of dissent, and the government oppressors. We don’t yet have a good understanding of the requirements which I’ve touched upon above, as I have only scratched the surface. We need an architecture to enable the unstoppable Open Internet that we failed to preserve from the early days. We need to develop a platform that realizes this vision to restore a healthy social fabric for our online communities.

system integration

There is something terribly wrong with software development in the enterprise application space. No one is able to release working software without coordinating across all product development teams to align the version of every product in the universe, because end-to-end workflows can’t be made to work as products are released on independent life cycles.

I believe we are missing architectural design principles. We talk about forward and backward compatibility of APIs, but I’m not sure the industry deeply understands what that entails. The problem goes beyond teams within an organization, because the software industry doesn’t even understand what compatibility entails.

The issue lies in how the base application (e.g., product catalog, store front, sales automation, care, order fulfillment, customer and subscription management, charging, billing, revenue management) is horizontal (generic) and hollow, expecting after-market extensibility to provide the vertical behavior that is specialized for the industry and the enterprise’s business model. The intent of the application vendor is to provide a general purpose platform that can be tailored after-market to the peculiarities of any enterprise. The application will implement an API defined by industry standards (say, tmforum.org for the communications industry) that reflects this general purpose hollowness. The application doesn’t have any real substance until it is customized to model the business. For example, a product catalog would not come populated with 5G mobile product specifications that are branded and priced according to a 5G service provider’s business model).

When extending entities with data that have hidden meaning, implied behavior, constraints, and statefulness (life cycle, workflow), these contribute to the API in ways that were not defined by the original specification. Each new element introduces some degree of incompatibility. Industry standards can never specify in a precise and rigorous manner things they did not foresee.

Stateful behavior is especially troublesome to specify in a manner that ensures compatibility. This includes conversational state and persistent state. Conversational state is where linked information is implicitly kept across multiple requests involved in the same session. A cursor for iterating through a collection of query results is an example of conversational state. Persistent state is durable across transactions, having memory that spans the life of a transaction, a session, a process, and even the life of a compute instance. When methods can only act against objects in certain states, but not others, this constraint must be honored for compatibility across collaborating components.

Objects and attributes are allowed to take on certain values at various points in their life cycles, and transactional behavior and workflow (the steps performed by business processes) are conditional upon the state of these objects. For example, when equipment is installed, it may be in various states of readiness for production use, but when not installed the equipment’s operational characteristics and configuration are irrelevant. Every component with access to that object must understand these semantics and enforce them consistently, otherwise there is no compatibility. Unfortunately, even these very simple conditional constraints and the ones in the previous paragraph are beyond the capability of today’s prevailing interface specification languages and entity modeling frameworks.

Immutability is often conditional on the life cycle state of an entity. For example, an order can be edited during information capture, but its captured intent cannot be edited after the order is firm and in the process of being fulfilled. Again, this constraint cannot be specified in a manner that ensures compatibility across collaborating components.

Methods have failure modes, usually specified as failure responses, error codes, or exceptions. Some kinds of failures are recoverable using techniques like retrying, while others are non-recoverable. This too is usually not expressible for compatibility.

Methods have performance expectations in terms of latency, concurrency, and transaction volume. Methods have resource consumption expectations in terms of memory, cpu, storage, network, and I/O. Methods that involve data sets have expectations about how much data can be passed with corresponding performance and scalability characteristics. This too is usually not expressible for compatibility.

Objects and their attributes are often persistent on durable storage. Subsets of attributes may be persistent, while others are volatile or derived (computed based on the value of other attributes, such as a rolled-up status or a count of a collection). This too is usually not expressible for compatibility.

Methods must trade off concurrency, availability, and partition tolerance. The expectation of what trade offs should be chosen is usually not expressible for compatibility.

Methods expect the caller to be authenticated and they are expected to enforce access control to verify that the caller is authorized. Moreover, the method is expected to enforce data permissions and data privacy. This too is usually not expressible for compatibility.

The list of requirements and constraints that contribute to compatibility goes on. The above is a sampling to give the reader a sense of the problem, not to be comprehensive. The intent is to show how formal specifications are grossly insufficient to ensure a high degree of compatibility across heterogeneous suppliers and independently developed implementations.

Because API compatibility is so unreliable based on specifications and contract testing, the promise of a microservice architecture (within an application) or a service-oriented architecture (for integrating applications across the enterprise) cannot be achieved naively. System integration continues to be plagued by a waterfall model of requiring a complete line-up of application versions to be tested end-to-end, before we have any confidence that they work together. The benefits of agile development and independent life cycles are not achievable, because the pre-requisite compatibility guarantees cannot be met. System integration of enterprise applications remains in the stone age because of this crippling deficiency.

Pain Feedback Loops

Feedback loops are very important to regulate behavior within an enterprise. This applies to both rewarding positive behavior, and encouraging more of it, as well as correcting negative behavior to get less of it. Continuous improvement is about feedback loops.

Focusing on negative feedback, we should recognize a phenomenon called ‘pain’. In this context, it refers mostly to pains in the ass, which are discomforts, inconveniences, and frustrations which burden people’s life, draining their time and energy in unproductive ways. In DevOps, when high severity operational problems arise, such a service outage in the middle of the night, pain manifests in a pager alert that wakes up an engineer to troubleshoot the incident and resolve the problem.

When fires need to be fought, fire-fighters experience this pain in proportion to the number of fires and their severity. Development teams tend to avoid work with the goal being to deliver features with faster time to market. They inevitably cut corners in areas that make operations more efficient, because they tend not to be placed in the position of experiencing the pain, when it comes. Disconnecting development priorities from operational responsibilities is a recipe for the infliction of pain on those who do not deserve it, and the result is an excess of unexpected pain that should have been foreseen and mitigated. The integration of development with operations into DevOps is intended to establish this connection. This connection must not be undermined by paying mere lip-service to operations without putting real skin in the game, so that development staff experience pain for operational failures as much as operations staff.

DevOps Mentality

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications

Scaling operations across tenants in the cloud

Dear Santa,

Currently, when using the tenant-per-namespace deployment model, operational management procedures are difficult to scale to many tenants, because typical actions like patching, upgrading, stopping, starting, etc. must be initiated as pipeline jobs, once per tenant, and watched for successful execution per job. This is labor intensive, error-prone (having to re-input the same input parameters per pipeline job), and tedious to manage. Therefore, it is not scalable in its current form.

To enable this model to scale, tooling is required to enable a single specification of intent to serve as input into an automated workflow that performs the required action across every applicable namespace (tenant). The intended action may be as simple as `kubectl patch` or it may be a very complex job (upgrade all resources). The workflow would coordinate the parallel execution of these actions against their respective namespaces (indentified either by label or a list of names), possibly throttling for limited concurrency to avoid resource contention, and reporting output for status monitoring and troubleshooting. This would reduce the operational cost and complexity of deploying patches and upgrades from O(n) to approximately O(1) for n tenants.

Personal Assistants

Continuing the series on Revolutionizing the Enterprise, where we left off at Sparking the Revolution, I would like to further emphasize immediate opportunities for productive improvements, which do not need to venture into much-hyped speculative technologies like blockchain and artificial intelligence.

In the previous article, I identified communication and negotiation as skills where software agents can contribute superior capabilities to improve human productivity by offloading tedium and toil. Basic elements of this problem can be solved without applying advanced technology like AI. Machine learning can provide additional value by discerning a person’s preferences and priorities. For example, this person is always preferring to reschedule dentist appointments but never reschedules family events to accommodate work. Automating the learning of rules enables the prioritization of activities to be automated, further offloading cognitive load.

In my own work, I wish I had a personal assistant, who could shadow my every move. I want it to record my activities so I can replay them later. I want these activities to be in the most concise and compact form, not only as audio and video. For example, as I execute commands in a bash shell, I want to record the command line arguments, the inputs, and the outputs, so this textual information can be copied to technical documentation. As I point and click through a graphical user interface, I want these events to be described as instructions (e.g., input “John Doe” in the field labeled “Name” and click on the “Submit” button).

With a history of my work in this form, this information will be useful for a number of purposes.

  • Someone who pioneers a procedure will eventually need to document it for knowledge transfer. Operating procedures teach others how to accomplish the same tasks by observing how it was done.
  • Pair programming is often inconvenient due to team members being located remote from each other and separated by time zones. An activity log can enable two remote workers to collaborate more effectively.
  • Context switching between tasks is expensive in terms of organizing one’s thoughts. Remembering what a person was doing, so that they can resume later would save time and improve effectiveness.

The above would be a good starting point for a personal assistant without applying any form of AI or analytics. Then, imagine what might be possible as future enhancements. Procedures can be optimized. Bad habits can be replaced by better ones. Techniques used by more effective workers can be taught to others. Highly repeatable tasks can be automated to remove that burden from humans.

I truly believe the places to begin innovating to revolutionize the enterprise are the mundane and ordinary, which machines have the patience, discipline, and endurance to perform better than humans. More ambitious technological capabilities are good value-adds, but we should start with the basics to establish personal assistants in the enterprise as participants in ordinary work, not as esoteric tools in obscure niches.

[Image credit – Robotics and the Personal Assistant of the Future]

planning is useless

Whenever an organization is faced with challenges that require many people to move in a different direction, change their behavior, adjust their attitudes, or alter their thinking, the first thing that management wants to put in place is leadership. They always believe that with the proper top-down inspiration, instruction, and oversight, it will drive the desired results. They believe this model scales hierarchically.

I don’t believe it’s true of problems for which the organization does not have experience and expertise. The more technical and schedule risk that a project incurs because of greater unknowns, the less helpful project planning is. The ability to plan relies on a degree of analysis and design. Without relevant experience to help speculate on how to implement something, planning must happen in ignorance. The plans are meaningless, because actual implementation experience will likely invalidate those plans and designs. Unfortunately, the natural reaction is to spend more time and effort getting those plans right, as the plan goes off track with execution. The more right you try to make it, the worse that situation becomes, as the organization invests more in a futile activity, and less in activities that actually achieve the result. A “learning organization” is what is needed, not one that assumes it knows (or more importantly “can know”) what it’s doing without having done it yet.

The idea of “spontaneous order” is appealing, but that requires all participants to behave rationally with the right signals, so they can work things out among themselves. In large engineering organizations, this does not seem to work, because the communications channels are too narrow, the number of participants too great, and the volume and complexity of knowledge that must be exchanged is too vast. Individuals become too overwhelmed and cannot keep up. Management structures are inevitably put in place to introduce controls and gatekeepers. Whereas chaos is too noisy and incoherent, the imposition of order destroys knowledge pathways from forming spontaneously.

I’m left wondering if there are methods that facilitate spontaneous order. Autonomy, mastery, and purpose are great motivators in the abstract, but they don’t easily translate into concrete methods and tools. I noticed that Facebook has started implementing a system like khanacademy.org for helping edit location information, where it awards points, badges, and levels. Such systems really do provide users with a motivation to achieve the measured outcome. I’m wondering if gamification is a superior way to achieve outcomes.

Sparking the Revolution

In my previous article, Revolutionizing the Enterprise, I provided an outlook for how emerging technologies may help to transform how we do work. Now, let’s explore how we might provide the spark that starts the fire to burn down the old and welcome the new. The world does not change in a radical way without a progression of steps that pave a path for getting from here to there. What might the first step be to introducing robots and AIs as personal assistants into the regular work lives of numerous employees?

We need only look to our daily struggles to identify where every person would see the value of machine intelligence. Organizing a meeting among several participants can be challenging. You need to find a convenient time when every participant is available. You need to find a suitable venue that can accommodate everyone. If folks need to travel, the complexity rises enormously, because each traveler’s attendance is then dependent upon successfully booking travel arrangements. The risk of a single unsatisfied requirement causing the meeting to be non-viable rises with each participant and their special needs. If the meeting needs to be moved to accommodate certain participants, this would then trigger a storm of activity to renegotiate, and a flurry of activity to explore how calendars can be readjusted with a cascade of renegotiations of other appointments, each having its own priority and constraints.

This kind of negotiation among a network of people is virtually impossible to accomplish by humans among each other, because of the latency for human communications. However, if every human could be represented by an agent, who could negotiate on their behalf, this kind of activity could become painless. Imagine how many hours of phone tag, email, and travel booking could be saved. Even if an agent were not entrusted to finalize decisions on travel booking, all of the negotiation and arrangements could be prepared and presented for final approval by the human; or even involve the human at key decision points by presenting a short list of options to guide the way forward for the agent.

I believe, ordinary mundane problems such as this one, which every person has experienced, will serve as an opportunity to introduce machine intelligence to work alongside us. The off-loading of such unproductive and non-creative toil to an automated personal assistant would be a welcome change that would be seen as another useful tool, rather than a radical development. And that’s how the revolutionary should begin.