Category Archives: software

Decentralization: Be Unstoppable and Ungovernable

The trucker’s freedom convoy in Canada has revealed how individuals are vulnerable to tyrannical (rights violating) actions by governments and corporations cooperating with authoritarian diktats across jurisdictional boundaries. Maajid Nawaz warns of totalitarian power over the populace using a social credit system imposed via central bank digital currency (CBDC) regimes being developed to eliminate cash. “Programmable” tokens will give the state power to control who may participate in financial transactions, with whom, when, for what, and how much. Such a regime would enable government tyranny to reign supreme over everyone and across everything within its reach.

Centralized dictatorial power is countered by decentralization, especially that which is designed into technology to become unchangeable by humans after the technology proliferates. The design principle is known as Code is law. The Proof of Work (PoW) consensus algorithm in Bitcoin is one such technology. CBDC is an attempt to prevent Bitcoin from becoming dominant. Criticism of PoW using too much electricity is another tactic. National and supranational powers (above nation states) are working against decentralization in order to preserve their dominance. The World Economic Forum (WEF) is installing its people into national legislatures and administrations to enact policies similar to those of the Chinese Communist Party (CCP) to concentrate globalized power for greater centralization of control.

We look toward Web3 and beyond to enable decentralization of digital services. As we explore decentralized applications, we must consider the intent behind distributed architectures for decentralization. What do we want from Web3?

Unstoppable Availability

Traditionally, we think about availability with regard to failure modes in the infrastructure, platform services, and application components. Ordinarily, we do not design for resiliency to the total loss of infrastructure and platform, because we don’t consider our suppliers to be potentially hostile actors. However, global integration and the unholy nexus of multinational corporations with foreign governments to impose extrajudicial punishments on individuals, who reside outside the jurisdiction of governments hostile to their cause, put them within the reach of unjust laws and tyrannical diktats of authoritarians. It is clear now that this is one of the greatest threats that must be mitigated.

Web3 technologies, such as blockchain, grew out of recognition that fiat is the enemy of the people, and we must decentralize by becoming trustless and disintermediated. Eliminate single points of failure everywhere. Run portably on compute, storage, and networking that are distributed across competitive providers in adversarial jurisdictions across the globe without cooperation. When totalitarianism comes, Bitcoin is the countermove. Decouple from centralized financial systems, including central banking and fiat currencies. Become unstoppable and ungovernable, resistant to totalitarianism.

To become unstoppable, users need to gain immunity from de-platforming and supply chain disruption through decentralization. Users need to be able to keep custody of their own data. Users need to self-host the application logic that operates on their data. Users need to compose other users’ data for collaboration without going through intermediaries (service providers who can block or limit access). Then, to achieve resiliency, users need to be able to migrate their software components to alternative infrastructure and platform providers, while maintaining custody of their data across providers. At a minimum, this migration must be doable by performing a human procedure with some acceptable interruption of service. Ideally, the possible deployment topologies would have been pre-configured to fail-over or switch-over automatically as needed with minimal service disruption. Orchestrating the name resolution, deployment, and configuration of services across multiple heterogeneous (competitive) clouds is a key ingredient.

Custody of data means that the owner must maintain administrative control over its storage and access, as well as having the option of keeping a copy of it on physical hardware that the owner controls. Self-hosting means that the owner must maintain administrative control over the resources and access for serving the application functions to its owner, and for that hosting to be unencumbered and technically practical to migrate to alternative resources (computing, financial, and human).

If Venezuela can be blocked from “some Ethereum services”, that is a huge red flag. Service providers should be free to block undesirable users. But if the protocol and platform enables authorities to block users from hosting and accessing their own services, then the technology is worthless for decentralization. Decentralization must enable users to take their business elsewhere.

Ungovernable Privacy

Privacy is a conundrum. Users need a way to identify themselves and authenticate themselves to exert ownership over their data and resources. Simultaneously, a user may have good reason to keep their identity hidden, presenting only a pseudonym or remaining cloaked in anonymity in public, where appropriate. Meanwhile, governments are becoming increasingly overbearing in their imposition of “Know Your Customer” (KYC) regulations on businesses ostensibly to combat money laundering. This is at odds with the people’s right to privacy and being free from unreasonable searches and surveillance. Moreover, recruiting private citizens to spy on and enforce policy over others is commandeering, which is also problematic.

State actors have opposed strong encryption, and they have sought to undermine cryptography by demanding government access to backdoors. Such misguided, technologically ignorant, and morally bankrupt motivations disqualify them from being taken seriously, when it comes to designing our future platforms and the policies that should be applied. Rights are natural (a.k.a. “God-given” or inalienable), and therefore they (including privacy) are not subject to anyone’s opinion regardless of their authority or stature. Cryptographic technology should disregard any influence such authorities want to exert, and design for maximum protection of confidentiality, integrity, and availability. Do not comply. Become ungovernable.

Composability

While the capabilities and qualities of the platform are important, we should also reconsider the paradigm for how we interact with applications. Web2 brought us social applications for human networking (messaging, connecting), media (news, video, music, podcasts), and knowledge (wikis). With anything social, group dynamics invariably also expose us to disharmony. Web2 concentrated power into a few Big Tech platforms; the acronym FAANG was coined to represent Facebook (now Meta), Amazon, Apple, Netflix, and Google (now Alphabet). With centralized control comes disagreement over how such power should be wielded as well as corruption and abuse of power. It also creates a system that is vulnerable to indirect aggression, where state actors can interfere or collude with private actors to side-step Constitutional protections that prohibit governments from certain behaviors.

David Sacks speaks with Bari Weiss about Big Tech’s assault on free speech and the hazard of financial technologies being used to deny service to individuals, as was done to the political opponents of Justin Trudeau in Canada in response to the freedom convoy protests.

Our lesson, after enduring years of rising tension in the social arena and culminating in outright tyranny, is that centralized control must disappear. Social interactions and all forms of transactions must be disintermediated (Big Tech must be removed as the middlemen). The article Mozilla unveils vision for web evolution shows Mozilla’s commitment to an improved experience from a browser perspective. However, we also need a broader vision from an application (hosted services) perspective.

The intent behind my thoughts on Future Distributed Applications and Browser based capabilities is composability. The article Ceramic’s Web3 Composability Resurrects Web 2.0 Mashups talks about how Web2 composability of components enabled mashups, and it talks about Web3 enabling composability of data. The focus is shifting from the ease of developing applications from reusable components to satisfying the growing needs of end users. Composability is how users with custody of their own data can collaborate among each other in a peer-to-peer manner to become social, replacing centralized services with disintermediated transactions among self-hosted services. The next best alternative to self-hosting is enabling users to choose between an unlimited supply of community-led hosted services that can be shared by like-minded mutually supportive users. The key is to disintermediate users from controlling entities run by people who hate them.

State of Technology

The article My First Web3 Webpage is a good introduction to Web3 technologies. This example illustrates some very basic elements, including name resolution, content storage and distribution, and the use of cryptocurrency to pay for resources. It is also revealing of how rudimentary this stuff is relative to the maturity of today’s Web apps. Web3 and distributed apps (dApps) are extremely green. Here is a more complicated example. Everyone is struggling to understand what Web3 is. Even search is something that needs to be rethought.

The article Why decentralization isn’t the ultimate goal of Web3 should give us pause. Moxie Marlinespike, Jack Dorsey, Mark Andreeson, and other industry veterans are warning us about the current crop of Web3 technologies being fraudulent and conflicted. Vitalik Buterin’s own views confess that the technology may not be going in the right direction. Ethereum’s deficiencies are becoming evident. This demands great caution and high suspicion.

Here is a great analysis of the critiques against today’s Web3 technologies. It is very clarifying. One important point is the ‘mountain man fantasy’ of self-hosting; no one wants to run their own servers. The cost and burden of hosting and operating services today is certainly prohibitive.

Even if the mountain man fantasy is an unrealistic expectation for the vast majority, so long as the threat of deplatforming and unpersoning is real, people will have a critical need for options to be available. When Big Tech censors and bans, when the mob mobilizes to ruin businesses and careers, when tyrannical governments freeze bank accounts and confiscate funds, it is essential for those targeted to have a safe haven that is unassailable. Someone living in the comfort of normal life doesn’t need a cabin in the woods, off-grid power, and a buried arsenal. But when you need to do it, living as a mountain man won’t be fantastic. Prepping for that fall back is what decentralization makes possible.

In the long term, self-hosting should be as easy, effortless, and affordable as installing desktop apps and mobile apps. We definitely need to innovate to make running our apps as cloud services cheap, one-click, and autonomous, before decentralization with self-hosting can become ubiquitous. Until then, our short-term goal should be to at least make decentralization practical, even if it is only accessible initially to highly motivated, technologically savvy early adopters. We need pioneers to blaze the trail in any new endeavor.

As I dive deeper into Web3, it is becoming clear the technology choices lean toward Ethereum blockchain to the exclusion of all else. Is Ethereum really the best blockchain to form a DAO? In Ethereum, writing application logic is expected to be smart contracts. Look at the programming languages available for smart contracts. Even without examining any of these languages, my immediate reaction is revulsion. Who would want to abandon popular general purpose programming languages and their enormous ecosystems? GTFO.

We need a general purpose Web architecture for dApps that are not confined to a niche. I imagine container images served by IPFS as a registry, and having a next-gen Kubernetes-like platform to orchestrate container execution across multicloud infrastructures and consuming other decentralized platform services (storage, load balancing, access control, auto-scaling, etc.). If the technology doesn’t provide a natural evolution for existing applications and libraries of software capabilities, there isn’t a path for broad adoption.

We are early in the start of a new journey in redesigning the Web. There is so much more to understand and invent, before we have something usable for developing real-world distributed apps on a decentralized platform. The technology may not exist yet to do so, despite the many claims to the contrary. This will certainly be more of a marathon, rather than a sprint.

What do we want from Web3?

In the journey to developing Web3, we must understand what is motivating decentralization. We are attempting to reinvent the Web to address deficiencies that have placed individuals in jeopardy of censorship, cancellation, and political persecution at the hands of Big Tech platforms, state actors, and adversarial groups intent on harm. Historical ideals to preserve the “free and open Internet” have been abandoned. If a “free and open Internet” is to be preserved, it cannot rely on the honor and voluntary cooperation of humans. Technologies must become permissionless, trustless, and unassailable, so that dishonorable and uncooperative humans can coexist.

  1. Protecting a user’s right to free speech by having the user take custody of their own data, and ensuring that data cannot be made inaccessible.
  2. Protecting a user’s right to free association by ensuring that data in the user’s custody can be published to whatever audience the owner wishes to reach.
  3. Protecting an audience’s right to free association by ensuring access to data published by others, and ensuring that applications can compose that data for the intended use, including for social collaboration.
  4. Protecting a user’s access to platform capabilities for providing the application services that process that data.
  5. Protecting a user’s ability to transact business with others without being subject to third party intermediaries cancelling them.
  6. Protecting a user’s privacy by ensuring the user can share their data only with others who are granted authorization. In some circumstances, a user may want to remain anonymous, so that their real-world identity cannot be exposed for doxxing. Hostile detractors often try to cancel people by targeting their business, sources of income, reputation, relationships, sensitive information, even their personal safety.

Let’s keep these requirements in mind as we explore technologies that can help realize Web3 in restoring the ideal of a free and open Internet in the face of large factions of society who are hostile to (or wobbly on) freedom and openness.

Browser based capabilities

One approach to better empowering users and upstart services to avoid Big Tech censorship, suppression, and control is to build capabilities into the browser for mashing up and mixing in complementary services. This would provide a client-side (browser based) approach for third party complementary services to extend incumbent services without needing the incumbent’s authorization or cooperation. This would be one element of building Future Distributed Applications.

Using this approach, social media sites (Facebook, Instagram, Twitter, YouTube, Reddit, etc.) that enforce authoritarian content moderation policies can be complemented by alternative services, where prohibited users and comments can be linked. Users could see the conversation with content merged from every desired source beyond Big Tech control. This approach for distributing comments that form a single conversation would be applicable to many services.

  • Comment on content where a user’s comments would be suppressed.
  • Annotate or review an article where commenting is not enabled. Allow an annotation to link precisely to a specific range of text, so it can be presented inline.
  • Add links to relevant content not referenced by the original.

This paradigm would enable end users to control how content is consumed, so that Web sites cannot censor or bias what information is presented about controversial topics.

Applying browser add-ons that mix-in complementary services would also enable end users to take information and process it in personalized ways, such as for fact-checking, reputation, rating, gaining insights through analytics, and discovering related (or contrarian) information. Complementary content could be presented by injecting HTML, or by rendering additional layers, frames, tabs, or windows, as appropriate.

Browser add-ons are only supported on the desktop, not mobile devices. Mobile devices would need to be supported for this paradigm to become broadly useful.

Microservices Life Cycle

There is friction between a microservices architecture and life cycle management goals for application releases. One significant motivation for microservices is independent life cycle management, so that capabilities with well-defined boundaries can be developed and operated by self-contained, self-directed teams. This allows for more efficient workflows, so that a fast-moving code base is not held back by other slower-moving code bases.

Typically, an application (a collection of services that form an integrated whole and are offered together as a product to users) is rolled out with major and minor releases on some cadence. Major releases include large feature enhancements and some degree of compatibility breakage, so these may happen on an annual or semi-annual basis. Minor releases or patches may happen quarterly, monthly, or even more frequently. With microservices, the expectation is that each service may release on its own schedule without coordination with all others even within the scope of an integrated application. A rapid release cadence is conducive to responsiveness for bug fixes and security fixes, which protect against exposing vulnerabilities to exploits.

One advantage of applications on the cloud is that a single release of software can be rolled out to all users in short order. This removes the substantial burden on developers to maintain multiple code branches, as they had to do in the past for on-premises deployments. Unfortunately, the burden is not entirely lifted, because as software under development graduates toward production use, various pre-release versions must be made available for pre-production staging, testing, and quality assurance.

Development is already complex, needing feature development toward a future release to proceed in parallel with being able to implement bug fixes for the release that is already in production (assuming all users are on only the latest). These parallel streams of development will be in various phases of pre-production testing toward being released to production, and in various phases of integration testing with a longer-term future release schedule. Varying levels of severity for bugs mean that the urgency for fixes varies. For example, emergency fixes need to be released as a patch to production immediately, if they are needed for security vulnerabilities that are exploitable. Whereas, fixes for functional defects may wait for the next release on the regular cadence. Cherry-picking and merging fixes across code branches is tedium that every developer dreads. Independent life cycle management of source code organized according to microservices is seen as helping to decouple coordination across development teams, which are organized according to microservice boundaries.

Independent life cycle management of services relies on both backward compatibility and forward compatibility. Integration between services needs to be tolerant of mismatched versions to be resilient to independent release timing, including both upgrades, rollbacks due to failed upgrades, and rerunning an upgrade after a prior failure. Backward compatibility enables a new version of a service to interoperate with an older client. Forward compatibility enables the current version of a service—soon to be upgraded—to interoperate with a newer client, especially during the span of time (brief or lengthy) in which one may be upgraded before the other. In my article about system integration, I explained the numerous problems that make compatibility difficult to achieve. Verification of API compatibility through contract testing is the best practice, but test coverage is seldom perfect. Moreover, no contract language specifies everything that impacts compatibility. Mocking will never be representative of non-functional qualities. This is one of many reasons why confidence in verification cannot be achieved without a fully integrated system. This is how the desire for independent life cycles for microservices is thwarted. The struggle is more real than most people realize. As software professionals, we enter into every new project with fresh optimism that this time we will do things properly to achieve utopia (well, at least independent life cycle would be a small victory), and at each and every turn we are confronted by this one insurmountable obstacle.

Application features involve workflows that span two or more collaborating microservices. For example, a design-time component provides the product modeling for a runtime component for selling and ordering those products. Selling and ordering cannot function without the product model, so the collaboration between those services must integrate properly for features to work. Most features rely on collaborations involving several services. Often, the work resulting from one service is needed to drive the processing in other services, as was the case in the selling and ordering example above. This pattern is repeated broadly in most applications. Once all collaborations are accounted for across the supported use cases, the integrations across services would naturally cover every service. The desire for an independent life cycle for each service that composes the application faces the interoperability challenges across this entire scope. There goes our independence.

Given the need to certify a snapshot of all services that compose an application to work properly together, we need a mechanism to correlate the versioning of source code to versions of binaries (container images) for deployment. Source code can be tagged with a release. This includes Helm charts, Kubernetes YAML files, Ansible playbooks, and whatever other artifacts support the control plane and operations pipelines for the application. A snapshot must be taken of the Helm chart version and their corresponding container image versions, so that the complete deployment can be reproduced.

This identifies an application release as a set of releases of services deployed together. This information aids in troubleshooting, bug reporting, and reproducing a build of those container images and artifacts from source code, each at the same version as what was released for deployment. This is software release management 101, nothing out of the ordinary. What is noteworthy is our inability to extricate ourselves from the monolithic approach to life cycle management despite adopting a modern microservices architecture.

Worse still, if our application is integrated into a suite of applications, as enterprise applications tend to be, the system integration nightmare broadens the scope to the entire suite. The desire for an independent life cycle even for each application that composes the suite faces interoperability challenges across this entire scope. What a debacle this has turned out to be. The system integration nightmare is the challenge that modern software engineering continues to fail at solving across the industry.

DevOps Transparency and Coordination

Coordinating human activities across organizations and disciplines is fundamental to DevOps. This requires having documented procedures to handle any situation, tools that enable participants to collaborate effectively, and a shared understanding of what information needs to be captured and communicated to work — especially when the actors are likely to be separated by space and time.

A DevOps procedure is initiated for a reason. These situations include scheduled maintenance, a response to an alert (detection of a condition that deserves attention), or a response to a request for support. In each case, there should be a ‘ticket’ (a record in a tool) that notifies a responsible DevOps engineer to work the issue. Ideally, the relevant procedure that applies to a ticket should be obvious — ideally, referenced explicitly.

When responding to an alert or a support request (usually a complaint about a service malfunction), usually it begins by confirming the reported condition. This requires gathering information about the context and collecting diagnostics to aid in troubleshooting. Ideally, the ticket clearly identified the problem; otherwise, interactions would be needed to gain such clarity. Humans are routinely terrible about assuming the recipient of a request has all the necessary context to understand what is being asked of them and why. To mitigate this inefficiency, tooling and procedural documentation are usually provided to guide how tickets should be written, so that many questions and answers are not needed afterward to satisfy the request.

The engineer who works a ticket should capture the service configuration, relevant logs to determine the failure mode, and any other data for the context associated with the problem for analysis toward an operational fix or for submitting a bug, if applicable.

A designated channel should be used for engineers to collaborate. Each engineer must record in real time the actions taken to troubleshoot, analyze, and correct the problem. This aids in coordination between multiple individuals in different roles across organizations to work together. Good communication enables everyone involved to stay informed and avoid taking actions that interfere with each other. Moreover, an accurate record can be reviewed later as a post-mortem. In the course of the

Severe incidents, such as those that cause a service outage, demand a root cause analysis toward preventative actions, such as process improvements, procedure documentation, operational tooling, or developing a permanent fix (for a software bug). This depends on the ticket capturing the necessary information to trace how the problem originated, such as the transaction processing in progress, events or metrics indicating resource utilization out of bounds (e.g., out of memory, insufficient cpu, I/O bandwidth limited, storage exhaustion) or performance impairment (e.g., lock contention, queue length, latency), or anything else that appears out of the ordinary.

One of the biggest impediments to verifying that a problem is resolved is that DevOps normally does not have access to the service features being exercises by end users. When a functional problem is reported by users, it may not be possible for DevOps to confirm that the problem is fixed from the user’s perspective. Communication with users may need to be mediated by customer support staff. The information on the ticket would need to facilitate this interaction.

Accurate record-keeping also enables later audits in case there are subsequent problems. The record of actions taken can be used to analyze whether these actions (such as configuration changes or software patches) are contributing to other problems. Troubleshooting procedures should include data mining of past incidents (have we seen this problem before? how did we fix it previously?) and auditing what procedures may have impacted the service under investigation (what could have broken this?).

The above guidance can be summarized as follows.

  • Say what you do
  • Do what you say
  • Write it down

Future Distributed Applications

Big Tech censorship and cancel culture are becoming intolerable. Politicization of business is destroying the fabric of society. Corporate oligarchs are implementing partisan agendas to shape public discourse by applying so-called “community standards” for social media content moderation. They de-platform personalities who express opinions that run counter to approved narratives. They silence dissent. Free speech and freedom of association are under threat, as private companies are coerced by state regulatory action, looming threats of state intervention, and mob rule through heckler’s veto, bullying, harassment, doxxing, and cancel culture. Concentration of power and control in a few dominant platforms, such as Google, YouTube, Facebook, Twitter, Wikipedia, and their peers has harmed consumer choice. Anti-competitive behavior, such as collusion among platform and infrastructure services to deny service to competitive upstarts and undesirable non-conformists, has suppressed alternatives like Parler, Gab, and BitChute.

The current generation of dominant platforms does not allow editorial control to be retained by content creators. The platform is viewed as the ultimate authority, and users are limited in their ability to assert control to form self-moderated communities and to set their own community standards. Control is asserted by the central platform authorities.

Control needs to be decoupled from centralized platform authorities and put back in the hands of content creators (authors, podcasters, video makers) and end users (content consumers and social participants). Editorial control over legal content does not belong with Big Tech. What constitutes legal content is dependent on the user’s jurisdiction, not Big Tech’s harmonization of globalist attitudes. To Americans, hate speech is protected speech, and it needs to be freely expressible. Similarly, users in other jurisdictions should be governed according to their own standards.

We need to develop apps with peer-to-peer protocols and end-to-end encryption to cut out the middlemen which will exterminate today’s generation of social media companies. Better yet, application logic itself should be deployable on user-controlled compute with user-controlled encrypted storage on any choice of infrastructure providers (providing a real impetus for the adoption of Edge Computing), so that centralized technology monopolies cannot dominate as they do today. This approach needs to be applied to decentralize all apps, including video, audio podcasts, music, messaging, news, and other content distribution.

I believe the next frontier for the Internet will be the development of a generalized approach on top of HTTP or as an adjunct to HTTP (like bittorrent) to enable distributed apps that put app logic and data storage at end-points controlled by users. This would eliminate control by middlemen over what content can be created and shared.

Applications must be distributed in a topology where a node is dedicated to each user, so that the user maintains control over the processing and data storage associated with their own content. Applications must be portable across cloud infrastructures available from multiple providers. A user should be able to deploy an application node on any choice of infrastructure provider. This would enable users to be immune from being de-platformed.

With an application whose logic and data are distributed in topology and administrative control, the content should be digitally signed so that it can be authenticated (verified to be produced by the user who owns it). This is necessary, so that a user’s application node can be moved to an alternative infrastructure (compute and storage) without other application nodes needing to establish any form of trust. Consumers (the audience with whom the owner shares content) and processors (other computational services that may operate on the content) of the information would be able to verify that the information is authentic, not forged or tampered with. The relationship between users and among application nodes, as well as processors, is based on zero trust.

Processing of information often involves mirroring and syndication. Mirroring with locality for low latency access gives certain types of transaction processing, such as search indexing, the performance characteristics they need. Authorizing a search engine to index one’s content does not automatically grant users of the search engine access to the content. Perhaps only an excerpt is presented by the search engine along with the owner’s address, where the user may request access. A standard protocol is needed to enable this negotiation to be efficient and automated, if the content owner chooses to forego human review and approval.

We need to change how social applications control the relationship between content producers and content consumers. First, for original source content, the root of a new discussion thread, the owner must control how broadly it is published. Second, consumers of content must control what sources of information they consume and how it is presented. Equally important, consumers of an article become producers of reviews and comments, when interacting in a social network. The same principle must apply universally to the follow-on interactions, so that the article’s author should not be able to block haters from commenting, but the author is not obligated to read them. Similarly, readers are not obligated to see hateful commenters, who they want to exclude from their network. The intent should be to enable each person to control their own content and experience, ceding no control to others.

Social applications need self-managed communities with member administered access control and content moderation. Community membership tends to be fluid with subgroups merging and splitting regularly. Each member’s access and content should follow their own memberships rather than being administered by others in those communities. The intent is to mitigate a blacklisted individual being cancelled by mobs. If a cancelled individual can form their own community and move their allies there with ease, cancel culture becomes powerless as a tool of suppression with global reach. Its reach is limited to communities that quarantine themselves.

This notion of social network or community is decentralized. A social application may support a registry of members, which would serve as a superset of potential relationships for content distribution. This would enable a new member to join a social network and request access to their content. Presumably, most members would enable automatic authorization of new members to see their content, if the new member has not been blocked previously. That is, enable a community to default to public square with open participation. However, honor freedom of association, so that no one is forced to interact with those with whom there is no desire to associate, and no one can be banned from forming their own mutually agreed relationships.

We need software innovations to address this urgent need to counter the censors, the cancellers, the de-platformers, the prohibitionists, the silencers of dissent, and the government oppressors. We don’t yet have a good understanding of the requirements which I’ve touched upon above, as I have only scratched the surface. We need an architecture to enable the unstoppable Open Internet that we failed to preserve from the early days. We need to develop a platform that realizes this vision to restore a healthy social fabric for our online communities.

system integration

There is something terribly wrong with software development in the enterprise application space. No one is able to release working software without coordinating across all product development teams to align the version of every product in the universe, because end-to-end workflows can’t be made to work as products are released on independent life cycles.

I believe we are missing architectural design principles. We talk about forward and backward compatibility of APIs, but I’m not sure the industry deeply understands what that entails. The problem goes beyond teams within an organization, because the software industry doesn’t even understand what compatibility entails.

The issue lies in how the base application (e.g., product catalog, store front, sales automation, care, order fulfillment, customer and subscription management, charging, billing, revenue management) is horizontal (generic) and hollow, expecting after-market extensibility to provide the vertical behavior that is specialized for the industry and the enterprise’s business model. The intent of the application vendor is to provide a general purpose platform that can be tailored after-market to the peculiarities of any enterprise. The application will implement an API defined by industry standards (say, tmforum.org for the communications industry) that reflects this general purpose hollowness. The application doesn’t have any real substance until it is customized to model the business. For example, a product catalog would not come populated with 5G mobile product specifications that are branded and priced according to a 5G service provider’s business model).

When extending entities with data that have hidden meaning, implied behavior, constraints, and statefulness (life cycle, workflow), these contribute to the API in ways that were not defined by the original specification. Each new element introduces some degree of incompatibility. Industry standards can never specify in a precise and rigorous manner things they did not foresee.

Stateful behavior is especially troublesome to specify in a manner that ensures compatibility. This includes conversational state and persistent state. Conversational state is where linked information is implicitly kept across multiple requests involved in the same session. A cursor for iterating through a collection of query results is an example of conversational state. Persistent state is durable across transactions, having memory that spans the life of a transaction, a session, a process, and even the life of a compute instance. When methods can only act against objects in certain states, but not others, this constraint must be honored for compatibility across collaborating components.

Objects and attributes are allowed to take on certain values at various points in their life cycles, and transactional behavior and workflow (the steps performed by business processes) are conditional upon the state of these objects. For example, when equipment is installed, it may be in various states of readiness for production use, but when not installed the equipment’s operational characteristics and configuration are irrelevant. Every component with access to that object must understand these semantics and enforce them consistently, otherwise there is no compatibility. Unfortunately, even these very simple conditional constraints and the ones in the previous paragraph are beyond the capability of today’s prevailing interface specification languages and entity modeling frameworks.

Immutability is often conditional on the life cycle state of an entity. For example, an order can be edited during information capture, but its captured intent cannot be edited after the order is firm and in the process of being fulfilled. Again, this constraint cannot be specified in a manner that ensures compatibility across collaborating components.

Methods have failure modes, usually specified as failure responses, error codes, or exceptions. Some kinds of failures are recoverable using techniques like retrying, while others are non-recoverable. This too is usually not expressible for compatibility.

Methods have performance expectations in terms of latency, concurrency, and transaction volume. Methods have resource consumption expectations in terms of memory, cpu, storage, network, and I/O. Methods that involve data sets have expectations about how much data can be passed with corresponding performance and scalability characteristics. This too is usually not expressible for compatibility.

Objects and their attributes are often persistent on durable storage. Subsets of attributes may be persistent, while others are volatile or derived (computed based on the value of other attributes, such as a rolled-up status or a count of a collection). This too is usually not expressible for compatibility.

Methods must trade off concurrency, availability, and partition tolerance. The expectation of what trade offs should be chosen is usually not expressible for compatibility.

Methods expect the caller to be authenticated and they are expected to enforce access control to verify that the caller is authorized. Moreover, the method is expected to enforce data permissions and data privacy. This too is usually not expressible for compatibility.

The list of requirements and constraints that contribute to compatibility goes on. The above is a sampling to give the reader a sense of the problem, not to be comprehensive. The intent is to show how formal specifications are grossly insufficient to ensure a high degree of compatibility across heterogeneous suppliers and independently developed implementations.

Because API compatibility is so unreliable based on specifications and contract testing, the promise of a microservice architecture (within an application) or a service-oriented architecture (for integrating applications across the enterprise) cannot be achieved naively. System integration continues to be plagued by a waterfall model of requiring a complete line-up of application versions to be tested end-to-end, before we have any confidence that they work together. The benefits of agile development and independent life cycles are not achievable, because the pre-requisite compatibility guarantees cannot be met. System integration of enterprise applications remains in the stone age because of this crippling deficiency.

Pain Feedback Loops

Feedback loops are very important to regulate behavior within an enterprise. This applies to both rewarding positive behavior, and encouraging more of it, as well as correcting negative behavior to get less of it. Continuous improvement is about feedback loops.

Focusing on negative feedback, we should recognize a phenomenon called ‘pain’. In this context, it refers mostly to pains in the ass, which are discomforts, inconveniences, and frustrations which burden people’s life, draining their time and energy in unproductive ways. In DevOps, when high severity operational problems arise, such a service outage in the middle of the night, pain manifests in a pager alert that wakes up an engineer to troubleshoot the incident and resolve the problem.

When fires need to be fought, fire-fighters experience this pain in proportion to the number of fires and their severity. Development teams tend to avoid work with the goal being to deliver features with faster time to market. They inevitably cut corners in areas that make operations more efficient, because they tend not to be placed in the position of experiencing the pain, when it comes. Disconnecting development priorities from operational responsibilities is a recipe for the infliction of pain on those who do not deserve it, and the result is an excess of unexpected pain that should have been foreseen and mitigated. The integration of development with operations into DevOps is intended to establish this connection. This connection must not be undermined by paying mere lip-service to operations without putting real skin in the game, so that development staff experience pain for operational failures as much as operations staff.

DevOps Mentality

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.

Audiences

One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.

Telemetry

“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)

Pipelines

Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications

Scaling operations across tenants in the cloud

Dear Santa,

Currently, when using the tenant-per-namespace deployment model, operational management procedures are difficult to scale to many tenants, because typical actions like patching, upgrading, stopping, starting, etc. must be initiated as pipeline jobs, once per tenant, and watched for successful execution per job. This is labor intensive, error-prone (having to re-input the same input parameters per pipeline job), and tedious to manage. Therefore, it is not scalable in its current form.

To enable this model to scale, tooling is required to enable a single specification of intent to serve as input into an automated workflow that performs the required action across every applicable namespace (tenant). The intended action may be as simple as `kubectl patch` or it may be a very complex job (upgrade all resources). The workflow would coordinate the parallel execution of these actions against their respective namespaces (indentified either by label or a list of names), possibly throttling for limited concurrency to avoid resource contention, and reporting output for status monitoring and troubleshooting. This would reduce the operational cost and complexity of deploying patches and upgrades from O(n) to approximately O(1) for n tenants.