system integration

There is something terribly wrong with software development in the enterprise application space. No one is able to release working software without coordinating across all product development teams to align the version of every product in the universe, because end-to-end workflows can’t be made to work as products are released on independent life cycles.

I believe we are missing architectural design principles. We talk about forward and backward compatibility of APIs, but I’m not sure the industry deeply understands what that entails. The problem goes beyond teams within an organization, because the software industry doesn’t even understand what compatibility entails.

The issue lies in how the base application (e.g., product catalog, store front, sales automation, care, order fulfillment, customer and subscription management, charging, billing, revenue management) is horizontal (generic) and hollow, expecting after-market extensibility to provide the vertical behavior that is specialized for the industry and the enterprise’s business model. The intent of the application vendor is to provide a general purpose platform that can be tailored after-market to the peculiarities of any enterprise. The application will implement an API defined by industry standards (say, for the communications industry) that reflects this general purpose hollowness. The application doesn’t have any real substance until it is customized to model the business. For example, a product catalog would not come populated with 5G mobile product specifications that are branded and priced according to a 5G service provider’s business model).

When extending entities with data that have hidden meaning, implied behavior, constraints, and statefulness (life cycle, workflow), these contribute to the API in ways that were not defined by the original specification. Each new element introduces some degree of incompatibility. Industry standards can never specify in a precise and rigorous manner things they did not foresee.

Stateful behavior is especially troublesome to specify in a manner that ensures compatibility. This includes conversational state and persistent state. Conversational state is where linked information is implicitly kept across multiple requests involved in the same session. A cursor for iterating through a collection of query results is an example of conversational state. Persistent state is durable across transactions, having memory that spans the life of a transaction, a session, a process, and even the life of a compute instance. When methods can only act against objects in certain states, but not others, this constraint must be honored for compatibility across collaborating components.

Objects and attributes are allowed to take on certain values at various points in their life cycles, and transactional behavior and workflow (the steps performed by business processes) are conditional upon the state of these objects. For example, when equipment is installed, it may be in various states of readiness for production use, but when not installed the equipment’s operational characteristics and configuration are irrelevant. Every component with access to that object must understand these semantics and enforce them consistently, otherwise there is no compatibility. Unfortunately, even these very simple conditional constraints and the ones in the previous paragraph are beyond the capability of today’s prevailing interface specification languages and entity modeling frameworks.

Immutability is often conditional on the life cycle state of an entity. For example, an order can be edited during information capture, but its captured intent cannot be edited after the order is firm and in the process of being fulfilled. Again, this constraint cannot be specified in a manner that ensures compatibility across collaborating components.

Methods have failure modes, usually specified as failure responses, error codes, or exceptions. Some kinds of failures are recoverable using techniques like retrying, while others are non-recoverable. This too is usually not expressible for compatibility.

Methods have performance expectations in terms of latency, concurrency, and transaction volume. Methods have resource consumption expectations in terms of memory, cpu, storage, network, and I/O. Methods that involve data sets have expectations about how much data can be passed with corresponding performance and scalability characteristics. This too is usually not expressible for compatibility.

Objects and their attributes are often persistent on durable storage. Subsets of attributes may be persistent, while others are volatile or derived (computed based on the value of other attributes, such as a rolled-up status or a count of a collection). This too is usually not expressible for compatibility.

Methods must trade off concurrency, availability, and partition tolerance. The expectation of what trade offs should be chosen is usually not expressible for compatibility.

Methods expect the caller to be authenticated and they are expected to enforce access control to verify that the caller is authorized. Moreover, the method is expected to enforce data permissions and data privacy. This too is usually not expressible for compatibility.

The list of requirements and constraints that contribute to compatibility goes on. The above is a sampling to give the reader a sense of the problem, not to be comprehensive. The intent is to show how formal specifications are grossly insufficient to ensure a high degree of compatibility across heterogeneous suppliers and independently developed implementations.

Because API compatibility is so unreliable based on specifications and contract testing, the promise of a microservice architecture (within an application) or a service-oriented architecture (for integrating applications across the enterprise) cannot be achieved naively. System integration continues to be plagued by a waterfall model of requiring a complete line-up of application versions to be tested end-to-end, before we have any confidence that they work together. The benefits of agile development and independent life cycles are not achievable, because the pre-requisite compatibility guarantees cannot be met. System integration of enterprise applications remains in the stone age because of this crippling deficiency.

Wealth versus Quality of Life

Conflating “wealth” with “quality of life”—in criticism of wealth inequality—is a fatal error. It is important to recognize that wealth in the form of capital (savings that are re-invested into factors of production toward increasing capacity for supplying goods and services into the future) speaks to supply-side capacity. The abundance created by this productive capacity is what provides for quality of life. On the demand side, quality of life comes from consumers with incomes that have purchasing power to acquire those goods and services. The greater the abundance of supply, the greater the purchasing power that consumers can wield (as expenses on the income statement or outflows on the cash flow statemen) WITHOUT wealth (assets and equity on the balance sheet) playing any role for consumers. The role of wealth is to associate ownership for management responsibility over factors of production to create and maintain supply. The role of income is to have purchasing power to enable quality of life for consumers. Savings (retained earnings that are re-invested) is how consumers cross over to participate in wealth toward the management of supply.

Pain Feedback Loops

Feedback loops are very important to regulate behavior within an enterprise. This applies to both rewarding positive behavior, and encouraging more of it, as well as correcting negative behavior to get less of it. Continuous improvement is about feedback loops.

Focusing on negative feedback, we should recognize a phenomenon called ‘pain’. In this context, it refers mostly to pains in the ass, which are discomforts, inconveniences, and frustrations which burden people’s life, draining their time and energy in unproductive ways. In DevOps, when high severity operational problems arise, such a service outage in the middle of the night, pain manifests in a pager alert that wakes up an engineer to troubleshoot the incident and resolve the problem.

When fires need to be fought, fire-fighters experience this pain in proportion to the number of fires and their severity. Development teams tend to avoid work with the goal being to deliver features with faster time to market. They inevitably cut corners in areas that make operations more efficient, because they tend not to be placed in the position of experiencing the pain, when it comes. Disconnecting development priorities from operational responsibilities is a recipe for the infliction of pain on those who do not deserve it, and the result is an excess of unexpected pain that should have been foreseen and mitigated. The integration of development with operations into DevOps is intended to establish this connection. This connection must not be undermined by paying mere lip-service to operations without putting real skin in the game, so that development staff experience pain for operational failures as much as operations staff.

Social Media Bias

Tim Pool did a reasonable job of enumerating several areas of concern.

  1. Applying a single global standard (“community standards”) to American citizens imposes “hate speech” regulations that are antithetical to American principles of free speech protected by the US Constitution. [Similar to concerns I raise:  and]
  2. Twitter claims to hold no politically biased agenda by intention, while trying to implement narrow goals around maximizing inclusion of users to conduct speech by protecting their physical safety (i.e., by disallowing targeted harassment and doxxing), but they have adopted ideologically biased policies that are selectively enforced in a manner that predominantly punishes conservatives.
  3. The near-monopoly status of the social media giants within their own niches combined with their unilateral decision-making that appear to most people to be politically biased in one direction will lead politicians, who are ignorant the actual issues and incompetent to formulate good solutions, will take a sledgehammer approach to regulate, and consequently make the situation much worse.
  4. The coordinated efforts among corporations to punish individuals across social media, hosting, Internet infrastructure, and payment processing systems demonstrates a terrifying abuse of power that is terrifying for how they can implement a social credit system that can unperson people in an extra-judicial manner without due process of law or avenues of redemption.

Black hole jets

In my hubris, I sometimes write emails to established scientists with my stupid ideas. I have a knack for formulating what I believe to be good questions. Here is what I sent to Netta Engelhardt at MIT this morning.

I appreciate the videos you’ve done on YouTube on black holes.

Some rhetorical questions come to mind. I don’t expect an answer. My intention is to ask them, in case they help stimulate some curiosity toward maybe forming a useful idea.

  1. How much of the mass that is falling into a black hole adds to the mass of the black hole, versus being ejected, say through its jets?
  2. Can we think of the jets as carrying information away from the black hole, given that the BH is accelerating the outbound particles substantially, thereby transferring energy from it?
  3. Wouldn’t (2) then be consistent with a model whereby all information about the BH is thought to be encoded on its boundary, for accreted matter to be seen as sticking to the boundary as it falls in, and over time that same information migrating to the poles of the BH and ejected through its jets?

It just seems to me, as a layman, that black hole jets are such a prominent feature, but I haven’t seen much talk about what mechanisms generate these jets, and what are their relationships to the flow of energy and information into and out of the black hole.


DevOps Mentality

Developers, who are new to Operations, as they become immersed in DevOps culture, may envision that their involvement in operations follows after development is done. However, operations are not an after-thought. This article is to enumerate some operations-related impacts to development practices that may not be at the forefront of a developer’s mind, but they should be. Design for operations.

Logging to enable monitoring, alerting, troubleshooting, and problem resolution

When coding and testing, pay attention to how helpful log messages are to troubleshooting problems. Do messages contain enough context to assist in identifying the root cause and corrective actions? Are messages at appropriate severity levels? One of the biggest impediments to monitoring and troubleshooting is excessive, imprecise, and unnecessary logging, otherwise known as noise.

If logging an ERROR level message, it should represent an operational problem that can be monitored. Each error should have a corresponding corrective action, if one is necessary. If a failure is transient and correctable by retrying, It should not be logged as an error until all attempts have been exhausted without success; repeated messages are unhelpful. Error level messages must be documented in the monitoring, troubleshooting, and problem resolution procedures. Error messages that are functional without any operational significance (not correctable through operational procedures) should be marked as such, so that they are not monitored for intervention. The knowledge base should document every foreseeable failure mode and its corrective actions.

Every log message incurs cost.

  • Computational cost to produce the message and collect it.
  • Storage cost for retention and indexing for search. Accounting for the volume of messages collected per service instance per day multiplied by one year retention and the number of service instances deployed, that may be hundreds of terabytes of data at a cost of tens of thousands of dollars per month.
  • Documentation cost to understand the meaning of the message and the expected operational procedures, if any, to monitor, alert, and carry out any corrective actions, when detected.

Excessive logging produces noise that becomes an impediment to operational efficiency and effectiveness. Monitoring and troubleshooting become more difficult, when significant information is buried among the noise. Seek to reduce noise by eliminating log messages that are not valuable. This can be done by classifying messages at a finer-grained log level (i.e., INFO, DEBUG) or by suppressing them altogether.

Specify a log message format so that alerts can be defined based on pattern matching. A precise identifier (e.g., OLTP-0123) for each type of message is helpful for monitoring solutions to key off of, rather than matching arbitrary strings that are not guaranteed to remain invariant.

Log messages should be parameterized to carry contextual information, such as the identifier of the entity being processed and the values of the most significant properties to the transaction. Avoid logging sensitive data that would compromise security, such as credentials or personally identifiable information subject to data privacy regulations. The context is important for isolating the problem, when trouble shooting, parameterizing corrective actions to resolve the problem, and providing a useful description when reporting a bug.

Avoid logging stack traces for non-debug levels of logging. Stack traces are verbose (noisy), and they carry information about the internal workings of the software (packages, classes, and source file names) that may be interesting to developers, but is not useful for operations.

Avoid repeatedly logging the same message. Repeated logging is often the consequence of an error condition that is handled with a retry loop. To avoid noise, retries can be counted without logging the continuing error. A summary can be logged when the retry loop ends. If retrying is successful, silence is preferred unless it is important to note violations of performance targets caused by retrying. Otherwise, timing out may entail escalating the error condition to a fall back mechanism or a circuit breaker, and logging this exceptional condition may be informative later, if the condition persists.

Do log a message when an error is detected for which a bug should be reported, an operations engineer should be alerted about a possible malfunction, or a corrective action is required. Error conditions that represent a possible service outage are especially important, as these are the messages that should be matched for alerting. Errors will be associated with corrective actions, which operations engineers will perform to resolve the problem, when encountered.

Avoid logging a message for normal operations, such as successful liveness probes and readiness probes. This is worthless noise.

Pay special attention to methods and transactions with security events.

  1. Redirect a request to access a user interface to login
  2. A successful login
  3. A failed login attempt
  4. Performing a privileged action that must be audited, such as administrative actions or gaining access to private information not owned by the user
  5. A denial of access due to insufficient privileges

Security events should be logged with a format that allows such messages to be classified, so that they can be directed to SIEM for special handling. SIEM is responsible for auditing, intrusion detection, and fraud detection. Being able to detect a security breach is among the most important operational responsibilities.


One of the most common mistakes is to conflate the log stream directed toward operations and information intended for end users. Services that enable administrative end users to configure or customize features need to provide transparency to the steps that are executing. This includes integrating other services to collaborate or specifying workflows or policies, which can be done erroneously. When something can go wrong, users must have a way of debugging those errors. Legacy server-based applications tended to not differentiate between these use cases and operations. Unfortunately, when evolving such applications into cloud services, these use cases are the most difficult to tease apart, so that information is directed to the proper audiences.

It is equally problematic when functional issues, which are intended for feedback to end users, are directed to operational logs. Operations have no role to play in monitoring such issues and taking corrective action. Directing such messages to operational logs adds to noise.

Developers must pay attention to the intended audience of messages, so that they are either directed to operational logs or end users or both.


“You get what you measure.”

Produce metrics for things you care about.

  • Service utilization – end user activity
  • Service outages – times when the liveness and readiness probes fail, so that these outages go toward calculating the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for service availability
  • Latency – the time between receiving a request and sending its corresponding response tell us how well the system is performing, as perceived by human users interacting through a user interface
  • Transaction processing throughput – the volume of transactions processed in each time interval tell us how well the system is performing with regard to work loads
  • Resource utilization – the infrastructure and platform resources (compute, storage, network) consumed to provide the service tell us how the demand is trending and how capacity should be managed to enable the service to scale into the future

Collect these metrics in a time series database for monitoring (including visualization and reporting) and alerting based on threshold crossing. Specify threshold crossing events for conditions that need attention, such as the following.

  • Violations of service level objectives (availability, performance)
  • Exceeding service limits (what the user is allowed) toward up-selling higher levels of service to the customer
  • Exceeding service and resource demands (capacity planning) toward adjusting scale and forecasting future scaling needs

Autonomous operation

Developers should design a service to be self-sustaining indefinitely without the need for human intervention, as much as possible. Human involvement is expensive (labor intensive), error-prone, and slow to respond compared to automated procedures. Any condition that resorts to human intervention should be considered a failure on the part of the developers to design for autonomous operation.

  • Fault tolerance – as a consumer of other services and resources, be resilient to failures and outages that are likely to recover after some time
  • Self-healing – as a provider that is experiencing a failure or outage, detect the problem and implement measures to recover, such as restarting, rescheduling to use alternative resources, or shifting workloads to surviving instances
  • Auto-scaling – adjust the number of resources to match the workload, so that the service continues to satisfy performance objectives
  • Self-maintaining – routine housekeeping should be scheduled and automated to prevent storage exhaustion (i.e., data purging), to maintain reasonable performance (i.e., recalculate statistics), and to enforce policies (i.e., secrets rotation)


Operations involve actions initiated through human access (by operations staff) and systems integration (i.e., by capturing an order submitted by the subscriber). Ad hoc changes to a production deployment should be forbidden, because these cannot be reproduced programmatically from source code. Therefore, all types of changes must be anticipated during development, so that they are available as pipelines (programmatic workflows) operationally. Each action should be parameterized, so that a precise set of information is input to drive the execution of its pipeline.

Provisioning – creation and termination of the service subscription

Upgrades and patches – deploying software updates and bug fixes

Configuration – scaling, enabling and disabling features, naming, certificates, policies, customization

Diagnostic actions – checks, tests, and probes with a verbose log level for troubleshooting and debugging, when increased scrutiny is needed for problem resolution

Administrative actions and maintenance – password and secrets rotation, data purging according to retention policy, and housekeeping (e.g., storage optimization)

Capacity management – adding and removing infrastructure and platform resources (e.g., compute, storage, addresses)

Corrective actions – interventions for problem resolution, such as stopping and restarting of services (rescheduling on compute resources), forcing maintenance tasks like purging data (due to storage exhaustion) or password rotation (prevent security breaches), replacing defective resources (e.g., kernel deadlocks)

See also:

  1. The Twelve-Factor App
  2. 10 key attributes of cloud native applications

Corporations Acting as Agents of Foreign Governments

4A thought experiment

An American citizen uses a healthcare web site to manage his lab results, medical images, medications, and communications with his doctors, all of whom are in the US, as is the web site’s hosting and corporate headquarters. The web site also serves citizens in other countries in the same manner. One day, the German government obtains a warrant from a German court to gain access to an American’s private health information.

Would a libertarian say that the web site is a private company and they may do as they please to violate the American’s privacy rights by handing over information to agents who have no jurisdiction? I think we can all agree that the answer must be no.

1A thought experiment

Canada, Germany, New Zealand, and every other country enact hate speech legislation that is incompatible with the American Constitutional protections of free speech. American social media web sites must enforce these “community standards” to the satisfaction of all these jurisdictions in order to do business with users in those countries. It would be costly to treat each user specifically in accordance to the laws and regulations in their jurisdiction, so the company prefers to implement a uniform set of community standards that becomes a superset of all laws and regulations in every jurisdiction. That means every country’s government mandated censorship now applies to American citizens to infringe upon their free speech rights.

Would a libertarian say that the web site is a private company and they may do as they please to violate the American’s free speech rights?

Would a libertarian agree that a private company may act as an agent on behalf of a foreign government to implement foreign laws and regulations on American citizens, such as China’s social credit system?

Social Media and the First Amendment

As social companies like Facebook, Twitter, and Google (YouTube) increasingly restrict what users can publish according their policies and “community standards”, we must be careful not to summarily dismiss such matters as private companies being free to operate their business as they please, because censorship (violating the First Amendment) is only applicable to state actors.

We must recognize three factors. (1) State actors are threatening to impose regulations or initiate anti-trust enforcement unless these companies self-regulate. Thus, the state is violating the First Amendment coercively through the back door. Or companies are working in cahoots with the government to implement the government’s desired regulations, knowing this is the only way such rules would not be struck down on Constitutional grounds. (2) Large corporations lobby for regulations to establish their own business practices as the status quo and raise barriers to entry for smaller competitors and start-ups. (3) Foreign governments want to enforce their laws and regulations, including against so-called “hate speech”, and corporations will enforce these rules against American citizens. All three are problematic from a First Amendment perspective.

Economics of Human Valuation

(My evolving thoughts that extend from

The currency of life is life-energy. When we give things of value to someone to improve their well-being, whether it is material or intangible, we transfer life-energy from ourselves to the recipient. The esteem that we hold for others is counted in life-energy credits and debts registered in our personal accounting system. For strangers who we hold at arm’s length there is a direct conversion of life-energy into monetary units when we conduct transactions. Even then, the quality of such transactions is accounted for with non-monetary life-energy to account for goodwill that is earned or extinguished.

Insights into innovation