Tag Archives: REST

Reliable Messaging with REST

Marc de Graauw’s article Nobody Needs Reliable Messaging remains as relevant today as it did in 2010, when it was first published. It echoes the principles outlined in Scalable, Reliable, and Secure RESTful services from 2007.

It basically says that you don’t need for REST to support WS-ReliableMessaging delivery requirements, because reliable delivery can be accomplished by the business logic through retries, so long as in the REST layer its methods are idempotent (the same request will produce the same result). Let’s examine the implications in more detail.

First, we must design the REST methods to be idempotent. This is no small feat. This is a huge topic that deserves its own separate examination. But let’s put this topic aside for now, and assume that we have designed our REST web services to support idempotence.

If we are developing components that call REST web services for process automation, the above principle says that the caller is responsible for retrying on failure.

The caller must be able to distinguish a failure to deliver the request from a failure by the server to perform the requested method. The former should be retried, expecting that the failure is temporary. The latter is permanent.

The caller must be able to implement retry in an efficient manner. If the request is retried immediately in a tight loop, it is likely to continue to fail for the same reason. Network connectivity issues sometimes take a few minutes to be resolved. However, if the reason for failure is because the server is overloaded, having all clients retry in a tight loop will exacerbate the problem by slamming the server with a flood of requests, when it is least able to process them. It would be helpful if clients would behave better by backing off for some time and retrying after a delay. Relying on clients to behave nicely on their honor is sure to fail, if their retry logic is coded ad hoc without following a standard convention.

The caller must be able to survive crashes and restarts, so that an automated task can be relied upon to reach a terminal state (success or failure) after starting. Therefore, message delivery must be backed by a persistent store. Delivery must be handled asynchronously so that it can be retried across restarts (including service migration to replacement hardware after a hardware failure), and so that the caller is not blocked waiting.

The caller must be able to detect when too many retry attempts have failed, so that it does not get stuck waiting forever for the request to be delivered. Temporary problems that take too long to be resolved need to be escalated for intervention. These requests should be diverted for special handling, and the caller should continue with other work, until someone can troubleshoot the problem. Poison message handling is essential so that retrying does not result in an infinite loop that would gum up the works.

POST methods are not idempotent, so retry must be handled very carefully to account for side-effects. Even if the request is guaranteed to be delivered, and it is processed properly (exactly once) by the server, the caller must be able to determine if the method succeeded reliably, because the reply can be lost. One approach is to deliver the reply reliably from the server back to the caller. Again, all of the above reliable delivery qualities apply. The interactions to enable this round trip message exchange certainly look very foreign to the simple HTTP synchronous interaction. Either the caller would poll for the reply, or a callback mechanism would be needed. Another approach is to enable the caller to confirm that the original request was processed. With either approach, the reliable execution requirement needs to alter the methods of the REST web services. To achieve better quality of service in the transport, the definition of the methods need to be radically redesigned. (If you are having a John McEnroe “you cannot be serious” moment right about now, it is perfectly understandable.)

Taking these requirements into consideration, it is clear that it is not true that “nobody needs reliable messaging”. Enterprise applications with automated processes that perform mission-critical tasks need the ability to perform those tasks reliably. If reliable message delivery is not handled at the REST layer, the responsibility for retry falls to the message sender. We still need reliable messaging; we must implement the requirement ourselves above REST, and this becomes troublesome without a standard framework that behaves nicely. If we accept that REST can provide only idempotence toward this goal, we must implement a standard framework to handle delivery failures, retry with exponential back off, and divert poison messages for escalation. That is to say, we need a reliable messaging framework on top of REST.

[Note that when we speak of a “client” above, we are not talking about a user sitting in front of a Web browser. We are talking about one mission-critical enterprise application communicating with another in a choreography to accomplish some business transaction. An example of a choreography is the interplay between a buyer and a seller through the systems for commerce, quote, procurement, and order fulfillment.]