Parallel and Distributed System Design, Part 2

The map is not the territory. – Alfred Korzybsky in Science and Sanity, 1933

This is the second post in a series where I’ll discuss various aspects of parallel and distributed system design. I’d like to provide a brief introduction to taxonomy (or classification of data) and eventual consistency as these building blocks are fundamental to parallel and distributed system design.

Extending my thoughts from a post I made some time ago, in computer science and information modeling, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. Why do we need to know this? Because all parallel and distributed systems require data to operate and, in turn, operate on data.

Semantic Ontologies

A semantic ontology, or ontology, is a formal explicit description of concepts in a domain of discourse, properties of each concept describing various features and attributes of the concept, and restrictions on properties. An ontology together with a set of individual instances of classes [or objects] constitutes a knowledge base. In reality, there is a very fine line where the ontology ends and the knowledge base begins but let’s put that aside for a moment.

An implementation of conceptual ontologies is the Semantic Web [URI]. Simply put, the Semantic Web is the representation of data on the web in which information is given well-defined meaning—“to be a universal medium for the exchange of data”.

The principal technologies of the Semantic Web fit into a set of layered specifications called the Resource Description Framework (RDF). The current components of that framework are the RDF Core Model, the RDF Vocabulary Description Language and the Web Ontology Language (OWL). OWL is a descriptive layer built on top of RDF used to model classes, properties, and objects. These languages all build on the foundation of URIs, XML, and XML namespaces.

Classes are the focus of most semantic ontologies. Outlined in most ontology research are the following observations:

There is no one correct way to model a domain— there are always viable alternatives. The best solution almost always depends on the application that you have in mind and the extensions that you anticipate.
Ontology development is necessarily an iterative process.
Concepts in the ontology should be close to objects (physical or logical) and relationships in your domain of interest. These are most likely to be nouns (objects) or verbs (relationships) in sentences that describe your domain.

Solution domains require, at times, different classifications. Regardless of which classification the solution domain requires, the following rules and considerations have proven effective in designing an efficient information model:

Effectively understand the information.
Describe it unambiguously.
Enforce structure and style guidelines.
Allow for efficient storage and retrieval of information.
Keep network communication to a minimum. Don’t over engineer.

I cannot overstate the need to understand your information model but always with an understanding of your delivery time frames. All too many times, we are caught up in the perfect model only in the end to take shortcuts to satisfy deadlines. The solution is acceptance and understanding that change is an integral part of any information model. Temporal and transient taxonomies may be an integral part of your solution domain. Semantic ontologies may be created at runtime to address a special-purpose need where the need itself is transient in nature. For example, in order to fulfill a set of calculations with a high time and computational complexity rating a transient taxonomy might need to be created to provide global access to intermediate calculation results. This is all too common in financial simulation.

Not surprising is that classes are the focus of most ontologies. So with that, let’s move on to a more usual and enjoyable example of a semantic ontology: wine. Wine is a potable liquid produced by at least one maker of type winery, and is made from at least one type of grape. For example, a class of Wine represents all wines. Specific wines are instances of this class. The Bordeaux wine in the glass in front of you while you read this post is an instance of the class of Bordeaux wine. A class can have subclasses that represent concepts that are more specific than the superclass. For example, we can divide the class of all wines into red, white, and rose wines. Alternatively, we can divide a class of all wines into sparkling and non-sparkling wines.

We can further describe properties of classes and instances: Chateau Lafite Rothschild Pauillac wine has a full body; it is produced by the Chateau Lafite Rothschild winery. We can have two properties describing the wine, the property body with the value full and the property maker with the value Chateau Lafite Rothschild winery. At the class level, we can say that instances of the class Wine will have properties describing their flavor, body, sugar level, the maker of the wine and so on.

All instances of the class Wine, and its subclass Pauillac, have a property maker the value of which is an instance of the class Winery. All instances of the class Winery have a property produces that refers to all the wines (instances of the class Wine and its subclasses) that the winery produces. The entire example OWL ontology derived for Wine can be found here. I am of course a huge fan.

The strength of your application requires you to understand the data on which it operates—unambiguosly. Though perfection is not key, it is second to paramount. Once you fully understand your information model, then you are able to understand eventual consistency for your application. And with this in mind, let’s discuss eventual consistency.

Eventual Consistency

Eventual consistency is one of the consistency models used in the domain of parallel programming. It means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent. I will remind you from a previous blog entry there is no way to guarantee complete knowledge of the current or future state of a distributed system, because knowledge of state changes must be propagated and propagation takes time, during which more state changes may occur. This is an axiom of distributed computing. So how can we begin to reconcile eventual consistency with this axiom? Well, we don’t really [URI]. We design for loose coupling. We design for tolerance. We design for symmetry of algorithms.

At a high level, most mission- and safety-critical distributed applications SHOULD strive to achieve these high-level technical requirements:

Provide high availability read and write access to information.
Minimize network communication.
Provide a highly available and fault tolerant system that can support an annual uptime guarantee of 99.999 percent , which is equal to 5.256 minutes of annual unscheduled downtime.
Provide instrumentation to aid implementers and customers in their hardware sizing estimates. Scalability MUST be measurable.
Devise a software architecture that scales up by a factor of 10^6. That is, an application’s storage and processing capacity can automatically grow by a factor of a million, doing jobs faster 10^6x speed up or doing 10^6 larger jobs in the same time 10^6 scale up, just by adding more resources.
Provide for conflict resolution. Common implementations in this area leverage Lamport timestamps and vector clocks.
Provide for transient node failure.

While the goal is to devise a software architecture that scales up without limits, there has to be some kind of limit: billions of dollars, or giga watts, or just space. So, the more realistic goal is to be able to scale from one node to a million nodes all working on the same problem or same set of problems.

While fundamentally a Distributed Hash Table (DHT) is generally well suited for specific classes of decentralized distributed systems, it is certainly not well suited for all. For example, a DHT is well suited for problems where a highly reliable network between nodes is usual. But a DHT is not necessarily well suited for problem domains where frequent node joins and leaves are the norm. That said, DHTs have been used for routing in many Peer-to-Peer (P2P) implementations [URI]. Even some of the more well known NoSQL implementations leverage DHTs.

But as I mentioned, DHTs are not necessarily well suited for all implementations. When we designed the core replication algorithms for the Microsoft Active Directory (AD) service, a DHT was not chosen. Not only did we want to account for Byzantine failures but we also needed to account for replica failures that lasted for long periods of time (30+ days). AD subscribes to eventual consistency. Roughly speaking, the replication model of AD is multi-master loose consistency with convergence (eventual consistency). In this model, the directory can have many replicas; a replication system propagates changes made at any given replica to all other replicas. The replicas are not guaranteed to be consistent with each other at any particular time (“loose consistency”), because changes can be applied to any replica at any time (“multi-master”). If the system is allowed to reach a steady state, in which no new updates are occurring and all previous updates have been completely replicated, all replicas are guaranteed to converge on the same set of values (“convergence or eventual consistency”).

In addition to the technical requirements listed earlier in this post, we imposed the following requirements on the design of AD replication:

To adapt to customer networks, provide flexibility in replication topology including choice of transports.
Provide a fully asynchronous and highly scalable architecture.
Minimize network communication in terms of size and round-trips, support encryption of data, and support transitive (store/forward) transportation of data.
Work well over high-latency communication links.
Operate correctly when encountering changes to the distinguished name (DN) of an object and to discriminate between a deleted object and a new object with the same DN. In other words, we replicate information based on object’s universally unique identifier (UUID) and not based on DN.
Automatic system topology generation. The system automatically creates (and destroys) “short cut” replication links between nodes in order to maintain a specific number of hops between any two points.

Active Directory uses a state-based approach to replication. This approach is easier to appreciate in contrast with an alternative, log-based replication. In a typical log-based system each master keeps a log of the updates that it originated. The goal of each master is to communicate its log to every other replica. Once a log arrives at a replica, the replica applies the log, bringing its state more up to date.

In state-based replication each master applies updates, both originating and replicated, to its replica as they arrive. As such, replication is not driven from logs stored with the source replica but from current state of the source replica. This state includes information for resolving conflicts (also needed in the log-based approach) and information to avoid sending the full replica on each replication cycle (inherent in the log-based approach.) A state-based approach uses a single mechanism for incremental and full sync, and performs fewer database updates since repeated or conflicting updates to an attribute are collapsed into a single state. The resulting design has several stability-enhancing properties, such as:

No matter how long two replication partners have been out of communication, they can always build upon any previous replication they have accomplished. They never have to start over from scratch because somebody’s log “wrapped around.”
If a replica has a “hot spot” attribute that’s being updated frequently for some reason (perhaps a bug in an application or an implementation similar to a performance counter), only the value that’s current at the time of replication is sent to a replication partner. Some designs send the full history of all intermediate values thereby multiplying the problem.
The very same mechanism is used to initialize a new replica as is used for bringing a replica up to date with recent changes. Some other designs require two mechanisms.
AD has a simple and robust solution to the “time went backwards due to restore from backup” problem (also called the “back sync” problem) that plagues some other systems.
In a replication relationship the destination (i.e. the replica being updated) takes all responsibility for keeping track of how up to date it is. The source (i.e. the replica supplying the updates) takes no responsibility. This design choice avoids a lot of complexity and potential sources of failure; some other systems don’t work this way.
In a replication relationship the destination always “pulls” changes from the source. The source may notify the destination “now would be a good time to pull,” but if the notification is lost (e.g. because the destination is overloaded or down) the result is longer replication latency, not incorrectness.
When two replicas establish a new replication relationship, that relationship is established in an incremental fashion, so not much work is lost should one replica go down before the relationship becomes complete.

Impact on Distributed Applications

A multi-master distributed system induces several problems on applications.

Version Skew. Version skew occurs when applications read the same object(s) from different replicas before a change has replicated. Applications reading the remote replica see the unchanged object. Version skew is an issue when a given application or set of applications use the information in the directory to interoperate.
Partial Updates. Partial Update occurs when applications read the same set of objects from different replicas while replication is in progress. Applications at the remote replica see some of the changes but not all. Note that there is a small window in which partial update can affect an application: the application must start reading objects while inbound replication is in progress, after one or more of the related, changed objects have been received but before all have been received. The time between the updates at the source replica directly affects the size of this window—updates that occur close together in time will be replicated close together in time. Partial update is an issue when an application uses a related set of objects.
Collisions. Collisions occur when the same properties of two or more replicas of a given object are changed during the same replication interval. The replication process reconciles the collision; because of reconciliation a user or application may “see” a value other than the one they wrote. A simple example is user address information—if a user changes their mailing address at replica R-a and an administrator changes the same mailing address at replica R-b, the value ultimately propagated to (R-a, R-b), and all other replicas will be the value selected by the collision reconciliation mechanism. Collision resolution is an issue for applications that make assumptions about the internal consistency of objects or sets of objects.

Though not in scope of this post, applications must accommodate replication latency. The best way to accommodate replication latency is to design applications to minimize the effects—to tolerate latency. The ideal distributed application is of course unaffected by replication latency induced state. Other applications must adopt an avoidance or detection strategy as a mechanism to tolerate replication latency.

Many NoSQL systems, such as MongoDB, employ a master/slave topology and so writes are accepted on a single node. Even the newer replica sets in MongoDB adhere to this model. Sharding (or partitioning), of course, is a key consideration in these types of systems. Sharding enables horizontal scaling across multiple nodes. A sharded MongoDB cluster performs automated leader election for the primary node. Leader election in a ring is another topic in and of itself. So let’s wrap this up…

Wrapping it all up…

Regardless of the underlying durable storage technology, it is of paramount responsibility that you properly classify your information model, understand your consistency model, and develop your application to be highly resilient to failures. It is this latter point—being highly resilient to failures—that is not an easy task and that is precisely why I’ll discuss fault tolerance and recovery in my next post.