Parallel and Distributed System Design, Part 1

…since my intention is to write something useful for anyone who understands it, it seemed more suitable to me to search after the effectual truth of the matter, rather than its imagined one. – Niccolo Machiavelli in The Prince, 1532

While conducting a bit of research on scalability and efficiently storing large amounts of information, I ran across an abstract in the Communications of the ACM written by a friend and former colleague Pat Helland. Pat and I worked at Microsoft together during the late 90s. Our paths naturally crossed given his core research in high performance transaction systems (HPTS) and my core work in parallel and distributed systems. Pat recently moved back to San Francisco after 15 years with Microsoft [URI]. For the two years prior to leaving, Pat worked on Cosmos, some of the plumbing for Bing. It stores hundreds of petabytes of data on tens of thousands of computers.

The paper, “If You Have Too Much Data, then ‘Good Enough’ Is Good Enough”, is brilliantly written. Pat has always been a wonderful and capable writer with an ability to reduce complex problems into a simple presentation. Similar to his clear and concise description of application models using concepts known as “fiefdoms” and “emissaries” in 2002 [URI] [URI], Pat identifies a world where many of our classic SQL database principles are being eroded by combining too much data each with disparate characteristics.

The non-relational storage technology (a.k.a. “NoSQL”) movement has of course given rise to the likes of MongoDB, CouchDB, and Redis. Each of these NoSQL technologies have their own unique advantage so judicious evaluation is a MUST. The vast amounts of information we process each day in turn has forced us to consider alternate ways to process this information else risk not being able to translate this data into actionable information. Applied technologies such as Hadoop and MapReduce are fundamental to processing “big data”. We often employ practices and technologies to enable timely dissemination, such as data analytics, business performance management, data warehousing, dashboards and key performance indicators (KPIs). Regardless of what we choose, the fundamentals of how we react to this vast amount of information must be swift, decisive, and–more importantly–accurate in today’s global economy. Being capable of pivoting is essential.

Mobile + Cloud

Mobile and cloud technologies are a winning distributed system solution for many problem domains. But let’s not forget this combination is a distributed system. When designing and implementing distributed systems and the data models on which a distributed system operates, organization is indeed paramount; and so is synchronization. For safety- and mission-critical applications, high availability has always been a paramount concern and recent experience with large Internet sites has underscored the need for availability in that domain as well. Traditional approaches to the problem have made three implicit assumptions:

Failure rates of hardware and software are low and improving.
Systems can be modeled for reliability analysis and their failure modes can be predicted.
Human error during maintenance is not a major source of failures.

The result is an emphasis on failure avoidance as the path to high availability. These assumptions are in many cases based on incorrect perceptions of today’s environment, and that renewed emphasis should be given to failure recovery. Even the most highly tested systems occasionally exhibit “Heisenbugs” and suffer from transient or permanent hardware failure and software aging, and human error has empirically been found to account for a nontrivial fraction of catastrophic failures. The most successful systems have been those that can recover from these unexpected errors because they were designed for recovery.

If a problem has no solution, it may not be a problem, but a fact—not to be solved, but to be coped with over time. – Shimon Peres, 1923

There is no way to guarantee complete knowledge of the current or future state of a distributed system, because knowledge of state changes must be propagated and propagation takes time, during which more state changes may occur. This is an axiom of distributed computing.

Tightly coupled systems deal with uncertainty by attempting to eliminate it. This is done through constraints on updates, requiring all nodes or some majority of nodes to be available before updates can be performed; using distributed locking schemes or single mastering for critical resources, constraining all nodes to be well connected, or some combination of these techniques. “Majority” often involves weighted voting schemes; so “big” nodes can be more influential than “small” nodes. The more tightly coupled the computing nodes in a distributed system are, the lower the scaling limit.

Loosely coupled systems deal with uncertainty by tolerating it. A loosely coupled system allows participating nodes to have differing views of the overall system state and provides algorithms for resolving conflicts.

My choice for parallel and distributed system design is loosely coupled because:

Customers require a highly distributed solution in which parts of the infrastructure can be spread across the internal and public networks and administered locally.
Large customers need to grow in capacity to many millions of transactions per day or to hundreds or thousands of nodes, or both.
Many networks provide only intermittent connectivity to some locations, for example remote oil drilling platforms and ships at sea, so the system MUST be tolerant of partly connected or disconnected operation.

Tightly coupled solutions are unsuitable for parallel and distributed system design because of the requirements for scalability to a very large numbers of nodes and disconnected operation. The loosely coupled model satisfies all of the above requirements.

I’d like to quickly point out an observation regarding tolerance as it relates to most Internet-aware software. As the World Wide Web Consortium (W3C) points out with respect to tolerance, the principle of tolerance does not blunt the need for a perfectly clear specification which draws a precise distinction between conformance and non-conformance. The principle of tolerance is no excuse for a product which contravenes a standard. Naturally, the aphorism “any problem in computer science can be solved with another level of indirection” still rings true but to a varying degree. The quantification of that degree is a difficult balance to achieve as it involves many variables with the most notable, in my opinion, being scalability, flexibility, and reliability.

I plan to discuss the merits of loose coupling in the larger fabric of parallel and distributed system design more concretely over the next several posts.