Michael Primeaux

Parallel and Distributed Systems


Peer-to-Peer (P2P)

“In a forest a tree will fade; from a forest a tree is made.” –Unknown

Over the past several years I’ve focused on application of P2P algorithms for the commercial space. Anyway, I thought I’d share some of my thoughts and tertiary research.

First, what is Peer-To-Peer (P2P)? P2P [URI] is a decentralized, fault tolerant, self-organizing system architecture comprised of many unreliable and heterogeneous nodes operating in a functionally symmetric manner—frequent joins and leaves are the norm. P2P is not a client / server architecture. P2P is a paradigm shift from coordination to cooperation, from centralization to decentralization, and from control to incentives.

Algorithms

To my knowledge, the Plaxton mesh (1997) was the earliest known proposal for a scalable P2P network. One major problem with a Plaxton mesh is its static—there is no concept of node arrival, departure, or failure. Essentially, nodes act as routers, clients, and servers simultaneously. A Plaxton mesh routes a message to the nodes whose name is numerically closest to the destination.

From 2001 through 2003, P2P algorithms such as Chord, CAN, Pastry, Tapestry, Viceroy, P-Grid, Kademlia, Koorde, SkipGraph, and SkipNet appeared.

Applications

Napster was one of the first P2P applications that appeared in 1999—though it’s not really a P2P network. With Napster, a single central server kept the directory of nodes and objects, which made is essentially a large file catalog—the discovery and routing algorithms are not P2P.

Gnutella first appeared in 2000 and then again in 2002. Though a pure P2P application, Gnutella generated far too much search traffic with a [broadcast-based] O(n) complexity—not very scalable.

eDonkey appeared in 2001 and presented a slight architectural improvement over Napster in that no single centralized server held the file catalog. Nevertheless, eDonkey was not a pure P2P application since it relied on status node structures.

Kazaa also appeared in late 2001 but I haven’t studied its architecture at all. I listed it here for completeness.

Gnutella-2 and BitTorrent were the P2P applications of 2002. In 2003, Skype, Steam, and PS3 appeared.

Architecture

Common characteristics of P2P architectures are:

  • No central servers.
  • High level of scalability. Discovery and routing is O(log(n)). Routing table size is also O(log(n)).
  • Highly resilient.
  • Dynamic adaptability to node arrival, departure and failure.

Because of these characteristics, P2P architectures offer challenges in many areas. The most visible of the challenges facing designers of P2P routing algorithms are:

  • Scalability. Scalability is a measure of how a system performs when the number of nodes and/or number of messages on the network grows.
  • Computational Complexity. Computational complexity is the measure of the order of steps required for a packet to travel from one host to another in a worst case scenario.
  • Anonymity. Anonymity is not a requirement of most P2P networks, however if a network is to be designed to provide anonymity then this is a problem that must be solved at the routing level.

Topology

A P2P infrastructure is best described as an overlay network—usually on top of an IP network. In many ways, you can think of P2P as an application-level router providing services that underlying layers do not (i.e. IP multicast). Generally speaking, P2P architectures employ one of the following topological structural designs:

  • Centralized (Napster)
  • Decentralized
    • Unstructured (Gnutella)
    • Structured (Chord)
  • Hierarchical (Multicast Backbone (MBone))
  • Hybrid (eDonkey)

Different P2P algorithms employ different topological structures. For example, Chord employs a ring topology whereas Kademlia employs a tree and CAN a hypercube.

Routing

Different topologies require different algorithm tactics and strategies. Routing within P2P architectures are very problematic.

Within P2P architectures, algorithms such as flooding, replication and caching, random walkers and probabilistic algorithms, super-peers, Time to Live (TTL), epidemic and gossip protocols, propagation, and dampening algorithms are widely used. These algorithms work well depending on the size of the P2P network. For example, flooding is more suited for a small to medium size networks but it has been shown that the cost of searching on a Gnutella style network increases super-linearly as the number of nodes increases.

Distributed Hash Tables (DHT)

DHTs are a class of decentralized distributed systems that partition ownership of a set of keys among participating nodes, and can efficiently route messages to the unique owner of any given key. A hash function (such as SHA-1) accepts a variable length string of bytes and returns a one-way hash. The physical nodes are the hash buckets. Chord is a good example of a DHT algorithm.

Incidentally, in general DHT algorithms will always beat flooding algorithms in the area of routing.

One advantage of the Chord DHT algorithm is that it guarantees receipt of a reply with log(n) time, which is a far better guarantee in terms of computational and time complexity than that offered by flooding techniques. That said, the Kademlia DHT algorithm does offer advantages over Chord depending on application requirements.

However, DHTs are not without problem:

  • Routing state maintenance algorithms generate overhead.
  • They do not work well when lots of nodes join and leave frequently (churn).
  • Their structured topology and routing algorithms make them frail and vulnerable to security attacks.
  • Network locality problem: Nodes numerically-close are not topologically-close.

Strategies do exist to address these problems. Regardless, structured DHT algorithms provide the most interesting P2P infrastructures.

Semantic Routing

This is a relatively new area of routing as related to P2P infrastructures. Semantic Routing is a method of routing which is more focused on the nature of the query to be routed than the network topology. Essentially, semantic routing improves on traditional routing by prioritizing nodes which have been previously good at providing information about the types of content referred to by the query.

Semantic routing differs fundamentally from other routing techniques because prospective nodes are selected because of another node’s confidence in their ability to respond correctly to a given query irrespective of their position within the network. Neurogrid and Remindin are of the most interesting semantic routing systems with the Remindin system incorporating a semantic routing algorithm developed with the intention of mimicking social networks.

Summary

Overall, the most interesting of the DHT algorithms are Chord, Kademlia, and XRing. I hope to release .NET implementations of all three of these algorithms (and a few variations) in the near future. You’ll find them here.