Michael Primeaux

Parallel and Distributed Systems


The Convergence of Mobile, Cloud, Social, and Big Data

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. – C.A.R. Hoare

Mobile, cloud, social, and non-relational storage technologies (big data) represent agility for many businesses. In particular, the convergence of these technologies equip businesses with efficient methods to attribute key performance indicators, optimize direction, and, equally as important, allocate changes in business processes; all in real-time and for a vast array of incoming data formats. Individually, each of these technology areas are immense and yet are enablers for each other; and so in early 2008, I began to focus my attention, application of algorithms, and research to the convergence of these technologies.

The technology shifts created by mobile, cloud, social, and big data have changed and are drastically changing the business landscape. The convergence of these technologies began years ago and has matured considerably since 2008. Social media has contributed to this significantly. Big Table and, subsequently, HBase was born out of need within Google and the open source community, respectively. The Apache Cassandra project was born out of need within Facebook. We’re able to learn interesting statistics like the half-life of links shared on Twitter is approximately 2.8 hours, sports stories spread the fastest of any topics, and there’s a material difference between what people share and what they click. From a sales and marketing perspective, we can use big data to better leverage social channels to rapidly detect changes in buyer activities and preferences, identify market shifts, monitor competitive activity, and identify opportunities for your business. Social media is analyzed to understand behavior, although it’s important to remember as with all population sampling, it only provides a specialized subset of opinions about a particular topic.

Big Data and Analytics

Big Data is primarily classified by the use of non-relational durable storage solutions that embrace the following characteristics of information technology.

  • Volume. As of 2012, about 2.5 exabytes of data are created each day, and that number is doubling every 40 months or so. More data cross the internet every second than were stored in the entire internet just 20 years ago. For example, it is estimated that Wal Mart collects more than 2.5 petabytes of data every hour from its customer transactions.
  • Velocity. For many applications, the speed of data creation is more important than the volume. Real-time or nearly real time information makes it possible for a company to be much more agile than its competitors. Just consider the speed at which information is collected from the sheer number of active mobile devices each and every day.
  • Variety. Information now flows from a plethora of sources in the form of messages, updates, and images from social networks; electronic communication; instrumented machinery; readings from sensors; GPS signals from mobile phones; point of sale terminals; and so on. Structured databases are not well suited to storing and processing information from these disparate sources.

…and it’s only becoming even more challenging. Each and every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection. Most of the data produced is irrelevant noise. Clearly, the challenge is in filtering this noise and distilling the remaining information into meaningful results. This vast amount of information is essentially useless without analytics. Timely analytics. Accurate analytics. Technical solutions in this space must be able to traverse multiple data centers, the cloud, and geographical zones. Accurate, real-time analysis affords us an opportunity to answer crucial business questions by those wanting to confidently pivot in response to key performance indicators (KPI).

Before you delve into the non-relational storage arena, it’s important to understand the CAP theorem. The theorem began as a conjecture made by University of California, Berkeley computer scientist Eric Brewer at the 2000 Symposium on Principles of Distributed Computing (PODC). In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewer’s conjecture, rendering it a theorem.

Formally, in theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously satisfy all three of the following characteristics:

  1. Consistency. All nodes possess the same data at the same time.
  2. Availability. A guarantee the system is available at all times to fully operate.
  3. Partition tolerance. The system continues to operate despite arbitrary message loss or failure of part of the system.

In May 2012, Brewer clarified some of his positions on why the oft-used “two out of three” concept can be misleading or misapplied.

Broadly speaking, durable storage technologies are divided into two innate categories: relational and non-relational. Relational databases have been widely used for the past 40 years and are still quite widely used today. Microsoft SQL Server, IBM DB2, Oracle, and MySQL are all examples of relational database implementations. Suffice it to say most non-relational database system architectures favor partition tolerance and availability over strong consistency. Other implementations such as Cassandra offer per-operation control over which two characteristics of the CAP theorem you would like to favor. My favorite distribution of Cassandra is DataStax, which combines Cassandra, Hadoop, and SOLR into a single packaged distribution. Their enterprise edition is free to developers, which is convenient.

With respect to non-relational databases, one must consider a few orientations.

  • Object. Object databases work to void object-relational impedance mismatch that occurs when trying to use a relational database under an application written in an object-oriented programming language. These databases store information in terms of the object themselves and not in terms of columns and rows.
  • XML. XML databases are optimized specifically for working with XML and are, essentially, a special form of a document-oriented database. XML databases have one core function: to allow for efficient storage and query of XML documents. Examples of XML databases are Tamino (Software AG), eXist, Berkeley XML DB (Oracle), MarkLogic Server, and Sedna.
  • Document. The basic unit or storage for a document database is a complete document. CouchDB and MongoDB are examples of document databases. This storage model offers several key advantages over a relational model:
    • A document can store any number of fields of any length, and each field can store multiple values. A relational model requires all fields to be present for every record.
    • An empty (or null) value (and the related field) does not need to be stored, which can save space. On a related note, many relational databases currently offer support for sparse tables.
    • Full text search capabilities are typically an intrinsic feature.
    • Security can be assigned at an individual document level.
    • A predefined schema is not required whereas a relational database does require schema to be defined in advance of performing any storage or retrieval operation.
  • Key-Value. A key/value database stores values based on keys and are classically represented as a Distributed Hashtable (DHT). In a relational model, we tend to first consider the tables that our domain requires, then think of how we can normalize the tables to avoid duplicate data. In a key-value store, however, typically we don’t define a schema. Examples of key-value databases are MongoDB, Redis, Amazon Dynamo, and Project Voldemort.
  • Graph. Graph databases differ from other non-relational database offerings (such as key-value stores) in that they represent the edge intrinsically. Instead of tables and columns, a graph database uses three basic constructs to represent data: nodes, edges, and properties. A node is a standalone, independent object. An edge is an object that depends on the existence of two nodes. Properties represent attributes of a node. Examples of graph databases are FlockDB (open sourced by Twitter in April 2010) and Neo4J.
  • Columnar. A columnar database organizes data around columns instead of rows. This subtle difference optimizes some workloads—in particular, data warehouse and analytics applications that require computing aggregate values over very large sets of data—for certain kinds of problems. Databases that follow this orientation represent distributed hash tables (DHTs). Examples of columnar databases are Cassandra, Google Bigtable, HBase, and Hypertable.

It’s worth noting, a few databases exist that represent hybrid orientation implementations; for example, Riak is a hybrid implementation based on Amazon’s Dynamo in that it is both a document-oriented and key-value implementation.

Though, just as with relational databases, planning is key when thinking about launching a non-relational database project. Non-relational database projects require a slightly different software engineering approach. In a relational model, we tend to consider our data model first and then we build queries to satisfy application requirements. Even though a database may be schema-less, in a non-relational model we MUST first understand an application’s queries as a starting point since that influences database and table layout.

Even companies with a large investment in relational-based analytics tools can leverage non-relational storage technologies but even so, the definition of a taxonomy (or multiple taxonomies) for analytics processing is of paramount importance. Typically, companies that are heavily invested in relational technology as their primary systems of record employ extract transform and load (ETL) processes to periodically restructure information based on defined taxonomies into non-relational storage products so they are able to take advantage of more efficient and real-time processing using massively parallel analytics technologies such as Hadoop and Map Reduce.

Capturing data without analytics in mind from a business’ point of view is generally moot. Consider the goals, and what key data sets are needed to reach those goals. More important is to consider the set of business questions you are attempting to answer and, to that point, I strongly recommend you bring in a data scientist to perform an audit of an existing data architecture.

Mobile

Business-to-consumer (B2C) mobile applications impose different requirements than that of business-to-business (B2B) and corporate applications not only in terms of marketing but in terms of the requirement for an extremely efficient and aesthetically pleasant user experience (UX) and more stringent software engineering practices. It’s very difficult to recover from poor reviews and customer feedback.

In the mobile space, three companies represent the majority globally: Apple, Google, and Microsoft. Apple and Google make up 96% of the entire mobile space. Microsoft is less than 1% with the remaining (just over 3%) still on others such as RIM, Symbian, etc. If we add tablets to the equation (so now not just mobile phones) then as of April 2013 Apple iOS represents 59.04%, Google Android represents 26.02%, and Microsoft Windows Phone represents 1.14%; see Net Marketshare.

Across all mobile operating systems, if you observe how people use smartphones and look beyond telephone calls, email, and texting then you’ll see that native applications dominate. Users spend on average, 82% of their mobile minutes with native applications and 18% with web browsers. They download approximately 50 to 75 applications to their phones (out of more than a million available) but regularly use about 10.

As another data point, mobile applications broadly fall into six categories:

  1. Games and Entertainment
  2. Social Networks (FaceBook, Twitter, Pinterest, Path, Tumblr, etc)
  3. Utilities (maps, clocks, calendars, cameras, email, etc.)
  4. Discovery (Yelp, GrubHub, etc.)
  5. Education (AccelaStudy, Contig, Flashcard Champion, etc.)
  6. Brands (Nike, Red Bull, etc.)

Regardless of your application’s category, native mobile applications offer the absolute best user experience in many aspects with frame rates, offline support, and access to the device’s hardware to name but a few. Contrary to what some may espouse, mobile devices are not well connected; even within major United States and European cities where connectivity is commonly sporadic and network latency is commonly high. Consider many scenarios whereby consumers pass in and out of good network connectivity while on mass transit systems; even mobile applications that “think” they are connected to a hi-fidelity wireless network may really be connected to a personal hotspot that is in turn connected to the internet over a cellular channel. It is for these reasons I strongly discourage the creation of hybrid and HTML 5 mobile applications. Granted, hybrid and HTML 5 mobile applications may prove satisfactory in some corporate settings but if you want your application to be among the 10 regularly used ones then write native applications; it’s that simple. And for the record, I am not a fan of “4GL” Integrated Development Environments (IDE) that generate code for multiple mobile platforms. Perhaps I am a purist but I prefer to have direct access to and control over the core mobile platform rather than having to interact with process and technical abstractions. Your consumers do appreciate this level of attention to detail and UX refinement.

If you do write applications in the consumer space and if you’ve purchased a smartphone or tablet recently, and if it doesn’t have an Apple logo on it then it’s likely you already know of Android. Android is an open sourced Linux-based operating system (OS) launched approximately 5 years ago as of the time of this writing. The Android operating system dates back to 2003 when Rich Milner, Nick Sears, Chris White, and Andy Rubin began working on a mobile operating system, which Google would eventually purchase in 2005.

For years, the dominant OS was Symbian, which has its roots in Psion’s EPOC OS. By 2007, Symbian accounted for over 63% of the worldwide smartphone market. However, by 2008, when it was purchased by Nokia, that number dropped to 52%. At that time, the closest competitors were Research in Motion’s Blackberry OS (16.6%), Microsoft Windows Mobile (11.8%), and the still very new iOS (8.2%). With Nokia’s OS was poorly equipped to compete with iOS, RIM not wanting to leave its business focus comfort zone and Microsoft’s mobile attempts clearly failing, it was Google’s purchase of Android Inc in 2005 that would eventually provide competition. The Linux-based, open-source mobile operating system would eventually find its way to the market with the release of the T-Mobile G1 in 2008. The G1 sold well and that elevated Android to reach a 3.9% share in the smartphone market in 2009. At this point, iOS had doubled its share to 14.4%, RIM grew to just under 20%, Windows Mobile had dropped to 8.7%, and Symbian correspondingly dropped to 46.9%. Today, Android is used in 61% of the smartphones (excluding tablets) sold worldwide, which is primarily attributed to the fact it can be found in both high-end handsets (targeting the Apple iPhone) and in lower budget smartphones (where Symbian once dominated). Apple iOS commands 20.5% of the worldwide smartphone marketshare, which is impressive considering it’s only included in Apple’s own high-end phones. RIM is now down to less than 1.5%.

In general, it’s easier to get an application released onto Android, but from technical engineering and business points of view, Android does have significant disadvantages over Apple’s iOS:

  • Google Play is not nearly as mature as Apple’s AppStore
  • People are reluctant to pay for Android applications
  • Piracy of applications is widespread
  • Security concerns and malware
  • Little to no enterprise penetration
  • OS upgrades are cumbersome, slow, and generally require intervention by the carrier

However, one can’t ignore Android these days; and in particular for the consumer space. Corporate and business applications are a different matter altogether and I’ll save that discussion for a later point in time time.

Cloud

So, where does cloud technology fit into all of this? Everywhere. Cloud is the cohesive technology between mobile, social, and big data. Over the past nearly 6 years, and after nearly 500 mobile applications shipped, I can think of exactly one mobile application I’ve written that did not involve hosted cloud services; and cloud providers are only getting better. Cloud providers such as Amazon AWS and Microsoft Windows Azure provide full featured platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS) offerings that complement solutions spanning the mobile, social, and non-relational database solution domains. There are other providers such as Linode that offer competitive IaaS offerings and with a very approachable cost model. Discussing the technical offering of each cloud provider is not in scope of this post; perhaps at a later point in time.

The use of cloud computing technologies offer companies an opportunity to recognize cost savings that would otherwise not be possible with on-premise hardware. For example, it’s not uncommon to develop scripts that create an environment starting with mere credentials to a fully functioning application stack in a matter of minutes only to tear it down once the day is over. Patterns such as this are quite attractive to stakeholders.

Conclusion

For organizations seeking a fast, easy and cost-effective way to manage business operations, the convergence of mobile, cloud, social, and non-relational storage technologies equips parallel and distributed system designers and business stakeholders with a unique opportunity and insight into new patterns that would otherwise require a much higher degree of cooperation and effort among disperate teams. an architecture that seeks signals, models them for their impact, and then adapts to the business process of the organization. These technologies also facilitate pattern-based strategy architectures, which are attractive to both technical architects and business stakeholders. A pattern-based strategy identifies signals, models them for their impact, and then adapts to the business process of the organization instead of current strategies defined by making hypotheses, performing analyses, and reaching conclusions that are often wrong given the time to implement the change and pivot occurs at a slower pace than the tracked market trends.

Good architectures don’t just happen, they are designed.

References:

  • Andrew McAfee and Erik Brynjolfsson. “Big Data: The Management Revolution”. Harvard Business Review, October 2012.