With the release of Apache Spark 1.6.1, I thought I’d share my experience with the automated creation of a Spark cluster in Amazon Web Services (AWS) running Ubuntu Server using Packer and Terraform.

Based on my experience, my preference is to follow these steps in the overall process:

  • Build a custom Apache Spark distribution against the latest version of Hadoop 2.7.x.
  • Using Packer, generate a new Amazon Machine Image (AMI) containing the correct and most recent versions of the prerequisites used by Apache Spark.
  • Using Terraform, create (and manage) a new Apache Spark cluster in AWS.

Let’s get started.

Building a Custom Apache Spark Distribution

Apache Spark is a fast and general engine for large-scale data processing. I’ve used it quite extensively over the past 18 months in deep machine learning projects and prefer it when compared with other technologies in this space.

For those new to Apache Spark, it provides intrinsic support for reading from and writing to Hadoop Sequence Files. When using AWS, I prefer to durably store the results of Apache Spark jobs in Amazon Simple Storage Service (S3). In particular, Amazon S3 represents a fast and economical storage container with native support for Hadoop Sequence Files.

Hadoop provides multiple file system clients for reading and writing to and from Amazon S3. These are:

  • S3 Block File System. The first generation file system represented by the URI scheme “s3”. A disadvantage of using this file system is files can be larger than 5 GB, but they are not interoperable with other S3 tools.
  • S3 Native File System. The second generation file system represented by the URI scheme “s3n”. The advantage of this file system is that one can access files on S3 that were written with other tools and, conversely, other tools can access files written using Hadoop. However, a disadvantage is a 5 GB file size limit is still imposed.
  • S3A Native File System. The third, and current, generation file system represented by the URI scheme “s3a”. This is the fully backwards compatible successor to the S3 Native (s3n) file system. Advantages of using this file system is support larger files (the 5 GB size limitation was lifted) and higher performance for read and write operations.

For reference, the S3A file system was introduced in the Hadoop 2.6.0 time frame and with it came issues. Even though two critical S3 issues were fixed in Hadoop 2.6.1, I prefer to work with Hadoop 2.7 given significant enhancements to the Hadoop Common Framework, Hadoop HDFS, YARN and MapReduce. Consequently, I generate a custom Apache Spark distribution with the following attributes:

  • Oracle Java 8 (8u74)
  • Scala 2.11.7
  • Spark 1.6.1
  • Hadoop 2.7.2

Start by downloading the Spark 1.6.1 sources from here and decompress the resulting file; spark-1.6.1.tgz.

Open the Maven POM file in the root of the distribution (pom.xml) and add the following Hadoop 2.7 profile near the existing profiles around line 2,500. You can just search for “hadoop-2.6”.

1
2
3
4
5
6
7
8
9
<profile>
	<id>hadoop-2.7</id>
	<properties>
        <hadoop.version>2.7.2</hadoop.version>
        <jets3t.version>0.9.4</jets3t.version>
        <zookeeper.version>3.4.8</zookeeper.version>
        <curator.version>3.1.0</curator.version>
	</properties>
</profile>

Add the Hadoop AWS jar as a dependency just before the “hadoop-client” dependency on line 870.

1
2
3
4
5
6
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-aws</artifactId>
	<version>${hadoop.version}</version>
	<scope>${hadoop.deps.scope}</scope>
</dependency>

Add a dependency for the AWS Java SDK by adding the following just after line 458. I recommend specifying the latest version of the AWS Java SDK, which is 1.10.62 at the time of writing.

1
2
3
4
5
<dependency>
	<groupId>com.amazonaws</groupId>
	<artifactId>aws-java-sdk</artifactId>
	<version>1.10.62</version>
</dependency>

Save the POM file and produce the Spark distribution compiled with Scala 2.11. From the sources root, execute the following commands.

1
2
./dev/change-scala-version.sh 2.11
./make-distribution.sh --name custom --tgz -Phadoop-2.7 -Pyarn -Dscala-2.11

If you prefer to remain with Scala 2.10 then omit the -Dscala-2.11 directive. If you’d like more information on usage, run the following command.

1
./make-distribution.sh --help

Upon successful completion, the distribution creation process generates a compressed archive file named spark-1.6.1-bin-custom.tgz.

Machine Image and Cluster Management

Scripts for machine image and cluster management are provided by Six After. Navigate to their Github repository. As a prerequisite, I encourage you to generate your custom Spark distribution as described in this post before working with their cluster management scripts.

Enjoy.