02 February 2016

Beginner’s Guide for Spark

In this Blog we will be discussing the basics of Spark’s functionality and its installation.

Apache spark is a cluster computing framework which runs on Hadoop and handles different types of data. It is a one stop solution to many problems. Spark has rich resources for handling the data and most importantly, it is 10-20x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.

Spark’s rich resources has almost all the components of Hadoop. For example we can perform batch processing in Spark and real time data processing, without using any additional tools like kafka/flume of Hadoop. It has its own streaming engine called spark streaming.

We can perform various functions with Spark:

SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.
Machine Learning: It has Machine Learning Library , MLib. It can perform Machine Learning without the help of MAHOUT.
Graph processing: It performs Graph processing by using GraphX component.

All the above features are in-built in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and Apache Mesos framework. It has its own standalone scheduler to get started, if other frameworks are not available.Spark provides the access and ease of storing the data,it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a simple and immutable distributed collection of objects. Each RDD is split into multiple partitions which may be computed on different nodes of the cluster. In spark all function are performed on RDDs only.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

Let’s see now the features of Resilient Distributed Datasets in the below explanation:

In Hadoop we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (datanodes) which are computed in parallel across all the nodes.

In Hadoop we need to replicate the data for fault recovery, but in case of Spark, replication is not required as this is performed by RDDs.

RDDs load the data for us and are resilient which means they can be recomputed.

RDDs perform two types of operations: transformations which creates a new dataset from the previous RDD and actions which return a value to the driver program after performing the computation on the dataset.

RDDs keeps a track of transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.

RDDs can be created in two different ways:

Referencing an external dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
By parallelizing a collection of objects(a list or a set) in the driver program.

Life cycle of a Spark program:

Some input RDDs are created from external data or by parallelizing the collection of objects in the driver program.
These RDDs are lazily transformed into new RDDs using transformations like filter() or map().
Spark caches any intermediate RDDs that will be needed to reused.
Actions such as count() and collect are launched to kick off a parallel computation which is then optimized and then executed by Spark.

Let’s now discuss the steps to install spark in your cluster:

Step by step process to install spark

Before installing spark Scala needs to be installed in the system. We need to follow the below steps to install scala.

1.Open the Terminal in your CentOS

To download Scala type the below command:

Wget http://downloads.typesafe.com/scala/2.11.1/scala-2.11.1.tgz

2.Extract the downloaded tar file by using the command

1	tar -xvf scala-2.11.1.tgz

After extracting specify the path of scala in .bashrc file

After setting the path we need to save the file and type the below command:

1	Source .bashrc

The above command will sum up the scala installation. we need to then install spark after that.

To install spark in centos we need to follow the below steps to download and install Single Node cluster of Spark in CentOS.

1.Open the browser and go the link

Download spark-1.5.1-bin-hadoop2.6.tgz

File will be downloaded into Downloads folder

Go to the Downloads folder and untar the Downloaded file using the below command:

1	tar –xvf spark-1.5.1-bin-hadoop2.6.tgz

After untaring the file we need to move the file to the Home Folder using the below command:

1	sudo mv spark-1.5.1-bin-hadoop2.6 /home/acadgild

Now the file is moved on to the home folder

We need to update the path for spark in the .bashrc in the same way as we did for scala.

Refer the below screen shot for updating the path for .bashrc.

After adding the path for SPARK type the command source .bashrc, refer the the screen shot for the same.

Make a folder by Name ‘work’ in HOME using the below command:

1	mkdir work

Inside the work folder we need to make another folder by name ‘sparkdata’ using the command

1	mkdir sparkdata

We need to give the permissions to the sparkdata folder as 777 using the command

1	chmod 777 $HOME/work/sparkdata

Now move into the conf directory of spark folder using the below command:

cd spark-1.5.1-bin-hadoop2.6

cd conf

Type the command ls to see the files inside conf folder:

There will be a file by name spark-env.sh.template , we need to copy that file by name spark-env.sh using the below command:

1	cp spark-env.sh.template spark-env.sh

Edit the spark-env.sh file using the below command

1	gedit spark-env.sh

and make the configuration as follows

Note: Make sure that you are giving the paths of Java and Scala correctly. After editing save the file and close the file.

Lets follow the below steps to start the spark single node cluster.Move to the sbin directory of spark folder using the below command:

1	cd spark-1.5.1-bin-hadoop2.6/sbin

Inside sbin type the below command to start the Master and Slave daemons.

1	./start-all.sh

Now the spark Single Node cluster will start with One Master and Two Workers.

You can check that the cluster is running or not by using the below command

‘jps’

If the Master and Worker Nodes are running then you have successfully started the spark single node cluster.

We hope this blog helped you in getting the basic understanding of spark and the ways to install it.

Visit our website www.acadgild.com/blog for more blogs on Big data and other technologies.

AcadGild

Beginner’s Guide for Spark

Resilient Distributed Datasets

RDDs can be created in two different ways:

Life cycle of a Spark program:

Related

Kiran Krishna

Related Posts

1 Comment

Leave a Reply

Big Data and Hadoop Developer 2016 | Big Data as Career Path | Introduction to Big Data and Hadoop

Resilient Distributed Datasets

RDDs can be created in two different ways:

Life cycle of a Spark program:

Share this:

Related

Kiran Krishna

Related Posts

1 Comment

Leave a Reply