Home / AcadGild • Big Data and Hadoop / Streaming Twitter Data Using Flume

24 December 2015

Streaming Twitter Data Using Flume

We all know that Hadoop is a framework which helps in parallel processing using huge datasets called Big Data. This is just a basic introduction/definition of Hadoop and you can find tons of stuff on Hadoop, over internet. So, instead of discussing more on Hadoop, we will focus on the data which Hadoop will fetch from various sources using Flume.

So, what does Flume actually do?

Apache Flume is a reliable service used for efficiently collecting, aggregating, and moving large amounts of log data.

The next question on your mind would be, “How does Flume do this?”

Read on to find out the answer for your question.

How Flume helps Hadoop to get data from live streaming?

Flume allows the user to do the following:

Stream data into Hadoop from multiple sources for analysis.
Collect high-volume web logs in real-time.
It acts as a buffer when the rate of incoming data exceeds the rate at which the data can be written. Thereby preventing data loss.
Guarantees data delivery.
Scales horizontally (connects commodity system in parallel) to handle additional data volume.

Essential Components Involved in Getting Data from a Live-Streaming Source

There are 3 major components, namely: Source, Channel, and Sink, which are involved in ingesting data, moving data and storing data, respectively.

Below is the breakdown of the parts applicable in this scenario:

Event – A singular unit of data that is transported by Flume (typically a single log entry).
Source – The entity through which data enters into the Flume. Sources either actively samples the data or passively waits for data to be delivered to them. A variety of sources such as log4j logs and syslogs, allows data to be collected.
Sink – The unit that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. Example: HDFS sink writes events to the HDFS.
Channel – It is the connection between the Source and the Sink. The Source ingests Event into the Channel and the Sink drains the Channel.
Agent – Any physical Java virtual machine running Flume. It is a collection of Sources, Sinks and Channels.
Client – It produces and transmits the Event to the Source operating within the Agent

How to Take Input Data from Twitter to the HDFS?

Before we discuss how to take the input data from Twitter to HDFS, let’s look at the necessary pre-requisites:

Twitter account
Install Hadoop/Start Hadoop

Step-by-step Tutorial: Data Streaming from Twitter to HDFS

Step 1: Open a Twitter account

Step 2: Go to the following link and click on ‘create app’.

https://apps.twitter.com/app

Step 3: Fill in the necessary details.

Step 4: Accept the agreement and click on ‘create your Twitter application’.

Step 5: Go to ‘Keys and Access Token’ tab.

Step 6: Copy the consumer key and the consumer secret.

Step 7: Scroll down further and click on ‘create my access token’.

You will now receive a message that says that you have successfully generated your application access token.

Step 8: Copy the Access Token and Access token Secret.

Step 9: Download flume tar file from below link and extract it .

https://drive.google.com/drive/u/0/folders/0B1QaXx7tpw3SWkMwVFBkc3djNFk

Extract the flume.tar file and update the path of extracted file in .bashrc

NOTE: keep the path same as where the extracted file exists.

Update the bashrc file with source command.

Step 10: Create a new file inside the ‘conf’ directory inside the Flume-extracted directory.

Step 11: Copy the information from the below link and paste it inside the newly created file.

https://drive.google.com/open?id=0B1QaXx7tpw3Sb3U4LW9SWlNidkk

Step 12: Change the twitter api keys with the keys generated as shown in the step no 6 and step number 8.

Step 13: Open the terminal to check for all Hadoop daemons running, by using the ‘jps’ command.

Step 14: Using the below command, create a directory inside HDFS where Twitter data will be stored.

Hadoop dfs –mkdir –p /user/flume/tweets

Step 15: For fetching data from Twitter, give the below command in the terminal.

flume-ng agent -n TwitterAgent -f <location of created/edited conf file>

This will start fetching data from Twitter and send it to the HDFS.

To stop fetching data, press ‘Ctrl+c’. This will end the process of fetching the data.

Step 16: To check the contents of the Tweets folder, use the following command:

hadoop dfs –ls /user/flume/tweets

Step 17: To see the data inside this file, type the following command:

hadoop dfs –cat /us er/flume/tweets/<flumeData file name>

We have completed the action of fetching live-streaming data from Twitter and loaded it to the HDFS, using Flume.

Prateek Kumar

Prateek Kumar has been working with AcadGild as an Associate analyst with rich expertise in Big data and Hadoop development and Administration. He has been a Java enthusiast and been associated with implementation of many Big data projects. AcadGild was founded with the vision of "Learn. Do. Earn". We provide skill development courses based on current industry needs. But what sets us apart is earning opportunities we provide after successful completion of course. We also provide live mentoring and 24x7 support. Our mentors are industry thought leaders in their respective fields. We provide courses for Android Programming, Big Data, Front End, Full Stack, AngularJS, NodeJS and Android Programming for children.

Hadoop Tutorial: Combiners in Hadoop

August 25, 2016
Hadoop Tutorial: HBase Admin DDL Commands (Java API)

August 24, 2016
Machine Learning with Spark – Part 3

August 23, 2016

4 Comments

karthikg Reply to karthikg

May 30, 2016 at 12:26 pm

Don’t we need to configure the flume-env.sh file?
I followed all the specified steps, I’m received this error,
No configuration found!!
Sharad Reply to Sharad

July 9, 2016 at 12:30 am

Yes,
In flume-env.sh file, set JAVA_HOME according to your path.
Example:
JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24
Vamshi Reply to Vamshi

July 14, 2016 at 1:16 pm

Thanks for step-by-step instructions appreciate your efforts. Is there an article which explains after loading data from Twitter how to perform analysis? like how to understand sentiments around these tweets what to consider and what to ignore.
nisha Reply to nisha

August 9, 2016 at 6:21 pm

i m getting the tweets but tweets are not according to keywords

AcadGild