In our previous post we had seen how to perform Sentiment Analysis on Twitter data using Pig. In this post we will be discussing how to find out the most popular hashtags in the tweets.
You can refer to this post to know how to get tweets from Twitter using Flume.
We have collected and stored the tweets in HDFS. The tweets are located in the following location of the HDFS: /user/flume/tweets/
The data from Twitter is in ‘Json’ format, so a Pig JsonLoader is required to load the data into Pig. You need to download the required jars for the JsonLoader from the below link:
Next, register the downloaded jars in Pig by using the following commands:
1 2 3 4 5 |
REGISTER '/home/kiran/Desktop/elephant-bird-hadoop-compat-4.1.jar'; REGISTER '/home/kiran/Desktop/elephant-bird-pig-4.1.jar'; REGISTER '/home/kiran/Desktop/json-simple-1.1.1.jar'; |
You can refer to the below screen shot for the same.
Note: You need to provide the path of the jar file accordingly.
After registering the required jars, we can now write a Pig script to perform Sentiment Analysis.
Below is a sample tweets collected for this purpose:
1 |
{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"https://pbs.twimg.com/profile_banners/197865769/1436198000","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"https://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"http://pbs.twimg.com/profile_images/664486535040000000/GOjDUiuK_normal.jpg","listed_count":50,"is_translator":false}} |
The tweets are in nested Json format and consists of map data types. We need to load the tweets using JsonLoader which supports maps, so we are using elephant bird JsonLoader to load the tweets.
Below is the first Pig statement that is required to load the tweets into Pig:
1 |
load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap; |
When we dump the above relation, we can see that all the tweets have been loaded successfully.
Now, let’s extract the id and the hashtag from the above tweets and the Pig statement for doing this is as shown below
1 |
extract_details = FOREACH load_tweets GENERATE FLATTEN(myMap#'entities') as (m:map[]),FLATTEN(myMap#'id') as id; |
In the tweet, the hashtag is present in the map object entities, which can be seen in the below image.
Since the hashtags are inside the map entities, we have extracted the entities as map[ ] data type. The schema of the relation extract_details can be viewed using the below command:
1 |
describe extract_details; |
Now, from the entities, we have to extract the hashtags which is again a map. So we will extract the hashtags as map[ ] data type as well.
1 |
hash = foreach extract_details generate FLATTEN(m#'hashtags') as(tags:map[]),id as id; |
The extracted hashtags can be viewed by dumping the above relation.
In the above image, we can see that the hashtags object has been extracted successfully. The extracted hashtags is also a map data type, which can see by describing the relation. You can refer to the below image for the same.
Now, from the extracted hashtags, we need to extract text which contains the actual hashtag. This can be done using the following command:
1 |
txt = foreach hash generate FLATTEN(tags#'text') as text ,id; |
Here, we have extracted the text which starts with # and named it with an alias name text.
We can see the extracted hashtags text in the below screen shot:
In the above image, we can see the hashtag’s text and the tweet_id from which it has originated.
Now, we will group the relation by hashtag’s text by using the below relation:
1 |
grp = group txt by text; |
We have successfully grouped by hashtag’s text. We can see the schema by describing the relation.
The next thing to do is, count the number of times the hashtag is repeated by the user. This can be achieved using the below relation:
1 |
cnt = foreach grp generate group as hashtag_text, COUNT(txt.text) as hashtag_cnt:int; |
Now we have the hashtags and its count in a relation as shown in the below screen shot.
We have now successfully performed hashtag count on Twitter data using Pig!
Hope this post has help you learn how to count hashtags. Keep visiting our blog for more updates on Big Data and other technologies.
Thanks for sharing nice post!!!
I follow the above steps and i was able to fetch the data from twitter but when i am going to analysis the data using pig as mentioned in this post so i am not getting any data by using dump command. steps which i followed are:
1) Fetched data from twitter stored in hdfs:
[cloudera@localhost conf]$ hadoop fs -ls flume
Found 9 items
-rw-r–r– 3 cloudera cloudera 179237 2016-08-02 05:36 flume/FlumeData.1470141370000
-rw-r–r– 3 cloudera cloudera 66274 2016-08-02 05:36 flume/FlumeData.1470141370001
-rw-r–r– 3 cloudera cloudera 66497 2016-08-02 05:36 flume/FlumeData.1470141370002
-rw-r–r– 3 cloudera cloudera 83746 2016-08-02 05:36 flume/FlumeData.1470141370003
-rw-r–r– 3 cloudera cloudera 65313 2016-08-02 05:36 flume/FlumeData.1470141370004
-rw-r–r– 3 cloudera cloudera 84880 2016-08-02 05:36 flume/FlumeData.1470141370005
-rw-r–r– 3 cloudera cloudera 71532 2016-08-02 05:36 flume/FlumeData.1470141370006
-rw-r–r– 3 cloudera cloudera 68419 2016-08-02 05:36 flume/FlumeData.1470141370007
-rw-r–r– 3 cloudera cloudera 64983 2016-08-02 05:36 flume/FlumeData.1470141370008
2) in pig grunt shell executed below commands:
Practice:
register /home/cloudera/Desktop/Jars/elephant-bird-hadoop-compat-4.1.jar;
register /home/cloudera/Desktop/Jars/elephant-bird-pig-4.1.jar;
register /home/cloudera/Desktop/Jars/json-simple-1.1.1.jar;
load_tweets = LOAD ‘/user/cloudera/flume’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) AS myMap;
dump load_tweets;
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-08-02 06:17:36 2016-08-02 06:18:23 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201608020416_0003 1 0 13 13 13 13 0 0 0 0 load_tweets MAP_ONLY hdfs://localhost.localdomain:8020/tmp/temp-128798977/tmp-1279927259,
Input(s):
Successfully read 0 records (752029 bytes) from: “/user/cloudera/flume”
Output(s):
Successfully stored 0 records in: “hdfs://localhost.localdomain:8020/tmp/temp-128798977/tmp-1279927259”
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201608020416_0003
SO here dump command is executed successfully but its not showing any data…. i don’t know, what went wrong? …can you please help me out.
Hi Pradeep,
While checking the files in hdfs you have given the command hadoop fs -ls flume so according to this command the path is just flume. But while loading the tweets using pig script you have given the path as /user/cloudera/flume. Both the paths are different therefore you are not getting any output. So please change the input file path in the pig script as ‘flume’.
Hi Satyam ,
Thanks for the quick reply!!!
I have tried with changing path as ‘flume’ but no luck. pl check below pig command.
load_tweets = LOAD ‘flume’ USING com.twitter.elephantbird.pig.load.JsonLoader(‘-nestedLoad’) AS myMap;