Flume Collecting Twitter Data


Apache Flume Collecting Twitter Data

We will create an application and get the tweets from it using the experimental twitter source provided by Apache Flume. We will use the memory channel to buffer these tweets and HDFS sink to push these tweets into the HDFS.

Step 1 – Create an application in twitter with your twitter account. Browse to below twitter URL to create twitter application.

https://apps.twitter.com/

a) Sign in to your Twitter account. You will have a Twitter Application Management window where you can create, delete, and manage Twitter Apps.

b) Click on the Create New App button. You will be redirected to a window where you will get an application form in which you have to fill in your details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.

c) Fill in the details, accept the Developer Agreement when finished, click on the Create your Twitter application button which is at the bottom of the page. If everything goes fine, an App will be created.

d) Under keys and Access Tokens tab at the bottom of the page, you can observe a button named Create my access token. Click on it to generate the access token.

e) Finally, click on the Test OAuth button which is on the right side top of the page. This will lead to a page which displays your Consumer key, Consumer secret, Access token, and Access token secret. Copy these details. These are useful to configure the agent in Flume.

Step 2 – Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Step 3 – Start all hadoop daemons.

$ start-all.sh

Step 4 – The JPS (Java Virtual Machine Process Status Tool) tool is limited to reporting information on JVMs for which it has the access permissions.

$ jps

Step 5 – Create a /user/hduser/twitter_data folder in HDFS.

$ hdfs dfs -mkdir hdfs://localhost:9000/user/hduser/twitter_data

Step 6 – Copy these twitter jar files in /usr/local/flume/lib/ folder. You can download these jar files from internet.

twitter4j-async-4.0.4.jar
twitter4j-core-4.0.4.jar
twitter4j-media-support-4.0.4.jar
twitter4j-stream-4.0.4.jar

Step 7 – Edit flume-env.sh file.

$ gedit flume-env.sh

Step 8 – Add flume library path to flume-env.sh file. Save and Close.

export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/*

Step 9 – Configuration File

Given below is an example of the configuration file. Copy this content and save as twitter.conf in the conf folder of Flume.

Dont forget to change consumerKey, consumerSecret, accessToken, accessTokenSecret with your twitter OAuths.

twitter.conf

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = bVd3fwceBGCvjghPqjVF6A2jW
TwitterAgent.sources.Twitter.consumerSecret = 86EPCj7ByjPpPTx4vNN1nTYqOsdjN0v7ZsainjEgjGY6KzwjFV
TwitterAgent.sources.Twitter.accessToken = ******************-0NpAbHQt1WW2NM5njFieh6xVA0BwedG
TwitterAgent.sources.Twitter.accessTokenSecret = lUcbFDxu08lRE6uIISHE9fgAsEdZXKCh6MTpJqbplYUXy

TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/hduser/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 5
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10

# Describing/Configuring the channel

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
$ cp /home/hduser/Desktop/FLUME/twitter.conf /usr/local/flume/conf/

Step 10 – Change the directory to /usr/local/flume

$ cd $FLUME_HOME

Step 11 – Execution

$ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *