Problem Scenario 7 [FLUME]

CCA 175 Hadoop and Spark Developer Exam Preparation - Problem Scenario 7

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walkthrough of this problem is available at [CLICK HERE]

Click here for the video version of this series. This takes you to the youtube playlist of videos.

This question focusses on validating your flume skills. You can either learn flume by following the video accompanied with this post or learn flume elsewhere and then solve this problem while using the video as a reference. This video serves both as tutorial and walkthrough of how to leverage flume for data ingestion.

Note: While this post only provides specifics related to solving the problem, the video provides an introduction, explanation and more importantly application of flume knowledge.

Problem 7:

This step comprises of three substeps. Please perform tasks under each subset completely
- using sqoop pull data from MYSQL orders table into /user/cloudera/problem7/prework as AVRO data file using only one mapper
- Pull the file from \user\cloudera\problem7\prework into a local folder named flume-avro
- create a flume agent configuration such that it has an avro source at localhost and port number 11112, a jdbc channel and an hdfs file sink at /user/cloudera/problem7/sink
- Use the following command to run an avro client flume-ng avro-client -H localhost -p 11112 -F <<Provide your avro file path here>>
The CDH comes prepackaged with a log generating job. start_logs, stop_logs and tail_logs. Using these as an aid and provide a solution to below problem. The generated logs can be found at path /opt/gen_logs/logs/access.log
- run start_logs
- write a flume configuration such that the logs generated by start_logs are dumped into HDFS at location /user/cloudera/problem7/step2. The channel should be non-durable and hence fastest in nature. The channel should be able to hold a maximum of 1000 messages and should commit after every 200 messages.
- Run the agent.
- confirm if logs are getting dumped to hdfs.
- run stop_logs.

Solution:

Step 1:

Pull orders data from order sqoop table to \user\cloudera\problem7\prework

sqoop import --table orders --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera -m 1 --target-dir /user/cloudera/problem7/prework --as-avrodatafile

Get the file from HDFS to local

mkdir flume-avro;

cd flume-avro;

hadoop fs -get /user/cloudera/problem7/prework/* .

gedit f.config

Create a flume-config file in problem7 folder named f.config

#Agent Name = step1

# Name the source, channel and sink
step1.sources = avro-source
step1.channels = jdbc-channel
step1.sinks = file-sink

# Source configuration
step1.sources.avro-source.type = avro
step1.sources.avro-source.port = 11112
step1.sources.avro-source.bind = localhost

# Describe the sink
step1.sinks.file-sink.type = hdfs
step1.sinks.file-sink.hdfs.path = /user/cloudera/problem7/sink
step1.sinks.file-sink.hdfs.fileType = DataStream
step1.sinks.file-sink.hdfs.fileSuffix = .avro
step1.sinks.file-sink.serializer = avro_event
step1.sinks.file-sink.serializer.compressionCodec=snappy

# Describe the type of channel -- Use memory channel if jdbc channel does not work
step1.channels.jdbc-channel.type = jdbc

# Bind the source and sink to the channel
step1.sources.avro-source.channels = jdbc-channel
step1.sinks.file-sink.channel = jdbc-channel

Run the flume agent

flume-ng agent --name step1 --conf . --conf-file f.config

Run the flume Avro client

flume-ng avro-client -H localhost -p 11112 -F <<Provide your avro file path here>>

Step 2:

mkdir flume-logs
cd flume-logs

create flume configuration file

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/gen_logs/logs/access.log

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/cloudera/problem7/step2
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 200

# Bind the source and sink to the channel
a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

create hdfs sink directory

hadoop fs -mkdir /user/cloudera/problem7/sink

Run the flume-agent

flume-ng agent --name a1 --conf . --conf-file f.config

PLEASE SEE VIDEO FOR A COMPLETE WALKTHROUGH OF THIS SOLUTION

28 comments:

DevenMay 24, 2017 at 12:49 PM
Nicely presented flume scenario. Please keep publishing the content on CCA175 certification. Thanks a ton!
ReplyDelete
Replies
DevenMay 24, 2017 at 1:56 PM
After creating the Avro file in hdfs /user/cloudera/problem7/sink, When I tried reading the Avro file in spark i get msg saying "java.io.IOException: Not an Avro data file".
I checked the flume-ng process I see msg "17/05/24 13:41:01 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
".

I have these parms set.
step1.sinks.file-sink.hdfs.fileType = DataStream
step1.sinks.file-sink.hdfs.fileSuffix = .avro
step1.sinks.file-sink.hdfs.serializer = avro_event
ReplyDelete
Replies
DevenMay 25, 2017 at 10:55 AM
Arun thanks for looking into the issue, much appreciated. Will try the options.
ReplyDelete
Replies
UnknownMay 27, 2017 at 1:56 PM
In CCA175 is there any probability that they will as Kafka and spark streaming as per new syllabus?
Any idea Arun Sir?
ReplyDelete
Replies
UnknownJune 22, 2017 at 11:35 PM
Hi Arun,

I am getting the following error for step 1, when I run an avro client:

17/06/22 23:30:19 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
17/06/22 23:30:24 ERROR avro.AvroCLIClient: Unable to deliver events to Flume. Exception follows.
org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: quickstart.cloudera, port: 11112 }: Failed to send batch
at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:315)
at org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:229)
at org.apache.flume.client.avro.AvroCLIClient.main(AvroCLIClient.java:72)
Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient { host: quickstart.cloudera, port: 11112 }: Avro RPC call returned Status: FAILED

Could you please advise the reason for the same? I am seeing some data being sent to sink directory though.

Thanks,
Bala.
ReplyDelete
Replies
KananAugust 1, 2017 at 10:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
KananAugust 1, 2017 at 10:58 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownOctober 4, 2017 at 10:21 PM
Hi Arun

Thanks for this wonderful blog and youtube session for Flume. Because of this, I could understand the concept behind usage of Flume.
ReplyDelete
Replies
GDKNovember 21, 2017 at 7:06 AM
Greetings Arun,

Thank you for your work here, it's been a real blessing to us all. I've looked at the 'new syllabus' and it states skills required to ingest real-time and near-real-time data -which after reading the documentation on spark streaming, seems like a tool to use for the job. However, as you have not covered it yet in your blog series, could you be so kind as to suggesting a way (or resource) for relevant practicing of spark streaming?

Seven blessings to you.

Regards
ReplyDelete
Replies
UnknownFebruary 15, 2018 at 1:55 AM
awesome post presented by you..your writing style is fabulous and keep update with your blogs thank you sharing keep updating the postBig data hadoop online Course India
ReplyDelete
Replies
UnknownMarch 14, 2018 at 7:30 AM
It is nice blog Thank you porovide importent information and i am searching for same information to save my timeBig data hadoop online training Hyderabad
ReplyDelete
Replies
UnknownApril 27, 2018 at 3:40 PM
Nice blog
ReplyDelete
Replies
zaheer shaikMay 13, 2018 at 7:10 AM
what all the api documentation available during exam in VM itself?
ReplyDelete
Replies
UnknownJune 13, 2018 at 7:21 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
ReplyDelete
Replies
AnonymousAugust 10, 2018 at 5:40 PM
Hi Arun,

How likely it is to have a question on Flume, Kafka or Spark Streaming on the CCA Hadoop exam.
ReplyDelete
Replies
UnknownDecember 20, 2018 at 1:38 AM
Hi Arun. Great blog! Could you please explain why the serializer is necessary? It was not asked in the question...
step1.sinks.file-sink.serializer = avro_event
step1.sinks.file-sink.serializer.compressionCodec=snappy
ReplyDelete
Replies
UnknownFebruary 15, 2019 at 12:57 AM
hi arun, i executed step1 as per the video and i also have all the resulting files in hdfs but the names of the files don't include -> .avro For e.g.- file name is FlumeData.1550219986301
ReplyDelete
Replies
kranthi kumarMay 19, 2019 at 8:02 PM
Hi Arun,
Thanks for great blog!
It seems some problem with solution to first problem.
I have executed all the commands executed above except I used memory channel since jdbc is not working
Everything seems right but input and output records are not matching!
1. mysql orders table has 68883 records
2. hdfs prework folder has same number of records - used spark to check the count
3. local avro file used for a svro-client has same number of records - used spark to check to count
4. but the output file in sink has very few records..less than thousand..
and more trouble some...spark is able to read that avro file...but i dont see any actual columns !!

Please check whether this is with your solution or only with me!!
I watched vedio...to find any clue!!
Thanks for great work!! Please go through it if you have some time!!
ReplyDelete
Replies
Shivam PurwarSeptember 30, 2019 at 4:54 AM
Hi,
when I streaming data by flume using netcat source in ware house directory and same file I am using in hive table .it is not given proper text in table.it showing like some of the content is in sequence file format.

Below is my flume conf

agent1.sources=source1
agent1.channel=channel1
agent1.sinks=sink1
agent1.sources.source1.type=netcat
agent1.sources.source1.bind=127.0.0.1
agent1.sources.source1.port=44444
agent1.sources.source1.interceptors=i1
agent1.sources.source1.interceptors.i1.type=regex_filter
agent1.sources.source1.interceptors.i1.regex=female
agent1.sources.source1.interceptors.i1.excludeEvents=true
agent1.channels.channel1.type=memory
agent1.channels.channel1.capacity=1000
agent1.channels.channel1.transactionCapacity=100

agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path=hdfs://localhost:9000/user/hive/warehouse/demo
agent1.sinks.sink1.fileType=DataStream
agent1.sinks.sink1.hdfs.writeFormat=Text

agent1.sinks.sink1.channel=channel1
agent1.sources.source1.channels.channel1

I tried below also still issue present
agent1.sinks.sink1.hdfs.fileSuffix=.txt

I expect the result:
alok,100000,male,29
jatin,105000,male,32
yogesh,134000,male,39

but the actual result. jatin105000SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.TextU▒Wn▒▒G▒~3JS▒yogesh134000x

please assist
ReplyDelete
Replies
veeraJuly 7, 2020 at 4:55 AM
Very nice Blog.Keep posting more posts with us.
Thank you.

big data online course
best big data course
big data and hadoop course
big data hadoop certification
ReplyDelete
Replies
PrePare4Test IT ExamsApril 21, 2022 at 10:29 AM
Prepare for Magento-2-Certified-Solution-Specialist exam with our preparation material with full confidence. We offer you 100% real Magento 2 Certified Solution Specialist Exam Magento-2-Certified-Solution-Specialist exam dumps for your better results. Prepare4Test’s Magento-2-Certified-Solution-Specialist pdf dumps are verified by Magento Gurus.
ReplyDelete
Replies

Add comment

Hadoop and Spark Developer - CCA 175