Problem Scenario 5 [SQOOP]

CCA 175 Hadoop and Spark Developer Exam Preparation - Problem Scenario 5

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walkthrough of this problem is available at [PART 1 CLICK HERE] AND [PART 2 CLICK HERE]

Click here for the video version of this series. This takes you to the youtube playlist of videos.

Sqoop is one of the important topics for the exam. Based on generally reported exam pattern from anonymous internet bloggers, you can expect 2 out of 10 questions on this topic related to Data Ingest and Data Export using Sqoop. Hence, 20% of the exam score can be obtained just by practicing simple Sqoop concepts. Sqoop can be mastered easily (i.e in a few hours) at the skill level that CCA 175 exam is expecting you to demonstrate. I created this problem focusing on Sqoop alone, if you are able to perform this exercise on your own or at worst using just the sqoop user guide then there is a very very high chance that you can score the 20% easily.

Pre-Work: Please perform these steps before solving the problem
1. Login to MySQL using below commands on a fresh terminal window
mysql -u retail_dba -p
Password = cloudera
2. Create a replica product table and name it products_replica
create table products_replica as select * from products
3. Add primary key to the newly created table
  alter table products_replica add primary key (product_id);
4. Add two more columns
  alter table products_replica add column (product_grade int, product_sentiment varchar(100))
5. Run below two update statements to modify the data
  update products_replica set product_grade = 1 where product_price > 500;
  update products_replica set product_sentiment = 'WEAK' where product_price between 300 and 500;

Problem 5: Above steps are important so please complete them successfully before attempting to solve the problem

Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '|' and lines are separated by '\n'. Null values are represented as -1 for numbers and "NOT-AVAILABLE" for strings. Only records with product id greater than or equal to 1 and less than or equal to 1000 should be imported and use 3 mappers for importing. The destination file should be stored as a text file to directory /user/cloudera/problem5/products-text.
Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '*' and lines are separated by '\n'. Null values are represented as -1000 for numbers and "NA" for strings. Only records with product id less than or equal to 1111 should be imported and use 2 mappers for importing. The destination file should be stored as a text file to directory /user/cloudera/problem5/products-text-part1.
Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '*' and lines are separated by '\n'. Null values are represented as -1000 for numbers and "NA" for strings. Only records with product id greater than 1111 should be imported and use 5 mappers for importing. The destination file should be stored as a text file to directory /user/cloudera/problem5/products-text-part2.
Using sqoop merge data available in /user/cloudera/problem5/products-text-part1 and /user/cloudera/problem5/products-text-part2 to produce a new set of files in /user/cloudera/problem5/products-text-both-parts
Using sqoop do the following. Read the entire steps before you create the sqoop job.
- create a sqoop job Import Products_replica table as text file to directory /user/cloudera/problem5/products-incremental. Import all the records.
- insert three more records to Products_replica from mysql
- run the sqoop job again so that only newly added records can be pulled from mysql
- insert 2 more records to Products_replica from mysql
- run the sqoop job again so that only newly added records can be pulled from mysql
- Validate to make sure the records have not be duplicated in HDFS
Using sqoop do the following. Read the entire steps before you create the sqoop job.
- create a hive table in database named problem5 using below command
- create table products_hive (product_id int, product_category_id int, product_name string, product_description string, product_price float, product_imaage string,product_grade int, product_sentiment string);
- create a sqoop job Import Products_replica table as hive table to database named problem5. name the table as products_hive.
- insert three more records to Products_replica from mysql
- run the sqoop job again so that only newly added records can be pulled from mysql
- insert 2 more records to Products_replica from mysql
- run the sqoop job again so that only newly added records can be pulled from mysql
- Validate to make sure the records have not been duplicated in Hive table
Using sqoop do the following. .
- insert 2 more records into products_hive table using hive.
- create table in mysql using below command
- create table products_external (product_id int(11) primary Key, product_grade int(11), product_category_id int(11), product_name varchar(100), product_description varchar(100), product_price float, product_impage varchar(500), product_sentiment varchar(100));
- export data from products_hive (hive) table to (mysql) products_external table.
- insert 2 more records to Products_hive table from hive
- export data from products_hive table to products_external table.
- Validate to make sure the records have not be duplicated in mysql table

Solution:

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.

Step 1:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir /user/cloudera/problem5/products-text \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-non-string -1 \
--null-string "NOT-AVAILABLE" \
-m 3 \
--where "product_id between 1 and 1000" \
--outdir /home/cloudera/sqoop1 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id between 1 and 1000";

Step 2:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir /user/cloudera/problem5/products-text-part1 \
--fields-terminated-by '*' \
--lines-terminated-by '\n' \
--null-non-string -1000 \
--null-string "NA" \
-m 2 \
--where "product_id <= 1111 " \
--outdir /home/cloudera/sqoop2 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id <= 1111";

Step 3:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir /user/cloudera/problem5/products-text-part2 \
--fields-terminated-by '*' \
--lines-terminated-by '\n' \
--null-non-string -1000 \
--null-string "NA" \
-m 5 \
--where "product_id > 1111 " \
--outdir /home/cloudera/sqoop3 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id > 1111"

Step 4:

sqoop merge \
--class-name products_replica \
--jar-file mp/sqoop-cloudera/compile/66b4f23796be7625138f2171a7331cd3/products_replica.jar \
--new-data /user/cloudera/problem5/products-text-part2 \
--onto /user/cloudera/problem5/products-text-part1 \
--target-dir /user/cloudera/problem5/products-text-both-parts \
--merge-key product_id;

Step 5:

On terminal -

sqoop job --create first_sqoop_job \
-- import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "retail_dba" \
--password "cloudera" \
--table products_replica \
--target-dir /user/cloudera/problem5/products-incremental \
--check-column product_id \
--incremental append \
--last-value 0;

sqoop job --exec first_sqoop_job

On MySQL command line -

mysql> insert into products_replica values (1346,2,'something 1','something 2',300.00,'not avaialble',3,'STRONG');

mysql> insert into products_replica values (1347,5,'something 787','something 2',356.00,'not avaialble',3,'STRONG');

On terminal -

sqoop job --exec first_sqoop_job

On MYSQL Command Line

insert into products_replica values (1376,4,'something 1376','something 2',1.00,'not avaialble',3,'WEAK');

insert into products_replica values (1365,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');

On terminal -

sqoop job --exec first_sqoop_job

Step 6:

On Terminal window-

sqoop job \
--create hive_sqoop_job \
-- import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "retail_dba" \
--password "cloudera" \
--table products_replica \
--check-column product_id \
--incremental append \
--last-value 0 \
--hive-import \
--hive-table products_hive \
--hive-database problem5;

On Hive window:

create database problem5;

use problem5;

create table products_hive (product_id int, product_category_id int, product_name string, product_description string, product_price float, product_imaage string,product_grade int, product_sentiment string);

On Terminal window

sqoop job --exec hive_sqoop_job

On MySQL window

insert into products_replica values (1378,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');

insert into products_replica values (1379,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');

On Terminal Window

sqoop job --exec hive_sqoop_job

On Hive Window

select * from products_hive

Step 7:

On Hive Window

use problem5;

insert into table products_hive values (1380,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

insert into table products_hive values (1381,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

On MYSQL window

create table products_external (product_id int(11) primary Key, product_grade int(11), product_category_id int(11), product_name varchar(100), product_description varchar(100), product_price float, product_impage varchar(500), product_sentiment varchar(100));

On Terminal

sqoop export \

--username "retail_dba" \

--password "cloudera" \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--export-dir /user/hive/warehouse/problem5.db/products_hive/ \

--input-fields-terminated-by '\001' \

--input-null-non-string "null" \

--input-null-string "null" \

--update-mode allowinsert \

--update-key product_id \

--columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;

On Hive Window

insert into table products_hive values (1382,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

insert into table products_hive values (1383,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

On Terminal Window:

sqoop export \
--username "retail_dba" \
--password "cloudera" \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--export-dir /user/hive/warehouse/problem5.db/products_hive/ \
--input-fields-terminated-by '\001' \
--input-null-non-string "null" \
--input-null-string "null" \
--update-mode allowinsert \
--update-key product_id \
--columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;

To Validate

On Hive

select count(*) from problem5.products_hive;

on MySQL

select count(*) from products_replica;

51 comments:

bloggieMay 9, 2017 at 11:16 AM
Great job. These exercises gives confidence of Clearing the Certification
ReplyDelete
Replies
DevenMay 23, 2017 at 1:02 PM
Excellent problem scenario!!! Please keep up the great work.
I came across a trivial point. In the 1st part of the problem it says "Only records with product id greater than 1 and less than 1000" - i think the where clause and boundary-query needs a slight change "product_id between 2 and 999"
ReplyDelete
Replies
UnknownMay 23, 2017 at 1:57 PM
Thank you Daven for identifying an error. Given that I cannot update the video, i updated the question to suit the video. Thanks for the complement as well.
ReplyDelete
Replies
KumarMay 30, 2017 at 4:20 AM
why you have used boundary query.
ReplyDelete
Replies
UnknownJune 8, 2017 at 3:27 PM
Hey Arun can you please tell me why you used boundary query ?
How can I get the product_replica.jar file which you have mentioned here
--jar-file mp/sqoop-cloudera/compile/66b4f23796be7625138f2171a7331cd3/products_replica.jar
I tack hadoop MR job I found that it produced product_replica.jar file not able to access it.
ReplyDelete
Replies
hiJune 8, 2017 at 3:42 PM
Look at the video buddy. Arun explained where to find the jar file
ReplyDelete
Replies
DevenJune 17, 2017 at 1:20 PM
Arun Sir thanks for the awesome content It helped me clear the CCA175 certification. Thanks!!!
ReplyDelete
Replies
SohailJune 23, 2017 at 11:06 PM
Another way to do question 2 using --query, to avoid where and boundary-query

split-by is then mandatory.

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "select * from products_replica where product_id<=1111 and \$CONDITIONS" \
--target-dir /user/cloudera/problem5/products-text-part1 \
--fields-terminated-by '*' \
--lines-terminated-by '\n' \
--null-non-string -1000 \
--null-string "NA" \
--split-by product_id \
-m 2
ReplyDelete
Replies
Varun MishraJune 24, 2017 at 9:36 AM
Hi Arun,

In question 5, If every time we will incremental append by last value 0, It will create duplicate value in it.
Can you please explain?

last value should be change for each and every sqoop job.??
ReplyDelete
Replies
UnknownJune 24, 2017 at 10:41 AM
you will have a better understanding if you can go through the video. to answer your question, Sqoop job remembers the last value and starts from where it left off. So you dont have to recreate the sqoop job with new sqoop import command everytime. Please watch video for better understanding. Good luck.
ReplyDelete
Replies
Varun MishraJune 24, 2017 at 10:44 AM
What is difference between input-fields-terminated-by and fields-terminated-by in sqoop export?
ReplyDelete
Replies
UnknownJune 24, 2017 at 11:06 AM
input-fields-terminated by is used in sqoop export and describes the source files field terminator. .whereas fields-terminated-by is used in sqoop import and describes the destination files field terminator.

I corrected the typo in the solution. Thanks for bringing it to my notice. All the best for your exam.
ReplyDelete
Replies
Murali RachakondaJuly 19, 2017 at 7:47 PM
Hi Arun, TFGW ( Thanks For Great Work). In the step 7 is there any specific reason why you mentioned columns in the export job
--columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;
ReplyDelete
Replies
UnknownSeptember 1, 2017 at 12:11 AM
Today, I cleared CCA - 175 with 7/9..Thanks arun sir for giving such good examples.
ReplyDelete
Replies
UnknownSeptember 27, 2017 at 12:47 PM
Hi all
In the sqoop command for Step 1 ,

u either need the where condn

--where "product_id between 1 and 1000" \

or the boundary-query

--boundary-query "select min(product_id), max(product_id) from products_replica where product_id between 1 and 1000";

We dont need both . I tried with just the boundary-query and it worked as expected . Pls correct me if am wrong.

ReplyDelete
Replies
UnknownNovember 8, 2017 at 3:30 AM
Hi Arun,
I've WATCHED YOUR YOUTUBE VIDEO but I still DON'T UNDERSTAND THIS part of 4c exercise. I'm totally lost with the combineByKeyResult:

combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
(x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
(x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size))

Completeley lost with the combineByKey. Could you please explain me what you are doing here?
I've got my exam in two days.

Thank you
ReplyDelete
Replies
GDKNovember 19, 2017 at 5:22 AM
Greetings Arun,

I've used the jar location as shown in the video but i keep getting either of the following errors (depending on how i reorder the merge parameters):
" bash: --merge-key: command not found " or
" bash: --jar-file: command not found "

Is there any pre-work i should have done apart from what you've mentioned in the problem statement? Thank you.
ReplyDelete
Replies
SuryaKJanuary 1, 2018 at 7:13 AM
sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --as-textfile -m 2 --query 'select * from products_replica where (product_id <=1111) AND $CONDITIONS' --split-by product_id --target-dir '/user/cloudera/problem5/products-text-part1' --null-string 'NA' --null-non-string '-1000' --fields-terminated-by '*' --lines-terminated-by '\n'

Arun cant we do problem 2 like above,Can you let me know if this is also a way to solce,and why is output-dir used?
ReplyDelete
Replies
UnknownFebruary 13, 2018 at 5:13 AM
Hi,
Thanks a lot Arun for coming up with such relevant questions and videos.
But I would like to know how important are sqoop job & sqoop merge for CCA 175? Did anyone encounter questions on these techniques?
Thanks
ReplyDelete
Replies
UnknownFebruary 14, 2018 at 3:17 AM
Its nice blog thank you for sharing relvent questions vedios Thank you lot on Big data hadoop online training Hyderabad
ReplyDelete
Replies
UnknownMarch 12, 2018 at 6:57 AM
Thank you.Well it was nice post and very helpful information onBig data hadoop online training Hyderbad

ReplyDelete
Replies
UnknownMarch 25, 2018 at 5:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMarch 25, 2018 at 5:36 AM
Hi everybody,

is it a best practice to use --outdir in sqoop command?
If i wouldn't watched this video i would never use it
ReplyDelete
Replies
UnknownMarch 28, 2018 at 7:21 AM
Thank you.Well it was nice post and very helpful information onHadoop Admin Online Course India
ReplyDelete
Replies
PeterApril 1, 2018 at 8:06 PM
Important to note for exporting from hive table products_hive to mysql, the fields-terminated-by option can also be entered as Ctrl+V+A in command line to give you the default hive delimiter
ReplyDelete
Replies
TejutejuMay 29, 2018 at 9:05 PM
Thank you.Well it was nice post and very helpful information onBig data hadoop online Course Bangalore

ReplyDelete
Replies
mamtaJune 6, 2018 at 11:48 AM
How can we export data from hive to mysql using sqoop export if the file format is other than text?
ReplyDelete
Replies
UnknownJune 13, 2018 at 7:21 PM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore
ReplyDelete
Replies
akhilapriya404June 19, 2018 at 4:37 AM
Thanks for posting such a great article.you done a great job
core Java online training
ReplyDelete
Replies
UnknownJune 29, 2018 at 7:58 AM
Very Nice. Thank You Arun. One question: Can't we use --query to achieve the result for question 1.
ReplyDelete
Replies
UnknownJuly 23, 2018 at 9:42 AM
what is outdir, why is it needed?
ReplyDelete
Replies
ChiragAugust 21, 2018 at 4:01 AM
In question 1 what is the point of the --where and --output-dir .
ReplyDelete
Replies
UnknownOctober 7, 2018 at 4:34 AM
Unable to export :

Error:
Caused by: java.lang.RuntimeException: Can't parse input data: 'http://images.acmesports.sports/Nike+Swoosh+Headband+-+2%22'
at products_external.__loadFromFields(products_external.java:507)
at products_external.parse(products_external.java:420)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
Caused by: java.lang.NumberFormatException: For input string: "http://images.acmesports.sports/Nike+Swoosh+Headband+-+2%22"

Query :
sqoop export --connect jdbc:mysql://localhost/retail_db --username root --password Hadoop#mysql4 --export-dir /user/hive/warehouse/problem5.db/prdct_hive/* -m 1 --input-fields-terminated-by ',' --input-lines-terminated-by '\n' --input-optionally-enclosed-by '"' --mysql-delimiters --input-escaped-by '"' --input-null-non-string "null" --input-null-non-string "null" --update-mode allowinsert --update-key product_id --table products_external --columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment"
ReplyDelete
Replies
aniruddhaNovember 23, 2018 at 11:03 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 11, 2018 at 10:58 AM
Why do we need --boundary-query when we are using --where clause?
ReplyDelete
Replies
poojaJanuary 15, 2019 at 2:33 AM
Step 7 last line should it not be
select count(*) from products_external;
instead of
select count(*) from products_replica;
ReplyDelete
Replies
UnknownJuly 12, 2019 at 4:50 AM
hi arun thank you so much for the problems with solutions.

for step 6: we can write like this also, since hive by default will append.

sqoop job --create hive_job -- import --connect jdbc:mysql://quickstart.cloudera/retail_db --username root --password cloudera --table products_replica --hive-import --hive-table problem5.products_hive --hive-overwrite
ReplyDelete
Replies
UnknownJune 9, 2020 at 4:10 AM
In step 4 instead of merge and all these y not use update key and mode
ReplyDelete
Replies
veeraJune 10, 2020 at 4:34 AM
Hii
Nice post,ver interesting Blog.
keep posting more Blogs.

hadoop admin training
big data and hadoop online training
ReplyDelete
Replies
anjani02September 11, 2025 at 12:37 AM
Power BI classes online: Clean and shape data with Power Query, model with a star schema, and write impactful DAX (including CALCULATE) to deliver interactive insights.
Design accessible dashboards, implement row‑level security, schedule refreshes, and align projects to PL‑300 skills for workplace‑ready reporting.
ReplyDelete
Replies