Problem Scenario 5 [SQOOP]

CCA 175 Hadoop and Spark Developer Exam Preparation - Problem Scenario 5

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walkthrough of this problem is available at [PART 1 CLICK HERE] AND [PART 2 CLICK HERE]

Click here for the video version of this series. This takes you to the youtube playlist of videos. 

Sqoop is one of the important topics for the exam. Based on generally reported exam pattern from anonymous internet bloggers, you can expect 2 out of 10 questions on this topic related to Data Ingest and Data Export using Sqoop. Hence, 20% of the exam score can be obtained just by practicing simple Sqoop concepts. Sqoop can be mastered easily (i.e in a few hours) at the skill level that CCA 175 exam is expecting you to demonstrate. I created this problem focusing on Sqoop alone, if you are able to perform this exercise on your own or at worst using just the sqoop user guide then there is a very very high chance that you can score the 20% easily.

Pre-Work: Please perform these steps before solving the problem
1. Login to MySQL using below commands on a fresh terminal window
    mysql -u retail_dba -p
    Password = cloudera
2. Create a replica product table and name it products_replica
    create table products_replica as select * from products
3. Add primary key to the newly created table
    alter table products_replica add primary key (product_id);
4. Add two more columns
    alter table products_replica add column (product_grade int, product_sentiment varchar(100))
5. Run below two update statements to modify the data
    update products_replica set product_grade = 1  where product_price > 500;
    update products_replica set product_sentiment  = 'WEAK'  where product_price between 300 and  500;
    
Problem 5: Above steps are important so please complete them successfully before attempting to solve the problem
  1. Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '|' and lines are separated by '\n'. Null values are represented as -1 for numbers and "NOT-AVAILABLE" for strings. Only records with product id greater than or equal to 1 and less than or equal to 1000 should be imported and use 3 mappers for importing. The destination file should be stored as a text file to directory  /user/cloudera/problem5/products-text
  2. Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '*' and lines are separated by '\n'. Null values are represented as -1000 for numbers and "NA" for strings. Only records with product id less than or equal to 1111 should be imported and use 2 mappers for importing. The destination file should be stored as a text file to directory  /user/cloudera/problem5/products-text-part1
  3. Using sqoop, import products_replica table from MYSQL into hdfs such that fields are separated by a '*' and lines are separated by '\n'. Null values are represented as -1000 for numbers and "NA" for strings. Only records with product id greater than 1111 should be imported and use 5 mappers for importing. The destination file should be stored as a text file to directory  /user/cloudera/problem5/products-text-part2.
  4. Using sqoop merge data available in /user/cloudera/problem5/products-text-part1 and /user/cloudera/problem5/products-text-part2 to produce a new set of files in /user/cloudera/problem5/products-text-both-parts
  5. Using sqoop do the following. Read the entire steps before you create the sqoop job.
    • create a sqoop job Import Products_replica table as text file to directory /user/cloudera/problem5/products-incremental. Import all the records.
    • insert three more records to Products_replica from mysql
    • run the sqoop job again so that only newly added records can be pulled from mysql
    • insert 2 more records to Products_replica from mysql
    • run the sqoop job again so that only newly added records can be pulled from mysql
    • Validate to make sure the records have not be duplicated in HDFS
  6. Using sqoop do the following. Read the entire steps before you create the sqoop job.
    • create a hive table in database named problem5 using below command 
    • create table products_hive  (product_id int, product_category_id int, product_name string, product_description string, product_price float, product_imaage string,product_grade int,  product_sentiment string);
    • create a sqoop job Import Products_replica table as hive table to database named problem5. name the table as products_hive
    • insert three more records to Products_replica from mysql
    • run the sqoop job again so that only newly added records can be pulled from mysql
    • insert 2 more records to Products_replica from mysql
    • run the sqoop job again so that only newly added records can be pulled from mysql
    • Validate to make sure the records have not been duplicated in Hive table
  7. Using sqoop do the following. .
    • insert 2 more records into products_hive table using hive. 
    • create table in mysql using below command   
    • create table products_external  (product_id int(11) primary Key, product_grade int(11), product_category_id int(11), product_name varchar(100), product_description varchar(100), product_price float, product_impage varchar(500), product_sentiment varchar(100));
    • export data from products_hive (hive) table to (mysql) products_external table. 
    • insert 2 more records to Products_hive table from hive
    • export data from products_hive table to products_external table. 
    • Validate to make sure the records have not be duplicated in mysql table

Solution: 

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. 


Step 1: 

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir /user/cloudera/problem5/products-text \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-non-string -1 \
--null-string "NOT-AVAILABLE" \
-m 3 \
--where "product_id between 1 and 1000" \
--outdir /home/cloudera/sqoop1 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id between 1 and 1000";


Step 2: 

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir  /user/cloudera/problem5/products-text-part1  \
--fields-terminated-by '*' \
--lines-terminated-by '\n' \
--null-non-string -1000 \
--null-string "NA" \
-m 2 \
--where "product_id <= 1111 " \
--outdir /home/cloudera/sqoop2 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id <= 1111";



Step 3: 

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table products_replica \
--target-dir  /user/cloudera/problem5/products-text-part2  \
--fields-terminated-by '*' \
--lines-terminated-by '\n' \
--null-non-string -1000 \
--null-string "NA" \
-m 5 \
--where "product_id > 1111 " \
--outdir /home/cloudera/sqoop3 \
--boundary-query "select min(product_id), max(product_id) from products_replica where product_id > 1111"

Step 4: 

sqoop merge \
--class-name products_replica \
--jar-file mp/sqoop-cloudera/compile/66b4f23796be7625138f2171a7331cd3/products_replica.jar \
--new-data /user/cloudera/problem5/products-text-part2 \
--onto /user/cloudera/problem5/products-text-part1 \
--target-dir /user/cloudera/problem5/products-text-both-parts \
--merge-key product_id;



Step 5: 

On terminal -

sqoop job --create first_sqoop_job \
-- import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "retail_dba" \
--password "cloudera" \
--table products_replica \
--target-dir /user/cloudera/problem5/products-incremental \
--check-column product_id \
--incremental append \
--last-value 0;

sqoop job --exec first_sqoop_job

On MySQL command line -

mysql> insert into products_replica values (1346,2,'something 1','something 2',300.00,'not avaialble',3,'STRONG');
mysql> insert into products_replica values (1347,5,'something 787','something 2',356.00,'not avaialble',3,'STRONG');

On terminal - 

sqoop job --exec first_sqoop_job

On MYSQL Command Line

insert into products_replica values (1376,4,'something 1376','something 2',1.00,'not avaialble',3,'WEAK');
insert into products_replica values (1365,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');

On terminal - 

sqoop job --exec first_sqoop_job

Step 6: 

On Terminal window-

sqoop job \
--create hive_sqoop_job \
-- import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username "retail_dba" \
--password "cloudera" \
--table products_replica  \
--check-column product_id \
--incremental append \
--last-value 0 \
--hive-import \
--hive-table products_hive \
--hive-database problem5;

On Hive window:

create database problem5;

use problem5;

create table products_hive  (product_id int, product_category_id int, product_name string, product_description string, product_price float, product_imaage string,product_grade int,  product_sentiment string);

On Terminal window

sqoop job --exec hive_sqoop_job

On MySQL window

insert into products_replica values (1378,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');
insert into products_replica values (1379,4,'something 1376','something 2',10.00,'not avaialble',null,'NOT APPLICABLE');

On Terminal Window

sqoop job --exec hive_sqoop_job

On Hive Window

select * from products_hive

Step 7: 

On Hive Window

use problem5;

insert into table products_hive values (1380,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

insert into table products_hive values (1381,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

On MYSQL window

create table products_external  (product_id int(11) primary Key, product_grade int(11), product_category_id int(11), product_name varchar(100), product_description varchar(100), product_price float, product_impage varchar(500), product_sentiment varchar(100));

On Terminal 

sqoop export \
--username "retail_dba" \
--password "cloudera" \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
 --export-dir /user/hive/warehouse/problem5.db/products_hive/ \
--input-fields-terminated-by '\001' \
--input-null-non-string "null" \
--input-null-string "null" \
--update-mode allowinsert \
--update-key product_id \
--columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;

On Hive Window 

insert into table products_hive values (1382,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

insert into table products_hive values (1383,4,'something 1380','something 2',8.00,'not avaialble',3,'NOT APPLICABLE');

On Terminal Window:


sqoop export \
--username "retail_dba" \
--password "cloudera" \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--export-dir /user/hive/warehouse/problem5.db/products_hive/ \
--input-fields-terminated-by '\001' \
--input-null-non-string "null" \
--input-null-string "null" \
--update-mode allowinsert \
--update-key product_id \
--columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;

To Validate
On Hive
select count(*) from problem5.products_hive;
on MySQL
select count(*) from products_replica;

50 comments:

  1. Great job. These exercises gives confidence of Clearing the Certification

    ReplyDelete
  2. Excellent problem scenario!!! Please keep up the great work.
    I came across a trivial point. In the 1st part of the problem it says "Only records with product id greater than 1 and less than 1000" - i think the where clause and boundary-query needs a slight change "product_id between 2 and 999"

    ReplyDelete
  3. Thank you Daven for identifying an error. Given that I cannot update the video, i updated the question to suit the video. Thanks for the complement as well.

    ReplyDelete
  4. why you have used boundary query.

    ReplyDelete
    Replies
    1. please identify yourself.

      Delete
    2. Arun, Can you let me know why you have used min and max product_id in the boundary-query when you have defined it in the where clause? I do see the range of product_ids in the data set. Just curious though. I do see that there is no group-by that might have helped to get all product_ids. Appreciate clarification. Thanks.

      Great videos.

      Delete
  5. Hey Arun can you please tell me why you used boundary query ?
    How can I get the product_replica.jar file which you have mentioned here
    --jar-file mp/sqoop-cloudera/compile/66b4f23796be7625138f2171a7331cd3/products_replica.jar
    I tack hadoop MR job I found that it produced product_replica.jar file not able to access it.

    ReplyDelete
  6. Look at the video buddy. Arun explained where to find the jar file

    ReplyDelete
  7. Arun Sir thanks for the awesome content It helped me clear the CCA175 certification. Thanks!!!

    ReplyDelete
  8. Another way to do question 2 using --query, to avoid where and boundary-query

    split-by is then mandatory.

    sqoop import \
    --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
    --username retail_dba \
    --password cloudera \
    --query "select * from products_replica where product_id<=1111 and \$CONDITIONS" \
    --target-dir /user/cloudera/problem5/products-text-part1 \
    --fields-terminated-by '*' \
    --lines-terminated-by '\n' \
    --null-non-string -1000 \
    --null-string "NA" \
    --split-by product_id \
    -m 2

    ReplyDelete
  9. Hi Arun,

    In question 5, If every time we will incremental append by last value 0, It will create duplicate value in it.
    Can you please explain?

    last value should be change for each and every sqoop job.??

    ReplyDelete
  10. you will have a better understanding if you can go through the video. to answer your question, Sqoop job remembers the last value and starts from where it left off. So you dont have to recreate the sqoop job with new sqoop import command everytime. Please watch video for better understanding. Good luck.

    ReplyDelete
  11. What is difference between input-fields-terminated-by and fields-terminated-by in sqoop export?

    ReplyDelete
  12. input-fields-terminated by is used in sqoop export and describes the source files field terminator. .whereas fields-terminated-by is used in sqoop import and describes the destination files field terminator.

    I corrected the typo in the solution. Thanks for bringing it to my notice. All the best for your exam.

    ReplyDelete
    Replies
    1. I used --fields-terminated-by option while exporting tab delimiter file. It's exported all the records to mysql table.

      Can you please tell me the exact difference between these 2 options?

      Delete
  13. Hi Arun, TFGW ( Thanks For Great Work). In the step 7 is there any specific reason why you mentioned columns in the export job
    --columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment" --table products_external;

    ReplyDelete
    Replies
    1. due to the difference in source and destination order. please go through the accompanying video. you can forward to that step if you dont want to watch the entire video.

      Delete
  14. Today, I cleared CCA - 175 with 7/9..Thanks arun sir for giving such good examples.

    ReplyDelete
  15. Hi all
    In the sqoop command for Step 1 ,

    u either need the where condn

    --where "product_id between 1 and 1000" \

    or the boundary-query

    --boundary-query "select min(product_id), max(product_id) from products_replica where product_id between 1 and 1000";

    We dont need both . I tried with just the boundary-query and it worked as expected . Pls correct me if am wrong.

    ReplyDelete
    Replies
    1. Again for Step 2, using boundary-query will suffice , no need of --where.

      And for step 3 , we can just use --where "product_id > 1111 . Boundary query is mostly meant to be used when ur primary key values are skewed much apart from each other ..for eg , the primary key values range from 6000 to 8000 and then start from 11000 thru 15000 . Just a gentle suggestion.

      Delete
  16. Hi Arun,
    I've WATCHED YOUR YOUTUBE VIDEO but I still DON'T UNDERSTAND THIS part of 4c exercise. I'm totally lost with the combineByKeyResult:

    combineByKey((x:(Float, String))=>(x._1,Set(x._2)),
    (x:(Float,Set[String]),y:(Float,String))=>(x._1 + y._1,x._2+y._2),
    (x:(Float,Set[String]),y:(Float,Set[String]))=>(x._1+y._1,x._2++y._2)).
    map(x=> (x._1._1,x._1._2,x._2._1,x._2._2.size))


    Completeley lost with the combineByKey. Could you please explain me what you are doing here?
    I've got my exam in two days.

    Thank you

    ReplyDelete
    Replies
    1. Hi @Doubts Several, you can have a look at this blog, explaining aggregateByKey:
      http://codingjunkie.net/spark-agr-by-key/

      Then have a look at the follow up post on combineByKey:
      http://codingjunkie.net/spark-combine-by-key/

      I hope it helps, i struggled too.

      Delete
  17. Greetings Arun,

    I've used the jar location as shown in the video but i keep getting either of the following errors (depending on how i reorder the merge parameters):
    " bash: --merge-key: command not found " or
    " bash: --jar-file: command not found "

    Is there any pre-work i should have done apart from what you've mentioned in the problem statement? Thank you.

    ReplyDelete
  18. sqoop-import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --as-textfile -m 2 --query 'select * from products_replica where (product_id <=1111) AND $CONDITIONS' --split-by product_id --target-dir '/user/cloudera/problem5/products-text-part1' --null-string 'NA' --null-non-string '-1000' --fields-terminated-by '*' --lines-terminated-by '\n'

    Arun cant we do problem 2 like above,Can you let me know if this is also a way to solce,and why is output-dir used?

    ReplyDelete
  19. Hi,
    Thanks a lot Arun for coming up with such relevant questions and videos.
    But I would like to know how important are sqoop job & sqoop merge for CCA 175? Did anyone encounter questions on these techniques?
    Thanks

    ReplyDelete
  20. Its nice blog thank you for sharing relvent questions vedios Thank you lot on Big data hadoop online training Hyderabad

    ReplyDelete
  21. Thank you.Well it was nice post and very helpful information onBig data hadoop online training Hyderbad

    ReplyDelete
  22. This comment has been removed by the author.

    ReplyDelete
  23. Hi everybody,

    is it a best practice to use --outdir in sqoop command?
    If i wouldn't watched this video i would never use it

    ReplyDelete
  24. Thank you.Well it was nice post and very helpful information onHadoop Admin Online Course India

    ReplyDelete
  25. Important to note for exporting from hive table products_hive to mysql, the fields-terminated-by option can also be entered as Ctrl+V+A in command line to give you the default hive delimiter

    ReplyDelete
    Replies
    1. you mean --input-fields-terminated-by '\001' not --fields-terminated-by

      Delete
  26. Thank you.Well it was nice post and very helpful information onBig data hadoop online Course Bangalore


    ReplyDelete
  27. How can we export data from hive to mysql using sqoop export if the file format is other than text?

    ReplyDelete
  28. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  29. Thanks for posting such a great article.you done a great job
    core Java online training

    ReplyDelete
  30. Very Nice. Thank You Arun. One question: Can't we use --query to achieve the result for question 1.

    ReplyDelete
    Replies
    1. you can , but make sure when using --query you must specify $CONDITION in the where clause(eg: where ID=10 and $CONDITION ) , and if you want to imported as more than 1 mapper you should specify --split-by

      Delete
  31. what is outdir, why is it needed?

    ReplyDelete
  32. In question 1 what is the point of the --where and --output-dir .

    ReplyDelete
  33. Unable to export :

    Error:
    Caused by: java.lang.RuntimeException: Can't parse input data: 'http://images.acmesports.sports/Nike+Swoosh+Headband+-+2%22'
    at products_external.__loadFromFields(products_external.java:507)
    at products_external.parse(products_external.java:420)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
    ... 10 more
    Caused by: java.lang.NumberFormatException: For input string: "http://images.acmesports.sports/Nike+Swoosh+Headband+-+2%22"

    Query :
    sqoop export --connect jdbc:mysql://localhost/retail_db --username root --password Hadoop#mysql4 --export-dir /user/hive/warehouse/problem5.db/prdct_hive/* -m 1 --input-fields-terminated-by ',' --input-lines-terminated-by '\n' --input-optionally-enclosed-by '"' --mysql-delimiters --input-escaped-by '"' --input-null-non-string "null" --input-null-non-string "null" --update-mode allowinsert --update-key product_id --table products_external --columns "product_id,product_category_id,product_name,product_description,product_price,product_impage,product_grade,product_sentiment"

    ReplyDelete
    Replies
    1. Did you find some solution to this issue? I used exactly, what he type but still got an export error.

      sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username root --password cloudera --export-dir /user/hive/warehouse/problem5.db/products_hive --table products_hive_external --columns "product_id,product_category_id,product_name,product_description,product_price,product_image,product_grade,product_sentiment" --update-key product_id --input-fields-terminated-by "\001" --update-mode allowinsert --input-null-non-string "null" --input-null-string "null" --input-lines-terminated-by "\n";

      Delete
  34. This comment has been removed by the author.

    ReplyDelete
  35. Why do we need --boundary-query when we are using --where clause?

    ReplyDelete
  36. Step 7 last line should it not be
    select count(*) from products_external;
    instead of
    select count(*) from products_replica;

    ReplyDelete
  37. hi arun thank you so much for the problems with solutions.

    for step 6: we can write like this also, since hive by default will append.

    sqoop job --create hive_job -- import --connect jdbc:mysql://quickstart.cloudera/retail_db --username root --password cloudera --table products_replica --hive-import --hive-table problem5.products_hive --hive-overwrite

    ReplyDelete
  38. In step 4 instead of merge and all these y not use update key and mode

    ReplyDelete

If you have landed on this page then you are most likely aspiring to learn Hadoop ecosystem of technologies and tools. Why not make you...