Problem Scenario 3

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walk through of this solution is available at [Click Here]

Click here for the video version of this series. This takes you to the youtube playlist of videos. 

Problem 3: Perform in the same sequence

  1. Import all tables from mysql database into hdfs as avro data files. use compression and the compression codec should be snappy. data warehouse directory should be retail_stage.db
  2. Create a metastore table that should point to the orders data imported by sqoop job above. Name the table orders_sqoop
  3. Write query in hive that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_sqoop
  4. query table in impala that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from order_sqoop. 
  5. Now create a table named retail.orders_avro in hive stored as avro, the table should have same table definition as order_sqoop. Additionally, this new table should be partitioned by the order month i.e -> year-order_month.(example: 2014-01)
  6. Load data into orders_avro table from orders_sqoop table.
  7. Write query in hive that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_avro
  8. evolve the avro schema related to orders_sqoop table by adding more fields named (order_style String, order_zone Integer)
  9. insert two more records into orders_sqoop table. 
  10. Write query in hive that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_sqoop
  11. query table in impala that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_sqoop
Solution: 
Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. 

Step 1
sqoop import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--warehouse-dir /user/hive/warehouse/retail_stage.db \
--compress \
--compression-codec snappy \
--as-avrodatafile
-m 1;


Step 2
hadoop fs -get /user/hive/warehouse/retail_stage.db/orders/part-m-00000.avro
avro-tools getschema part-m-00000.avro > orders.avsc
hadoop fs -mkdir /user/hive/schemas
hadoop fs -ls /user/hive/schemas/order
hadoop fs -copyFromLocal orders.avsc /user/hive/schemas/order

Launch HIVE using 'hive' command in a separate terminal

Below HIVE command will create a table pointing to the avro data file for orders data

create external table orders_sqoop
STORED AS AVRO
LOCATION '/user/hive/warehouse/retail_stage.db/orders'
TBLPROPERTIES ('avro.schema.url'='/user/hive/schemas/order/orders.avsc')





Step 3-Run the query in Hive: 

Run this query in Hive. 


select * from orders_sqoop as X where X.order_date in (select inner.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) inner);



Step 4-Run the query Impala: 

Lanch Impala shell by using command impala-shell

1. Run 'Invalidate metadata'
2. Run below query


select * from orders_sqoop as X where X.order_date in (select a.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) a);


Step 5 and 6: 

create database retail;

create table orders_avro
    > (order_id int,
    > order_date date,
    > order_customer_id int,
    > order_status string)
    > partitioned by (order_month string)
    > STORED AS AVRO;

 insert overwrite table orders_avro partition (order_month)
select order_id, to_date(from_unixtime(cast(order_date/1000 as int))), order_customer_id, order_status, substr(from_unixtime(cast(order_date/1000 as int)),1,7) as order_month from default.orders_sqoop;



Step 7 - Query Hive 


select * from orders_avro as X where X.order_date in (select inner.order_date from (select Y.order_date, count(1) as total_orders from orders_avro as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) inner);


Step 8 - Evolve Avro Schema  

1. hadoop fs -get /user/hive/schemas/order/orders.avsc
2. gedit orders.avsc

3.{
  "type" : "record",
  "name" : "orders",
  "doc" : "Sqoop import of orders",
  "fields" : [ {
    "name" : "order_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "order_id",
    "sqlType" : "4"
  }, {
    "name" : "order_date",
    "type" : [ "null", "long" ],
    "default" : null,
    "columnName" : "order_date",
    "sqlType" : "93"
  }, {
    "name" : "order_customer_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "order_customer_id",
    "sqlType" : "4"
  },{
    "name" : "order_style",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "order_style",
    "sqlType" : "12"
  }, {
    "name" : "order_zone",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "order_zone",
    "sqlType" : "4"
  }, {
    "name" : "order_status",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "order_status",
    "sqlType" : "12"
  } ],
  "tableName" : "orders"
}

4. hadoop fs -copyFromLocal -f orders.avsc /user/hive/schemas/order/orders.avsc

Step 9 - Insert 2 records from Hive shell

insert into table orders_sqoop values (8888888,1374735600000,11567,"xyz",9,"CLOSED");
insert into table orders_sqoop values (8888889,1374735600000,11567,"xyz",9,"CLOSED");

Step 10 -Run the query in Hive: 

Run this query in Hive. 

select * from orders_sqoop as X where X.order_date in (select inner.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) inner);



Step 11-Run the query Impala: 

Lanch Impala shell by using command impala-shell

1. Run 'Invalidate metadata'
2. Run below query


select * from orders_sqoop as X where X.order_date in (select a.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) a);






55 comments:

  1. Hi, Arun. Let me say Thank you for exercises. please provide more(no problem without videos).

    For Task-1, is it really necessary to use -m 1? please explain.

    ReplyDelete
    Replies
    1. -m 1 (i.e number of mappers) can be anything based on the number of nodes you want to leverage in a cluster to number of files you want sqoop to create. I used 1 in order to help the flow of the video series. one file is easier to compare, query and manage in terms of size and data while walking through the solution. This is not mandatory.

      Delete
    2. Thank You.
      &
      Expecting more exercises from you.

      Delete
  2. Hi Arun,
    That's a nice blog out there, am learning a lot of things and thanks for the same!

    I have a question on documentation available during exam. Will we have access to the complete documentation( as listed in cloudera website ) or will it be a stripped down version of the online documentation?

    Thanks,
    Hassan.

    ReplyDelete
  3. Go to the link https://www.cloudera.com/more/training/certification/cca-spark.html and scroll to the bottom. You will have access to the same documentation that is available on this page during the exam as well. But i recommend dont rely on documentation as your primary or secondary means. it should be a final resort in case you cannot find the answer in your brain or via help commands.

    ReplyDelete
  4. Hi Arun,

    This query does not support to HiveContext in spark.

    ReplyDelete
    Replies
    1. can you post the query you are referring to?

      Delete
    2. step-3,
      The same query working in hive but not in spark in hive context mode.
      I don't it is question for me also?

      select * from orders_sqoop as X where X.order_date in (select inner.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) inner);

      Delete
    3. Try this replace inner by Z

      select * from orders_sqoop as X where X.order_date in (select Z.order_date from (select Y.order_date, count(1) as total_orders from orders_sqoop as Y group by Y.order_date order by total_orders desc, Y.order_date desc limit 1) Z);

      Delete
  5. Hi Arun, Appreciate your efforts for putting all the material together. Have a few Qs, please respond.
    1. Does each Q have sub-questions like the one you have above? I see scenario 3 and there are 11 sub-Q underneath it. Have you put these together just to signify all the valid Qs on the topic?
    2. For some of the Qs, there is no mention of where I should store the result. Do I have to create files in my HOME and name them 1a,1b to signify the Q and the sub-Q? I have read that they dont look at the type of solution, but only interested in the result.
    3. I also thought that since they are only interested in the stored result, it doesn't matter which lang we use. As long as i am comfortable in 1 lang (eg: scala), you don't need to know Python or Java
    4. Coming from the HortonWorks world, I don't know Impala, do I need to learn Impala?
    5. A Spark problem can be solved using spark core, sql or DF. Do I need to know all the functions in a regular Spark core or can I solve this using Spark SQL as I have more familiarity with SQL

    ReplyDelete
  6. 1. Does each Q have sub-questions like the one you have above? I see scenario 3 and there are 11 sub-Q underneath it. Have you put these together just to signify all the valid Qs on the topic?
    Ans: in the real exam, dont expect so many subquestions. As i mentioned in the introduction page, this is just to test your knowledge.
    2. For some of the Qs, there is no mention of where I should store the result. Do I have to create files in my HOME and name them 1a,1b to signify the Q and the sub-Q? I have read that they dont look at the type of solution, but only interested in the result.
    A. in the exam, you will be given clear instructions on where to store the result. For the practice mode i created, you can store wherever you want if the destination is not predetermined in the question. once again, the intent is to help you practice the concepts and not get lost in the nuances which can be tackled with common sense during the exam.
    3. I also thought that since they are only interested in the stored result, it doesn't matter which lang we use. As long as i am comfortable in 1 lang (eg: scala), you don't need to know Python or Java
    A. if the question is asking you for a result then it does not matter which language you use. however, if the question is asking you to complete a partially created solution to a problem then you will have to understand phython as well just in case that partially created solution is in python.
    4. Coming from the HortonWorks world, I don't know Impala, do I need to learn Impala?
    A. Impala is probably the easiest to learn. you just have to type impala-shell. Majority of the syntax is same as hive. Most importantly, impala works so fast that you will love it better than hive.
    5. A Spark problem can be solved using spark core, sql or DF. Do I need to know all the functions in a regular Spark core or can I solve this using Spark SQL as I have more familiarity with SQL
    A. I recommend that you know all, this will equip you to face 'fill in the blanks' questions.

    I hope this helps. Sorry for the typos and grammatical mistakes. Good luck.

    ReplyDelete
  7. Thank you Arun. Your response cleared my apprehensions. Much appreciated.

    ReplyDelete
  8. Arun, Good Morning. Can you please tell me if they would have SBT and Scala installed On the Cloudera VM on the exam?

    Thank you

    ReplyDelete
    Replies
    1. Yes. You can use spark-shell for all the spark problems in the exam if they are related to output only. However, if they are related to completing a portion of code then you will have all the required tools.

      Delete
  9. Hello Arun
    are you going to do a similar blog like this one (which I congrat you for this excelent work) but for the Data Engineer exam? CCP 575.

    Best regards,
    DB

    ReplyDelete
  10. Hi Arun,

    in the question 2 you are talking about "metastore table", i think this is not a good name as it create a confusion because a "metastore" in hive is a central repository that contain all the metadata (table schemas ...) of the hive data. so i suggest to name it just "hive table" instead of "metastore table". what do you think? i'm making sense ?

    ReplyDelete
    Replies
    1. I understand your question. However, see what CCA 175 web page on the certification syllabus says. I tried to be consistent with CCA 175 terminology.

      This text is copied from CCA 175 web page. 'Use meta store tables as an input source or an output sink for Spark applications'

      Delete
  11. Hello Arun,

    When i run the below query it gets failed

    select * from orders_sqoop as x where x.order_date = (select ji.order_date from (select y.order_date, count(y.order_id) count11 from orders_sqoop y group by y.order_date order by count11 desc limit 1) ji);

    With the error

    FAILED: ParseException line 1:55 cannot recognize input near 'select' 'ji' '.' in expression specification

    but when i put "in" instead of "=" it works. Can you explain this behaviour or is it the way hive works

    ReplyDelete
  12. Hello Arun

    This topic is about avro files but in new syllabus they taken out avro files.still i need to learn about avro files?

    ReplyDelete
  13. Absolutely.......you need to know this

    ReplyDelete
  14. so for cca 175 as latest update i need to avro files

    ReplyDelete
  15. can you please tell me under which category this avro files come Data Ingest --- Transform, Stage, and Store ----Data Analysis

    ReplyDelete
  16. Hi Arun, Thanks for all your efforts. In step 8.evolve the avro schema related to orders_sqoop table by adding more fields named (order_style String, order_zone Integer). Is there any specific reason why you have not added new columns after order_status ( which was the last field). Thanks.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Never mind this... I got my answers towards the end of your video. Again, thank you from making this blog.

      Delete
  18. Hi Arun
    Thanks for all the problem statements, its testing all our knowledge and equipping us to sit on time bound Exam.

    Also , we will have to getschema using avro-tools from the .avro file when u have only the .avro file available . We do get the orders.avsc file and orders.java file when u run Sqoop import on orders and we can find it in the folder where we ran the Sqoop import from.

    We can move that .avsc into HDFS to create the Hive table and evolve the schema in the same. Just a way to save time , for exam takers.

    ReplyDelete
    Replies
    1. also you dont have to copy to hdfs , use 'avro.schema.url' = 'file:///path/to/orders.avsc'

      Delete
  19. Hi Arun,

    I encountered the following error messages when doing step 5 and 6:


    FAILED: SemanticException [Error 10096]:
    Dynamic partition strict mode requires at least one static partition column.
    To turn this off set hive.exec.dynamic.partition.mode=nonstrict

    Please help to add the following instruction in solutions :

    hive>set hive.exec.dynamic.partition.mode=nonstrict;

    Please correct me if I am wrong, thanks :)


    Again, thank you for providing these good practices to us. :)

    ReplyDelete
    Replies
    1. i think my CDH VM had this setting my default and hence it was not needed. I am glad you found a fix to the problem. I cannot update the video now. but will update the posting shortly

      Delete
  20. problem says "data warehouse directory should be retail_stage.db". it meant to say always it is /user/hive/warehouse/ even if it is not mentioned in Q..

    ReplyDelete
  21. 1.) You have used different code for snappy compression in Problem 1 and this problem. Are both the same?

    ReplyDelete
  22. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blogHadoop admin Online Training

    ReplyDelete
  23. Hi Arun,

    In CCA exam the same database tables will be given or different tables? Because we know the schema of the default tables in cloudera VM, but if they give different tables then it might take extra time to find their schema and other metadata.

    ReplyDelete
  24. Nice post ! Thanks for sharing valuable information with us. Keep sharing..Hadoop Admin Online Course

    ReplyDelete
  25. By default spark launches 200 tasks to run single job.
    Question:
    df.write.parquet("/user/cloudera/prob3")
    By default 200 parquet files will be generated under prob3 folder.
    If I change spark.sql.shuffle.partition to 4 (to improve performance) then only 4 op files will be generated.
    how do they evaluate the answer? do they compare number of output files as well?

    ReplyDelete
  26. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Hadoop Administration Online Course India

    ReplyDelete
  27. Really Good blog post.provided a helpful information.I hope that you will post more updates like thisBig data hadoop online Training

    ReplyDelete
  28. This comment has been removed by the author.

    ReplyDelete
  29. Hi Arun. Thank you for taking time and preparing all these questions.
    I have 1 question on the warehouse-dir..when I give retail_stage.db, it complains about '_' presence in namespace and says only alphanumeric characters are allowed...

    ReplyDelete
  30. Hi Arun, first off, thanks for all your efforts in putting everything in one place.

    The question I have is, in this Problem you've used avro schema evolution, is that part of CCA175 ?

    ReplyDelete
  31. Nice article admin thanks for share your atricle keep share your knowledge i am waiting for your new post check mens winter jackets polo shirts kindly review and reply me

    ReplyDelete
  32. KissAsian is Asian Dramas Platform where you can find all Korean, Chinese, Japanese dramas.

    ReplyDelete
  33. I'm thankful for your efforts. I learned a lot of things from your blog. I want to ask that in actual exam same difficulty and lengthy questions will come? because we need to solve 7/10 to pass but i took 30 minutes to solve this even that by taking hints from google. So, is it mean i shouldn't give exam in 1 month? I should practice atleast for 2 more months.

    ReplyDelete
  34. such a good blog, thanks for sharing this information.
    Spark Scala Training

    ReplyDelete
  35. Great blog! This is really helpful for my reference. Do share more such posts and keep us updated. Looking forward to more informative blogs from you.
    Why DevOps
    DevOps Agile

    ReplyDelete
  36. Watch Bade Achhe Lagte Hain Season 2 All New Episodes Online in High Quality Videos, Bade Achhe Lagte Hain 2 Latest Complete Episodes and Free Download.

    badeachhelagtehain2.com

    ReplyDelete

If you have landed on this page then you are most likely aspiring to learn Hadoop ecosystem of technologies and tools. Why not make you...