Problem Scenario 6 [Data Analysis]

PLEASE READ THE INTRODUCTION TO THIS SERIES. CLICK ON HOME LINK AND READ THE INTRO BEFORE ATTEMPTING TO SOLVE THE PROBLEMS

Video walkthrough of this problem is available at [CLICK HERE] AND

Click here for the video version of this series. This takes you to the youtube playlist of videos. 

This problem helps you strengthen and validate skills related to data analysis objective of the certification exam.

Data model in mysql on cloudera VM looks like this. [Note: only primary and foreign keys are included in the relational schema diagram shown below]



Problem 6: Provide two solutions for steps 2 to 7
  • Using HIVE QL over Hive Context
  • Using Spark SQL over Spark SQL Context or by using RDDs
  1. create a hive meta store database named problem6 and import all tables from mysql retail_db database into hive meta store. 
  2. On spark shell use data available on meta store as source and perform step 3,4,5 and 6. [this proves your ability to use meta store as a source]  
  3. Rank products within department by price and order by department ascending and rank descending [this proves you can produce ranked and sorted data on joined data sets]
  4. find top 10 customers with most unique product purchases. if more than one customer has the same number of product purchases then the customer with the lowest customer_id will take precedence [this proves you can produce aggregate statistics on joined datasets]
  5. On dataset from step 3, apply filter such that only products less than 100 are extracted [this proves you can use subqueries and also filter data]
  6. On dataset from step 4, extract details of products purchased by top 10 customers which are priced at less than 100 USD per unit [this proves you can use subqueries and also filter data]
  7. Store the result of 5 and 6 in new meta store tables within hive. [this proves your ability to use metastore as a sink]

Solution: 

Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. Please go through the video for an indepth explanation of the solution.

NOTE: The same solution can be implemented using Spark SQL Context. Just replace Hive Context object with SQL Context object below. Rest of the solution remains the same. i.e same concept of querying, using temp table and storing the result back to hive. 

Step 1: 

sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --warehouse-dir /user/hive/warehouse/problem6.db --hive-import --hive-database problem6 --create-hive-table --as-textfile;

Step 2: 

var hc = new org.apache.spark.sql.hive.HiveContext(sc);

Step 3: 

var hiveResult = hc.sql("select   d.department_id,   p.product_id,   p.product_name,  p.product_price,  rank() over (partition by d.department_id order by p.product_price) as product_price_rank,   dense_rank() over (partition by d.department_id order by p.product_price) as product_dense_price_rank   from products p   inner join categories c on c.category_id = p.product_category_id  inner join departments d on c.category_department_id = d.department_id  order by d.department_id, product_price_rank desc, product_dense_price_rank ");

Step 4: 

var hiveResult2 = hc.sql("select c.customer_id, c.customer_fname, count(distinct(oi.order_item_product_id)) unique_products  from customers c   inner join orders o on o.order_customer_id = c.customer_id  inner join order_items oi on o.order_id = oi.order_item_order_id  group by c.customer_id, c.customer_fname  order by unique_products desc, c.customer_id   limit 10
")

Step 5: 

hiveResult.registerTempTable("product_rank_result_temp");
hc.sql("select * from product_rank_result_temp where product_price < 100").show();

Step 6: 

 var topCustomers = hc.sql("select c.customer_id, c.customer_fname, count(distinct(oi.order_item_product_id)) unique_products  from customers c   inner join orders o on o.order_customer_id = c.customer_id  inner join order_items oi on o.order_id = oi.order_item_order_id  group by c.customer_id, c.customer_fname  order by unique_products desc, c.customer_id   limit 10  ");

topCustomers.registerTempTable("top_cust");

var topProducts = hc.sql("select distinct p.* from products p inner join order_items oi on oi.order_item_product_id = p.product_id inner join orders o on o.order_id = oi.order_item_order_id inner join top_cust tc on o.order_customer_id = tc.customer_id where p.product_price < 100");

Step 7: 

hc.sql("create table problem6.product_rank_result as select * from product_rank_result_temp where product_price < 100");


hc.sql("create table problem 6.top_products as select distinct p.* from products p inner join order_items oi on oi.order_item_product_id = p.product_id inner join orders o on o.order_id = oi.order_item_order_id inner join top_cust tc on o.order_customer_id = tc.customer_id where p.product_price < 100");

43 comments:

  1. Hi Arun , This Blog is being so awesome for me !!! When can we expect the next videos which will be related to Flume / Kafka / Spark Streaming and Configuration ??? Looking forward to learn more from you.

    ReplyDelete
    Replies
    1. Flume problem is out already. I dont expect anything asked in Kafka. Spark Streaming is a grey area as no one reported seeing a problem in spark streaming so far. I will post a problem on spark streaming shorting anyway.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. hi Arun,

    do you really need the --create-hive-table in the first question ?

    ReplyDelete
  4. Hi Arun,

    I am getting error- "org.apache.spark.sql.AnalysisException: Table not found: products;" on while running the query in spark sql. I followed all the steps correctly till that point.

    ReplyDelete
    Replies
    1. hey dude how did you resolve this ..im facing same issue?

      Delete
    2. This comment has been removed by the author.

      Delete
    3. try following command in terminal before launching spark
      sudo ln /etc/hive/hive/hive-site.xml /etc/hive/conf/hive-site.xml

      Then launch spark. It should work

      Delete
  5. Hi.arun first of all great effort? As I am pretty weak in hive but good in spark so I am confused whether what is replacement for dense rank in spark SQL.

    ReplyDelete
    Replies
    1. As part 3 or step 3 is pretty confusing to me.

      Delete
  6. Hi Arun,
    Great Blog for preparing for CCA175.
    I would like to know , that a single question in the certification contains the number of steps or queries like above? or they are fewer?? I am asking keeping in mind the duration of the exam. 2 hrs for 10 questions , gives us 12 mins per question. Would we be able to solve all the steps in this amount of time?

    ReplyDelete
    Replies
    1. you can safely consider each step as a single question in the exam. The exam questions should not be time consuming if you understand that approach to take.

      Delete
  7. Such an useful blog!!
    I have a query.Can i give answers to all the spark related questions in hive context/sql context?

    ReplyDelete
  8. This is more of a sql or hive QL question , In this query below for 'Rank products within department by price and order by department ascending and rank descending' - is there a way to list only the top 3 ranked products within each department ??


    select d.department_id, p.product_id, p.product_name, p.product_price, rank() over (partition by d.department_id order by p.product_price) as product_price_rank, dense_rank() over (partition by d.department_id order by p.product_price) as product_dense_price_rank from products p inner join categories c on c.category_id = p.product_category_id inner join departments d on c.category_department_id = d.department_id order by d.department_id, product_price_rank desc, product_dense_price_rank


    ReplyDelete
    Replies
    1. Never mind , I got it.

      select * from (select product_id ,product_price , category_id , category_name , rank() over (partition by category_id order by product_price) price_rank , dense_rank() over (partition by category_id order by product_price) price_dense_rank from products join categories on product_category_id = category_id order by category_id,price_rank, price_dense_rank ) tmp where price_dense_rank <= 3;

      Thanks though
      lakshmi

      Delete
    2. Do we need Rank and Dense Rank both Here , Please let me know . I guess dense_rank would suffice the requirement

      Delete
  9. HI Arun
    great post, helped me a lot. One question i have is is there no other way to push data in hive without going through hiveContext?

    I have tried DF.write.mode("append").saveAsTable("schema.table") and it works but still wondering if there is another better way.

    ReplyDelete

  10. it is very excellent blog and useful article thank you for sharing with us , keep posting learn more about Big Data Hadoop important information thank you providing this important information on
    Big Data Hadoop Online Course BANGALORE

    ReplyDelete
  11. Hi Arun,
    Thanks for such a great post!
    I need a help with this problem, I got an Error while running my SQL query in spark hive context. Error is:
    org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
    TungstenExchange hashpartitioning(department_id#93,200), None
    +- Project [department_id#93,product_id#84,product_category_id#85,product_name#86,product_price#88]

    ReplyDelete
  12. awesome post presented by you..your writing style is fabulous and keep update with your blogs Big data hadoop online Course

    ReplyDelete
  13. need help..... getting error- "org.apache.spark.sql.AnalysisException: Table not found: products;" on while running the query in spark sql....

    ReplyDelete
  14. using cdh 5.13 in a kvm image.
    I had to run sudo ln -s /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/hive-site.xml
    to get spark to see the hive metastore that has the problem6 database

    ReplyDelete
  15. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop Training in electronic city, Bangalore | #Big Data Hadoop Training in electronic city, Bangalore

    ReplyDelete
  16. Hi arun, detailed and very help for CCA175 prep. I was wondering if we can use only sparkSQL for data analysis instead of sparkRDD or DF functions? or will the question specify the technology to be used

    ReplyDelete
  17. Thanks for the excellent problem questions, Arun. Its a great help, along with Durga Gadiraju's itversity labs and youtube videos. You both have the blessings of so many people.

    ReplyDelete
  18. where is it mentioned to use problem6 database?

    ReplyDelete
  19. Great efforts Arun. Keep the great work on.

    ReplyDelete
  20. Hi
    val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) ;val b = a.keyBy(_.length); b.groupByKey.collect
    expected result:

    Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
    (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

    actual result:

    Array((4,CompactBuffer(lion)), (6,CompactBuffer(spider)), (3,CompactBuffer(dog, cat)), (5,CompactBuffer(tiger, eagle)))

    I run above line in saprk 2
    will it create any issue in answered correct consiedration in CCA175
    please assist

    ReplyDelete
  21. Excellent post. I learned a lot of information from this blog and Its useful for gain my knowledge. Keep blogging
    Apache hive Training in Electronic City

    ReplyDelete
  22. Nice blog, thanks for sharing this information.
    Spark Scala Training

    ReplyDelete
  23. Wohhooo

    Very impressive and useful blog.
    Thank you..

    big data and hadoop course

    ReplyDelete
  24. Hi Arun,
    I am planning to attend CCA175 exam, Could you please have the practice test for CCA175 with new syllabus?

    Thanks,
    Ranju Thomas

    ReplyDelete
  25. This is most informative and also this post most user friendly and super navigation to all posts.
    Apache Spark Training Institutes in Pune

    ReplyDelete
  26. I would also motivate just about every person to save this web page for any favourite assistance to assist posted the appearance.
    data analytics course in hyderabad

    ReplyDelete
  27. Download the Magento M70-101 Q&A PDF file easily to prepare Magento Certified Developer exam. It is particularly designed for Magento M70-101 exam and our Magento specialists have created this M70-101 Question Dumps observing the original M70-101 exam.

    ReplyDelete
  28. Nice article and good information shared by the author. when you are isofttrainings with us we use it to assist with the education information blog.
    Visit Us - JAVA Full Stack Developer Online Training

    ReplyDelete
  29. Thank you for sharing the valuable information with us.
    Best Data Management Company/a>

    ReplyDelete

If you have landed on this page then you are most likely aspiring to learn Hadoop ecosystem of technologies and tools. Why not make you...