Cloudera Live:QuickStart:场景【2】
说明:本文为跟着【Cloudera Live】的练习记录。
——————————————
场景【2】:Showing Big Data Value
在场景二中,构建的对话里面,你的老板开始质疑你在场景一中的结果了:
Scenario:
Your Management: is indifferent, you produced what you always produce – a report on structured data, but you really didn’t prove any additional value.
You: are either also indifferent and just go back to what you have always done… or you have an ace up your sleeve…
下面,开始该场景。
——————————————————————————
首先,在Cloudera Manager的主页上确认以下服务的状态是运行的,并且是正常的:
1. HDFS
2. Hive
3. Hue
4. Impala
如图:
接下来,场景中提出了一个问题:【所有被浏览最多的产品,是卖得最好的吗?】
在场景二中,引出了Hadoop的组件:Flume。
——————————
首先,讲本地的数据上传到HDFS:
Let’s move this data from the local filesystem, into HDFS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
[root@quickstart ~]# su - cloudera [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ id hdfs uid=492(hdfs) gid=491(hdfs) groups=491(hdfs),492(hadoop) [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -ls /user/hive/warehouse Found 6 items drwxrwxrwx - root supergroup 0 2017-10-24 08:52 /user/hive/warehouse/categories drwxrwxrwx - root supergroup 0 2017-10-24 08:53 /user/hive/warehouse/customers drwxrwxrwx - root supergroup 0 2017-10-24 08:54 /user/hive/warehouse/departments drwxrwxrwx - root supergroup 0 2017-10-24 08:55 /user/hive/warehouse/order_items drwxrwxrwx - root supergroup 0 2017-10-24 08:56 /user/hive/warehouse/orders drwxrwxrwx - root supergroup 0 2017-10-24 08:58 /user/hive/warehouse/products [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -ls /user/hive/warehouse Found 7 items drwxrwxrwx - root supergroup 0 2017-10-24 08:52 /user/hive/warehouse/categories drwxrwxrwx - root supergroup 0 2017-10-24 08:53 /user/hive/warehouse/customers drwxrwxrwx - root supergroup 0 2017-10-24 08:54 /user/hive/warehouse/departments drwxrwxrwx - root supergroup 0 2017-10-24 08:55 /user/hive/warehouse/order_items drwxrwxrwx - root supergroup 0 2017-10-24 08:56 /user/hive/warehouse/orders drwxr-xr-x - hdfs supergroup 0 2017-10-24 19:18 /user/hive/warehouse/original_access_logs drwxrwxrwx - root supergroup 0 2017-10-24 08:58 /user/hive/warehouse/products [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ ls -ltr /opt/examples/log_files/ total 38668 -rw-r--r-- 1 cloudera cloudera 39593868 Jul 19 13:35 access.log.2 [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -ls /user/hive/warehouse/original_access_logs [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/log_files/access.log.2 /user/hive/warehouse/original_access_logs [cloudera@quickstart ~]$ [cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -ls /user/hive/warehouse/original_access_logs Found 1 items -rw-r--r-- 1 hdfs supergroup 39593868 2017-10-24 19:20 /user/hive/warehouse/original_access_logs/access.log.2 [cloudera@quickstart ~]$ |
现在,你就可以在Hive中创建表,并且通过Impala和Hue去查询上面上传的数据了。
执行语句:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"', 'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s") LOCATION '/user/hive/warehouse/original_access_logs'; CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR /usr/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs; |
执行:
注意,上面是在Hue的Hive中执行的。
然后,需要告诉Impala,有一些表已经通过其他的工具【Hive】创建成功了。
代码:
1 |
invalidate metadata; |
执行:
查询:
1 2 3 |
select count(*),url from tokenized_access_logs where url like '%\/product\/%' group by url order by count(*) desc; |
或者:
1 2 3 |
select count(*),url from tokenized_access_logs where url like '%\/product\/%' group by url order by count(*) desc limit 10; |
效果:
比较上面的结果和场景【1】中查询到的前十位的分类结果集:
可以看到,有的产品被查看了很多次,但是却没有卖出去。
至此,已经回答了场景【2】开始的时候的问题。
————————————————————————
Done。