大数据之hive(尚硅谷)Hive基本概念，hive数据结构，hiveDDL定义语言

发布日期：2024-04-10 来源：数媒在线课堂

Hive基本概念

Hive：由Facebook开源用于解决海量结构化日志的数据统计。

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能。

本质是：将HQL转化成MapReduce程序

1）Hive处理的数据存储在HDFS

2）Hive分析数据底层的实现是MapReduce

3）执行程序运行在Yarn上

优点

操作接口采用类SQL语法，提供快速开发的能力（简单、容易上手）。

避免了去写MapReduce，减少开发人员的学习成本。

Hive的执行延迟比较高，因此Hive常用于数据分析，对实时性要求不高的场合。

Hive优势在于处理大数据，对于处理小数据没有优势，因为Hive的执行延迟比较高。

Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。

缺点

Hive的HQL表达能力有限

迭代式算法无法表达

数据挖掘方面不擅长，

Hive的效率比较低

Hive自动生成的MapReduce作业，通常情况下不够智能化

Hive调优比较困难，粒度较粗

Hive架构

用户接口：Client

CLI（command-line interface）、JDBC/ODBC(jdbc访问hive)、WEBUI（浏览器访问hive）

元数据：Metastore

元数据包括：表名、表所属的数据库（默认是default）、表的拥有者、列/分区字段、表的类型（是否是外部表）、表的数据所在目录等；

默认存储在自带的derby数据库中，推荐使用MySQL存储Metastore

Hadoop

使用HDFS进行存储，使用MapReduce进行计算。

驱动器：Driver

（1）解析器（SQL Parser）：将SQL字符串转换成抽象语法树AST，这一步一般都用第三方工具库完成，比如antlr；对AST进行语法分析，比如表是否存在、字段是否存在、SQL语义是否有误。

（2）编译器（Physical Plan）：将AST编译生成逻辑执行计划。

（3）优化器（Query Optimizer）：对逻辑执行计划进行优化。

（4）执行器（Execution）：把逻辑执行计划转换成可以运行的物理计划。对于Hive来说，就是MR/Spark。

hive与数据库比较

数据量

hive延迟高，MySQL延迟低

hive存储位置存储在分布式文件系统中，而MySQL存在本地文件系统

…

hive数据类型

Hive DDL 定义语言

创建数据库：

CREATE DATABASE [IF NOT EXISTS] database_name

[COMMENT database_comment]

[LOCATION hdfs_path]

[WITH DBPROPERTIES (property_name=property_value, ...)];

创建一个数据库，数据库在HDFS上的默认存储路径是/user/hive/warehouse/*.db：

create database db_hive;

避免要创建的数据库已经存在错误，增加if not exists判断。（标准写法）：

create database if not exists db_hive;

创建一个数据库，指定数据库在HDFS上存放的位置：

create database db_hive2 location ‘/db_hive2.db’;

查询数据库

显示数据库：show databases;

过滤显示查询的数据库：show databases like ‘db_hive*’;

查看数据库详情：desc database db_hive;

显示数据库详细信息：desc database extended db_hive

切换当前数据库：use db_hive;

修改数据库：alter database db_hive set dbproperties(‘createtime’=‘20170830’);

删除数据库：drop database db_hive2;

如果删除的数据库不存在，最好采用 if exists判断数据库是否存在：drop database if exists db_hive2;

如果数据库不为空，可以采用cascade命令，强制删除： drop database db_hive cascade;

创建表

建表语法:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

[TBLPROPERTIES (property_name=property_value, ...)]

[AS select_statement]

字段解释说明

（1）CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。

（2）EXTERNAL关键字可以让用户创建一个外部表，在建表的同时可以指定一个指向实际数据的路径（LOCATION），在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

（3）COMMENT：为表和列添加注释。

（4）PARTITIONED BY创建分区表

（5）CLUSTERED BY创建分桶表

（6）SORTED BY不常用，对桶中的一个或多个列另外排序

（7）ROW FORMAT

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, …)]

用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED，将会使用自带的SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过SerDe确定表的具体的列的数据。

SerDe是Serialize/Deserilize的简称， hive使用Serde进行行对象的序列与反序列化。

（8）STORED AS指定存储文件类型

常用的存储文件类型：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）

如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。

（9）LOCATION ：指定表在HDFS上的存储位置。

（10）AS：后跟查询语句，根据查询结果创建表。

（11）LIKE允许用户复制现有的表结构，但是不复制数据。

管理表：

默认创建的表都是所谓的管理表，有时也被称为内部表。因为这种表，Hive会（或多或少地）控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下。当我们删除一个管理表时，Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。

普通创建表：

create table if not exists student2(

id int, name string

)

row format delimited fields terminated by ‘\t’

stored as textfile

location ‘/user/hive/warehouse/student2’;

根据查询结果创建表（查询的结果会添加到新创建的表中）

create table if not exists student3 as select id, name from student;

根据已经存在的表结构创建表

create table if not exists student4 like student;

查询表的类型

hive (default)> desc formatted student2;

Table Type: MANAGED_TABLE

外部表

因为表是外部表，所以Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉。

管理表和外部表的使用场景

每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表。

案例：

上传数据到HDFS:

hive (default)> dfs -mkdir /student;

hive (default)> dfs -put /opt/module/datas/student.txt /student;

建表语句:

创建外部表

hive (default)> create external table stu_external(

查看创建的表:

hive (default)> select * from stu_external;

查看表格式化数据:

hive (default)> desc formatted dept;

删除外部表:

hive (default)> drop table stu_external;

外部表删除后，hdfs中的数据还在，但是metadata中stu_external的元数据已被删除

管理表与外部表的互相转换

查询表的类型

hive (default)> desc formatted student2;

修改内部表student2为外部表

alter table student2 set tblproperties(‘EXTERNAL’=‘TRUE’);

修改外部表student2为内部表

alter table student2 set tblproperties(‘EXTERNAL’=‘FALSE’);

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询

效率会提高很多。

分区表基本操作

引入分区表（需要根据日期对日志进行管理）

/user/hive/warehouse/log_partition/20170702/20170702.log

/user/hive/warehouse/log_partition/20170703/20170703.log

/user/hive/warehouse/log_partition/20170704/20170704.log

创建分区表语法

hive (default)> create table dept_partition(

deptno int, dname string, loc string

)

partitioned by (month string)

row format delimited fields terminated by ‘\t’;

注意：分区字段不能是表中已经存在的数据，可以将分区字段看作表的伪列。

加载数据到分区表中

hive (default)> load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition partition(month=‘201709’);

hive (default)> load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition partition(month=‘201708’);

hive (default)> load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition partition(month='201707’);

查询分区表中数据

单分区查询

hive (default)> select * from dept_partition where month=‘201709’;

多分区联合查询

hive (default)> select * from dept_partition where month=‘201709’

union

select * from dept_partition where month=‘201708’

union

select * from dept_partition where month=‘201707’;

增加分区

创建单个分区

hive (default)> alter table dept_partition add partition(month=‘201706’) ;

同时创建多个分区

hive (default)> alter table dept_partition add partition(month=‘201705’) partition(month=‘201704’);

删除分区

删除单个分区

hive (default)> alter table dept_partition drop partition (month=‘201704’);

同时删除多个分区

hive (default)> alter table dept_partition drop partition (month=‘201705’), partition (month=‘201706’);

查看分区表有多少分区

hive> show partitions dept_partition;

查看分区表结构

hive> desc formatted dept_partition;

分区表注意事项

创建二级分区表

hive (default)> create table dept_partition2(

deptno int, dname string, loc string

)

partitioned by (month string, day string)

row format delimited fields terminated by ‘\t’;

正常的加载数据

（1）加载数据到二级分区表中

hive (default)> load data local inpath ‘/opt/module/datas/dept.txt’ into table

default.dept_partition2 partition(month=‘201709’, day=‘13’);

（2）查询分区数据

hive (default)> select * from dept_partition2 where month=‘201709’ and day=‘13’;

把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

方式一：上传数据后修复

上传数据

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=12;

hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;

查询数据（查询不到刚上传的数据）

hive (default)> select * from dept_partition2 where month=‘201709’ and day=‘12’;

执行修复命令

hive> msck repair table dept_partition2;

再次查询数据

hive (default)> select * from dept_partition2 where month=‘201709’ and day=‘12’;

方式二：上传数据后添加分区

上传数据

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=11;

hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;

执行添加分区

hive (default)> alter table dept_partition2 add partition(month=‘201709’,

day=‘11’);

查询数据

hive (default)> select * from dept_partition2 where month=‘201709’ and day=‘11’;

方式三：创建文件夹后load数据到分区

创建目录

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=10;

上传数据

hive (default)> load data local inpath ‘/opt/module/datas/dept.txt’ into table

dept_partition2 partition(month=‘201709’,day=‘10’);

查询数据

hive (default)> select * from dept_partition2 where month=‘201709’ and day=‘10’;

修改表

重命名表

语法

ALTER TABLE table_name RENAME TO new_table_name

实操案例

hive (default)> alter table dept_partition2 rename to dept_partition3;

增加、修改和删除表分区

见分区表基本操作

增加/修改/替换列信息

语法：

更新列

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

增加和替换列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …)

注：ADD是代表新增一字段，字段位置在所有列后面(partition列前)，REPLACE则是表示替换表中所有字段。

案例实操：

实操案例

（1）查询表结构

hive> desc dept_partition;

（2）添加列

hive (default)> alter table dept_partition add columns(deptdesc string);

（3）查询表结构

hive> desc dept_partition;

（4）更新列

hive (default)> alter table dept_partition change column deptdesc desc int;

（5）查询表结构

hive> desc dept_partition;

（6）替换列

hive (default)> alter table dept_partition replace columns(deptno string, dname

string, loc string);

（7）查询表结构

hive> desc dept_partition;

删除表

hive (default)> drop table dept_partition;

文章来源于网络侵删

原文链接：https://blog.csdn.net/qq_41864303/article/details/106179411

上一篇：百度一下就知道
下一篇：磁力搜索网站 BT torrent se

相关推荐RECOMMENDED