Hive存储和压缩结合详解

修改Hadoop集群具有Snappy压缩方式

  1. 查看hadoop checknative命令使用
[liujh@hadoop104 hadoop-2.7.2]$ hadoop checknative [-a|-h]  check native hadoop and compression libraries availability
  1. 查看hadoop支持的压缩方式
[liujh@hadoop104 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib:    true /lib64/libz.so.1
snappy:  false 
lz4:     true revision:99
bzip2:   false
  1. 将编译好的支持Snappy压缩的hadoop-2.7.2.tar.gz包导入到hadoop102的/opt/software中
  2. 解压hadoop-2.7.2.tar.gz到当前路径
[liujh@hadoop102 software]$ tar -zxvf hadoop-2.7.2.tar.gz
  1. 进入到/opt/software/hadoop-2.7.2/lib/native路径可以看到支持Snappy压缩的动态链接库
[liujh@hadoop102 native]$ pwd
/opt/software/hadoop-2.7.2/lib/native
[liujh@hadoop102 native]$ ll
-rw-r--r--. 1 liujh liujh 472950 9月   1 10:19 libsnappy.a
-rwxr-xr-x. 1 liujh liujh 955 9月   1 10:19 libsnappy.la
lrwxrwxrwx. 1 liujh liujh 18 12月 24 20:39 libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx. 1 liujh liujh 18 12月 24 20:39 libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x. 1 liujh liujh 228177 9月   1 10:19 libsnappy.so.1.3.0
  1. 拷贝/opt/software/hadoop-2.7.2/lib/native里面的所有内容到开发集群的/opt/module/hadoop-2.7.2/lib/native路径上
[liujh@hadoop102 native]$ cp ../native/* /opt/module/hadoop-2.7.2/lib/native/
  1. 分发集群
[liujh@hadoop102 lib]$ xsync native/
  1. 再次查看hadoop支持的压缩类型
[liujh@hadoop102 hadoop-2.7.2]$ hadoop checknative
17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib:    true /lib64/libz.so.1
snappy:  true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1
lz4:     true revision:99
bzip2:   false
  1. 重新启动hadoop集群和hive

测试存储和压缩

官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
ORC存储方式的压缩:

Key Default Notes
orc.compress ZLIB high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 number of bytes in each compression chunk
orc.stripe.size 268,435,456 number of bytes in each stripe
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1000)
orc.create.index true whether to create row indexes
orc.bloom.filter.columns “” comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

注意:所有关于ORCFile的参数都是在HQL语句的TBLPROPERTIES字段里面出现

创建一个非压缩的的ORC存储方式

  1. 建表语句
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE");
  1. 插入数据
hive (default)> insert into table log_orc_none select * from log_text ;
  1. 查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ;

7.7 M /user/hive/warehouse/log_orc_none/000000_0

创建一个SNAPPY压缩的ORC存储方式

  1. 建表语句
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="SNAPPY");
  1. 插入数据
hive (default)> insert into table log_orc_snappy select * from log_text ;
  1. 查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;

3.8 M /user/hive/warehouse/log_orc_snappy/000000_0

上一节中默认创建的ORC存储方式,导入数据后的大小为

2.8 M /user/hive/warehouse/log_orc/000000_0
比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩,ZLIB采用的是deflate压缩算法。比snappy压缩的小。

存储方式和压缩总结

在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。

关注微信公众号
简书:https://www.jianshu.com/u/0278602aea1d
CSDN:https://blog.csdn.net/u012387141

你可能感兴趣的