Apache Hudi整合Spark SQL操作hide表

2022-03-31 13:07:52

1. 摘要

社群小夥伴一直期待的Hudi整合Spark SQL的PR正在積極Review中並已經快接近尾聲，Hudi整合Spark SQL預計會在下個版本正式釋出，在整合Spark SQL後，會極大方便使用者對Hudi表的DDL/DML操作，下面就來看看如何使用Spark SQL操作Hudi表。

2. 環境準備

首先需要將PR拉取到本地打包，生成SPARK_BUNDLE_JAR(hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar)包

2.1 啟動spark-sql

在設定完spark環境後可通過如下命令啟動spark-sql

spark-sql --jars $PATH_TO_SPARK_BUNDLE_JAR  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

2.2 設定並行度

由於Hudi預設upsert/insert/delete的並行度是1500，對於演示的小規模資料集可設定更小的並行度。

set hoodie.upsert.shuffle.parallelism = 1;
set hoodie.insert.shuffle.parallelism = 1;
set hoodie.delete.shuffle.parallelism = 1;

同時設定不同步Hudi表後設資料

set hoodie.datasource.meta.sync.enable=false;

3. Create Table

使用如下SQL建立表

create table test_hudi_table (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
 partitioned by (dt)
 options (
  primaryKey = 'id',
  type = 'mor'
 )
 location 'file:///tmp/test_hudi_table'

說明：表型別為MOR，主鍵為id，分割區欄位為dt，合併欄位預設為ts。

建立Hudi表後檢視建立的Hudi表

show create table test_hudi_table

4. Insert Into

4.1 Insert

使用如下SQL插入一條記錄

 insert into test_hudi_table select 1 as id, 'hudi' as name, 10 as price, 1000 as ts, '2021-05-05' as dt

insert完成後檢視Hudi表本地目錄結構，生成的後設資料、分割區和資料與Spark Datasource寫入均相同。

4.2 Select

使用如下SQL查詢Hudi表資料

select * from test_hudi_table

查詢結果如下

5. Update

5.1 Update

使用如下SQL將id為1的price欄位值變更為20

update test_hudi_table set price = 20.0 where id = 1

5.2 Select

再次查詢Hudi表資料

select * from test_hudi_table

查詢結果如下，可以看到price已經變成了20.0

檢視Hudi表的本地目錄結構如下，可以看到在update之後又生成了一個deltacommit，同時生成了一個增量log檔案。

6. Delete

6.1 Delete

使用如下SQL將id=1的記錄刪除

delete from test_hudi_table where id = 1

檢視Hudi表的本地目錄結構如下，可以看到delete之後又生成了一個deltacommit，同時生成了一個增量log檔案。

6.2 Select

再次查詢Hudi表

select * from test_hudi_table;

查詢結果如下，可以看到已經查詢不到任何資料了，表明Hudi表中已經不存在任何記錄了。

7. Merge Into

7.1 Merge Into Insert

使用如下SQL向test_hudi_table插入資料

 merge into test_hudi_table as t0
 using (
  select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-03-21' as dt
 ) as s0
 on t0.id = s0.id
 when not matched and s0.id % 2 = 1 then insert *

7.2 Select

查詢Hudi表資料

select * from test_hudi_table

查詢結果如下，可以看到Hudi表中存在一條記錄

7.4 Merge Into Update

使用如下SQL更新資料

 merge into test_hudi_table as t0
 using (
  select 1 as id, 'a1' as name, 12 as price, 1001 as ts, '2021-03-21' as dt
 ) as s0
 on t0.id = s0.id
 when matched and s0.id % 2 = 1 then update set *

7.5 Select

查詢Hudi表

select * from test_hudi_table

查詢結果如下，可以看到Hudi表中的分割區已經更新了

7.6 Merge Into Delete

使用如下SQL刪除資料

merge into test_hudi_table t0
 using (
  select 1 as s_id, 'a2' as s_name, 15 as s_price, 1001 as s_ts, '2021-03-21' as dt
 ) s0
 on t0.id = s0.s_id
 when matched and s_ts = 1001 then delete

查詢結果如下，可以看到Hudi表中已經沒有資料了

8. 刪除表

使用如下命令刪除Hudi表

drop table test_hudi_table;

使用show tables檢視表是否存在

show tables;

可以看到已經沒有表了

9. 總結

通過上面範例簡單展示了通過Spark SQL Insert/Update/Delete Hudi表資料，通過SQL方式可以非常方便地操作Hudi表，降低了使用Hudi的門檻。另外Hudi整合Spark SQL工作將繼續完善語法，儘量對標Snowflake和BigQuery的語法，如插入多張表（INSERT ALL WHEN condition1 INTO t1 WHEN condition2 into t2），變更Schema以及CALL Cleaner、CALL Clustering等Hudi表服務。

以上就是Apache Hudi整合Spark SQL操作hide表的詳細內容，更多關於Apache Hudi整合Spark SQL的資料請關注it145.com其它相關文章！