Using iceberg

release/20240425
English

Using iceberg

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in VPC

Iceberg is an open table type system for a vast analysis data set that adds a table which uses a high-performance SQL table for Presto and Spark.

Iceberg components

NiFi is composed of 3 components that constitutes a hierarchy which are iceberg catalog, metadata later, and data layer.
chadoop-iceberg-1-1_ko

Iceberg Catalog Layer
Used to identify the location or to read the data for the specified table. Iceberg Catalog helps to find the table metadata from current point-in-time. It is also used to find the metadata file that is needed to execute the query.
Metadata Layer
It is composed of metadata file, manifest list, and manifest file. The metadata file contains information of snapshot, partition, and schema of the table which are needed to quickly find the required data from the query.
Data Layer
It is used to store the actual data file and uses the meta information of the manifest file to access the required data file.

Using iceberg

The following describes how to use iceberg.

Caution

The description of the following example is based on version 1.2.1.

Test using hive shell

Access hive.

[hive@dev-nch023-ncl ~]$ hive
Hive Session ID = cca75225-f55c-423b-b6c8-d8fb0
hive> set hive.vectorized.execution.enabled=false;
hive> set iceberg.engine.hive.lock-enabled=false;
hive> set tez.mrreader.config.update.properties=hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids;
hive> set hive.execution.engine=mr;

Create database.

hive> create database test;
OK
Time taken: 2.182 seconds

Select database.

hive> use test;
OK
Time taken: 0.278 seconds
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Create table.

hive> CREATE EXTERNAL TABLE test_tbl (id int) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
OK
Time taken: 2.796 seconds

Use the iceberg library using add jar.

hive> add jar /usr/nch/3.1.0.0-78/hive/lib/iceberg-hive-runtime-1.2.1.jar;
Added [/usr/nch/3.1.0.0-78/hive/lib/iceberg-hive-runtime-1.2.1.jar] to class path
Added resources: [/usr/nch/3.1.0.0-78/hive/lib/iceberg-hive-runtime-1.2.1.jar]

Use the libfb library using add jar.

hive> add jar /usr/nch/3.1.0.0-78/hive/lib/libfb303-0.9.3.jar;
Added [/usr/nch/3.1.0.0-78/hive/lib/libfb303-0.9.3.jar] to class path
Added resources: [/usr/nch/3.1.0.0-78/hive/lib/libfb303-0.9.3.jar]

Enter the data using insert.

hive> INSERT INTO test_tbl values (1);
Query ID = hive_20231012143056_a80b-fe72-472a-8773-4e7589
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
23/10/12 14:30:57 INFO client.AHSProxy: Connecting to Application History server at dev-nch023-ncl.nfra.io/10.168.142.23:10200
23/10/12 14:30:57 INFO client.AHSProxy: Connecting to Application History server at dev-nch023-ncl.nfra.io/10.168.142.23:10200
Starting Job = job_1696850670798_0017, Tracking URL = http://dev-nch2-ncl.nfra.io:8088/proxy/application_1696850670798_0017/
Kill Command = /usr/nch/3.1.0.0-78/hadoop/bin/mapred job  -kill job_1696850670798_0017
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
2023-10-12 14:31:07,818 Stage-2 map = 0%,  reduce = 0%
2023-10-12 14:31:16,035 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 5.33 sec
MapReduce Total cumulative CPU time: 5 seconds 330 msec
Ended Job = job_16968506_0017
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1   Cumulative CPU: 5.33 sec   HDFS Read: 173742 HDFS Write: 2611 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 330 msec
OK
Time taken: 22.507 seconds

Check the data with select.

hive> select * from test_tbl;
OK
1
Time taken: 0.493 seconds, Fetched: 1 row(s)

Check the table schema.

hive> show create table test_tbl;
OK
CREATE EXTERNAL TABLE `test_tbl`(
  `id` int COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'

LOCATION
  'hdfs://test-test/warehouse/tablespace/managed/hive/test.db/test_tbl'
TBLPROPERTIES (
  'bucketing_version'='2',
  'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"int"}]}',
  'current-snapshot-id'='128779159509',
  'current-snapshot-summary'='{"added-data-files":"1","added-records":"1","added-files-size":"407","changed-partition-count":"1","total-records":"1","total-files-size":"407","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}',
  'current-snapshot-timestamp-ms'='1697088677165',
  'engine.hive.enabled'='true',
  'external.table.purge'='TRUE',
  'last_modified_by'='hive',
  'last_modified_time'='1697088657',
  'metadata_location'='hdfs://test-test/warehouse/tablespace/managed/hive/test.db/test_tbl/metadata/00001-33b09b82-b9b9-4005-a804-3f7970fc23ec.metadata.json',
  'previous_metadata_location'='hdfs://test-test/warehouse/tablespace/managed/hive/test.db/test_tbl/metadata/00000-5a7c11d1-b12b-45a-a75a8c975f85.metadata.json',
  'snapshot-count'='1',
  'table_type'='ICEBERG',
  'transient_lastDdlTime'='1697088657',
  'uuid'='95dffef0-97e6-4ca2-ae01-b5bfde8')
Time taken: 0.315 seconds, Fetched: 25 row(s)

Was this article helpful?

What's Next

Using NiFi

Table of contents

Iceberg components
Using iceberg