Iceberg table management with Cloud Hadoop Hive

Available in VPC

This section describes how to manage Iceberg tables using Hive on NAVER Cloud Platform's Cloud Hadoop.

Apache Iceberg overview

What Is Apache Iceberg?

Iceberg is a high performance table format designed for analyzing massive data.
It not only provides the reliability and simplicity of SQL tables in a big data environment, but also enables simple integration with widely used data processing frameworks such as Spark, Flink, Hive, and Trino.

Apache Iceberg features

Expressive SQL: supports widely used SQL commands for reading, writing, and updating data. Anyone familiar with SQL commands can build a data lake and perform most data lake operations with Apache Iceberg, without needing to learn a new language.
Data consistency: ensures that all users reading and writing data see the same data, thereby guaranteeing consistency.
Full schema evolution: even if the schema changes, there is no need to rewrite the table. You can add columns, rename columns, or remove columns in the data table without modifying the underlying data.
Hidden partitioning: Iceberg derives column values, optionally transforms them, and then creates partition values. Because you don't have to maintain physical partitions directly, partitioning can remain hidden.
Time travel and rollback: Iceberg supports data version control, allowing you to track how data changes over time. Thus, you can access earlier versions of the data for querying and the time travel function allows you to investigate changes between update and delete operations.
Data compaction: data compression is natively supported, giving you various methods to optimize file layout and size.

Using Iceberg in Data Catalog

In the Data Catalog console, you can create and view Iceberg tables and then manage the data according to the guide below.
Later on, additional functions such as data compaction, snapshot management, and table optimization will be provided. (Scheduled in the second half of 2025)

Cloud Hadoop settings

Configure the Cloud Hadoop cluster's Hive Metastore repository to Set to Data Catalog.

Accessing Hue in the Cloud Hadoop Ecosystem

Click on the cluster account, then on the cluster details screen, check the domain address.
In your web browser's address bar, enter the domain address and port number as follows, then access the Hue web page:

https://{domain address}:8081

Note

Hue (Hadoop User Experience) is a web based interface that works with Apache Hadoop clusters Bundled with other Hadoop ecosystems, Hue can be used to run Hive tasks and Spark jobs.

Verifying database and table created in Data Catalog

You can check in Hue the database and Iceberg type tables created in the Data Catalog console.
For information on creating Iceberg tables in the Data Catalog console, see Table guide.

Caution

When creating an Iceberg type table directly in Hive, external.table.purge=FALSE option configuration is required.

# Examples of table creation query
CREATE EXTERNAL TABLE `dct_iceberg_table`(
  `id` bigint,
  `name` string,
  `category` string)
ROW FORMAT SERDE
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION
  's3a://{table location}'
TBLPROPERTIES (
  'external.table.purge'='FALSE',
  'table_type'='ICEBERG');

Settings for Hive query execution

Use the iceberg library using add jar.

add jar /usr/nch/3.1.0.0-78/hive/lib/iceberg-hive-runtime-1.2.1.jar;

Use the libfb library using add jar.

add jar /usr/nch/3.1.0.0-78/hive/lib/libfb303-0.9.3.jar;

Set the option values.

SET hive.vectorized.execution.enabled=false;
SET iceberg.engine.hive.lock-enabled=false;
SET hive.execution.engine=mr;

Caution

If you do not configure the option values, Insert queries execute without issues, but the results may not be reflected when querying the table.

hive.vectorized.execution.enabled: because Iceberg can conflict with Hive's vectorization feature, you must set this to false.
iceberg.engine.hive.lock-enabled: disable this so that Iceberg does not use the lock feature in Hive tables. If enabled, it may limit write operations.
hive.execution.engine: because Iceberg may not fully integrate with Hive's Tez engine, set this to mr (MapReduce).

Run Hive query

Enter the data using insert.

Insert into dct_iceberg_table values (1, 'John Doe', 'student');

Check the data with select.

select * from dct_iceberg_table;

Changes in Iceberg files upon query execution

Data files

When storing table data, Iceberg uses columnar formats such as Parquet, Avro, or ORC. (The default value is Parquet)
New data files are created according to the file format specified in Iceberg settings.
They are saved in s3a://{bucket name}/{table location}/data/ subdirectories.
Iceberg utilizes a unique UUID-based file naming convention to avoid conflicts.

Manifest file

This file stores metadata for each data file.
It includes the data file path, minimum/maximum values (stats) of the file, and partition details.
It is stored in s3a://{bucket name}/{table location}/metadata/ subdirectory.
A typical file name is in {UUID}.avro format.

Manifest list file

This manages a list of manifest files, and when a new INSERT task is performed, Iceberg updates changes based on the previous manifest list.
It is stored in s3a://{bucket name}/{table location}/metadata/ subdirectory.
A typical file name is in snap-{snapshot-id}-{UUID}.avro format.

Metadata JSON file

After each task (INSERT, DELETE, UPDATE), Iceberg creates a new snapshot of the table.
This file captures the table's entire state, supporting rollback and time travel functions.
It is stored in s3a://{bucket name}/{table location}/metadata/ subdirectory.
A typical file name is in *.metadata.json format.

Iceberg table structural changes

s3a://{bucket name}/{table-location}/
  ├── data/
  │   ├── {UUID}.parquet
  │   └── {UUID}.parquet
  └── metadata/
      ├── snap-{snapshot-id}-{UUID}.avro
      ├── {UUID}.avro
      ├── {UUID}.avro
      └── *.metadata.json

Note for supported functions by version

Check the Hive version in Hue.

select version();

Depending on your Hive version, Delete and Update may not be supported.

Caution

Depending on your Hive version, Delete and Update may not be supported.
For more information on SQL support by Iceberg across different Hive versions, see Iceberg documentation.