Scanner

Available in VPC

Scanner infers the source data's schema and uses classifiers to create a proper table for the data. You can set an execution cycle for a scanner to periodically collect data and update its metadata to the latest status. In the Scanner menu, you can create, run, and manage scanners.

Scanner list interface

The Scanner menu for using Data Catalog includes the following basics:

datacatalog-scanner_screen_ko

Area	Description
① Menu name	Shows the current menu name and the number of scanners currently being viewed.
② Basic features	Features shown when you first open the Scanner menu. [Create scanner]: See Create a scanner. [Learn more]: Go to the Data Catalog overview page. [Refresh]: Refresh the scanner list.
③ Scanner information tab	Select the Basic information, Source data, Output data, or execution cycle tabs to view details.
④ Scanner information area	View the details of the tab selected in Scanner information tab.
⑤ Execution history	View scanner run and detailed history.

Scanner details interface

The Scanner details interface includes the following basics:

datacatalog-scanner_screen_ko

Area	Description
① Scanner name	The selected scanner name.
② Basic features	[Run]: Run now. [Edit]: Edit scanner information. [Delete]: Delete the scanner. [Run management]: Stop a running scanner, pause the schedule, or resume the schedule. [Refresh]: Refresh the scanner list.
③ Features after creation	Features that are enabled after you create a scanner. [Run]: Run the scanner (see Run scanner). [Run management]: Open the periodic run management menu.
④ Search bar	Search for scanners by scanner name or description.
⑤ Scanner list	A list of scanners being viewed. Click to view details.

Create a scanner

You can create a scanner by setting up the source data from which you want to collect metadata and the option information for running the scan. To create a scanner, follow these steps:

In the VPC environment of the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
Click the [Create scanner] button.
Enter information on the source data to scan.
- Data type: Select a data source.
- Connection: Select a connection to connect to the data source.
  - You can create a connection by clicking the [Create connection] button. For more information, see Create connection.
  - If the data type is Cloud DB type, the [Test connection] button appears when you select a connection. Make sure to click the [Test connection] button to verify your connection.
  - If the data type is Object Storage or Apache Iceberg type, do not select a connection.
- Path: Enter the path of the source data to scan.
  - Run a scan for sub-paths of the path you entered.
  - If the source data type is Object Storage, click the [+Settings] button to specify a detailed path for the bucket or sub-bucket.
  - If the source data type is Cloud DB, enter the table name to scan.
    - If you enter %, it scans the entire database and create a metadata table for each table.
  - If the source data type is Apache Iceberg type, specify the parent folder of the metadata or the metadata folder.
    - Example: if Iceberg's metadata is located under /iceberg_table/metadata/, you need to specify the path as /iceberg_table/ or /iceberg_table/metadata/ in order to scan.
    - Apache Iceberg type can scan 1 table for each scanner, and format-version supports versions 1 and 2.
- Scan range: If the source data type is Object Storage type, specify the number of files to scan, which will be read in file name order.
  - You can specify from 1 to 100, otherwise all files will be scanned.
  - It scans the specified number of files by leaf node in the specified path.
Enter the run options.
- Execution cycle: Enter the execution cycle for the scan.
  - On-demand: Run the scanner directly from the console without an execution cycle.
  - Daily/Weekly/Monthly: Run a scan at a set date and time.
  - Cron: Enter the execution cycle in the cron format.
- Pattern: Enable inclusion/exclusion of metadata collection for specific data.
  - Enter it in the Glob Pattern format.
  - Exclusion settings take precedence over inclusion settings.
- Classifier: Select a classifier based on the data type and click the [Add] button to add a classifier.
  - You can set this option if the source data type is Object Storage.
  - You can create a classifier by clicking the [Create classifier] button. For more information, see Create classifier.
  - You can delete the added classifiers by clicking .
- Set partition: Identify only the desired partitioning form to scan.
  - You can set this option if the source data type is Object Storage.
  - If you do not check [Apply hive partitioning form only], it identifies all directory partitioning forms as the partition.
  - If you check [Apply hive partitioning form only], it sets to identify only the hive partitioning form as the partition.
Click the [Next] button.
Enter the output data information and processing method to handle table updates.
- Database: Select a database to connect the table to be created by running the scanner.
  - You can create a database by clicking the [Create database] button. For more information, see Create database.
- Prefix: Enter a string to add before the name of the table to create.
  - If you do not enter a string, the table name is automatically created based on the name of the source data.
- When adding a schema: Select the table update method to be performed when changes in the schema of the source data are detected.
  - Update table definition: Create a new schema and delete metadata for the deleted data.
  - Add new columns only: Add a new schema, but keep the existing schema.
  - Ignore: Keep the existing schema.
- Merge table: If the file types and partition structures in the folder are the same, all data is merged and output as 1 table, regardless of the data structure of the file. (Merging sub fields of the struct type field is not supported. To be provided later.)
- Table number limit: If the number of tables output after scanning exceeds the set number, table creation is canceled.
Click the [Next] button.
Enter a name and description for the scanner, check its settings, and click the [Save] button.

Note

You can create up to 30 Object Storage data type scanners.

Partitioning in Hive is a way to manage database tables efficiently and enhance query performance. Partitioning is a technique to divide and save a big data set into several smaller sub-data sets. You can use this technique to have query scan certain partitions only to reduce unnecessary data scans.
Typically, it saves data by creating directories in key=value form, and for example, you can save data by dividing it by date, such as "month=01" and "day=01," or by certain values, such as "type=A" and "type=B."

Search for scanners and view information

To search for created scanners and view the information:

In the VPC environment of the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
In the search bar, enter the name or description of the scanner, then click to search for the scanner.
Click a scanner name to open the details screen and review the following: For field descriptions, see Create scanner.
- Basic information tab
  - Status: Scanner status
  - Description: Scanner description.
  - Recent run results: Results from the most recent scanner run.
  - Last run date and time: The date and time of the most recent scanner run.
  - Created time: When the scanner was created.
  - Update date and time: The most recent date and time you edited the scanner settings.
- Source data tab
  - Data type: Type of data scanned.
  - Pattern: Include/exclude patterns for the scan target.
  - Path: Scan path
  - Classifier: Classifiers applied during scan.
  - Partition settings: Partition format settings to scan.
- Output data tab
  - Database: Database that hosts the scanned result tables.
  - Prefix: Prefix for scanned result table names.
  - Schema change options: Update options for scan results.
  - Table count limit: Maximum number of tables to output when scanning.
  - Table merge: Whether to merge into a single table when scanning.
- Execution cycle tab
  - Execution cycle: The configured scanner schedule (shown with a strikethrough when paused).
- [Run history]: Click to view the latest 10 runs by search condition.
  - Start/end date and time: date and time of the scan starting/ending.
  - Run time: The time it took to run the scan.
  - Run results: The results of the scan run.
  - Result summary: shows information, such as the number of tables added or changed due to running the scan, causes of failed scans, and canceled scan history; Click to open a popup with details.
  - [View details] button: you can view the scan run logs in the CLA service.

Run a scanner

You can run the scanner manually from the console.

Caution

Partition keys are only created on the first scan and are not added between subsequent scans. Therefore, if a partition key is added, delete the table and run the scan again. However, partition values can still be added between scans.
For *.zip file where several files are bound and zipped, unzip it and scan 1 random file.

Note

Scanners that you set an execution cycle for will run automatically based on your settings, and you can run them manually from the console at any time.

To run a scanner:

In the VPC environment of the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
Select a scanner and click [Run], or select a scanner, open its details, and click [Run].
- On the Scanner details page, the scan phases and progress are shown.
  - Object Storage scanner phases: INIT (initialize), SCAN_FILE (scanning files), CHECK_PARTITON (detecting partitions), MERGE_PARTITON (merging partitions), UPDATE_RESULT (sending results).
  - Cloud DB / JDBC / Iceberg scanner phases: INIT (initialize), SCAN_FILE (scanning tables), UPDATE_RESULT (sending results).
- When the run is completed, the scanner's Status will display as Pending for run, with the Last run result as Succeeded.
- You can stop the scan by clicking a running scanner to select and then navigating to [Manage running] > Stop running.

Pause and resume the scanner execution cycle

You can pause autoruns for scanners you set to autorun periodically or resume autoruns for scanners you paused. To set configuration:

In the VPC environment of the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
Select a scanner and click [Run management], or click the scanner name, open details, and then click [Run management].
Depending on what you want to set, click Pause execution cycle or Resume execution cycle.
- Pause execution cycle: Pause the autorun of scanners set to autorun periodically.
- Resume execution cycle: Resume autorun of a paused scanner.

Edit scanner

To edit the information of the created scanner:

Note

You cannot edit a running scanner.

In the VPC environment of the NAVER Cloud Platform console, navigate to > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
Click the name of the scanner you want to edit to open scanner details.
Click the [Edit] button.
Edit the scanner's information on the scanner editing page.
- For more information on each of these items, see Create scanner.
Once you completed editing, click the [Save] button.

Delete scanner

To delete a scanner you created:

Caution

You cannot recover deleted scanners.

Note

You cannot delete a running scanner.

In VPC environment on the NAVER Cloud Platform console, click > Services > Big Data & Analytics > Data Catalog.
Navigate to the Scanner menu.
Click the name of the scanner you want to delete to open the details screen.
Click the [Delete] button.
When the notification popup window appears, read the cautions and click the [Delete] button.