Examples of using Object Storage Scanner

Available in VPC

Quickstart overview

This section describes creating a scanner against Object Storage Bucket to infer the source data schema you want and create tables.

Source data

The source data stored in Object Storage is air pollution measurement data, which contains various measurement information, such as date and time, measurement station name, ozone concentration, sulfur dioxide concentration, etc. It is stored in the partition structure used by the Hadoop File System, with files divided by year/month.

Folder structure

/atmosphere-data
├── year=2022
│   ├── month=10
│   │   └── atomosphere_data_2022.10.csv
│   ├── month=11
│   │   └── atomosphere_data_2022.11.csv
│   └── month=12
│       └── atomosphere_data_2022.12.csv
└── year=2023
    ├── month=01
    │   └── atomosphere_data_2023.01.csv
    ├── month=02
    │   └── atomosphere_data_2023.02.csv
    └── month=03
        └── atomosphere_data_2023.03.csv

File structure

All files are organized in the same schema, and the following is a sample schema and data.

date	area_code	area_name	measure_center_code	measure_center_name	fine_dust_per_hour	fine_dust_per_day	ultrafine_dust_per_day	ozone_ppm	nitrogen_dioxide_concentration_ppm	carbon_monoxide_concentration_ppm	sulfurous_acid_gas_concentration_ppm
202210302300	100	downtown	111123	junggu	69	59	49	0.013	0.064	0.8	0.004
202210302300	100	downtown	111121	junggu-2	82	59	56	0.008	0.074	0.8	0.003
202210302300	100	downtown	111131	yongsangu	68	58	64	0.028	0.037	0.7	0.003

Create and run scanner

Create connection

 Name: (atmosphere-bucket-connection)
 Data type: (Object Storage)
 Bucket: (saved bucket name)

Create scanner

   Data type: (Object Storage)
   Connection: (atmosphere-bucket-connection)
   Path: (not entered)
   Execution cycle: (on-demand)
   Pattern: (include:*.csv)
   Classifier: (not entered)
   Database: (default)
   Prefix: (not entered)
   Output data: (update table definition)
   Name: atmosphere-scanner

Pattern: the include/exclude file types in the pattern to scan only the files you want; otherwise, all files will be scanned.
Classifier: if you don't specify (select) a classifier, the schema is determined by Data Catalog's internal classifier; adding a user-defined classifier will attempt to infer the schema to take precedence over the classifier you added.
Path: specify a specific path within the bucket to infer the schema based on data down the specified path. If not entered, all paths under the bucket will be scanned.

Note

If you create a new encrypted bucket and specify a path to use Object Storage encrypted buckets, you need to reload the catalog. (1 time only).

Run scanner

With the scanner waiting to run, press the [Run] button to start scanning.
Scanners in the Start running state will change to the Waiting to run state as soon as the scan is complete.
You can view the results in the Run history tab and view the schema in the Table menu.

How to select tables/partitions for scan operations

The scanner accesses files on a bucket/path that you specify in Object Storage, reads portions of the file, and creates metadata by inferring the file's format and internal schema.

Scan processing rules

Parse files and infer schemas

Sequentially access all files in the bucket/path you specified and read the first 1 MB of the file.
Parses the files through system classifiers, such as Parquet, XML, JSON, and CSV, to infer the internal schema.
However, if you set up a user-defined Classifier in Scanner, it will attempt to parse the user-defined Classifier over the system Classifier.
If the file fails to parse, it reads the next 1 MB and tries to parse it identically. Reads files up to 10 MB.

Merge individual files into a directory unit

If the schemas of individual files are similar, and if the percentage of similar schemas is at least 70%
The individual files are the same type
Columns between similar schemas are merged, and columns from non-similar schemas are ignored

obj_scanner_scan_policy

3. Merge individual directories into a parent directory unit

The partition information contained in the path information for individual directories must be valid
If the schemas of individual directories are similar, and if the percentage of similar schemas is at least 70%
The individual directories are the same type
Columns between similar schemas are merged, and columns from non-similar schemas are ignored

obj_scanner_foler_merge_policy

Scanner run results

Additional table	Additional partition	Schema
atmosphere-data	atmosphere_data:partition_0/year/month	date (bigint ) date (bigint ) area_code (double) area_name (string) measure_center_code (double) measure_center_name (string) fine_dust_per_hour (double) fine_dust_per_day (double) ultrafine_dust_per_day (double) ozone_ppm (double) nitrogen_dioxide_concentration_ppm (double) carbon_monoxide_concentration_ppm (double) sulfurous_acid_gas_concentration_ppm (double)

Supported file types for scanning

The Object Storage scanner supports the following file Content-Types:

text/csv
text/tab-separated-values
text/xml
application/xml
application/json
application/x-ndjson
application/octet-stream
binary/octet-stream
application/parquet
application/x-parquet
application/orc
application/x-orc
application/avro
application/vnd.apache.avro
application/gzip
application/x-gzip
application/x-bzip2