Available in VPC
Quickstart overview
This section describes creating a scanner against Object Storage Bucket to infer the source data schema you want and create tables.
Source data
The source data stored in Object Storage is air pollution measurement data, which contains various measurement information, such as date and time, measurement station name, ozone concentration, sulfur dioxide concentration, etc. It is stored in the partition structure used by the Hadoop File System, with files divided by year/month.
Folder structure
/atmosphere-data
├── year=2022
│ ├── month=10
│ │ └── atomosphere_data_2022.10.csv
│ ├── month=11
│ │ └── atomosphere_data_2022.11.csv
│ └── month=12
│ └── atomosphere_data_2022.12.csv
└── year=2023
├── month=01
│ └── atomosphere_data_2023.01.csv
├── month=02
│ └── atomosphere_data_2023.02.csv
└── month=03
└── atomosphere_data_2023.03.csv
File structure
- All files are organized in the same schema, and the following is a sample schema and data.
| date | area_code | area_name | measure_center_code | measure_center_name | fine_dust_per_hour | fine_dust_per_day | ultrafine_dust_per_day | ozone_ppm | nitrogen_dioxide_concentration_ppm | carbon_monoxide_concentration_ppm | sulfurous_acid_gas_concentration_ppm |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 202210302300 | 100 | downtown | 111123 | junggu | 69 | 59 | 49 | 0.013 | 0.064 | 0.8 | 0.004 |
| 202210302300 | 100 | downtown | 111121 | junggu-2 | 82 | 59 | 56 | 0.008 | 0.074 | 0.8 | 0.003 |
| 202210302300 | 100 | downtown | 111131 | yongsangu | 68 | 58 | 64 | 0.028 | 0.037 | 0.7 | 0.003 |
Create and run scanner
Create connection
Name: (atmosphere-bucket-connection)
Data type: (Object Storage)
Bucket: (saved bucket name)
Create scanner
Data type: (Object Storage)
Connection: (atmosphere-bucket-connection)
Path: (not entered)
Execution cycle: (on-demand)
Pattern: (include:*.csv)
Classifier: (not entered)
Database: (default)
Prefix: (not entered)
Output data: (update table definition)
Name: atmosphere-scanner
- Pattern: the include/exclude file types in the pattern to scan only the files you want; otherwise, all files will be scanned.
- Classifier: if you don't specify (select) a classifier, the schema is determined by Data Catalog's internal classifier; adding a user-defined classifier will attempt to infer the schema to take precedence over the classifier you added.
- Path: specify a specific path within the bucket to infer the schema based on data down the specified path. If not entered, all paths under the bucket will be scanned.
Note
If you create a new encrypted bucket and specify a path to use Object Storage encrypted buckets, you need to reload the catalog. (1 time only).
Run scanner
- With the scanner waiting to run, press the [Run] button to start scanning.
- Scanners in the Start running state will change to the Waiting to run state as soon as the scan is complete.
- You can view the results in the Run history tab and view the schema in the Table menu.
How to select tables/partitions for scan operations
- The scanner accesses files on a bucket/path that you specify in Object Storage, reads portions of the file, and creates metadata by inferring the file's format and internal schema.
Scan processing rules
- Parse files and infer schemas
- Sequentially access all files in the bucket/path you specified and read the first 1 MB of the file.
- Parses the files through system classifiers, such as Parquet, XML, JSON, and CSV, to infer the internal schema.
- However, if you set up a user-defined Classifier in Scanner, it will attempt to parse the user-defined Classifier over the system Classifier.
- If the file fails to parse, it reads the next 1 MB and tries to parse it identically. Reads files up to 10 MB.
- Merge individual files into a directory unit
- If the schemas of individual files are similar, and if the percentage of similar schemas is at least 70%
- The individual files are the same type
- Columns between similar schemas are merged, and columns from non-similar schemas are ignored

- The partition information contained in the path information for individual directories must be valid
- If the schemas of individual directories are similar, and if the percentage of similar schemas is at least 70%
- The individual directories are the same type
- Columns between similar schemas are merged, and columns from non-similar schemas are ignored

Scanner run results
| Additional table | Additional partition | Schema |
|---|---|---|
| atmosphere-data | atmosphere_data:partition_0/year/month | date (bigint ) date (bigint ) area_code (double) area_name (string) measure_center_code (double) measure_center_name (string) fine_dust_per_hour (double) fine_dust_per_day (double) ultrafine_dust_per_day (double) ozone_ppm (double) nitrogen_dioxide_concentration_ppm (double) carbon_monoxide_concentration_ppm (double) sulfurous_acid_gas_concentration_ppm (double) |
Supported file types for scanning
- The Object Storage scanner supports the following file Content-Types:
text/csv
text/tab-separated-values
text/xml
application/xml
application/json
application/x-ndjson
application/octet-stream
binary/octet-stream
application/parquet
application/x-parquet
application/orc
application/x-orc
application/avro
application/vnd.apache.avro
application/gzip
application/x-gzip
application/x-bzip2