Object Storage Scanner usage examples
    • PDF

    Object Storage Scanner usage examples

    • PDF

    Article Summary

    Available in VPC

    Scenario overview

    Describes the process of inferring the schema of the source data you want and creating a table by creating a scanner targeting an Object Storage Bucket.

    Source Data

    The source data stored in Object Storage consists of air pollution measurement data, including various measurement information such as date and time, monitoring station, ozone concentration, and sulfur dioxide concentration. This data is organized and stored in a partition structure similar to the Hadoop File System. The files are divided and stored based on the year and month of the measurements.

    Folder structure

    /atmosphere-data
    ├── year=2022
    │   ├── month=10
    │   │   └── atomosphere_data_2022.10.csv
    │   ├── month=11
    │   │   └── atomosphere_data_2022.11.csv
    │   └── month=12
    │       └── atomosphere_data_2022.12.csv
    └── year=2023
        ├── month=01
        │   └── atomosphere_data_2023.01.csv
        ├── month=02
        │   └── atomosphere_data_2023.02.csv
        └── month=03
            └── atomosphere_data_2023.03.csv
    

    File structure

    • All files are organized with the same schema. The following is a schema and data sample.
    datearea_codearea_namemeasure_center_codemeasure_center_namefine_dust_per_hourfine_dust_per_dayultrafine_dust_per_dayozone_ppmnitrogen_dioxide_concentration_ppmcarbon_monoxide_concentration_ppmsulfurous_acid_gas_concentration_ppm
    202210302300100downtown111123junggu6959490.0130.0640.80.004
    202210302300100downtown111121junggu-28259560.0080.0740.80.003
    202210302300100downtown111131yongsangu6858640.0280.0370.70.003


    Create and run scanner

    Create connection

     Name: (atmosphere-bucket-connection)
     Data type: (Object Storage)
     Bucket: (saved bucket name)
    

    Create scanner

       Data type: (Object Storage)
       Connection: (atmosphere-bucket-connection)
       Path: (not entered)
       Cycle: (on demand)
       Pattern: (include: *.csv)
       Classifier: (not entered)
       Database: (default)
       Prefix: (not entered)
       Output data: (update table definition)
       Name: atmosphere-scanner
    
    • Pattern: if you use the include/exclude file type in the pattern, you can scan only the files you want, otherwise all files will be scanned.
    • Classifier: if no classifier is specified (selected), the schema is determined by the internal classifier in the Data Catalog. If a user-defined classifier is added, schema inference is attempted with priority given to the classifier.
    • Path: if you specify a specific path within a bucket, the schema is inferred based on the data under the specified path. If not entered, all paths under the bucket are scanned.

    Run scanner

    • Start scanning by pressing the [Run] button on the scanner waiting to run.
    • The scanner in the Start running state changes to the Standby state as soon as the scan is completed.
    • You can check the results in the History tab and the schema in the Table menu.

    How to select table/partition during scan operation

    • The scanner accesses files in the bucket/path in Object Storage specified by the user, reads a part of the file, infers the format and internal schema of the file, and creates meta data.

    Scan processing rules

    1. File parsing and schema inference
    • Sequentially accesses all files on the user-specified bucket/path, reading the first 1 MB of the file.
    • Parses the files through system classifiers such as parquet, xml, JSON, and CSV, and infers internal schema.
    • However, if a user-defined classifier is set in the Scanner, the user-defined classifier takes precedence over the system classifier and parsing is attempted.
    • If parsing the file fails, read the next 1 MB and try parsing the same. Reads files up to 10 MB.
    1. Merge individual files directory by directory
    • When the schemas of individual files are similar, when the similar schema ratio is 70% or more
    • Individual files are of the same type
    • Columns from similar schemas are merged, and columns from dissimilar schemas are ignored

    image

    1. Merge individual directories into parent directory units
    • Partition information included in the path information of individual directories must be valid.
    • When the schemas of individual directories are similar, when the similar schema ratio is 70% or more
    • Individual directories are of the same type
    • Columns from similar schemas are merged, and columns from dissimilar schemas are ignored
      image

    Scanner execution result

    Additional tableAdditional partitionSchema
    atmosphere-dataatmosphere_data:partition_0/year/monthdate (bigint ) date (bigint )
    area_code (double)
    area_name (string)
    measure_center_code (double)
    measure_center_name (string)
    fine_dust_per_hour (double)
    fine_dust_per_day (double)
    ultrafine_dust_per_day (double)
    ozone_ppm (double)
    nitrogen_dioxide_concentration_ppm (double)
    carbon_monoxide_concentration_ppm (double)
    sulfurous_acid_gas_concentration_ppm (double)

    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.