Data Forest overview
    • PDF

    Data Forest overview

    • PDF

    Article Summary

    The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

    Available in VPC

    Data Forest is a large-scale, multi-tenant big data processing cluster based on Apache Hadoop. Data Forest supports various big data frameworks to simplify data storage, data processing, deep learning analysis, and serving. Security technology is applied and large-scale data is stored in distributed storage, so you can use it safely.

    df-overview_storage_vpc_ko

    Various features provided by Data Forest

    • Integrated analysis environment
      Data Forest is an integrated analysis platform based on Apache Hadoop that enables data collection and processing, deep learning analysis, and serving. It runs the service in the form of a YARN application and provides an environment where users can create a big data ecosystem by combining applications. It may also undertake deep learning training such as TensorFlow and PyTorch by getting GPU resources for each user dynamically assigned.

    • Easy and quick configuration of analysis environments
      You can launch apps easily and quickly in a container-based serverless environment, and create the app-based Hadoop ecosystem you need to configure the analysis environment. It provides an integrated multitenancy-based platform designed to process large amounts of data and large numbers of users. Depending on the analysis purpose, you can perform batch analysis and long-lived analysis in a multi-tenant environment.

    • Flexible scalability
      Even after creating an app, you can expand or reduce containers as many or as few as you need to use to flexibly respond to traffic. Because it's container-based, it can dynamically scale online and change quickly when needed.

    • Enhanced security
      Data Forest supports Kerberos/LDAP authentication as a secure Hadoop cluster with enhanced security. It provides a powerful security environment by using secret key encryption, so that other credentials are not transferred through a network. In addition, application permission management provides security through Apache Ranger authentication.

    • Guaranteed high-level network and disk performance
      Data Forest uses Hadoop Distributed File System (HDFS) storage as the local disk base of app-based computing nodes and physical servers, ensuring best network and disk performance.

    • Various components
      Data Forest consists of components available to store, analyze, and visualize data. Users may create and use components suitable for each purpose. HDFS, HBase, Kafka, and OpenTSDB are provided for data storage, Spark, Hive, Hive LLAP, Elasticsearch, Grafana, Hue, Trino, and Phoenix are provided for data analysis and processing, and Kibana and Zeppelin are provided for data visualization.

    • Improving component accessibility and enabling web-based development environment
      To facilitate seamless accessibility to components within a VPC environment, Data Forest provides a proxy function and a web-based development environment called Jupyter Notebook. Users can access the JupyterLab web page of the created notebook node to run queries and codes necessary for big data analysis and machine learning training. Integration with Object Storage further allows flexible reuse of data.

    Data Forest user guides

    Data Forest provides its services in the Korea Region. See the following table of contents and the details for efficient use of Data Forest.

    NAVER Cloud Platform provides various related resources in addition to guides to help customers understand Data Forest. Developers, marketers, etc., who are thinking about adopting Data Forest in their company or need detailed information while establishing data-related policies, should actively utilize the following resources.

    Check the FAQs first.

    The FAQ can quickly answer common questions. If your questions are not resolved in the FAQ, see the user guides to find the information you want.

    Q. Cloud Hadoop and Data Forest seem to be similar services. What's the difference?
    A. The difference between the two services lies in server/serverless.

    • Cloud Hadoop builds and provides a Hadoop cluster using customer-dedicated resources.
      • It is a self-managed product in which the customer directly manages Hadoop.
      • An open source-based web management tool (Apache Ambari) which allows self-management is provided.
    • Data Forest is a serverless product and is used by submitting Jobs (DL Jobs) required for analysis. Hadoop Ecosystem, which must be run long-lived, can be easily analyzed by creating an app.
      • It is a managed product that guarantees high availability at the product level, rather than having customers directly manage Hadoop.
      • It provides more apps than Cloud Hadoop, and GPU-based Deep Learning Jobs can also be submitted.

    Comparison

    FeatureCloud HadoopData Forest
    ScalabilityThe user directly determines Hadoop cluster sizeManaged in service
    CostHadoop cluster maintenance fee incursDuring user running, storage fee incurs
    MaintenanceThe user manages directly and a user management tool (Apache Ambari) is providedManaged in service
    CharacteristicsThe user can configure free environmentProvides various apps. GPU-based deep learning jobs can be submitted

    Q. What features are provided to collect and process real-time data or configure ETL environments?
    A. Although Data Forest does not directly provide real-time data collection and processing, the environment can be configured using the Hadoop Ecosystem, which consists of various services of NAVER Cloud Platform and apps provided by Data Forest. A service for professionally configuring ETL is scheduled to be released as a separate product in the future.

    Q. I want to access Quick links provided by Data Forest. How can I access it?
    A. Accessing Quick links requires a Data Forest notebook server. You can create it in the Data Forest > Notebooks menus.
    For more information, see Access Quick links.

    Q. When I enter the creation command in the terminal of the user PC to create an SSH tunnel between the user PC and the notebook, it continues to say enter the password.
    A. If authentication has failed with authentication key, it constantly requests a password. It can occur when the key used for running the SSH command differs from the login key set when creating the notebook. If you have lost the authentication key, you must change the authentication key.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.