Data Forest overview

Article summary

Did you find this summary helpful?

Thank you for your feedback

The latest service changes have not yet been reflected in this content. We will update the content as soon as possible. Please refer to the Korean version for information on the latest updates.

Available in VPC

Data Forest is a large-scale, multi-tenant big data processing cluster based on Apache Hadoop. Data Forest supports various big data frameworks to simplify data storage, data processing, deep learning analysis, and serving. Security technology is applied and large-scale data is stored in distributed storage, so you can use it safely.

df-overview_storage_vpc_ko {height="" width="70%"}

Data Forest features

Integrated analysis environment
Data Forest is an integrated analysis platform based on Apache Hadoop that enables data collection and processing, deep learning analysis, and serving. It runs the service in the form of a YARN application and provides an environment where users can create a big data ecosystem by combining applications. It may also undertake deep learning training such as TensorFlow and PyTorch by getting GPU resources for each user as dynamically assigned.
Easy and quick configuration of analysis environments
You can launch apps easily and quickly in a container-based serverless environment, and create the app-based Hadoop ecosystem you need to configure the analysis environment. It provides an integrated multitenancy-based platform designed to process large amounts of data and large numbers of users. Depending on the analysis purpose, you can perform batch analysis and long-lived analysis in a multi-tenant environment.
Flexible scalability
Even after creating an app, you can expand or reduce containers to as many or as few as you need to use to respond to traffic flexibly. Because it is container-based, it can dynamically scale online and change quickly when needed.
Enhanced security
Data Forest supports Kerberos/LDAP authentication as a secure Hadoop cluster with enhanced security. It provides a powerful security environment by using secret key encryption, so that other credentials are not transferred through a network. In addition, application permission management provides security through Apache Ranger authentication.
Guaranteed high-level network and disk performance
Data Forest uses Hadoop Distributed File System (HDFS) storage as the local disk base of app-based computing nodes and physical servers, ensuring the best network and disk performance.
Various components
Data Forest consists of components available to store, analyze, and visualize data. You may create and use components suitable for each purpose. HDFS, HBase, Kafka, and OpenTSDB are provided for data storage, Spark, Hive, Hive LLAP, Elasticsearch, Grafana, Hue, Trino, and Phoenix are provided for data analysis and processing, and Kibana and Zeppelin are provided for data visualization.
Improving component accessibility and enabling web-based development environment
To facilitate seamless accessibility to components within a VPC environment, Data Forest provides a proxy function and a web-based development environment called Jupyter Notebook. Users can access the JupyterLab web page of the created notebook node to run queries and codes necessary for big data analysis and machine learning training. Integration with Object Storage further allows flexible reuse of data.

Data Forest user guide

Data Forest is available in the Korea Region. This guide will walk you through the information you need to start using Data Forest.

Data Forest overview: introduction to Data Forest advantages, related resources of Data Forest, FAQs
Data Forest quickstart: guides you through the entire process step-by-step
Data Forest prerequisites: view supported environments
- VPC
  - Getting started with Data Forest: learn how to configure the client environment to access Data Forest and the Data Forest app
  - Using Data Forest
    - Create and manage accounts: learn how to create and manage a Data Forest account and how to verify your account
    - Create and manage notebooks: learn how to create and manage Data Forest notebooks
    - Create and manage apps: learn how to create and manage Data Forest apps
      - Using Data Forest app
        Access quick links: learn quick link types and how to access quick links
        Using Dev: learn Dev app details and how to use it
        Using Elasticsearch: learn Elasticsearch details and precautions
        Using Grafana: learn Grafana details, how to add a data source, and how to back up database
        Using HBase: learn HBase details and precautions
        Using Hive: learn Hive details, access methods, and precautions
        Using Hue: learn Hue details
        Using Kafka: learn Kafka details, how to use Kafka manager, and precautions when using
        Using Kibana: learn Kibana details
        Using OpenTSDB: learn OpenTSDB details
        Using Phoenix: learn Phoenix details
        Using Spark History Server: learn Spark History Server details and how to view tasks
        Using Trino: learn Trino details
        Using Zepplin: learn Zepplin details, interpreter settings, and how to backup
        Using Zookeeper: learn Zookeeper details, how to connect with other apps, and precautions when using
    - Monitoring: learn how to monitor submitted batch jobs and apps
  - Utilizing the Data Forest ecosystem
    - Using HDFS: learn how to upload files to and download files from HDFS
    - Using Public Hive: learn how to create Hive databases and tables
    - Using Oozie: learn how to compose workflows
    - Using Ranger: learn how to set up Apache Ranger policies
    - Using Spark: learn how to submit a Spark Job
  - Data Forest use cases
    - Copy HDFS data to Object Storage: learn how to copy HDFS data to Object Storage
    - Register Spark batch jobs with Oozie scheduler: learn how to register Spark batch jobs with Oozie scheduler
    - Data process with Spark and Hive: learn how to process Spark and Hive data with Zepplin and Dev apps
  - Using AI Forest
    - AI Forest overview: guide on AI Forest
    - AI Forest quickstart: quickstart guide on AI Forest
    - Create and manage workspaces: learn how to create and manage workspaces
    - Using Workspace Browser: learn how to manage and edit source files in a workspace
    - Manage AI app: learn AI app details, how to see logs, and how to end the app
    - Using AI Forest CLI: scenario for using AI Forest CLI in the Linux environment
  - AI Forest use cases
    - Classify MNIST Handwritten Images with Tensorflow: learn how to submit jobs in Singlebatch
    - Detect pedestrians using PyTorch: learn how to write a program to detect pedestrian objects and submit it as a Singlebatch job
    - Container Registry integrations: learn how to integrate with Container Registry products to use Docker images
Data Forest permissions management: learn Data Forest permissions management methods and policies
Data Forest release notes: see documentation history for Data Forest user guides

NAVER Cloud Platform provides various related resources in addition to guides to help customers understand Data Forest. If you are considering whether to introduce Data Forest or in need of detailed information for developing data-related policy as a developer, marketer, and so on, make good use of the following resources:

Pricing plans, characteristics, and detailed features: Data Forest introduction and pricing information
Latest service news: the latest news on Data Forest
FAQs: frequently asked questions from Data Forest users
Contact us: send direct inquiries in case of any unresolved questions that are not answered by the user guide

Check FAQs first.

The FAQs can quickly answer common questions. If your questions are not resolved in the FAQs, see the user guides to find the information you want.

Q. Cloud Hadoop and Data Forest seem to be similar services. What is the difference?
A. The difference between the two services lies in server/serverless.

Cloud Hadoop builds and provides a Hadoop cluster using customer-dedicated resources.
- It is a self-managed product in which the customer directly manages Hadoop.
- An open source-based web management tool (Apache Ambari) that allows self-management is provided.
Data Forest is a serverless product and is used by submitting Jobs (DL Jobs) required for analysis. Hadoop Ecosystem, which must be run long-lived, can be easily analyzed by creating an app.
- It is a managed product that guarantees high availability at the product level, rather than having customers directly manage Hadoop.
- It provides more apps than Cloud Hadoop, and GPU-based Deep Learning Jobs can also be submitted.

Comparison

Feature	Cloud Hadoop	Data Forest
Scalability	The user directly determines Hadoop cluster size	Managed in service
Cost	Hadoop cluster maintenance fee charged	During user running, storage fee charged
Maintenance	The user manages directly and a user management tool (Apache Ambari) is provided	Managed in service
Characteristics	The user can configure a free environment	Provides various apps. GPU-based deep learning jobs can be submitted

Q. What features are provided to collect and process real-time data or configure ETL environments?
A. Although Data Forest does not directly provide real-time data collection and processing, the environment can be configured using the Hadoop Ecosystem, which consists of various services of NAVER Cloud Platform and apps provided by Data Forest. A service for professionally configuring ETL is scheduled to be released as a separate product in the future.

Q. I want to access Quick links provided by Data Forest. How can I access it?
A. Accessing Quick links requires a Data Forest notebook server. You can create it in the Data Forest > Notebooks.
For more information, see Access quick links.

Q. When I enter the creation command in the terminal of the user PC to create an SSH tunnel between the user PC and the notebook, it continues to prompt me for the password.
A. If authentication has failed with the authentication key, it constantly requests a password. This can occur when the key used for running the SSH command differs from the login key set when creating the notebook. If you have lost the authentication key, you must change the authentication key.

Q. When I use the notebook, are there differences in operation methods by kernel?
A. If you use PySpark Kernel, a task is performed by connecting to Livy of the common cluster through Sparkmagic. Therefore, it uses the Spark version of the common cluster.
However, if you use Python Kernel, a task is performed as a standalone, so it uses the Spark version of the local environment.

Was this article helpful?

What's Next

Data Forest quickstart

Table of contents

Data Forest features
Data Forest user guide
Data Forest related resources
Check FAQs first.

Data Forest overview

Data Forest features

Data Forest user guide

Data Forest related resources

Check FAQs first.

What's Next