Using Presto

release/20240425
English

Using Presto(Trino)

Article Summary

Share feedback

Thanks for sharing your feedback!

Available in VPC

Presto is a tool that allows you to analyze data in terabytes and petabytes using distributed queries.
Presto can read data from various sources, including HDFS, Hive warehouse, and RDBMS.

Unlike Hive and Pig, where queries are executed as MapReduce Job, Presto has a separate query execution engine. Since Presto is designed to deliver data from memory to memory, without writing the results of each step on a disk, it can analyze data stored in HDFS faster and more interactively than Hive. Therefore, Presto is more suitable than Hive for integration with BI Tools such as Tableau.

Note

Up to Cloud Hadoop 1.9, it was used under the name Presto, and in Cloud Hadoop 2.0, it is used under the name Trino.
Presto, like Hive and Pig, is designed to process OLAP queries. Therefore, it cannot replace transaction-based RDBMS.

Presto components

The Presto service consists of two components: coordinator and worker.
It can have 1 coordinator as the master, and multiple workers as the slaves. Communication between coordinators and workers and among worker nodes relies on REST APIs.

Coordinator
The coordinator is the hub of the Presto service and responsible for the following:
- Receiving requests from the client
- Conducting SQL syntax parsing and query planning
- Adjusting worker nodes when running queries and tracking activities of worker nodes
Worker
Workers perform tasks received from the coordinator and process data. The task execution result is transferred directly from worker to client.

Query execution process

The process of a query execution is as follows: (see the image below)

Start the Presto worker process and register with the coordinator's discovery server
- Queries must be registered on the discovery server so the coordinator can assign workers to the task
The client transfers queries to the coordinator through HTTP
The coordinator creates a query plan and requests schema data from the connector plugin
The coordinator sends the task to be executed to the worker
Ther worker reads data from data sources through the connector plugin
The worker executes the task in the memory
The worker returns the results to the client

Data sources

Connector
In Presto, the connector functions like a driver in a database. In other words, it connects the data sources with the coordinator or worker so the data can be read from the data source.
By default, Presto provides connectors for various data sources such as Hive, MySQL, and Kafka.
Catalog
The catalog is a mount point for the connector. Evey catalog is associated with a specific connector. Presto can access data sources through the connector mounted on the catalog. For example, to access a Hive warehouse with the Hive Connector, you need to configure /etc/presto/catalog the following Hive catalog (hive.properties).

Presto queries can accommodate one or more catalogs. In other words, you can use multiple data sources within a single query.

Catalogs are defined in config.properties under the Presto configuration directory (/etc/presto/).
Schema
A schema is a way to organize your tables.
You can define a table set, which can be queried at once, using one catalog and schema.

When accessing Hive or RDBMS with Presto, the schema is equivalent to the concept of a database.
In other data sources, tables can be organized to create a schema.
Table
The table concept of RDBMS is applied identically here.
When referencing a table in Presto, it must be fully-qualified, meaning that the catalog, schema, and table name must be specified and separated by periods (.).
(e.g., hive.samples.t1)

Using Presto clusters

Create cluster

From NAVER Cloud Platform console, create the Cloud Hadoop cluster.
For more information on creating clusters, see Create cluster.

Note

Staring with Cloud Hadoop 1.3, you can use clusters with Presto v0.240 installed.
In Cloud Hadoop 1.3, even if you didn't create the cluster as a Presto type, you can still add Presto using Ambari Add Service.

Check Presto service in Ambari UI

After installing Presto, you can see the service in the Ambari UI. You can start and stop each component of the service from this page.

Summary: checks hosts with components installed
Configs: changes configurations of Presto service
Quick Links: Presto Discovery UI
- Accessing these links requires tunneling. Access through the web UI link provided by the console. For more information, see Access Presto Discovery UI.

Key configurations

jvm.properties
Enter the JVM option used by the coordinator or worker server. You can adjust the JVM heap with the -Xmx option. Since the coordinator node and the worker node specifications may be different, the memory settings are applied separately. jvm.properties configuration is divided by role as shown below. Following the /etc/presto/conf path of the actual server, jvm.properties, which is the same file name, exists.
config.properties
It's rare, but if needed, you can configure memory settings differently by roles of coordinators and workers for config.properties. The definition of key components are as follows:

Item	Default value	Description
query.max-memory-per-node	1G	- Maximum value of user memory that a single query can use in a worker - If any of the memory values assigned to each worker by a query exceeds this limit, the query is canceled
query.max-memory	20G	- Maximum value of memory that a single query can use across the entire cluster - If the sum of user memory allocated to all worker nodes through a specific query exceeds this limit, the query is canceled
query.max-total-memory-per-node	2G	- Maximum value of user and system memory that a single query can use across the entire cluster - If the sum of user and system memory allocated to all worker nodes through a specific query exceeds this limit, the query is canceled

Caution

To change http-server.http.port, you must change the coordinator's http-server.http.port and worker's http-server.http.port to the same value. If you specify different ports, they cannot communicate with each other.

node.properties
You can set the log directory, pid directory, and others used by the Presto daemon. To change this directory, you need to check the owner and permissions of the directory on each server.
You can specify the environment name currently in use in node.environment.

Presto CLI

Presto CLI provides a responsive shell to run queries.
You can use shells on all hosts assigned the Presto Clis role.
For a detailed explanation on using Presto CLI, see Presto CLI Documentation.

Connect to Presto Coordinator server

/usr/lib/presto/bin/presto-cli --server <COORDINATOR-HOST-IP>:8285

Note

When accessing the Presto Coordinator server, <COORDINATOR-HOST-IP> is the Private IP address of the edge node (e-001). You can check it in the Ambari UI > Hosts menu.

View available catalogs

presto> show catalogs;
 Catalog
---------
 system
(1 row)

Query 20190430_020419_00001_j79dc, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:07 [0 rows, 0B] [0 rows/s, 0B/s]

Note

You can find more information about how to execute queries by adding data sources in the Analyzing Hive warehouse data with Presto guide.

Access Presto Discovery UI

You can access the Presto Discovery UI through [View by application] on the Cloud Hadoop console. For more information, see View by application.
cloudhadoop-access-webui_ko

You can see the overall status of Presto services on the Presto Discovery UI page. You can also view query history.

Was this article helpful?

What's Next

Analyzing Hive Warehouse data using Presto(Trino)

Table of contents

Presto components
Query execution process
Data sources
Using Presto clusters