- Print
- PDF
Using Presto(Trino)
- Print
- PDF
Available in VPC
Presto is a tool that allows you to analyze data in terabytes and petabytes using distributed queries.
Presto can read data from various sources, including HDFS, Hive warehouse, and RDBMS.
Unlike Hive and Pig, where queries are executed as MapReduce Job, Presto has a separate query execution engine. Since Presto is designed to deliver data from memory to memory, without writing the results of each step on a disk, it can analyze data stored in HDFS faster and more interactively than Hive. Therefore, Presto is more suitable than Hive for integration with BI Tools such as Tableau.
- Up to Cloud Hadoop 1.9, it was used under the name Presto, and in Cloud Hadoop 2.0, it is used under the name Trino.
- Presto, like Hive and Pig, is designed to process OLAP queries. Therefore, it cannot replace transaction-based RDBMS.
Presto components
The Presto service consists of two components: coordinator and worker.
It can have 1 coordinator as the master, and multiple workers as the slaves. Communication between coordinators and workers and among worker nodes relies on REST APIs.
Coordinator
The coordinator is the hub of the Presto service and responsible for the following:- Receiving requests from the client
- Conducting SQL syntax parsing and query planning
- Adjusting worker nodes when running queries and tracking activities of worker nodes
Worker
Workers perform tasks received from the coordinator and process data. The task execution result is transferred directly from worker to client.
Query execution process
The process of a query execution is as follows: (see the image below)
- Start the Presto worker process and register with the coordinator's discovery server
- Queries must be registered on the discovery server so the coordinator can assign workers to the task
- The client transfers queries to the coordinator through HTTP
- The coordinator creates a query plan and requests schema data from the connector plugin
- The coordinator sends the task to be executed to the worker
- Ther worker reads data from data sources through the connector plugin
- The worker executes the task in the memory
- The worker returns the results to the client
Data sources
Connector
In Presto, the connector functions like a driver in a database. In other words, it connects the data sources with the coordinator or worker so the data can be read from the data source.
By default, Presto provides connectors for various data sources such as Hive, MySQL, and Kafka.Catalog
The catalog is a mount point for the connector. Evey catalog is associated with a specific connector. Presto can access data sources through the connector mounted on the catalog. For example, to access a Hive warehouse with the Hive Connector, you need to configure/etc/presto/catalog
the following Hive catalog (hive.properties
).
Presto queries can accommodate one or more catalogs. In other words, you can use multiple data sources within a single query.
Catalogs are defined inconfig.properties
under the Presto configuration directory (/etc/presto/
).Schema
A schema is a way to organize your tables.
You can define a table set, which can be queried at once, using one catalog and schema.
When accessing Hive or RDBMS with Presto, the schema is equivalent to the concept of a database.
In other data sources, tables can be organized to create a schema.Table
The table concept of RDBMS is applied identically here.
When referencing a table in Presto, it must befully-qualified
, meaning that the catalog, schema, and table name must be specified and separated by periods (.).
(e.g., hive.samples.t1)
Using Presto clusters
Create cluster
From NAVER Cloud Platform console, create the Cloud Hadoop cluster.
For more information on creating clusters, see Create cluster.
Staring with Cloud Hadoop 1.3, you can use clusters with Presto v0.240 installed.
In Cloud Hadoop 1.3, even if you didn't create the cluster as a Presto type, you can still add Presto using Ambari Add Service.
Check Presto service in Ambari UI
After installing Presto, you can see the service in the Ambari UI. You can start and stop each component of the service from this page.
- Summary: checks hosts with components installed
- Configs: changes configurations of Presto service
- Quick Links: Presto Discovery UI
- Accessing these links requires tunneling. Access through the web UI link provided by the console. For more information, see Access Presto Discovery UI.
Key configurations
jvm.properties
Enter the JVM option used by the coordinator or worker server. You can adjust the JVM heap with the-Xmx
option. Since the coordinator node and the worker node specifications may be different, the memory settings are applied separately. jvm.properties configuration is divided by role as shown below. Following the/etc/presto/conf
path of the actual server,jvm.properties
, which is the same file name, exists.
config.properties
It's rare, but if needed, you can configure memory settings differently by roles of coordinators and workers for config.properties. The definition of key components are as follows:
Item | Default value | Description |
---|---|---|
query.max-memory-per-node | 1G | - Maximum value of user memory that a single query can use in a worker - If any of the memory values assigned to each worker by a query exceeds this limit, the query is canceled |
query.max-memory | 20G | - Maximum value of memory that a single query can use across the entire cluster - If the sum of user memory allocated to all worker nodes through a specific query exceeds this limit, the query is canceled |
query.max-total-memory-per-node | 2G | - Maximum value of user and system memory that a single query can use across the entire cluster - If the sum of user and system memory allocated to all worker nodes through a specific query exceeds this limit, the query is canceled |
To change http-server.http.port
, you must change the coordinator's http-server.http.port
and worker's http-server.http.port
to the same value. If you specify different ports, they cannot communicate with each other.
- node.properties
You can set the log directory, pid directory, and others used by the Presto daemon. To change this directory, you need to check the owner and permissions of the directory on each server.
You can specify the environment name currently in use innode.environment
.
Presto CLI
Presto CLI provides a responsive shell to run queries.
You can use shells on all hosts assigned the Presto Clis role.
For a detailed explanation on using Presto CLI, see Presto CLI Documentation.
- Connect to Presto Coordinator server
/usr/lib/presto/bin/presto-cli --server <COORDINATOR-HOST-IP>:8285
When accessing the Presto Coordinator server, <COORDINATOR-HOST-IP>
is the Private IP address of the edge node (e-001). You can check it in the Ambari UI > Hosts menu.
- View available catalogs
presto> show catalogs;
Catalog
---------
system
(1 row)
Query 20190430_020419_00001_j79dc, FINISHED, 2 nodes
Splits: 36 total, 36 done (100.00%)
0:07 [0 rows, 0B] [0 rows/s, 0B/s]
You can find more information about how to execute queries by adding data sources in the Analyzing Hive warehouse data with Presto guide.
Access Presto Discovery UI
You can access the Presto Discovery UI through [View by application] on the Cloud Hadoop console. For more information, see View by application.
You can see the overall status of Presto services on the Presto Discovery UI page. You can also view query history.