Presto setup with AWS EC2 & S3 (2/2)
This is the second part of my presto setup note (check here for first part). In the previous post we setup a VPC with standalone metastore and in this post, we will setup presto coordinator & workers and run some quick tests.
Presto
In presto setting, we will need uri for metastore and presto coordinator. To make the config simple, I will create 2 ENIs (Network interfaces) for metastore and coordinator. These 2 private IPs will be attached to active meastore and coordinator EC2 instance so we can use fixed url in our configuration. I use:
- 10.0.0.20 for metastore
- 10.0.0.10 for presto coordinator
For high availability, you can also use a script to check instance status and attached ENIs to healthy ones, like this.
Presto AMI
We will create a shared image (AMI) for coordinator and worker since most configs are similar. We will create a few config templates in the image and generate final configs when we initialize the server (through EC2 “user data”).
To build the image:
# Install open JDK
sudo yum update
sudo amazon-linux-extras install java-openjdk11# install presto
cd ~
wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.252/presto-server-0.252.tar.gz
tar -xvf presto-server-0.252.tar.gz
mv presto-server-0.252 presto
sudo mv presto /opt/presto
sudo mkdir -p /opt/presto/etc/
chown ec2-user /opt/presto/etc/# create data dir
sudo mkdir -p /var/presto/data
chown ec2-user /var/presto/data
Also, create the following templates in “/opt/presto/conf_template/” folder. (Notes: Part of the script are copied & modified from presto-on-aws repo.)
config.properties
coordinator={{isCoor}}
node-scheduler.include-coordinator={{includeCoor}}
http-server.http.port=8080
query.max-memory={{maxMem}}
query.max-memory-per-node={{maxMemPerNode}}
query.max-total-memory-per-node={{maxTotalMemPerNode}}
discovery.uri=http://{{discoverUri}}:8080
node.properties
node.environment={nodeEnv}
node.id={nodeID}
node.data-dir=/var/presto/data
jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
presto.service
[Unit]
Description=Presto
After=syslog.target network.target[Service]
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.9.11–0.amzn2.0.1.x86_64
User=ec2-user
Type=forking
ExecStart=/opt/presto/bin/launcher start
ExecStop=/opt/presto/bin/launcher stop
Restart=always[Install]
WantedBy=multi-user.target
Presto Coordinator
Once the AMI is created, we can setup the the coordinator with “user data” like:
#! /bin/bash
# add config.properties
cp /opt/presto/conf_template/config.properties /opt/presto/etc/config.properties
chown ec2-user /opt/presto/etc/config.properties
sed -i -e “s/{{isCoor}}/false/g” /opt/presto/etc/config.properties
sed -i -e “s/{{includeCoor}}/false/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxMem}}/50GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxMemPerNode}}/6GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxTotalMemPerNode}}/7GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{discoverUri}}/localhost/g” /opt/presto/etc/config.properties
echo "discovery-server.enabled=true" >> /opt/presto/etc/config.properties# add node.properties
cp /opt/presto/conf_template/node.properties /opt/presto/etc/node.properties
chown ec2-user /opt/presto/etc/node.properties
sed -i -e “s/{{nodeEnv}}/production/g” /opt/presto/etc/node.properties
sed -i -e “s/{{nodeID}}/$(curl http://169.254.169.254/latest/meta-data/instance-id/)/g" /opt/presto/etc/node.properties# add jvm.config
cp /opt/presto/conf_template/jvm.config /opt/presto/etc/jvm.config
chown ec2-user /opt/presto/etc/jvm.config# start as service
/etc/systemd/system/presto.service
sudo systemctl enable presto
sudo systemctl start presto
(Notes: you will need to adjust the memory setting based on your instance type.)
Presto Worker
Similar to coordinator, I created an auto-scale group with the following “user data”:
#! /bin/bash
# add config.properties
cp /opt/presto/conf_template/config.properties /opt/presto/etc/config.properties
chown ec2-user /opt/presto/etc/config.properties
sed -i -e “s/{{isCoor}}/false/g” /opt/presto/etc/config.properties
sed -i -e “s/{{includeCoor}}/false/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxMem}}/50GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxMemPerNode}}/7GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{maxTotalMemPerNode}}/8GB/g” /opt/presto/etc/config.properties
sed -i -e “s/{{discoverUri}}/10.0.0.10/g” /opt/presto/etc/config.properties# add node.properties
cp /opt/presto/conf_template/node.properties /opt/presto/etc/node.properties
chown ec2-user /opt/presto/etc/node.properties
sed -i -e “s/{{nodeEnv}}/production/g” /opt/presto/etc/node.properties
sed -i -e “s/{{nodeID}}/$(curl http://169.254.169.254/latest/meta-data/instance-id/)/g" /opt/presto/etc/node.properties# add jvm.config
cp /opt/presto/conf_template/jvm.config /opt/presto/etc/jvm.config
chown ec2-user /opt/presto/etc/jvm.config
Quick Test
Now we should have the presto coordinator and workers up and running. First we can connect to coordinator from cli:
./presto-cli — server 10.0.0.10:8080 — catalog hive — schema default
and check if workers are found:
select * from system.runtime.nodes;
You should see worker instance id as “node_id”
To test if hive connector and S3 are working, we can create a table in S3 with TPCH data:
CREATE SCHEMA hive.mytest WITH (location = ‘s3a://{{s3 bucket name}}/mytest/’);create table hive.mytest.lineitem_sf10 as select * from tpch.sf10.lineitem;
We can also use Web UI to check query progress and other stats. You can set up ssh tunnel to access the coordinator server:
ssh -L localhost:8080:10.0.0.10:8080 ssh-login
Now you access the web UI on localhost:8080.