Presto setup with AWS EC2 & S3 (1/2)

Chyi-Kwei Yau
3 min readMay 15, 2021

--

I have been using Presto for a while now but only setup Presto development environment on my local machine once. Recently I am planning to do some performance test on it so I setup a production ready environment from scratch on AWS. Setup Presto itself is not complex and the official guide is easy to follow but for production you will want to make it secure and scalable.

In this note, I will setup a Presto cluster with standalone metastore in AWS VPC and use S3 as storage. The whole setup will take about 1 hour and no 3rd party libraries or packages are required.

This note will include 3 parts:

  • AWS setup: This will includes VPC, EC2, IAM, EIN, secure groups, … etc.
  • Metastore setup: We will setup a RDS server to persist meta data and 1 or more (for HA) EC2 servers for Hive metastore service.
  • Presto setup: We will follow the deployment guide to setup presto coordinator and workers. We will also add an auto-scale group for workers to make the cluster scalable.

AWS

First, we need a VPC with 2 subnets (1 public and 1 private). To keep things secure, all services will be hosted in the private subnet. The public subnet will only have NAT and a EC2 instance for ssh login purpose. The final architecture will look like this:

You can follow this AWS doc to setup VPC with public and private subnets. In my setting, I use 10.0.0.0/24 for private subnet IPs and 10.0.1.0/24 as public subnet IPs.

For EC2 servers, we will create IAM roles and corresponding secure groups first. In my setting, I create 3 separate IAM roles:

  1. ssh-login: This is the entrance for public access.
  2. metastore: This is for Hive metastore servers. Make sure this role have read/write access your s3 bucket.
  3. presto: Similar to metastore but for presto servers. This will also need s3 access.

And for base image, I use Amazon Linux 2 AMI.

Next we can start a ssh-login server in public subnet and make sure you can ssh into it. We will use this to access other instances in the private subnet in the next 2 parts.

Standalone Metastore

My metastore is based on this post and admin manual. I created a Postgres RDS as the backend RDBMS. It will be good to setup new user and database for metastore:

CREATE ROLE hive LOGIN PASSWORD 'hive';
GRANT hive TO postgres;
CREATE DATABASE hive_metastore owner hive;

To test connection, you can install Postgres client on the ssh log-in server and access it:

psql -h <RDS URL> -p 5432 -d hive_metastore -U hive

Next we will setup a metastore server:

# JDK
sudo yum update
sudo amazon-linux-extras install java-openjdk11
# metastore
wget https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz
tar -xvf hive-standalone-metastore-3.1.2-bin.tar.gz
rm -f hive-standalone-metastore-3.1.2-bin.tar.gz
mv apache-hive-metastore-3.1.2-bin metastore
mv metastore/ /opt/
# hadoop
wget https://apache.claz.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
tar -xvf hadoop-3.2.2.tar.gz
sudo mv hadoop-3.2.2 /opt
# postgres driver
wget https://jdbc.postgresql.org/download/postgresql-42.2.20.jar
mv postgresql-42.2.20.jar /opt/metastore/lib/
# update old library
cp /opt/hadoop-3.2.2/share/hadoop/common/lib/guava-27.0-jre.jar /opt/metastore/lib/
rm -f /opt/metastore/lib/guava-19.0.jar
# aws SDK
cp /opt/hadoop-3.2.2/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar /opt/metastore/lib/
cp /opt/hadoop-3.2.2/share/hadoop/tools/lib/hadoop-aws-3.2.2.jar /opt/metastore/lib/

And for metastore configuration (metastore-site.xml):

<property>
<name>metastore.task.threads.always</name>
<value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
</property>
<property>
<name>metastore.expression.proxy</name>
<value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
</property>
<! — database setting >
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://{{your RDS URL}}></value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.InstanceProfileCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.assumed.role.arn</name>
<value>{{metastore IAM ARN}}</value>
</property>
</configuration>

(One thing to note is in the official doc mentioned we should set metastore.task.threads.always to org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask for standalone mode but in my test, EventCleanerTask is the only task we need.)

For environment variables:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.9.11–0.amzn2.0.1.x86_64
export HADOOP_HOME=/opt/hadoop-3.2.2
export METASTORE_HOME=/opt/metastore
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$METASTORE_HOME/bin:$HADOOP_HOME/bin

Now we can initialize metastore schema and run the service. To initialize schema, run:

/opt/metastore/bin/schematool -initSchema -dbType postgres

To run metastore as a service, I use systemd to manage it:

[Unit]
Description=Hive metastore
After=network.target
[Service]
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.9.11–0.amzn2.0.1.x86_64
Environment=HADOOP_HOME=/opt/hadoop-3.2.2
Environment=METASTORE_HOME=/opt/metastore
User=ec2-user
ExecStart=/opt/metastore/bin/start-metastore
[Install]
WantedBy=multi-user.target

And start the service:

sudo systemctl enable metastore.service
sudo systemctl start metastore

Now the metastore is up and running. We will finish the presto setup in next post and run some test queries on it.

--

--

No responses yet