Preparing the Cloud Environment

To prepare a cloud environment for Pepperdata, set up the required buckets and folders; obtain the appropriate Pepperdata package, extract it, and upload the contents to the folder you create in your bucket; and upload the Pepperdata configuration and license files. After you finish the preparation steps, you’ll configure your Pepperdata products.

In cloud, Pepperdata enables graceful shutdown of ephemeral clusters by default. This ensures following two things:

There are no metric data gaps at the end of the life of a node in an ephemeral cluster.
If autoscaling optimization was enabled on the cluster, the autoscaling policies are restored to their original settings (requires Pepperdata Supervisor v8.1.2 or later).

On This Page

Prerequisites
Task 1: Upload the Pepperdata Software
Task 2: Copy Configuration Template Files
Task 3: Add the Pepperdata License
Task 4: (Kerberized Clusters) Enable Kerberos Authentication
Task 5: (Rarely Required) Open Port for Listening

Important: Be sure to repeat all the preparation steps on every ResourceManager host and NodeManager host in your cluster.

Run all commands as the root user.

Prerequisites

Ensure that your cloud platform is supported; see the entries for Pepperdata 8.1.x in the table of Supported Platforms by Pepperdata Version.
Ensure that your cloud environment is configured for read access to buckets by all cluster nodes. Read access is required so that the cluster’s bootstrap script can access the Pepperdata installation packages and configuration.
(EMR) When clusters are created, they must be configured to include Hadoop as an application.
Before you install Pepperdata, you must decide whether to install into an existing/running cluster or a new cluster.

There are several factors to consider.
- When you install Pepperdata into an existing/running cluster, you must separately install and activate Pepperdata on every already-running host, which can be a time-consuming process if there are more than just a few hosts.
- To install Pepperdata into an existing/running cluster, every currently-running host in the cluster must already have an initialization (bootstrap) script.
  
  If there is no initialization script, you must destroy the cluster and re-create it so that every host has an initialization script. The script can be empty or you can follow the procedure for activating Pepperdata on a new cluster; in Install Pepperdata (Cloud), see the procedure for your environment.
- Installing Pepperdata into a new cluster means that you do not have to install Pepperdata on individual hosts.
- If you have cluster management functions that are unrelated to Pepperdata (such as certificate management), it’s easier to install Pepperdata into an existing/running cluster because there’s already an initialization (bootstrap) script that you can edit to add a call the Pepperdata bootstrap script.
- If you want to install Pepperdata into a new cluster, and want to invoke non-Pepperdata cluster management functions (either now or in the future), you can create a “helper bootstrap script” to invoke those functions and call the Pepperdata bootstrap script.
- If you will be configuring autoscaling optimization in an EMR environment, you must install Pepperdata into a new cluster. (There is no support for adding autoscaling optimization in an EMR environment to an existing/running cluster.)

Task 1: Upload the Pepperdata Software

Click the tab for your cloud environment, and perform the procedure.

Procedure

Set up the required bucket and folders in your Amazon S3 execution environment.
1. In your Amazon S3 execution environment, create a bucket for Pepperdata.
  
  For EMR 5.32.0 and EMR 6.2.0, the bucket name cannot include any dot characters (.). Otherwise, you can name it anything, so long as you adhere to the AWS bucket naming rules. This documentation refers to the bucket as <my-bucket>.
2. In the Pepperdata bucket, create folders for install-packages and config. Do not use any other names for these folders.
  - s3://<my-bucket>/install-packages
  - s3://<my-bucket>/config
3. In the config folder, create a folder for the cluster configuration.
  
  Important: Name the folder with the same name as your cluster name, which must match the cluster name used in the Pepperdata license file and the bootstrap script.
  
  For example, if your cluster is named, my-cluster, the new folder would be:
  
  s3://<my-bucket>/config/my-cluster
  
  This folder is referred to as the cluster configuration folder in the rest of the installation and configuration procedures.
Obtain the appropriate Pepperdata package; the filename ends in -rpm-cloud.tgz. See the Downloads page.
Extract the package contents of the TGZ archive to any local location.
Upload the base directory and all its files and subfolders to the install-packages folder that you created (s3://<my-bucket>/install-packages).
- The base directory is supervisor-X.Y.Z-<distribution>, where <distribution> is the final part of the package name, without the file type; for example, supervisor-X.Y.Z-H26_YARN2_A.
- In addition to the emr/ contents that you’ll use, the package contents include files for other cloud environments. You can ignore those files or delete them.
- You can store multiple versions of Pepperdata in the /install-packages folder.

Dataproc

Procedure

Set up the required bucket and folders in your GDP environment.
1. In your GDP environment, create a bucket for Pepperdata.
  
  You can name it anything.This documentation refers to the bucket as <my-bucket>.
2. In the Pepperdata bucket, create folders for install-packages and config. Do not use any other names for these folders.
  - gs://<my-bucket>/install-packages
  - gs://<my-bucket>/config
3. In the config folder, create a folder for the cluster configuration.
  
  Important: Name the folder with the same name as your cluster name, which must match the cluster name used in the Pepperdata license file and the bootstrap script.
  
  For example, if your cluster is named, my-cluster, the new folder would be:
  
  gs://<my-bucket>/config/my-cluster
  
  This folder is referred to as the cluster configuration folder in the rest of the installation and configuration procedures.
Obtain the appropriate Pepperdata package; the filename ends in -deb-cloud.tgz. See the Downloads page.
Extract the package contents of the TGZ archive to any local location.
Upload the base directory and all its files and subfolders to the install-packages folder that you created (gs://<my-bucket>/install-packages).
- The base directory is supervisor-X.Y.Z-<distribution>, where <distribution> is the final part of the package name, without the file type; for example, supervisor-X.Y.Z-H26_YARN2_A.
- In addition to the dataproc/ contents that you’ll use, the package contents include files for other cloud environments. You can ignore those files or delete them.
- You can store multiple versions of Pepperdata in the /install-packages folder.

Task 2: Copy Configuration Template Files

Click the tab for your Pepperdata installation/cluster manager/environment, and perform the procedure.

Procedure

Copy the config-template/pepperdata-config.sh-template and config-template/pepperdata-site.xml-template configuration files from the extracted contents of the Pepperdata package to any local location, and rename them to remove the -template suffix.
- pepperdata-config.sh-template -> pepperdata-config.sh
- pepperdata-site.xml-template -> pepperdata-site.xml
These files are referred to as the cluster-level Pepperdata configuration file and the cluster-level Pepperdata site file, respectively, in the rest of the installation and configuration procedures.
Upload the pepperdata-config.sh and pepperdata-site.xml files to the cluster configuration folder in your Amazon S3 execution environment (s3://<my-bucket>/config/my-cluster).

Continuing with our my-cluster example, the files would be:
- s3://<my-bucket>/config/my-cluster/pepperdata-config.sh
- s3://<my-bucket>/config/my-cluster/pepperdata-site.xml

Dataproc

Procedure

Copy the config-template/pepperdata-config.sh-template and config-template/pepperdata-site.xml-template configuration files from the extracted contents of the Pepperdata package to any local location, and rename them to remove the -template suffix.
- pepperdata-config.sh-template -> pepperdata-config.sh
- pepperdata-site.xml-template -> pepperdata-site.xml
These files are referred to as the cluster-level Pepperdata configuration file and the cluster-level Pepperdata site file, respectively, in the rest of the installation and configuration procedures.
Upload the pepperdata-config.sh and pepperdata-site.xml files to the cluster configuration folder that you created in your GDP environment (gs://<my-bucket>/config/my-cluster).

Continuing with our my-cluster example, the files would be:
- gs://<my-bucket>/config/my-cluster/pepperdata-config.sh
- gs://<my-bucket>/config/my-cluster/pepperdata-site.xml

Task 3: Add the Pepperdata License

Click the tab for your cloud environment, and perform the procedure.

Copy the license.txt file that we emailed to you to the cluster configuration folder in your Amazon S3 execution environment.

Continuing with our my-cluster example, the file would be:

s3://<my-bucket>/config/my-cluster/license.txt

Dataproc

Copy the license.txt file that we emailed to you to the cluster configuration folder in your GDP environment.

Continuing with our my-cluster example, the file would be:

gs://<my-bucket>/config/my-cluster/license.txt

Task 4: (Kerberized Clusters) Enable Kerberos Authentication

If the core services of the ResourceManagers, the MapReduce Job History Server, and, for Tez support in Application Profiler, the YARN Timeline Server are Kerberized (secured with Kerberos), add the Kerberos principal and the path of the corresponding keytab file to the Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh.

Prerequisites

Be sure that the PepAgent user has read access to the keytab file. (To determine the PepAgent user name, see the PD_USER entry in the Pepperdata configuration file, pepperdata-config.sh.)

Procedure

(Optional) Create a new user principal and keytab file to use for Pepperdata.

Although you can reuse an existing principal and keytab file, best practice is to create a new one for Pepperdata. Separate users let you apply ACLs (access control lists) in accordance with your organization’s security policies. User principals, unlike service principals, do not include the hostname.
Verify that the Kerberos principal and keytab file are valid.
1. Obtain and cache a Kerberos ticket-granting ticket by using the kinit command, which should return without error. Be sure to substitute your user name, realm name, and the location of your keytab file for the <your-kerberos-user>, <your-realm-name>, and <path-of-your-keytab-file> placeholders.
  
  kinit <your-kerberos-user>@<your-realm-name> -kt <path-of-your-keytab-file>
2. Authenticate and connect by using the curl --negotiate command.
  
  Be sure to substitute your ResourceManager domain for the resourcemanager.example.com placeholder.
  - For non-secured endpoints (HTTP):
    
    curl -L --tlsv1.2 --negotiate -u : http://resourcemanager.example.com:8088
  - For secured endpoints (HTTPS):
    
    curl -L --tlsv1.2 --negotiate -u : https://resourcemanager.example.com:8090
  If you can connect, you’ve confirmed that the Kerberos principal and keytab file are valid. Otherwise, debug the connection failure.
Add the Kerberos principal and the path of the corresponding keytab file to the Pepperdata configuration.
1. Download a copy of your existing cluster-level Pepperdata configuration file, pepperdata-config.sh, from the environment’s cluster configuration folder (in the cloud) to a location where you can edit it.
2. Open the file for editing, and add the required environment variables. Be sure to substitute your user name, realm name, and the location of your keytab file for the your-kerberos-user, your-realm-name, and path-of-your-keytab-file placeholders.
```
export PD_AGENT_PRINCIPAL=your-kerberos-user@your-realm-name
export PD_AGENT_KEYTAB_LOCATION=path-of-your-keytab-file
```
  Important: If your Kerberos principal contains the _HOST macro expansion, it is replaced at runtime by the fully-qualified domain name of the host. For this replacement to work, reverse DNS must be working correctly on every host where the _HOST macro is configured.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-config.sh file.
(YARN 3.x) For YARN 3.x environments (which typically align with Hadoop 3.x-based distros such as EMR 6.x), add authentication properties to the Pepperdata configuration to enable REST access.

Note: If you will be configuring Application Profiler, you can add these authentication properties now or during the configuration process.
1. Log in to the ResourceManager host, and download a copy of the host’s existing Pepperdata site file, pepperdata-site.xml, from the environment’s cluster configuration folder (in the cloud) to a location where you can edit it.
2. Open the file for editing, and add the required properties.
  
  Be sure to substitute your HTTP service policy—HTTP_ONLY or HTTPS_ONLY—for the your-http-service-policy placeholder in the following code snippet.
  
  For Kerberized clusters, the HTTP service policy is usually HTTPS_ONLY. But you should check with your cluster administrator or look for the value of the yarn.http.policy property in the cluster’s yarn-site.xml file or the Hadoop configuration.
```
<property>
  <name>pepperdata.agent.yarn.http.authentication.type</name>
  <value>kerberos</value>
</property>
<property>
  <name>pepperdata.agent.yarn.http.policy</name>
  <value>your-http-service-policy</value>
</property>
```
  Malformed XML files can cause operational errors that can be difficult to debug. To prevent such errors, we recommend that you use a linter, such as xmllint, after you edit any .xml configuration file.
3. Save your changes and close the file.
4. Upload the revised file to overwrite the original pepperdata-site.xml file.

Task 5: (Rarely Required) Open Port for Listening

PepAgents listen on port 50505, whether they’re running on ResourceManager hosts, as we recommend, or on NodeManager hosts.

In most environments this port is available for use and is not blocked by internal firewalls. However, in rare situations you might need to open/unblock this port or reconfigure which port Pepperdata uses.

Note: If port 50505 is used by another service, you can reconfigure which port Pepperdata uses by redefining the pepperdata.agent.rpc.server.port property in the Pepperdata site file, pepperdata-site.xml.

• After you reconfigure the pepperdata.agent.rpc.server.port property (default=50505), restart the PepAgents.

To enable SSL support, see Configure SSL Near Real-Time Monitoring on Port 50505.

For information about accessing the stats that are provided via the Web servlets associated with this port, with either HTTP or SSL-secured HTTPS communication, see Pepperdata Status Views via Web Servlets.

Important: Before you go to the next procedure, be sure to install Pepperdata on every ResourceManager host and NodeManager host in your cluster.

Next: Configuring Pepperdata