Configuration (RPM/DEB)

After Pepperdata is installed, configured, and running, you might want to reconfigure some of the initial settings, validate that things are working as expected, or add metrics collection. This section’s procedures explain how to do so, and even how to disable certain Pepperdata components.

Configure Encryption
In addition to the transport-level encryption that is always in effect between your cluster and Pepperdata, and the secure https access that the Pepperdata dashboard requires, you can enable 128-bit AES-CBC (symmetric) storage-level encryption for sensitive data at rest, such as user names, hostnames, and job and queue names. When this extra level of encryption is enabled, Pepperdata receives and stores only the encrypted version. To configure encryption for your cluster, add the PD_ENCRYPTION_KEY and PD_ANONYMIZE variables to your Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh.
Configure a Custom Certificate of Authority
If your environment includes a custom Certificate of Authority (CA) that contains custom or non-standard certificates/chains (such as self-signed certificates) that are not included in the set of standard certificates typically included in internet browsers, you must enable Pepperdata to find the CA file. You can either configure the REQUESTS_CA_BUNDLE and SSL_CERT_FILE environment variables or place the CA file in a location that Pepperdata searches for CA files—the default locations for CA files according to each supported OS vendor/version. The environment variables take precedence: if you assign them, Pepperdata does not search for the certificates anywhere else, and so will not find them even if you’ve installed them according to your OS’s requirements.
Set Up a Pepperdata Proxy
If your cluster hosts must be “air gapped” from the internet or otherwise isolated, you can use a proxy server on your network to enable Pepperdata functionality. Pepperdata is fully integrated with the standard https_proxy environment variable, which you can configure in the Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh.
Enable SAML-SSO for User Authentication
In addition to standard user name and password authentication via Okta, Pepperdata supports Service Provider (SP) initiated SAML-SSO (Security Assertion Markup Language–Single Sign-On) for user authentication for Pepperdata-hosted services, for SAML version 2.0. To enable SSO-SAML, we need to know your SSO integration details, such as your platform, entity Id, and SSO URL. Download the Pepperdata SSO Intake Form (Pepperdata-SSO-Intake-Form.docx or Pepperdata-SSO-Intake-Form.pdf ), fill it out (electronically or by hand), and attach it to a request for support (see Open a Service Request).
Configure SSL Near Real-Time Monitoring on Port 50505
By default, the port that Pepperdata uses for listening (port 50505 for PepAgents) is unsecured. You can configure the port for secure SSL communication by using certificates and adding properties for the certificate’s keystore location, name, and password to the Pepperdata site file, pepperdata-site.xml.
Configure History Fetcher Retries
To ensure that application history is successfully fetched from the applicable component (MapReduce Job History Server for MapReduce apps, Spark History Server for Spark apps, or YARN Timeline Server for Tez apps), the Pepperdata Supervisor uses a two-phase approach. Phase 1 makes the initial attempt to fetch the history, and if it fails, makes up to three retries. Phase 2 adds an additional try and by default up to five retries, with the interval between retries increased by a factor of five every time. You can customize the number of retries for each phase, which might be required for environments with extreme network latency or frequent connectivity issues.
Configure Spark History Servers
If you’re using Application Profiler to fetch history data for Spark apps, you can customize the connection timeout value and/or add a second Spark History Server for monitoring.
Advanced Spark Job Monitoring
By default, the Pepperdata PepAgent monitors container-launched Spark jobs where the Spark driver and the PepAgent are on the same host. It’s possible, however, to run container-launched Spark jobs and have the Spark driver send the metrics data to a PepAgent on a remote host—a host other than where the Spark driver is resident. To enable such monitoring for a container-launched Spark job, include the following Pepperdata configuration override in the launch command: --conf spark.force.data.toRemoteHost=true.
Run Pepperdata as a Non-Root User
If your organization requires that everything be run under the principle of least privilege (PoLP), you can run Pepperdata as a non-root user—a user who lacks root access to the cluster hosts. However, because some I/O, CPU, and network metrics collection require privileged access, Pepperdata is unable to collect that data when running as a non-root user. To change from the default root user to another user, stop the Pepperdata services, remove the default log directory, change the PD_USER variable in the Pepperdata configuration file, pepperdata-config.sh, and restart the Pepperdata services.
Verify Ability to Upload to Pepperdata Dashboard
If the Pepperdata dashboard does not show current data, it’s likely that pepcollectd—the Pepperdata Collector that uploads data from cluster hosts to the Pepperdata dashboard—cannot upload its data because of connection errors.
Adding Apache® Impala Query Metrics
Cluster administrators often need precise resource usage information, even down to the query level of detail, in order to create accurate chargeback reports. If you’re using Apache Impala, you can enable Pepperdata to collect Impala query metrics for CPU and memory usage. When the queries are finished, Pepperdata reads the Impala query profiles to calculate the resource usage.
Memory Swapping Detection and Mitigation in the Pepperdata Supervisor
Swapping is a memory management technique that computer operating systems (OSes) use to ensure that RAM is maximally used on processes that need it. During swapping, the OS moves processes between virtual memory on disk and physical RAM. Swapping is a normal, expected part of operation, but it can be a problem when there are too many processes scheduled to run at once.
Disable/Enable Pepperdata Data Collection for a Host
Occasionally you might want Pepperdata to not collect data from a cluster host on which Pepperdata is installed. Or, you might want to re-enable data collection for a host where you previously disabled data collection. In such cases, you can disable or enable the host from Pepperdata data collection by configuring the host’s PD_COLLECT_AND_UPLOAD environment variable.
Disabling Components/Uninstalling Pepperdata
Some Pepperdata products can be individually disabled, while other products are always on as long as Pepperdata is running. This section includes procedures for all the products that can be individually disabled, and explains their interdependencies.