(Platform Spotlight) Configure HDFS Data Temperature Report (RPM/DEB)

The HDFS Data Temperature report shows the data temperature across a cluster’s entire Hadoop Distributed File System (HDFS) space—how many hot, warm, and cold files are on the HDFS file system; the total sizes for hot, warm, and cold files; and what the configured storage policies are. Using this data, you can optimize your storage use by changing the storage policies, particularly for files whose policies are not set to cold, but whose last access dates correlate with the cold file configuration.

The reported number of files and total sizes do not account for HDFS replicas.

The definitions for hot, warm, and cold files for your Pepperdata implementation are included in the HDFS Data Temperature Report for your clusters. The image shows the default definitions, but yours might be different. Be sure to check with your System Administrator.

Screenshot of the data temperature definitions in the HDFS Data Temperature Report

By default, the HDFS Data Temperature report runs on the standby NameNode. If failover occurs, the report stops running on the former standby NameNode, and starts running on the new standby NameNode. You can override these defaults, and configure the report to run on a different (specific) host instead of the NameNode hosts.

By default, Pepperdata reads the FsImage files once during a scheduled read window of 00:00–23:00 on Saturdays. You can override the default frequency of reads and/or the scheduled read window by changing the Pepperdata configuration.

Before you can access the HDFS Data Temperature report from the Pepperdata dashboard, you must configure your Pepperdata installation to collect the necessary metrics and to upload those metrics to the Pepperdata backend data storage.

On This Page

Prerequisites
Procedure

Prerequisites

Before you begin configuring your cluster for the HDFS Data Temperature report, ensure that your system meets the required prerequisites.

The Pepperdata PepAgent must be installed on every host that you want monitored for HDFS data temperature.
The monitored hosts must be configured for Hadoop, as required by the Hadoop Distributed File System (HDFS).
On the Pepperdata backend, the HDFS Data Temperature report must be enabled for your cluster. To request or confirm that the report is enabled, contact Pepperdata Support.
The host(s) on which the report is to run must be HDFS Gateway nodes.
On the host(s) where the report is run, the PD_USER must be the root user, the hdfs user, or a user with permission to read from the directory that holds the FsImage files (which you’ll specify by using the pepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir property) and from the Hadoop configuration directory that contains the hdfs-site.xml and core-site.xml files (which you’ll specify by using the PD_HDFS_TIERING_CONF_DIR environment variable).
(Kerberized clusters) Be sure that you added the required environment variables—PD_AGENT_PRINCIPAL and PD_AGENT_KEYTAB_LOCATION—to the Pepperdata configuration during the installation process (Task 4. (Kerberized clusters) Enable Kerberos Authentication).
- The PD_AGENT_PRINCIPAL is used only for authentication (not authorization).
- It does not need HDFS access.
- It needs to be set only on the host(s) where the report is to run.
Access times for HDFS must be enabled—a non-zero value for the dfs.namenode.accesstime.precision property.

Procedure

Important: For running the report on the NameNode host, you can begin with either the active or standby NameNode host in your cluster. Then be sure to repeat the configuration and restart steps on the other NameNode host (standby or active) in your cluster.

For running the report on a non-NameNode host, perform the steps on that host instead of either NameNode host.

Add the reporting parameters to the Pepperdata system-wide configuration.
1. Open the host’s Pepperdata site file, pepperdata-site.xml, for editing.
  
  By default, the Pepperdata site file, pepperdata-site.xml, is located in /etc/pepperdata. If you customized the location, the file is specified by the PD_CONF_DIR environment variable. See Change the Location of pepperdata-site.xml for details.
2. Enable the HDFS Tiering Fetcher (to fetch HDFS metadata).
```
<property>
  <name>pepperdata.agent.genericJsonFetch.hdfsTiering.enabled</name>
  <value>true</value>
</property>
```
3. Configure the number of inodes—HDFS representations of files and directories—to read during every 15-second interval.
  
  The larger this value, the faster Pepperdata reads the inodes, and the more memory is needed to process the data (which is configured by the PD_MEM_LIMIT property, below).
  
  The default value of 100000 (100K) typically provides a good balance.
  
  Be sure to substitute your value for the your-inodes-setting placeholder in the following code snippet.
```
<property>
  <name>pepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch</name>
  <value>your-inodes-setting</value>
</property>
```
4. Configure the location of the FsImage files—metadata about the file system namespace, including the mapping of blocks to files and file system properties.
  - The FsImage files are located in a current subdirectory of the location specified by the dfs.namenode.name.dir property in the hdfs-site.xml configuration file.
    
    For example, if the dfs.namenode.name.dir property value is /bigdisk/hdfs/name/, you would use /bigdisk/hdfs/name/current for the value of the Pepperdata pepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir property.
  - If your dfs.namenode.name.dir property value is a comma-separated list of multiple locations, you can choose any of the locations (and add the /current subdirectory).
    
    Be sure to substitute your path for the path-to-your-FsImage-files placeholder in the following code snippet.
    <property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir</name> <value>path-to-your-FsImage-files</value> </property>
5. Configure how frequently the Pepperdata metrics data file rolls over to a newly opened file. Together with the number of inodes read during every 15–second interval (which is configured by the pepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch property, above), this controls the number of entries that Pepperdata processes from one file.
  - The interval is in minutes.
  - Default=5.
```
<property>
  <name>pepperdata.metric.genericJsonFetch.rollingProtoLogInterval</name>
  <value>1</value>
</property>
```
6. (To override the default scheduled read window for FsImage files and/or frequency of reads during the read window) Add your scheduled read window for FsImage files and/or your frequency of reads to the Pepperdata configuration.
  
  If the default scheduled read window and interval for FsImages is okay (once during the window of 00:00–23:00 on Saturdays), you do not need to add anything to the Pepperdata configuration, and you should skip this substep.
  - Be sure to substitute your cron schedule and read frequency for the your-cron-schedule (in Quartz Cron format), and your-read-frequency (in seconds) placeholders.
  - We recommend that you use as long a read window as possible to ensure that even if the PepAgent is restarted or there’s been a NameNode failover, the PepAgent can still begin the read during the read window.
    
    As well, given the nature of the data—cold files have not been accessed in more than 90 days, and even warm files have been accessed 30–90 days ago—by definition the numbers of hot, warm, and cold files change very infrequently. So there is no advantage to constantly reading “newer” data.
    
    Tip: To generate or validate your Quartz Cron expression, use the Cron Expression Generator & Explainer - Quartz . For example, if you want to change the read day from Saturdays to Tuesdays, the cron schedule would be "* * 0-23 ? * TUE".
  - We recommend that the read interval correspond to the length of the read window, so that only a single read activity is performed during the window; for example, if the window is one day, set the interval to 86400 (seconds). The FsImage files can be very large (upwards of 50 GB), and storing multiple files—which are ignored because the report generation uses only the most recent file—wastes storage resources and imposes large loads on the NameNode.
```
<property>
  <name>pepperdata.agent.genericJsonFetch.hdfsTiering.cronSchedule</name>
  <value>your-cron-schedule</value>
</property>
<property>
  <name>pepperdata.agent.genericJsonFetch.hdfsTiering.readIntervalSecs</name>
  <value>your-read-frequency</value>
</property>
```
7. Be sure that the XML properties that you added are correctly formatted.
  
  Malformed XML files can cause operational errors that can be difficult to debug. To prevent such errors, we recommend that you use a linter, such as xmllint, after you edit any .xml configuration file.
8. Save your changes and close the file.
Configure the reporting for the PepAgent.
1. Open the Pepperdata configuration file, /etc/pepperdata/pepperdata-config.sh, for editing.
2. Add the environment variables in the following format. The snippet uses typical values. Be sure to replace them with values appropriate for your environment.
  - PD_HDFS_TIERING_CONF_DIR—Name of the Hadoop configuration directory that contains the hdfs-site.xml and core-site.xml files; typically /etc/hadoop/conf/.
  - PD_MEM_LIMIT—(Default=128m) Amount of memory to provide for processing the inodes, from 64m to 2g (inclusive). The greater the number of inodes read during every 15-second interval (which is configured by the pepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch property, above), the larger the PD_MEM_LIMIT must be.
    
    The value must be formatted as a number followed by (without any intervening spaces) one of the following units designations: m (MB), g (GB).
    
    For running the HDFS Data Temperature Report on non-NameNode hosts, we recommend setting this to 256m.
  - PD_HDFS_TIERING_RUN_ON_HOST—(Only for running the report on a non-NameNode host; no default) Canonical name of the host where the report is to run; for example, ip-192-168-1-1.internal. Be sure to substitute your fully-qualified, canonical hostname for the YOUR.CANONICAL.HOSTNAME placeholder in the following code snippet.
```
export PD_LOAD_HDFS_INODE_JARS=1
export PD_HDFS_TIERING_CONF_DIR=/etc/hadoop/conf/
export PD_MEM_LIMIT=128m
# To run the report on a non-NameNode host, uncomment and assign the following value:
# export PD_HDFS_TIERING_RUN_ON_HOST=YOUR.CANONICAL.HOSTNAME
```
3. Save your changes and close the file.
For running the report on the NameNode hosts, repeat steps 1–2 on the second NameNode host in your cluster.
On every host in the cluster, restart the PepCollector and PepAgent services.
1. Restart the Pepperdata Collector.
  
  You can use either the service (if provided by your OS) or systemctl command:
  - sudo service pepcollectd restart
  - sudo systemctl restart pepcollectd
1. Restart the PepAgent.
  
  You can use either the service (if provided by your OS) or systemctl command:
  - sudo service pepagentd restart
  - sudo systemctl restart pepagentd