(Platform Spotlight) Configure HDFS Data Temperature Report (RPM/DEB)
The HDFS Data Temperature report shows the data temperature across a cluster’s entire Hadoop Distributed File System (HDFS) space—how many hot, warm, and cold files are on the HDFS file system; the total sizes for hot, warm, and cold files; and what the configured storage policies are.
Using this data, you can optimize your storage use by changing the storage policies, particularly for files whose policies are not set to cold, but whose last access dates correlate with the cold
file configuration.
The definitions for hot, warm, and cold files for your Pepperdata implementation are included in the HDFS Data Temperature Report for your clusters. The image shows the default definitions, but yours might be different. Be sure to check with your System Administrator.
By default, the HDFS Data Temperature report runs on the standby NameNode. If failover occurs, the report stops running on the former standby NameNode, and starts running on the new standby NameNode. You can override these defaults, and configure the report to run on a different (specific) host instead of the NameNode hosts.
By default, Pepperdata reads the FsImage files once during a scheduled read window of 00:00–23:00 on Saturdays. You can override the default frequency of reads and/or the scheduled read window by changing the Pepperdata configuration.
Before you can access the HDFS Data Temperature report from the Pepperdata dashboard, you must configure your Pepperdata installation to collect the necessary metrics and to upload those metrics to the Pepperdata backend data storage.
On This Page
Prerequisites
Before you begin configuring your cluster for the HDFS Data Temperature report, ensure that your system meets the required prerequisites.
-
The Pepperdata PepAgent must be installed on every host that you want monitored for HDFS data temperature.
-
The monitored hosts must be configured for Hadoop, as required by the Hadoop Distributed File System (HDFS).
-
On the Pepperdata backend, the HDFS Data Temperature report must be enabled for your cluster. To request or confirm that the report is enabled, contact Pepperdata Support.
-
The host(s) on which the report is to run must be HDFS Gateway nodes.
-
On the host(s) where the report is run, the
PD_USER
must be theroot
user, thehdfs
user, or a user with permission to read from the directory that holds the FsImage files (which you’ll specify by using thepepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir
property) and from the Hadoop configuration directory that contains thehdfs-site.xml
andcore-site.xml
files (which you’ll specify by using thePD_HDFS_TIERING_CONF_DIR
environment variable). -
(Kerberized clusters) Be sure that you added the required environment variables—
PD_AGENT_PRINCIPAL
andPD_AGENT_KEYTAB_LOCATION
—to the Pepperdata configuration during the installation process (Task 4. (Kerberized clusters) Enable Kerberos Authentication).- The
PD_AGENT_PRINCIPAL
is used only for authentication (not authorization). - It does not need HDFS access.
- It needs to be set only on the host(s) where the report is to run.
- The
-
Access times for HDFS must be enabled—a non-zero value for the
dfs.namenode.accesstime.precision
property.
Procedure
For running the report on a non-NameNode host, perform the steps on that host instead of either NameNode host.
-
Add the reporting parameters to the Pepperdata system-wide configuration.
-
Open the host’s Pepperdata site file,
pepperdata-site.xml
, for editing.By default, the Pepperdata site file,
pepperdata-site.xml
, is located in/etc/pepperdata
. If you customized the location, the file is specified by thePD_CONF_DIR
environment variable. See Change the Location of pepperdata-site.xml for details. -
Enable the HDFS Tiering Fetcher (to fetch HDFS metadata).
<property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.enabled</name> <value>true</value> </property>
-
Configure the number of inodes—HDFS representations of files and directories—to read during every 15-second interval.
The larger this value, the faster Pepperdata reads the inodes, and the more memory is needed to process the data (which is configured by the
PD_MEM_LIMIT
property, below).The default value of
100000
(100K) typically provides a good balance.Be sure to substitute your value for the
your-inodes-setting
placeholder in the following code snippet.<property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch</name> <value>your-inodes-setting</value> </property>
-
Configure the location of the FsImage files—metadata about the file system namespace, including the mapping of blocks to files and file system properties.
-
The FsImage files are located in a
current
subdirectory of the location specified by thedfs.namenode.name.dir
property in thehdfs-site.xml
configuration file.For example, if the
dfs.namenode.name.dir
property value is/bigdisk/hdfs/name/
, you would use/bigdisk/hdfs/name/current
for the value of the Pepperdatapepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir
property. -
If your
dfs.namenode.name.dir
property value is a comma-separated list of multiple locations, you can choose any of the locations (and add the/current
subdirectory).Be sure to substitute your path for the
path-to-your-FsImage-files
placeholder in the following code snippet.<property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.namenodeFsimageDir</name> <value>path-to-your-FsImage-files</value> </property>
-
-
Configure how frequently the Pepperdata metrics data file rolls over to a newly opened file. Together with the number of inodes read during every 15–second interval (which is configured by the
pepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch
property, above), this controls the number of entries that Pepperdata processes from one file.- The interval is in minutes.
- Default=5.
<property> <name>pepperdata.metric.genericJsonFetch.rollingProtoLogInterval</name> <value>1</value> </property>
-
(To override the default scheduled read window for FsImage files and/or frequency of reads during the read window) Add your scheduled read window for FsImage files and/or your frequency of reads to the Pepperdata configuration.
If the default scheduled read window and interval for FsImages is okay (once during the window of 00:00–23:00 on Saturdays), you do not need to add anything to the Pepperdata configuration, and you should skip this substep.-
Be sure to substitute your cron schedule and read frequency for the
your-cron-schedule
(in Quartz Cron format), andyour-read-frequency
(in seconds) placeholders. -
We recommend that you use as long a read window as possible to ensure that even if the PepAgent is restarted or there’s been a NameNode failover, the PepAgent can still begin the read during the read window.
As well, given the nature of the data—cold files have not been accessed in more than 90 days, and even warm files have been accessed 30–90 days ago—by definition the numbers of hot, warm, and cold files change very infrequently. So there is no advantage to constantly reading “newer” data.
Tip: To generate or validate your Quartz Cron expression, use the Cron Expression Generator & Explainer - Quartz . For example, if you want to change the read day from Saturdays to Tuesdays, the cron schedule would be "* * 0-23 ? * TUE". -
We recommend that the read interval correspond to the length of the read window, so that only a single read activity is performed during the window; for example, if the window is one day, set the interval to
86400
(seconds). The FsImage files can be very large (upwards of 50 GB), and storing multiple files—which are ignored because the report generation uses only the most recent file—wastes storage resources and imposes large loads on the NameNode.
<property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.cronSchedule</name> <value>your-cron-schedule</value> </property> <property> <name>pepperdata.agent.genericJsonFetch.hdfsTiering.readIntervalSecs</name> <value>your-read-frequency</value> </property>
-
-
Be sure that the XML properties that you added are correctly formatted.
Malformed XML files can cause operational errors that can be difficult to debug. To prevent such errors, we recommend that you use a linter, such asxmllint
, after you edit any .xml configuration file. -
Save your changes and close the file.
-
-
Configure the reporting for the PepAgent.
-
Open the Pepperdata configuration file,
/etc/pepperdata/pepperdata-config.sh
, for editing. -
Add the environment variables in the following format. The snippet uses typical values. Be sure to replace them with values appropriate for your environment.
-
PD_HDFS_TIERING_CONF_DIR
—Name of the Hadoop configuration directory that contains thehdfs-site.xml
andcore-site.xml
files; typically/etc/hadoop/conf/
. -
PD_MEM_LIMIT
—(Default=128m
) Amount of memory to provide for processing the inodes, from64m
to2g
(inclusive). The greater the number of inodes read during every 15-second interval (which is configured by thepepperdata.agent.genericJsonFetch.hdfsTiering.maxInodesBatch
property, above), the larger thePD_MEM_LIMIT
must be.The value must be formatted as a number followed by (without any intervening spaces) one of the following units designations:
m
(MB),g
(GB).For running the HDFS Data Temperature Report on non-NameNode hosts, we recommend setting this to
256m
. -
PD_HDFS_TIERING_RUN_ON_HOST
—(Only for running the report on a non-NameNode host; no default) Canonical name of the host where the report is to run; for example,ip-192-168-1-1.internal
. Be sure to substitute your fully-qualified, canonical hostname for theYOUR.CANONICAL.HOSTNAME
placeholder in the following code snippet.
export PD_LOAD_HDFS_INODE_JARS=1 export PD_HDFS_TIERING_CONF_DIR=/etc/hadoop/conf/ export PD_MEM_LIMIT=128m # To run the report on a non-NameNode host, uncomment and assign the following value: # export PD_HDFS_TIERING_RUN_ON_HOST=YOUR.CANONICAL.HOSTNAME
-
-
Save your changes and close the file.
-
-
For running the report on the NameNode hosts, repeat steps 1–2 on the second NameNode host in your cluster.
-
On every host in the cluster, restart the PepCollector and PepAgent services.
-
Restart the Pepperdata Collector.
You can use either the
service
(if provided by your OS) orsystemctl
command:sudo service pepcollectd restart
sudo systemctl restart pepcollectd
-
Restart the PepAgent.
You can use either the
service
(if provided by your OS) orsystemctl
command:sudo service pepagentd restart
sudo systemctl restart pepagentd
-