Assigning Workflow Ids for Grouping and Chargeback Reporting
A workflow Id identifies all the applications/jobs that function together for a single purpose. Grouping or filtering metrics by workflow Id enables chargeback reporting, filtering charts, and viewing resource consumption by workflow.
On This Page
Uses for Workflow Ids
Although workflow Ids are primarily used to enable chargeback reporting, they serve a variety of purposes:
-
(YARN-only) Creating the Chargeback report—cost to use the cluster’s resources, apportioned for each cluster user.
-
(YARN-only) Series breakdowns in charts; see Filter the Charts & Tables by Dimensions: Hosts, Users, Etc..
-
Grouping data in the Workflows Overview; see Application Spotlight Overviews & Reports.
Pepperdata Workflow Id: YARN Clusters
In YARN clusters, the Oozie and Hive workflow schedulers automatically assign their own workflow Ids, but to enable Pepperdata workflow-related functionality, you must manually configure a Pepperdata workflow Id—for Oozie jobs, oozie.job.id
; for other jobs, the pepperdata.workflow.id
key.
Procedure
-
Assign the appropriate Pepperdata workflow Id key to the job—
pepperdata.workflow.id
oroozie.job.id
, depending on the job’s type—and the value you want to assign to the application configuration.How to assign the Pepperdata workflow Id depends on your particular environment.
The examples show the most broadly applicable method—specifying a parameter and its value in a command-line invocation for running an app. But you should check with your system administrator to determine whether there’s a custom app/job manager or other framework or method for overriding default configuration settings.
Examples
-
For Oozie jobs
-
Submitted through YARN:
yarn jar -Doozie.job.id=group01-522 my_application.jar <myoptions>
-
Submitted through Spark:
spark-submit --conf spark.hadoop.oozie.job.id=group01-522 --class com.company.application.MainClass my_application.jar
-
-
For all other types of jobs
-
Submitted through YARN:
yarn jar -Dpepperdata.workflow.id=group01-522 my_application.jar <myoptions>
-
Submitted through Spark:
spark-submit --conf spark.hadoop.pepperdata.workflow.id=group01-522 --class com.company.application.MainClass my_application.jar
-
Pepperdata Workflow Id: Kubernetes Clusters
In Kubernetes clusters, Pepperdata associates a workflow with a Spark application by using the DAG (Directed Acyclic Graph) name (dag_name
) and task name (task_name
) labels (which you configure in the Pepperdata dashboard) for the driver Pod and executor Pods.
Procedure
-
Add the applicable Spark properties for labels—
dag_name
andtask_name
—to all Pepperdata-monitored Spark applications.-
Add the properties to the same
<spark-job>.yaml
files that you configured for Pepperdata instrumentation (see Activate Pepperdata for Spark Applications). -
For a given app, the
dag_name
values must be the same for thedriver
andexecutor
properties. Likewise, thetask_name
values must be the same for a given app’sdriver
andexecutor
properties. -
Be sure to replace the
your-*
placeholder names with the actual names.
# For Spark applications: "spark.kubernetes.driver.label.dag_name": "your-dag-name" "spark.kubernetes.executor.label.dag_name": "your-dag-name" "spark.kubernetes.driver.label.task_name": "your-task-name" "spark.kubernetes.executor.label.task_name": "your-task-name"
-
What to do next
- Associate the
dag_name
andtask_name
label attributes with specific applications; see Configure Labels.