Designing InfoSphere DataStage jobs for optimum lineage

Design your IBM® InfoSphere® DataStage® jobs to ensure that data lineage accurately displays and reports the flow of information through the jobs.

When a job is developed, information that is included in the job is called design metadata. Lineage that is based on design definitions and default parameter values is called design lineage.

When you design a job, you build the data flow from a source to a target. The data output of one job can be the data input for another job. In this case, the data source is shared between the two jobs. InfoSphere Information Governance Catalog uses design metadata to build lineage reports that analyze the flow of data from source to target. The lineage analysis identifies relationships between information assets and stages. This metadata includes the following information:
  • Name of the database server or the data connection
  • Name of the database schema
  • User-defined SQL statements
  • Name and location of the data file

Lineage that is based on job run metadata is called operational lineage.

Use the following tables of actions to ensure that your job design gives complete metadata for best design and operational lineage results.

Table 1. Actions to ensure complete lineage metadata for design and operational data lineage
Action Importance Description How this action affects lineage Additional information
When you move jobs to another project or environment, always include the job designs and any shared containers that are used by them Essential Operational metadata alone is insufficient to track the flow of data in a job run. The job design is also needed for operational lineage and design lineage. Therefore, the job design and job executable must be kept in synch. Both the job design and the job executable are needed so that lineage will work on the target production system. Otherwise, design lineage and operational lineage will not display data flow accurately.  
Use supported stage types to access data sources Essential

Stages provide metadata about the job design. The metadata is stored in the metadata repository of IBM InfoSphere Information Server and can be accessed by InfoSphere Information Governance Catalog for lineage analysis.

With this metadata, InfoSphere Information Governance Catalog can infer the database or data file assets that the job reads from or writes to.

For a list of job stages that support lineage, see this technote.

For a list of all job stages with their description, see Alphabetical list of stages. Whether a particular stage is displayed on the InfoSphere DataStage Designer client palette depends on the type of job and the installed products and add-ons.

Recompile all jobs before running them if their job design was changed in a way that impacts data flows Essential This step is applicable to jobs whose run produces operational lineage:
  • Gathering operational metadata was enabled
  • The parameter Include for Lineage is set to True for the job
You must recompile jobs whenever the design changes:
  • The job structure changes when you add, remove, or modify stages, links, or stage columns
  • Data source names change
  • Used shared containers change when you add, remove, or modify stages, links, or stage columns or when data source names change
If a job design was changed in a way to impact data flows, but the job was not recompiled, the updated job design and the outdated job executable might give erroneous data flow results, no results, or run-time errors.  
When you import a data file, ensure that its name and path are defined in the same way that they are defined in the stage Essential

The name and path of the imported or shared data file must match the name and directory path in the stage. If several such data files exist, the data file with the engine's host is chosen. If no data file has the engine's host, then the file stage will be connected to an arbitrary data file with the same name and path.

For example, in the Details page of the data file FullName.txt, its path is /opt/SimpleFiles. The slashes of the file path must be the same as in the Parameters tab of the Job Properties window of the data file FullName.txt that is in InfoSphere DataStage.

If the name or directory path is not the same as it is in the stage, the data file and stage cannot be linked correctly in the job data flow. As a result, the lineage is incorrect or incomplete.

If no data file with the same name and path was imported at the time of job analysis, the job analysis creates a virtual data file asset with the engine host name. The virtual data file shows the name and path from the job design or job run that lineage looks for in imported data files, as it tries to find a match.

If you import the data file with the same name and path after job analysis, and its host is the same as the engine host, the data file and file stage are linked automatically.

If you import the data file with the same name and path as for the file stage only after job analysis, and its host is different than the engine host, then the data file and file stage are not linked even though the name and path are the same. You must click Detect Lineage Relationships for the job in order to connect the stage to the newly imported data file. This scenario is one of the few cases where this manual step is needed.

In RDBMS stages, ensure that either the schema name or the user name for database tables are defined Essential The stage design must qualify the database table with either the schema or the user name. In the MSSQL stage, the default schema name is dbo. If neither of these names is defined, the database table remains a virtual asset, even if you do data connection mapping. The schema is displayed as [missing_user].  
If you need to keep the SQL statements flexible, use parameters whose values are SQL statements rather than external files that contain the SQL statements Essential   SQL statements from external files do not go to the metadata repository where they can be used for job flow analysis. These SQL statements are also not included in operational metadata XML files.  
When you import a sequencer job, make sure that the jobs that it invokes are in the same dsx, or were previously imported. Recommended, otherwise you need to recompile the sequencer job before it gets reflected in lineage reports Sequence flows get detected during analysis of the sequencer job. If the sequence flow detection happens before the actual job is imported, the sequence flows cannot be linked because the jobs are still missing in the metadata repository. When you import the previously missing jobs, InfoSphere DataStage does not yet link them to their sequencers in the metadata repository. However, after you compile the sequencer, InfoSphere DataStage explicitly makes the links and updates the sequencer's last-modified-date. As a result, the updated date triggers job analysis in InfoSphere Information Governance Catalog.

If the invoked jobs were imported only after the sequencer job, then recompile the sequencer job after that import. This compilation will link the sequencer job with the invoked jobs, both in View pages and in lineage reports. In all cases, for lineage reports to reflect these connections, make sure to check Impacted Assets in the Lineage Filter dialog.

For a description of impact analysis in lineage reporting, see Data lineage and business lineage reports.

Use the generated SQL statements rather than SQL that you specify, where possible Recommended In InfoSphere Information Governance Catalog, the schema and database table name of the imported database must be the same as the schema and database table name in the stage. You can generate default SQL statements to read from and write to data sources. Alternatively, you can specify SQL statements manually that read from and write to data sources. SQL that you specify might contain certain DBMS-vendor-specific constructs that cannot be parsed correctly in job lineage analysis. As a result, the relationships between stages and data sources might be incorrect.

For information about user-defined SQL in InfoSphere DataStage, see User-defined SQL.

For information about job design considerations and SQL, see Job design considerations.

If you must use custom SQL rather then generated SQL, then at least use standard-based SQL rather than PL/SQL or other procedural extensions Essential Job flow analysis in InfoSphere Information Governance Catalog parses most standard SQL statements, even when they contain vendor-specific constructs such as SEL instead of SELECT. However, procedural language extensions such as PL/SQL are not parsed. PL/SQL are not parsed in job flow analysis. As a result, the relationships between stages and data sources will be incorrect.  

Table 2. Actions to ensure complete lineage metadata for design lineage only
Action Description How this action affects lineage Additional information
Use environment variables and job parameters

Wherever possible, use environment variables, parameters, and parameter sets as common references across all jobs in a project.

The use of environment variables reduces error, helps to ensure correct job metadata, and promotes data reuse in job development.

For more information about how to set up job parameters and parameter sets, see Making your jobs adaptable.

For general information about setting environment variables, see Guide to setting environment variables.

For general information about environment variables, see Environment variables.

Import project-level default parameters as environment variables Before you include jobs, or a whole project, for lineage, you must import the project-level environment variables that you defined in InfoSphere DataStage into InfoSphere Information Governance Catalog. InfoSphere Information Governance Catalog uses the default values of project-level parameters that are stored in environment variables to link the job to the sources that the job references. For information about how to import project-level environment variables, see Import project-level environment variables.
Check that parameters that are used in data source identity definitions of stages have realistic default values. This action applies to job parameters and to environment variables that are used as project-level parameters.

Parameters that are used in stage definitions should have default values.

To list the environment variables that are defined for the project, use the dsadmin utility.

Ideally, all parameters that are used in stage definitions have realistic default values. Realistic values enable lineage reports to link jobs to the actual data sources in the metadata repository.

When default values are missing (such as due to job design policies), lineage reports display the jobs as linked to virtual data sources. These virtual assets are named by using the parameter name enclosed with a pound sign (#). An example might be #some_param_name#. Design lineage through jobs and virtual assets can still be meaningful if you take the following actions:
Use the same parameters for the same data sources in all jobs
Otherwise, design lineage does not recognize certain data sources as identical, and lineage flow might stop.
Use different parameter names for different data sources
Otherwise, design lineage can regard distinct data sources as identical and, as a result, display too many flows.

For information about how to run the dsadmin utility, see Listing environment variables.

For information about job parameters, see Job parameters.