Designing InfoSphere DataStage jobs for optimum lineage
Design your IBM® InfoSphere® DataStage® jobs to ensure that data lineage accurately displays and reports the flow of information through the jobs.
When a job is developed, information that is included in the job is called design metadata. Lineage that is based on design definitions and default parameter values is called design lineage.
- Name of the database server or the data connection
- Name of the database schema
- User-defined SQL statements
- Name and location of the data file
Lineage that is based on job run metadata is called operational lineage.
Use the following tables of actions to ensure that your job design gives complete metadata for best design and operational lineage results.
Action | Importance | Description | How this action affects lineage | Additional information |
---|---|---|---|---|
When you move jobs to another project or environment, always include the job designs and any shared containers that are used by them | Essential | Operational metadata alone is insufficient to track the flow of data in a job run. The job design is also needed for operational lineage and design lineage. Therefore, the job design and job executable must be kept in synch. | Both the job design and the job executable are needed so that lineage will work on the target production system. Otherwise, design lineage and operational lineage will not display data flow accurately. | |
Use supported stage types to access data sources | Essential | Stages provide metadata about the job design. The metadata is stored in the metadata repository of IBM InfoSphere Information Server and can be accessed by InfoSphere Information Governance Catalog for lineage analysis. |
With this metadata, InfoSphere Information Governance Catalog can infer the database or data file assets that the job reads from or writes to. |
For a list of job stages that support lineage, see this technote. For a list of all job stages with their description, see Alphabetical list of stages. Whether a particular stage is displayed on the InfoSphere DataStage Designer client palette depends on the type of job and the installed products and add-ons. |
Recompile all jobs before running them if their job design was changed in a way that impacts data flows | Essential | This step is applicable to jobs whose run
produces operational lineage:
|
If a job design was changed in a way to impact data flows, but the job was not recompiled, the updated job design and the outdated job executable might give erroneous data flow results, no results, or run-time errors. | |
When you import a data file, ensure that its name and path are defined in the same way that they are defined in the stage | Essential | The name and path of the imported or shared data file must match the name and directory path in the stage. If several such data files exist, the data file with the engine's host is chosen. If no data file has the engine's host, then the file stage will be connected to an arbitrary data file with the same name and path. For example, in the Details page of the data file FullName.txt, its path is /opt/SimpleFiles. The slashes of the file path must be the same as in the Parameters tab of the Job Properties window of the data file FullName.txt that is in InfoSphere DataStage. |
If the name or directory path is not the same as it is in the stage, the data file and stage cannot be linked correctly in the job data flow. As a result, the lineage is incorrect or incomplete. If no data file with the same name and path was imported at the time of job analysis, the job analysis creates a virtual data file asset with the engine host name. The virtual data file shows the name and path from the job design or job run that lineage looks for in imported data files, as it tries to find a match. If you import the data file with the same name and path after job analysis, and its host is the same as the engine host, the data file and file stage are linked automatically. |
If you import the data file with the same name and path as for the file stage only after job analysis, and its host is different than the engine host, then the data file and file stage are not linked even though the name and path are the same. You must click Detect Lineage Relationships for the job in order to connect the stage to the newly imported data file. This scenario is one of the few cases where this manual step is needed. |
In RDBMS stages, ensure that either the schema name or the user name for database tables are defined | Essential | The stage design must qualify the database
table with either the schema or the user name. In the MSSQL stage,
the default schema name is dbo. |
If neither of these names is defined, the
database table remains a virtual asset, even if you do data connection
mapping. The schema is displayed as [missing_user]. |
|
If you need to keep the SQL statements flexible, use parameters whose values are SQL statements rather than external files that contain the SQL statements | Essential | SQL statements from external files do not go to the metadata repository where they can be used for job flow analysis. These SQL statements are also not included in operational metadata XML files. | ||
When you import a sequencer job, make sure that the jobs that it invokes are in the same dsx, or were previously imported. | Recommended, otherwise you need to recompile the sequencer job before it gets reflected in lineage reports | Sequence flows get detected during analysis of the sequencer job. If the sequence flow detection happens before the actual job is imported, the sequence flows cannot be linked because the jobs are still missing in the metadata repository. | When you import the previously missing jobs, InfoSphere DataStage does not yet link them to their sequencers in the metadata repository. However, after you compile the sequencer, InfoSphere DataStage explicitly makes the links and updates the sequencer's last-modified-date. As a result, the updated date triggers job analysis in InfoSphere Information Governance Catalog. | If the invoked jobs were imported only after the sequencer job, then recompile the sequencer job after that import. This compilation will link the sequencer job with the invoked jobs, both in View pages and in lineage reports. In all cases, for lineage reports to reflect these connections, make sure to check Impacted Assets in the Lineage Filter dialog. For a description of impact analysis in lineage reporting, see Data lineage and business lineage reports. |
Use the generated SQL statements rather than SQL that you specify, where possible | Recommended | In InfoSphere Information Governance Catalog, the schema and database table name of the imported database must be the same as the schema and database table name in the stage. You can generate default SQL statements to read from and write to data sources. Alternatively, you can specify SQL statements manually that read from and write to data sources. | SQL that you specify might contain certain DBMS-vendor-specific constructs that cannot be parsed correctly in job lineage analysis. As a result, the relationships between stages and data sources might be incorrect. | For information about user-defined SQL in InfoSphere DataStage, see User-defined SQL. For information about job design considerations and SQL, see Job design considerations. |
If you must use custom SQL rather then generated SQL, then at least use standard-based SQL rather than PL/SQL or other procedural extensions | Essential | Job flow analysis in InfoSphere Information Governance Catalog parses most standard SQL statements, even when they contain vendor-specific constructs such as SEL instead of SELECT. However, procedural language extensions such as PL/SQL are not parsed. | PL/SQL are not parsed in job flow analysis. As a result, the relationships between stages and data sources will be incorrect. |
Action | Description | How this action affects lineage | Additional information |
---|---|---|---|
Use environment variables and job parameters | Wherever possible, use environment variables, parameters, and parameter sets as common references across all jobs in a project. |
The use of environment variables reduces error, helps to ensure correct job metadata, and promotes data reuse in job development. |
For more information about how to set up job parameters and parameter sets, see Making your jobs adaptable. For general information about setting environment variables, see Guide to setting environment variables. For general information about environment variables, see Environment variables. |
Import project-level default parameters as environment variables | Before you include jobs, or a whole project, for lineage, you must import the project-level environment variables that you defined in InfoSphere DataStage into InfoSphere Information Governance Catalog. | InfoSphere Information Governance Catalog uses the default values of project-level parameters that are stored in environment variables to link the job to the sources that the job references. | For information about how to import project-level environment variables, see Import project-level environment variables. |
Check that parameters that are used in data source identity definitions of stages have realistic default values. This action applies to job parameters and to environment variables that are used as project-level parameters. | Parameters that are used in stage definitions should have default values. To list the environment variables that are defined for the project, use the dsadmin utility. |
Ideally, all parameters that are used in stage definitions have realistic default values. Realistic values enable lineage reports to link jobs to the actual data sources in the metadata repository. When
default values are missing (such as due to job design policies), lineage
reports display the jobs as linked to virtual data sources. These
virtual assets are named by using the parameter name enclosed with
a pound sign (#). An example might be #some_param_name#.
Design lineage through jobs and virtual assets can still be meaningful
if you take the following actions:
|
For information about how to run the dsadmin utility, see Listing environment variables. For information about job parameters, see Job parameters. |