IBM InfoSphere Streams Version 4.1.0
Developing and running applications that use the HDFS Toolkit
SPL standard and specialized toolkits > com.ibm.streamsx.hdfs 2.0.0 > Developing and running applications that use the HDFS Toolkit
To create applications that use the HDFS Toolkit, you must configure either Streams Studio or the SPL compiler to be aware of the location of the toolkit.
Before you begin
- Install IBM® InfoSphere® Streams. Configure the product environment variables by entering the following command:
source product-installation-root-directory/4.0.0.0/bin/streamsprofile.sh
- Install a supported version of Hadoop.
- Ensure that InfoSphere Streams has access to Hadoop libraries and configuration files to allow streams processing applications to read and write to HDFS.
About this task
After the location of the toolkit is communicated to the compiler, the SPL artifacts that are specified in the toolkit can be used by an application. The application can include a use directive to bring the necessary namespaces into scope. Alternatively, you can fully qualify the operators that are provided by toolkit with their namespaces as prefixes.
Procedure
Procedure
- If InfoSphere Streams has access to the location where Hadoop is installed, set the following environment variables:
- For Apache HDFS, Cloudera, or Hortonworks Data Platform:
- Set HADOOP_HOME to Hadoop_Install_Directory. For example, /usr/lib/hadoop.
- Set JAVA_HOME to the location where Java™ is installed.
- For IBM InfoSphere BigInsights® 3.x:
- Set BIGINSIGHTS_HOME to BigInsights_Install_Directory. For example, /opt/ibm/biginsights.
- Set HADOOP_HOME to BigInsights_Install_Directory/IHC. For example, /opt/ibm/biginsights/IHC.
- Set JAVA_HOME to the location where Java is installed.
- For IBM InfoSphere BigInsights 4.x:
- Set HADOOP_HOME to BigInsights_Install_Directory/hadoop. For example, /usr/iop/4.0.0.0/hadoop.
- Set JAVA_HOME to the location where Java is installed.
- For Apache HDFS, Cloudera, or Hortonworks Data Platform:
- If InfoSphere Streams does not have access to the location where Hadoop is installed, copy the Hadoop library files to a location that is accessible to InfoSphere Streams and set the appropriate environment variables. The following list describes the Hadoop library files to copy:
- For Apache HDFS, Cloudera, or Hortonworks Data Platform:
- Copy /usr/lib/hadoop to the InfoSphere Streams cluster and place it in a directory on the cluster, which is accessible to InfoSphere Streams. Note: When copying the directories, you must ensure that symbolic links are dereferenced, otherwise the directory containing the core-site.xml file might not get copied to the InfoSphere Streams cluster. On Linux, you can dereference a symbolic link by using the -L flag. For example:
cp -Lr /usr/lib/hadoop /usr/lib/hadoop-hdfs /path-on-cluster
- Copy /usr/lib/hadoop-hdfs to the InfoSphere Streams cluster and place it in a directory on the cluster, which is accessible to InfoSphere Streams.
- Copy /usr/lib/hadoop to the InfoSphere Streams cluster and place it in a directory on the cluster, which is accessible to InfoSphere Streams. Note: When copying the directories, you must ensure that symbolic links are dereferenced, otherwise the directory containing the core-site.xml file might not get copied to the InfoSphere Streams cluster. On Linux, you can dereference a symbolic link by using the -L flag. For example:
- For IBM InfoSphere BigInsights 3.x:
- Copy BigInsights_Install_Directory/IHC to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/IHC.
- Copy theBigInsights_Install_Directory/hadoop-conf directory to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/hadoop-conf
- For IBM InfoSphere BigInsights 4.x:
- Copy BigInsights_Install_Directory/hadoop to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/hadoop.
- Copy the BigInsights_Install_Directory/hadoop-hdfs directory to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/hadoop-hdfs
- For IBM InfoSphere BigInsights 3.x installed on GPFS™:
- Important: If IBM InfoSphere BigInsights is installed on GPFS, you do not need to install InfoSphere Streams on an IBM InfoSphere BigInsights data node. Use the webhdfs://hdfshost:webhdfsport schema in the URI that you use to connect to GPFS.
- Copy BigInsights_Install_Directory/IHC to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/IHC.
- Copy the BigInsights_Install_Directory/hadoop-conf directory to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory/hadoop-conf
- Copy BigInsights_Install_Directory/lib/biginsights-gpfs.jar to the InfoSphere Streams cluster and place it under a directory on the cluster, which is accessible to InfoSphere Streams. For example, /home/Streams/BigInsights_Install_Directory.
- For IBM InfoSphere BigInsights 4.0.0.0 installed on GPFS:
- The com.ibm.streamsx.hdfs toolkit does not support remote connections to a BigInsight 4.0.0.0 GPFS cluster.
- For Apache HDFS, Cloudera, or Hortonworks Data Platform:
- Set HADOOP_HOME to /home/Streams/hadoop.l
- Set JAVA_HOME to the location where Java is installed.
- For IBM InfoSphere BigInsights 3.x:
- Set HADOOP_HOME to /home/Streams/biginsights/IHC.
- Set BIGINSIGHTS_HOME to /home/Streams/biginsights.
- Set JAVA_HOME to the location where Java is installed.
- For IBM InfoSphere BigInsights 4.x:
- Set HADOOP_HOME to /home/Streams/biginsights/hadoop.
- Set JAVA_HOME to the location where Java is installed.
- IBM InfoSphere BigInsights 3.x installed on GPFS:
- Set HADOOP_HOME to /opt/ibm/biginsights/IHC/.
- Set BIGINSIGHTS_HOME to /opt/ibm/biginsights.
- Set JAVA_HOME to the location where Java is installed.
- For Apache HDFS, Cloudera, or Hortonworks Data Platform:
- Configure the SPL compiler to find the toolkit root directory. Use one of the following methods:
- Set the STREAMS_SPLPATH environment variable to the root directory of a toolkit or multiple toolkits (with : as a separator). For example:
export STREAMS_SPLPATH=$STREAMS_INSTALL/toolkits/com.ibm.streamsx.hdfs
- Specify the -t or --spl-path command parameter when you run the sc command. For example:
where MyMain is the name of the SPL main composite. Note: These command parameters override the STREAMS_SPLPATH environment variable.sc -t $STREAMS_INSTALL/toolkits/com.ibm.streamsx.hdfs -M MyMain
- Add the toolkit location in InfoSphere Streams Studio.
- Set the STREAMS_SPLPATH environment variable to the root directory of a toolkit or multiple toolkits (with : as a separator). For example:
- Develop your application. To avoid the need to fully qualify the operators, add a use directive in your application.
- For example, you can add the following clause in your SPL source file:
You can also specify a use clause for individual operators by replacing the asterisk (*) with the operator name. For example:use com.ibm.streamsx.hdfs::*;
use com.ibm.streamsx.hdfs::HDFS2FileSink;
- For example, you can add the following clause in your SPL source file:
- If IBM InfoSphere BigInsights 3.x or 4.x is installed on GPFS:
- To access GPFS locally, set the fs.defaultFS option in the core-site.xml configuration file to gpfs:///.
- To access GPFS remotely (only applies to BigInsights 3.x), modify the core-site.xml that you have copied over from the remote system. Set the fs.default.FS option in the core-site.xml configuration file to webhdfs://hdfshost:webhdfsport. For example, webhdfs://myhdfshost:14000. Ensure that the user is set up to access the file system by using the webhdfs schema.
- To read and write to HDFS, specify a uniform resource identifier (URI) to connect to HDFS. You can specify the URI in one of the following ways:
- Specify a value for the fs.defautlFS or fs.default.name option in the core-site.xml HDFS configuration file. By default, the operators look for the core-site.xml file in the following directories:
- $HADOOP_HOME/../hadoop-conf
- $HADOOP_HOME/etc/hadoop
- $HADOOP_HOME/conf
- $HADOOP_HOME/share/hadoop/hdfs/*
- $HADOOP_HOME/share/hadoop/common/*
- $HADOOP_HOME/share/hadoop/common/lib/*
- $HADOOP_HOME/lib/*
- $HADOOP_HOME/*
- Specify a value for the hdfsUri operator parameter.
- Specify a value for the fs.defautlFS or fs.default.name option in the core-site.xml HDFS configuration file. By default, the operators look for the core-site.xml file in the following directories:
- Build your application. You can use the sc command or Streams Studio.
- Start the InfoSphere Streams instance.
- Run the application. You can submit the application as a job by using the streamtool submitjob command or by using Streams Studio.