IBM InfoSphere Streams Version 4.1.0

Toolkit com.ibm.streamsx.hdfs 2.0.0

SPL standard and specialized toolkits > com.ibm.streamsx.hdfs 2.0.0

General Information

The HDFS Toolkit provides operators that can read and write data from Hadoop Distributed File System (HDFS) version 2 or later.

The operators in this toolkit use Hadoop Java™ APIs to access HDFS and GPFS™. The operators support the following versions of Hadoop distributions:
  • Apache Hadoop versions 2.x
  • InfoSphere® BigInsights® 2.1.2, 3.0.0.x, 4.0.0.0
  • Cloudera distribution including Apache Hadoop version 4 (CDH4) and version 5 (CDH 5)
  • Hortonworks Data Platform (HDP) 2.2

Note: The reference platforms that were used for testing are Hadoop 2.6.0, InfoSphere BigInsights 3.0.0.2 and CDH 5.2.0.

When you use the operators to access GPFS, you do not need to install InfoSphere Streams on an InfoSphere BigInsights data node. Instead, you can access GPFS remotely by specifying the webhdfs://hdfshost:webhdfsport schema in the URI that you use to connect to GPFS.

For Apache Hadoop 2.x, CDH, and HDP, you can optionally configure these operators to use the Kerberos protocol to authenticate users that read and write to HDFS. Kerberos authentication provides a more secure way of accessing HDFS by providing user authentication. To use Kerberos authentication, you must configure the authPrincipal and authKeytab operator parameters at compile time. The authPrincipal parameter specifies the Kerberos principal, which is typically the principal that is created for the Streams instance owner. The authKeytab parameter specifies the keytab file that is created for the principal.

Restriction: Kerberos authentication is not supported for InfoSphere BigInsights 2.1.2 and 3.0.0.x.

Developing and running applications that use the HDFS Toolkit
To create applications that use the HDFS Toolkit, you must configure either Streams Studio or the SPL compiler to be aware of the location of the toolkit.
Version
2.0.0
Required Product Version
4.0.0.0