Using Streamsets for ETL to/from Hadoop

[blog in progress – incomplete]

This blog will show some examples of doing ETL to or from Hadoop.

USE CASE #1: Use Sqoop commands inside Streamsets to copy data to Hadoop from RDBMS

USE CASE #2: Use Kafka with Kerberos and Sentry with Streamsets on Cloudera

The following steps need to be completed to setup Streamsets with Kafka and Kerberos:

In the Cloudera Manager Data Collector Advanced Configuration Snippet (Safety Valve) for add the below

export STREAMSETS_LIBRARIES_EXTRA_DIR=”/opt/sdc-extras/”


Next copy the streamsets.keytab from the latest /run/cloudera-scm-agent/process/xxx-streamsets-… directory and copy to /etc/security/keytabs/streamsets.keytab

Next create a jaas.conf file in
/opt/sdc-extras/kafka/jaas.conf and add the lines below:

KafkaClient { required

Make sure the files are owned by

$ chown sdc:sdc

$ chown sdc:sdc

Do the above changes in all Streamsets hosts. Restart Cloudera Manager.

NOTE: You may need to use kafka-sentry command to give permission to the sdc user on the topic being used.

  1. On the General tab of the Kafka Consumer origin in the cluster pipeline, set the Stage Library property to Apache Kafka or a later version.
  2. On the Kafka tab, add the security.protocol Kafka configuration property, and set it to SASL_PLAINTEXT.
  3. Then, add the configuration property, and set it to kafka.

For example, the following Kafka properties enable connecting to Kafka with Kerberos:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.