Streamsets install using Cloudera Manager

Environment:

CDH 5.12

STREAMSETS-3.1.2.0.jar

Follow the install instructions in the link below:

https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Installation/CMInstall-Overview.html#concept_nb5_c3m_25

Installation with Cloudera Manager

To install Data Collector through Cloudera Manager, perform the following steps:

  1. Install the StreamSets custom service descriptor (CSD).
  2. (Optional.) Manually install the parcel and checksum files. Typically only needed when the Cloudera Manager Server does not have internet access.
  3. Download, distribute, and activate the StreamSets parcel.
  4. Configure the StreamSets service.

Afterwards, you can configure the Data Collector if necessary.

Step 1. Install the StreamSets Custom Service Descriptor

  1. Use the following URL to download the CSD from the StreamSets website: https://streamsets.com/opensource

To verify the path to download to, in Cloudera Manager, click Administration > Settings. In the navigation panel, select the Custom Service Descriptors category. Place the CSD file in the path configured for Local Descriptor Repository Path usually it is /opt/cloudera/csd/ .

Copy CSD to:

$ mv STREAMSETS-3.1.2.0.jar /opt/cloudera/csd/

Change file owner + permissions:

$ sudo chown cloudera-scm:cloudera-scm STREAMSETS-3.1.2.0.jar && sudo chmod 644 STREAMSETS-3.1.2.0.jar

Restart Cloudera Manager:

$ sudo /etc/init.d/cloudera-scm-server restart

In Cloudera Manager, to restart the Cloudera Management Service, click Home > Status. To the right of Cloudera Management Service, click the Menu icon and select Restart.

Are you sure you want to run the Restart command on the service Cloudera Management Service? Click Restart.

Step 3. Distribute and Activate the StreamSets Parcel

  1. To view the list of available parcels, in the menu bar, click the Parcels icon.
    The StreamSets parcel displays in the list of available parcels. If it doesn’t display, click Check for New Parcels.

[Note: Ran into an issue with this as the STREAMSETS_DATACOLLECTOR 3.1.2.0 parcel did not show up initially in the parcel list. There was an error in the /var/log/cloudera-scm-server/cloudera-scm-server.log :

2018-03-26 16:37:10,286 ERROR ParcelUpdateService:com.cloudera.parcel.components.ParcelDownloaderImpl: (7 skipped) Unable to retrieve remote parcel repository manifest
java.util.concurrent.ExecutionException: java.net.ConnectException: Received fatal alert: protocol_version to https://archives.streamsets.com/datacollector/latest/parcel/manifest.json

This seemed to be caused by either java version or ssl/tls issue. So changed the parcel repo url from https://   to http://archives.streamsets.com/datacollector/latest/parcel/ . After this change the STREAMSETS_DATACOLLECTOR 3.1.2.0 parcel showed up in the parcel list]

2. To download the StreamSets parcel to the local repository, click Download.

After the parcel is downloaded, the Download button becomes the Distribute button.
3. To distribute the StreamSets parcel to the cluster, click Distribute.
After distribution, the Distribute button becomes the Activate button.
4.To activate the StreamSets parcel, click Activate.

Step 4. Configure the StreamSets Service

To run Data Collector in cluster streaming mode, colocate Data Collector on a node with the Spark Gateway role. To run Data Collectorin cluster batch mode, colocate Data Collector on a node with the YARN Gateway role.

To write to HDFS, colocate Data Collector on a node with the HDFS Gateway role. Similarly, to write to HBase or Hive, colocate Data Collector on nodes with the HBase or Hive Gateway roles, respectively.

  1. In Cloudera Manager, click the menu for the cluster you want to use, then click Add a Service.
  2. In the Service Types list, select StreamSets, then click Continue.
  3. To select the hosts where you want to install StreamSets, on the Customize Role Assignments for StreamSets page, click Select Hosts to open a list of available hosts.
  4. Select one or more hosts, then click OK. Click Continue.
    The Review Changes page displays the Data and Resource directories for the Data Collector.
  5. Optionally change the directories, then click Continue.
    The First Run Command page displays status updates as Cloudera Manager starts Data Collector on the selected hosts.
  6. Click Continue, then click Finish.

[NOTE: This gave an error in CM about JDK7 not supported so need to upgrade to JDK8.

Mon Mar 26 17:48:58 EDT 2018: Prepending content from /opt/cloudera/parcels/STREAMSETS_DATACOLLECTOR-3.1.2.0/libexec/sdc-env.sh to /var/run/cloudera-scm-agent/process/809-streamsets-DATACOLLECTOR/sdc-env.sh
ERROR: Detected JDK7 that is no longer supported. Please upgrade to JDK8.]

 

After upgrading the JDK to JDK1.8u162 the Streamsets service started successfully.

Check the other blog here on how to upgrade Cloudera CDH from JDK1.7 to JDK1.8.

https://plenium.wordpress.com/2018/03/27/upgrade-jdk1-7-to-jdk1-8-in-cloudera-cdh/

 

Step 5(optional). Change the admin password for StreamSets login

The default passwords in Streamsets after install are like user=admin / password=admin ; guest / guest etc. There is no menu option to change user passwords. This has to be done using Cloudera Manager. Go to the Cloudera Manager->Streamsets->Configuration menu.

Search for password in the search box we will see the list of users and MD5 hash password for example:

Data Collector Users
datacollector.users

admin: MD5:xxxxxxxxxxxxxxxxx,user,admin

guest: MD5:xxxxxxxxxxxxxxxxxx,user,guest

Now login to linux OS and create a new MD5 password to update the above passwords:

$ echo -n password123 | md5sum
482c811da5d5b4bc6d497ffa98491e38 –

Ignore the hyphen at the end and copy the MD5 hash above and paste into the Cloudera Manager->Streamsets Configuration. Save and restart/deploy the Streamsets service using Cloudera Manager. After that you will be able to login with the new password into Streamsets.

 

REFERENCES:

http://blog.cloudera.com/blog/2016/02/how-to-build-a-real-time-search-system-using-streamsets-apache-kafka-and-cloudera-search/

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.