Using Streamsets for ETL to/from Hadoop

[blog in progress - incomplete] This blog will show some examples of doing ETL to or from Hadoop. USE CASE #1: Use Sqoop commands inside Streamsets to copy data to Hadoop from RDBMS https://www.youtube.com/watch?v=k8VbTR77l8M     REFERENCES: https://streamsets.com/tutorials/ https://github.com/streamsets/tutorials http://www.treselle.com/blog/import-and-ingest-data-into-hdfs-using-kafka-in-streamsets/ http://blog.cloudera.com/blog/2016/02/how-to-build-a-real-time-search-system-using-streamsets-apache-kafka-and-cloudera-search/ https://www.youtube.com/watch?v=Gnvl30OJNao https://www.youtube.com/watch?v=qAyFvC4c2n4  

Advertisements

Streamsets install Oracle JDBC driver in External Library

This blog will show how to install the Oracle JDBC driver to the Streamsets External Library.  Environment: Cloudera CDH 5.12, Streamsets 3.1.2 TASK: Update the Oracle JDBC driver inside Streamsets https://streamsets.com/documentation/datacollector/latest/help/#datacollector/UserGuide/Configuration/ExternalLibs.html#concept_pdv_qlw_ft Step 1. Set Up an External Directory  Setting Up for Cloudera Manager In Cloudera Manager, select the StreamSets service and then click Configuration. On the Configuration page, in the Data … Continue reading Streamsets install Oracle JDBC driver in External Library

Upgrade JDK1.7 to JDK1.8 in Cloudera CDH

Environment: Cloudera CDH 5.12 on RHEL Requirements Install one of the CDH and Cloudera Manager Supported JDK Versions. Install the same version of the Oracle JDK on each host. Install the JDK in /usr/java/jdk-version All nodes must run the same JDK version. Cloudera only supports 64bit JDK from Oracle. Upgrading to Oracle JDK 1.8 in a Cloudera Manager Deployment … Continue reading Upgrade JDK1.7 to JDK1.8 in Cloudera CDH

Streamsets install using Cloudera Manager

Environment: CDH 5.12 STREAMSETS-3.1.2.0.jar Follow the install instructions in the link below: https://streamsets.com/documentation/datacollector/latest/help/index.html#datacollector/UserGuide/Installation/CMInstall-Overview.html#concept_nb5_c3m_25 Installation with Cloudera Manager To install Data Collector through Cloudera Manager, perform the following steps: Install the StreamSets custom service descriptor (CSD). (Optional.) Manually install the parcel and checksum files. Typically only needed when the Cloudera Manager Server does not have internet access. Download, … Continue reading Streamsets install using Cloudera Manager

Integrate Hadoop Hue with LDAP

Authenticate Hue Users with LDAP Environment: CDH 5.12 on RHEL, Active Directory LDAP We will use Search Bind as it seems to be compatible with both AD and LDAP. We will follow the steps in the below manual: https://www.cloudera.com/documentation/enterprise/latest/topics/hue_sec_ldap_auth.htmlhttps://www.youtube.com/watch?time_continue=12&v=pCgUxQ8CU4o Log on to Cloudera Manager and click Hue. Click the Configuration tab and filter by scope=Service-wide and category=Security. Set the … Continue reading Integrate Hadoop Hue with LDAP

Connect Hadoop to ElasticSearch using Talend

(BLOG IN PROGRESS - INCOMPLETE) This blog will show how to update an ElasticSearch index with data from HDFS file using the Talend Open Studio for Big Data ETL tool. First create a new job in Talend Studio such as HDFStoESindexjob. Drag the following components into the Design area: tHDFSconnection_1----onsubok----> tHDFSinput_1-----row1(Main)--> tWriteJSONField_2-----row2(Main)--->tRESTClient_1 3.   Talend … Continue reading Connect Hadoop to ElasticSearch using Talend

Use Talend Open Studio for Big Data to ETL to Hadoop

Talend Open Studio for Big Data is a powerful ETL tool which is also open source. You can download and use it to do ETL to and from Hadoop including both HDFS and Hive. Talend Install steps Downloaded the free Talend Open Studio for Big Data from https://www.talend.com/products/big-data/big-data-open-studio/ The download file location is set to c:\temp … Continue reading Use Talend Open Studio for Big Data to ETL to Hadoop

Connect ElasticSearch to Cloudera Hadoop using ES-Hadoop.

[CAUTION: Currently the ES-Hadoop jars are giving errors with Cloudera CDH and Hue throwing errors saying multiple jars found and so the below process is not working. Use these instructions at your own risk as they may not work and so far not able to get a solution yet.] Environment: Cloudera CDH 5.12.x elasticsearch-hadoop-6.2.1   … Continue reading Connect ElasticSearch to Cloudera Hadoop using ES-Hadoop.

Cloudera Search (Solr) install steps

The following steps are used to install Cloudera Search which is based on Apache Solr. Environment: Cloudera CDH 5.12.x solr-spec 4.10.3 Deploying Cloudera Search Cloudera Search (powered by Apache Solr) is included in CDH 5. If you have installed CDH 5.0 or higher, you do not need to perform any additional actions to install Search. … Continue reading Cloudera Search (Solr) install steps