Category Archives: Cloudera

Install ElasticSearch on Cloudera Hadoop

Environment:

Cloudera CDH 5.12.x

 

There are 3 ways to connect from Hive to ElasticSearch.

We can set ES-hadoop jar in the command:

hive -hiveconf hive.aux.jars.path=/opt/elastic/elasticsearch-hadoop-2.4.3/dist/elasticsearch-hadoop-hive-2.4.3.jar;

Other option for doing same thing is to open hive session and then calling following command as first thing:


ADD JAR /opt/elastic/elasticsearch-hadoop-2.4.3/dist/elasticsearch-hadoop-hive-2.4.3.jar;

Problem with both these approaches is that you will have to keep letting hive know the full path to elasticsearch jars every single time. Instead you can take care of this issue by copying elasticsearch-hadoop-hive-<eshadoopversion>.jar into same directory on every node in your local machine. In my case i copied it to /usr/lib/hive/lib directory by executing following command

sudo cp /opt/elastic/elasticsearch-hadoop-2.4.3/dist/elasticsearch-hadoop-hive-2.4.3.jar /usr/lib/hive/lib/.

Then in Cloudera Manager set the value of Hive Auxiliary JARs Directory hive.aux.jars.path property to /usr/lib/hive/lib directory.

 

 

REFERENCES:

https://qbox.io/blog/how-to-offload-elasticsearch-indices-to-hive-hadoop

https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-03-integrating-hadoop-and-elasticsearch-part-1-loading-and-querying

https://www.linkedin.com/pulse/how-use-hive-table-elasticsearch-index-khan-arshad/

http://www.idata.co.il/2016/06/integrating-elasticsearch-with-hadoop-using-hive/

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/index.html

 

 

 

Advertisements

Cloudera Search (Solr) install steps

The following steps are used to install Cloudera Search which is based on Apache Solr.

Environment:

Cloudera CDH 5.12.x

solr-spec 4.10.3

Deploying Cloudera Search

Cloudera Search (powered by Apache Solr) is included in CDH 5. If you have installed CDH 5.0 or higher, you do not need to perform any additional actions to install Search.

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, and uses ZooKeeper to simplify management, which results in a cluster of coordinating Apache Solr servers.

For Cloudera Manager installations, if you have not yet added the Solr service to your cluster, do so now from the Cloudera Manager->Home->Add Service dropdown.  The Add a Service wizard automatically configures and initializes the Solr service.

Select the hosts in the next screen in CMgr where you want to deploy the SOLR servers. I selected two hosts which also had zookeeper running on them. Keep the default Zookeeper Znode /solr and HDFS Data Dirfectory /solr in the next screen.

However while running the add service got an error that “Sentry requires that authentication be turned on for Solr.”

Clicked back on the page and this time removed the dependency of Sentry and only kept HDFS and Zookeeper. And also in another browser went to the Cloudera Manager->Solr–>configuration and set Sentry Service to None. After that retried the install and it ran successfully.

Next in CMgr Hue->configuration->Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini remove Search from blacklist.

[desktop]

app_blacklist=hbase,impala

 

NEXT STEP:

Login to Hue and try to click on the Query->Dashboard . But got an error:

Error!

HTTPConnectionPool(host=’localhost’, port=8983): Max retries exceeded with url: /solr/admin/cores?user.name=hue&doAs=hive&wt=json (Caused by NewConnectionError(‘<requests.packages.urllib3.connection.HTTPConnection object at 0x7fba0805bad0>: Failed to establish a new connection: [Errno 113] No route to host’,))

To resolve this issue go to Cloudera manager->Hue->configuration and search for solr.

In the Solr Service->Hue(Service Wide) click on Solr button if it has none clicked.

This will automatically update these lines in the hue.ini

-app_blacklist=spark,zookeeper,hbase,impala,search
+app_blacklist=spark,zookeeper,hbase,impala
+[search]
+solr_url=http://solrhostxyz.com:8983/solr

Click restart in CMgr. After that clicking Hue->Query->Dashboard works and gives a new message:

It seems there is nothing to search on..

What about creating a new index?

This indicates that Hue is now successfully configured with Solr search.

NEXT STEPS:

A little about SolrCores and Collections

On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore‘s on different machines. We call all of these SolrCores that make up one logical index a collection. A collection is a essentially a single index that spans many SolrCore‘s, both for index scaling as well as redundancy.

EXAMPLE:  Easy indexing of data into Solr with ETL operations

http://gethue.com/easy-indexing-of-data-into-solr/

Selecting data

We need to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.

Click on Hue->Indexes menu option on left. (Note: Index and Collection are mostly the same). Click on add Index.

Give the index a name and the location of the .csv file which was uploaded to HDFS using Hue. It will recognize the fields and data type and we can create the index by clicking finish.

After the index is created we can see it on the left under Collections. Click on the index then click on Search. This will display a Dashboard and you can create various Dashboards from the data in Hue.

NEXT STEP:

If you go to the SOLR admin page you can get lot of details about SOLR and the index(collection) created.

http://mysolrhost:8983

 

Next running a list command on the Solr server gives the list of collections:

/root>solrctl instancedir –list

index1_reviews

managedTemplate

managedTemplateSecure

predefinedTemplate

predefinedTemplateSecure

schemalessTemplate

schemalessTemplateSecure

 

Reference:

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cm_ig_install_search.html

http://gethue.com/easy-indexing-of-data-into-solr/

 

 

Kafka install on Cloudera Hadoop

Below are the steps to install Kafka parcel in Cloudera manager.

Cloudera Distribution of Apache Kafka Requirements and Supported Versions:

Cloudera Kafka 2.2.x lowest supported Cloudera manager version 5.9.x, CDH 5.9.x and higher .

General Information Regarding Installation and Upgrade

These are the official instructions: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_installing.html#concept_jms_yb1_v5

Cloudera recommends that you deploy Kafka on dedicated hosts that are not used for other cluster roles.

Click on the Parcels icon in Cloudera Manager in top right. If you do not see Kafka 2.2.x  in the list of parcels, you can add the parcel url to the list.

  1. Find the parcel for the version of Kafka you want to use on Cloudera Distribution of Apache Kafka Versions.
  2. url http://archive.cloudera.com/kafka/parcels/2.2.0/ as the 3.3.0 parcel is not supported on CDH 5.12.x .  Copy this parcel repository link.
  3. On the Cloudera Manager Parcels page, click Configuration.
  4. In the field Remote Parcel Repository URLs, click + next to an existing parcel URL to add a new field.
  5. Paste the parcel repository link.
  6. Save your changes.
  7. On the Cloudera Manager Parcels page, download the Kafka parcel, distribute the parcel to the hosts in your cluster, and then activate the parcel. After you activate the Kafka parcel, Cloudera Manager prompts you to restart the cluster.
  8. Add the Kafka service to your cluster using the Cloudera manager->Add Service
  9. Select HDFS, Sentry and Zookeeper as list of dependencies when prompted.
  10. Next download, distribute, activate the parcel.
  11. Add the Kafka Service in Cloudera manager.
  12. Select the hosts for kafka services.
  13. Enter the Destination Broker List, Source Broker List including port.Destination Broker List                             myhost.com:9092bootstrap.servers

    Source Broker List                                     myhost.com:9092

    source.bootstrap.servers

Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you’ll get other errors. Also the format with the trailing Port Number is mandatory!

Seems there is some bug in this kafka parcel and review this posting to find the solution. 

https://community.cloudera.com/t5/Cloudera-Manager-Installation/adding-a-Kafka-service-failed/td-p/40526

3) Click “Continue”. Service will NOT start (error). Do not navigate away from that screen

4) Open another Cloudera Manager in another browser pane. You should now see “Kafka” in the list of Services (red, but it should be there). Click on the Kafka Service and then “Configure”.

5) Search for the “java heap space” Configuration Property. The standard Java Heap Space you’ll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough.

6) Now search for the “whitelist” Configuration Property. In the field, put in “(?!x)x” (without the quotation marks). That’s a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I’m assuming you don’t want to replicate any topics remotely right now, just put in something that won’t replicate anything e.g. that regular expression.

7) Save the changes and go back to the original Configuration Screen on the other browser pane. Click “Retry”, or whatever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager.

After this the Kafka service should start successfully.

 

Next try some Kafka commands:

Kafka command-line tools are located in /usr/bin. Login to the server where a Kafka broker is running with root:

  • kafka-topics examples for Create, alter, list, and describe topics. 

$ /usr/bin/kafka-topics –create –zookeeper kafkahost.com:2181 –replication-factor 1 –partitions 1 –topic xyztopic1

17/12/18 10:09:34 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
17/12/18 10:09:34 INFO admin.AdminUtils$: Topic creation {“version”:1,”partitions”:{“0”:[112]}}
Created topic “gctopic1”.
17/12/18 10:09:34 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
17/12/18 10:09:34 INFO zookeeper.ZooKeeper: Session: 0x2602cabe1904fcd closed
17/12/18 10:09:34 INFO zookeeper.ClientCnxn: EventThread shut down

$ kafka-topics –zookeeper Bkafkahost.com:2181 –list

 

 

References:

https://www.rittmanmead.com/blog/2015/03/creating-real-time-search-dashboards-using-apache-solr-hue-flume-and-morphlines/

https://www.cloudera.com/documentation/kafka/latest/topics/kafka_installing.html#concept_jms_yb1_v5

https://www.confluent.io/blog/stream-data-platform-1/

https://developer.ibm.com/opentech/2017/05/31/kafka-acls-in-practice/