Tag Archives: Hadoop

Cloudera Search (Solr) install steps

The following steps are used to install Cloudera Search which is based on Apache Solr.

Environment:

Cloudera CDH 5.12.x

solr-spec 4.10.3

Deploying Cloudera Search

Cloudera Search (powered by Apache Solr) is included in CDH 5. If you have installed CDH 5.0 or higher, you do not need to perform any additional actions to install Search.

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, and uses ZooKeeper to simplify management, which results in a cluster of coordinating Apache Solr servers.

For Cloudera Manager installations, if you have not yet added the Solr service to your cluster, do so now from the Cloudera Manager->Home->Add Service dropdown.  The Add a Service wizard automatically configures and initializes the Solr service.

Select the hosts in the next screen in CMgr where you want to deploy the SOLR servers. I selected two hosts which also had zookeeper running on them. Keep the default Zookeeper Znode /solr and HDFS Data Dirfectory /solr in the next screen.

However while running the add service got an error that “Sentry requires that authentication be turned on for Solr.”

Clicked back on the page and this time removed the dependency of Sentry and only kept HDFS and Zookeeper. And also in another browser went to the Cloudera Manager->Solr–>configuration and set Sentry Service to None. After that retried the install and it ran successfully.

Next in CMgr Hue->configuration->Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini remove Search from blacklist.

[desktop]

app_blacklist=hbase,impala

 

NEXT STEP:

Login to Hue and try to click on the Query->Dashboard . But got an error:

Error!

HTTPConnectionPool(host=’localhost’, port=8983): Max retries exceeded with url: /solr/admin/cores?user.name=hue&doAs=hive&wt=json (Caused by NewConnectionError(‘<requests.packages.urllib3.connection.HTTPConnection object at 0x7fba0805bad0>: Failed to establish a new connection: [Errno 113] No route to host’,))

To resolve this issue go to Cloudera manager->Hue->configuration and search for solr.

In the Solr Service->Hue(Service Wide) click on Solr button if it has none clicked.

This will automatically update these lines in the hue.ini

-app_blacklist=spark,zookeeper,hbase,impala,search
+app_blacklist=spark,zookeeper,hbase,impala
+[search]
+solr_url=http://solrhostxyz.com:8983/solr

Click restart in CMgr. After that clicking Hue->Query->Dashboard works and gives a new message:

It seems there is nothing to search on..

What about creating a new index?

This indicates that Hue is now successfully configured with Solr search.

NEXT STEPS:

A little about SolrCores and Collections

On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore‘s on different machines. We call all of these SolrCores that make up one logical index a collection. A collection is a essentially a single index that spans many SolrCore‘s, both for index scaling as well as redundancy.

EXAMPLE:  Easy indexing of data into Solr with ETL operations

http://gethue.com/easy-indexing-of-data-into-solr/

Selecting data

We need to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.

Click on Hue->Indexes menu option on left. (Note: Index and Collection are mostly the same). Click on add Index.

Give the index a name and the location of the .csv file which was uploaded to HDFS using Hue. It will recognize the fields and data type and we can create the index by clicking finish.

After the index is created we can see it on the left under Collections. Click on the index then click on Search. This will display a Dashboard and you can create various Dashboards from the data in Hue.

NEXT STEP:

If you go to the SOLR admin page you can get lot of details about SOLR and the index(collection) created.

http://mysolrhost:8983

 

Next running a list command on the Solr server gives the list of collections:

/root>solrctl instancedir –list

index1_reviews

managedTemplate

managedTemplateSecure

predefinedTemplate

predefinedTemplateSecure

schemalessTemplate

schemalessTemplateSecure

 

Reference:

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cm_ig_install_search.html

http://gethue.com/easy-indexing-of-data-into-solr/

 

 

Advertisements

Kafka install on Cloudera Hadoop

Below are the steps to install Kafka parcel in Cloudera manager.

Cloudera Distribution of Apache Kafka Requirements and Supported Versions:

Cloudera Kafka 2.2.x lowest supported Cloudera manager version 5.9.x, CDH 5.9.x and higher .

General Information Regarding Installation and Upgrade

These are the official instructions: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_installing.html#concept_jms_yb1_v5

Cloudera recommends that you deploy Kafka on dedicated hosts that are not used for other cluster roles.

Click on the Parcels icon in Cloudera Manager in top right. If you do not see Kafka 2.2.x  in the list of parcels, you can add the parcel url to the list.

  1. Find the parcel for the version of Kafka you want to use on Cloudera Distribution of Apache Kafka Versions.
  2. url http://archive.cloudera.com/kafka/parcels/2.2.0/ as the 3.3.0 parcel is not supported on CDH 5.12.x .  Copy this parcel repository link.
  3. On the Cloudera Manager Parcels page, click Configuration.
  4. In the field Remote Parcel Repository URLs, click + next to an existing parcel URL to add a new field.
  5. Paste the parcel repository link.
  6. Save your changes.
  7. On the Cloudera Manager Parcels page, download the Kafka parcel, distribute the parcel to the hosts in your cluster, and then activate the parcel. After you activate the Kafka parcel, Cloudera Manager prompts you to restart the cluster.
  8. Add the Kafka service to your cluster using the Cloudera manager->Add Service
  9. Select HDFS, Sentry and Zookeeper as list of dependencies when prompted.
  10. Next download, distribute, activate the parcel.
  11. Add the Kafka Service in Cloudera manager.
  12. Select the hosts for kafka services.
  13. Enter the Destination Broker List, Source Broker List including port.Destination Broker List                             myhost.com:9092bootstrap.servers

    Source Broker List                                     myhost.com:9092

    source.bootstrap.servers

Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you’ll get other errors. Also the format with the trailing Port Number is mandatory!

Seems there is some bug in this kafka parcel and review this posting to find the solution. 

https://community.cloudera.com/t5/Cloudera-Manager-Installation/adding-a-Kafka-service-failed/td-p/40526

3) Click “Continue”. Service will NOT start (error). Do not navigate away from that screen

4) Open another Cloudera Manager in another browser pane. You should now see “Kafka” in the list of Services (red, but it should be there). Click on the Kafka Service and then “Configure”.

5) Search for the “java heap space” Configuration Property. The standard Java Heap Space you’ll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough.

6) Now search for the “whitelist” Configuration Property. In the field, put in “(?!x)x” (without the quotation marks). That’s a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I’m assuming you don’t want to replicate any topics remotely right now, just put in something that won’t replicate anything e.g. that regular expression.

7) Save the changes and go back to the original Configuration Screen on the other browser pane. Click “Retry”, or whatever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager.

After this the Kafka service should start successfully.

 

Next try some Kafka commands:

Kafka command-line tools are located in /usr/bin. Login to the server where a Kafka broker is running with root:

  • kafka-topics examples for Create, alter, list, and describe topics. 

$ /usr/bin/kafka-topics –create –zookeeper kafkahost.com:2181 –replication-factor 1 –partitions 1 –topic xyztopic1

17/12/18 10:09:34 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
17/12/18 10:09:34 INFO admin.AdminUtils$: Topic creation {“version”:1,”partitions”:{“0”:[112]}}
Created topic “gctopic1”.
17/12/18 10:09:34 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
17/12/18 10:09:34 INFO zookeeper.ZooKeeper: Session: 0x2602cabe1904fcd closed
17/12/18 10:09:34 INFO zookeeper.ClientCnxn: EventThread shut down

$ kafka-topics –zookeeper Bkafkahost.com:2181 –list

 

 

References:

https://www.rittmanmead.com/blog/2015/03/creating-real-time-search-dashboards-using-apache-solr-hue-flume-and-morphlines/

https://www.cloudera.com/documentation/kafka/latest/topics/kafka_installing.html#concept_jms_yb1_v5

https://www.confluent.io/blog/stream-data-platform-1/

https://developer.ibm.com/opentech/2017/05/31/kafka-acls-in-practice/

 

 

Install Jupyter notebook with Livy for Spark on Cloudera Hadoop

Environment

  • Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy)
  • Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH)

We will first install Anaconda and Sparkmagic on Windows 10 to install Jupyter Notebook using Anaconda.

  1. We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

2. Download the Anaconda installer from https://www.anaconda.com/download/ which was Anaconda 5.0.1 For Windows Installer Python 3.6 version 64-bit.

3. After successful install of the package on Windows click on the Anaconda Navigator in the Windows Start. Click on the Jupyterlab launch button.

4. On the Jupyterlab browser notebook and tried to load sparkmagic using:

%load_ext sparkmagic.magics

This gave an error of module not found.

5. We need to install sparkmagic. Run the following command on the notebook.

!pip install sparkmagic

It gives a successful install message in the output  that Sparkmagic-0.12.5 is installed along with few other packages.

After this run in the notebook:

!pip show sparkmagic

Name: sparkmagic

Version: 0.12.5

Summary: SparkMagic: Spark execution via Livy

Home-page: https://github.com/jupyter-incubator/sparkmagic

Author: Jupyter Development Team

Author-email: jupyter@googlegroups.org

License: BSD 3-clause

 

NEXTSTEP:

From https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

Copy the example_config.json into ~/.sparkmagic/config.json  (Usually in Windows the folder location will be C:\Users\youruserid\.sparkmagic\config.json ). Change the username, password and url of the Livy server.

{

“kernel_python_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

“auth”: “None”

},

“kernel_scala_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998

“auth”: “None”

},

“kernel_r_credentials”: {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

},

 

Next launch Jupiter Lab browser:

Click Windows->Start->Anaconda Navigator and Launch Jupiter Lab. Next run the below command in the Jupiter Notebook:

!jupyter nbextension enable –py –sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension…
– Validating: ok

 

Next run the following commands in the Jupyter notebook:

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pysparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pyspark3kernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkrkernel

!jupyter serverextension enable –py sparkmagic

 

NEXTSTEP:

From the Jupyter top right corner click on the Python 3 and change to the PySpark kernel.

Make sure Livy server is running by running curl on the Livy server:

$ curl localhost:8998/sessions
{"from":0,"total":0,"sessions":[]}

Run a simple command 1+1 in the Jupyter, the notebook will connect to Spark cluster to execute your commands. It will start Spark application with your first command.

Run another command:

%%sql

show databases

It will display the list of databases defined in Hive in the Hadoop Spark Cluster.

 

Another EXAMPLE: To draw a plot and store in a pdf file on the Livy server:

import os

import matplotlib

import matplotlib.pyplot as plt

plt.switch_backend(“Agg”)

matplotlib.use(‘Agg’)

mydir = r’/home/yourlinuxid/’

os.chdir(mydir)

os.getcwd()

plt.plot([1,4,3,6,12,20])

plt.savefig(‘myplot1.pdf’)

 

Now if you download the myplot1.pdf from your home directory in the linux server running Livy then you can see the graph created in pdf.

You have now successfully installed Jupyter notebook on Windows 10 and ran Python using pySpark to access Livy and Spark for Hadoop backend.

 


REFERENCES:

https://github.com/jupyter-incubator/sparkmagic

https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d

https://spark-summit.org/east-2017/events/secured-kerberos-based-spark-notebook-for-data-science/