Tag Archives: Hue

Cloudera Search (Solr) install steps

The following steps are used to install Cloudera Search which is based on Apache Solr.


Cloudera CDH 5.12.x

solr-spec 4.10.3

Deploying Cloudera Search

Cloudera Search (powered by Apache Solr) is included in CDH 5. If you have installed CDH 5.0 or higher, you do not need to perform any additional actions to install Search.

When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, and uses ZooKeeper to simplify management, which results in a cluster of coordinating Apache Solr servers.

For Cloudera Manager installations, if you have not yet added the Solr service to your cluster, do so now from the Cloudera Manager->Home->Add Service dropdown.  The Add a Service wizard automatically configures and initializes the Solr service.

Select the hosts in the next screen in CMgr where you want to deploy the SOLR servers. I selected two hosts which also had zookeeper running on them. Keep the default Zookeeper Znode /solr and HDFS Data Dirfectory /solr in the next screen.

However while running the add service got an error that “Sentry requires that authentication be turned on for Solr.”

Clicked back on the page and this time removed the dependency of Sentry and only kept HDFS and Zookeeper. And also in another browser went to the Cloudera Manager->Solr–>configuration and set Sentry Service to None. After that retried the install and it ran successfully.

Next in CMgr Hue->configuration->Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini remove Search from blacklist.





Login to Hue and try to click on the Query->Dashboard . But got an error:


HTTPConnectionPool(host=’localhost’, port=8983): Max retries exceeded with url: /solr/admin/cores?user.name=hue&doAs=hive&wt=json (Caused by NewConnectionError(‘<requests.packages.urllib3.connection.HTTPConnection object at 0x7fba0805bad0>: Failed to establish a new connection: [Errno 113] No route to host’,))

To resolve this issue go to Cloudera manager->Hue->configuration and search for solr.

In the Solr Service->Hue(Service Wide) click on Solr button if it has none clicked.

This will automatically update these lines in the hue.ini


Click restart in CMgr. After that clicking Hue->Query->Dashboard works and gives a new message:

It seems there is nothing to search on..

What about creating a new index?

This indicates that Hue is now successfully configured with Solr search.


A little about SolrCores and Collections

On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore‘s on different machines. We call all of these SolrCores that make up one logical index a collection. A collection is a essentially a single index that spans many SolrCore‘s, both for index scaling as well as redundancy.

EXAMPLE:  Easy indexing of data into Solr with ETL operations


Selecting data

We need to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.

Click on Hue->Indexes menu option on left. (Note: Index and Collection are mostly the same). Click on add Index.

Give the index a name and the location of the .csv file which was uploaded to HDFS using Hue. It will recognize the fields and data type and we can create the index by clicking finish.

After the index is created we can see it on the left under Collections. Click on the index then click on Search. This will display a Dashboard and you can create various Dashboards from the data in Hue.


If you go to the SOLR admin page you can get lot of details about SOLR and the index(collection) created.



Next running a list command on the Solr server gives the list of collections:

/root>solrctl instancedir –list















Install Anaconda Python package on Cloudera CDH.


This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.

Install Steps:

Installing the Anaconda Parcel

1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

2.Click the “Configuration” button on the top right of the Parcels page.

3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

4.Click the “Save Changes” button.

NOTE:  The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.

2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin

2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task


at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)

at com.google.common.collect.Collections2.filter(Collections2.java:92)

at com.cloudera.parcel.components.ParcelDownloaderImpl$RepositoryInfo.getParcelsWithValidNames(ParcelDownloaderImpl.java:673)


Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.




5.Click the “Download” button to the right of the Anaconda parcel listing.

6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.

7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.

9. Now Anaconda is deployed and can be used with Hue pySpark notebook. Test a small example in Hue->Query->Editor->pySpark:

Example: Pandas test

#!/usr/bin/env python

def import_pandas(x):
import pandas
return x+10

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x)).collect()



[11, 12, 13, 14]


Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.



Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.


<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search



Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0



<<<NEXT STEPS>>>               


NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip

2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]

/opt/cloudera/livy>unzip livy-server-0.3.0.zip

Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh

In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy


<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &


17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
“from”: 0,
“sessions”: [],
“total”: 0


<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:


You will see result:


You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
“from”: 0,
“sessions”: [
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
“total”: 1



Below is a pySpark example run on Hue notebook:

from random import random

def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print “Pi is roughly %f” % (4.0 * count / NUM_SAMPLES)


Below is the result:

Pi is roughly 3.140000



EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext

from pyspark.sql import HiveContext


# Create a Hive Context

hive_context = HiveContext(sc)


print “Reading Hive table…”

mytbl = hive_context.sql(“SELECT * FROM db1.table1 limit 5”)


print “Registering DataFrame as a table…”

mytbl.show()  # Show first rows of dataframe