Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

Kerberos was not used in the Hadoop cluster. Kerberos will need additional steps not given in this blog.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.


<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search



Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0



<<<NEXT STEPS>>>               


NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

[NOTE:  version should also work from  try this version first before trying Livy 0.3.0 version.]

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget

2017-11-13 10:37:42 (65.1 MB/s) – “” saved [95253743/95253743]


Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup

In /opt/cloudera/livy/livy-server-0.3.0/conf/ add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy


<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &


17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
“from”: 0,
“sessions”: [],
“total”: 0


<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:


You will see result:


You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
“from”: 0,
“sessions”: [
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
“total”: 1



Below is a pySpark example run on Hue notebook:

from random import random

def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print “Pi is roughly %f” % (4.0 * count / NUM_SAMPLES)


Below is the result:

Pi is roughly 3.140000



EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext

from pyspark.sql import HiveContext


# Create a Hive Context

hive_context = HiveContext(sc)


print “Reading Hive table…”

mytbl = hive_context.sql(“SELECT * FROM db1.table1 limit 5”)


print “Registering DataFrame as a table…”  # Show first rows of dataframe







Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.