Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

Kerberos was not used in the Hadoop cluster. Kerberos will need additional steps not given in this blog.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.

<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search

[desktop]
app_blacklist=

Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0

/_/

<<<NEXT STEPS>>>               

NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

[NOTE: livy-0.5.0-incubating-bin.zip  or higher version should also work from https://livy.apache.org/download/  try this version first before trying Livy 0.3.0 version.]

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip

2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]

/opt/cloudera/livy>unzip livy-server-0.3.0.zip

Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh

In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy

<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &

17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
{
“from”: 0,
“sessions”: [],
“total”: 0
}

<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:

print(1+2)

You will see result:

3

You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
{
“from”: 0,
“sessions”: [
{
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
},
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
}
],
“total”: 1
}

<<<SOME EXAMPLES>>>: 

Below is a pySpark example run on Hue notebook:


import random
print(random.random())

NUM_SAMPLES = 5000
def sample(p):
  x, y = random.random(),random.random()
  return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print('Pi is roughly %f' % (4.0 * count / NUM_SAMPLES))

Below is the result:

Pi is roughly 3.140000

EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext
from pyspark.sql import HiveContext

# Create a Hive Context
hive_context = HiveContext(sc)
print ("Reading Hive table...")

mytbl = hive_context.sql("SELECT * FROM db1.table1 limit 5")

print ("Registering DataFrame as a table...")
mytbl.show()  # Show first rows of dataframe

mytbl.printSchema()

REFERENCES:

https://blogs.msdn.microsoft.com/pliu/2016/06/18/run-hue-spark-notebook-on-cloudera/

http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.