Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

Kerberos was not used in the Hadoop cluster. Kerberos will need additional steps not given in this blog.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.

 

<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search

[desktop]
app_blacklist=

 

Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0

/_/

 

<<<NEXT STEPS>>>               

 

NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

[NOTE: livy-0.5.0-incubating-bin.zip  version should also work from https://livy.apache.org/download/  try this version first before trying Livy 0.3.0 version.]

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip

2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]

/opt/cloudera/livy>unzip livy-server-0.3.0.zip

Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh

In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy

 

<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &

 

17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
{
“from”: 0,
“sessions”: [],
“total”: 0
}

 

<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:

print(1+2)

You will see result:

3

You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
{
“from”: 0,
“sessions”: [
{
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
},
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
}
],
“total”: 1
}

 

<<<SOME EXAMPLES>>>: 

Below is a pySpark example run on Hue notebook:


from random import random

NUM_SAMPLES = 1000
def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print “Pi is roughly %f” % (4.0 * count / NUM_SAMPLES)


 

Below is the result:

Pi is roughly 3.140000

 

 

EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext

from pyspark.sql import HiveContext

 

# Create a Hive Context

hive_context = HiveContext(sc)

 

print “Reading Hive table…”

mytbl = hive_context.sql(“SELECT * FROM db1.table1 limit 5”)

 

print “Registering DataFrame as a table…”

mytbl.show()  # Show first rows of dataframe

mytbl.printSchema()

 

REFERENCES:

https://blogs.msdn.microsoft.com/pliu/2016/06/18/run-hue-spark-notebook-on-cloudera/

http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s