This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark scripts using Livy.
Environment used:
CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.
Sentry was installed in unsecure mode.
Kerberos was not used in the Hadoop cluster. Kerberos will need additional steps not given in this blog.
NOTE: Make sure the user who logs into Hue has access to Hive metastore.
<<<NEXT STEPS>>>
Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search
[desktop]
app_blacklist=
Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:
$ spark-shell –version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
<<<NEXT STEPS>>>
NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ .
This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.
[NOTE: livy-0.5.0-incubating-bin.zip or higher version should also work from https://livy.apache.org/download/ try this version first before trying Livy 0.3.0 version.]
Create Livy install directory under /opt/cloudera:
/opt/cloudera>mkdir livy
/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip
2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]
/opt/cloudera/livy>unzip livy-server-0.3.0.zip
Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh
In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LIVY_LOG_DIR=/var/log/livy
NOTE: create the directory /var/log/livy and change ownership:
$ mkdir /var/log/livy
$ chown hdfs:hadoop /var/log/livy
drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy
<<<NEXT STEPS>>>
Start the Livy server using hdfs user:
/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server
NOTE: To run the Livy server in the background run the command:
/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &
17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.
17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0
17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0
17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.
17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998
17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0
Check if Livy is running correctly:
/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
{
“from”: 0,
“sessions”: [],
“total”: 0
}
<<<NEXT STEPS>>>
Go to Hue UI and bring up the Query->Editor->pySpark
Run an Example. First time it may take 10 secs to start spark context:
print(1+2)
You will see result:
3
You can now run interactive pySpark scripts in Hue 4.0 Notebook.
If you now run in linux below command you will see the session info:
/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
{
“from”: 0,
“sessions”: [
{
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
},
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
}
],
“total”: 1
}
<<<SOME EXAMPLES>>>:
Below is a pySpark example run on Hue notebook:
import random print(random.random()) NUM_SAMPLES = 5000 def sample(p): x, y = random.random(),random.random() return 1 if x*x + y*y < 1 else 0 count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \ .reduce(lambda a,b: a+b) print('Pi is roughly %f' % (4.0 * count / NUM_SAMPLES))
Below is the result:
Pi is roughly 3.140000
EXAMPLE: Query a hive table and schema
#!/usr/bin/env python from pyspark import SparkContext from pyspark.sql import HiveContext # Create a Hive Context hive_context = HiveContext(sc) print ("Reading Hive table...") mytbl = hive_context.sql("SELECT * FROM db1.table1 limit 5") print ("Registering DataFrame as a table...") mytbl.show() # Show first rows of dataframe mytbl.printSchema()
REFERENCES:
https://blogs.msdn.microsoft.com/pliu/2016/06/18/run-hue-spark-notebook-on-cloudera/
http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/