Category Archives: Livy

Install Jupyter notebook with Livy for Spark on Cloudera Hadoop

Environment

  • Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy)
  • Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH)

We will first install Anaconda and Sparkmagic on Windows 10 to install Jupyter Notebook using Anaconda.

  1. We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

2. Download the Anaconda installer from https://www.anaconda.com/download/ which was Anaconda 5.0.1 For Windows Installer Python 3.6 version 64-bit.

3. After successful install of the package on Windows click on the Anaconda Navigator in the Windows Start. Click on the Jupyterlab launch button.

4. On the Jupyterlab browser notebook and tried to load sparkmagic using:

%load_ext sparkmagic.magics

This gave an error of module not found.

5. We need to install sparkmagic. Run the following command on the notebook.

!pip install sparkmagic

It gives a successful install message in the output  that Sparkmagic-0.12.5 is installed along with few other packages.

After this run in the notebook:

!pip show sparkmagic

Name: sparkmagic

Version: 0.12.5

Summary: SparkMagic: Spark execution via Livy

Home-page: https://github.com/jupyter-incubator/sparkmagic

Author: Jupyter Development Team

Author-email: jupyter@googlegroups.org

License: BSD 3-clause

 

NEXTSTEP:

From https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

Copy the example_config.json into ~/.sparkmagic/config.json  (Usually in Windows the folder location will be C:\Users\youruserid\.sparkmagic\config.json ). Change the username, password and url of the Livy server.

{

“kernel_python_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

“auth”: “None”

},

“kernel_scala_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998

“auth”: “None”

},

“kernel_r_credentials”: {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

},

 

Next launch Jupiter Lab browser:

Click Windows->Start->Anaconda Navigator and Launch Jupiter Lab. Next run the below command in the Jupiter Notebook:

!jupyter nbextension enable –py –sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension…
– Validating: ok

 

Next run the following commands in the Jupyter notebook:

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pysparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pyspark3kernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkrkernel

!jupyter serverextension enable –py sparkmagic

 

NEXTSTEP:

From the Jupyter top right corner click on the Python 3 and change to the PySpark kernel.

Make sure Livy server is running by running curl on the Livy server:

$ curl localhost:8998/sessions
{"from":0,"total":0,"sessions":[]}

Run a simple command 1+1 in the Jupyter, the notebook will connect to Spark cluster to execute your commands. It will start Spark application with your first command.

Run another command:

%%sql

show databases

It will display the list of databases defined in Hive in the Hadoop Spark Cluster.

 

Another EXAMPLE: To draw a plot and store in a pdf file on the Livy server:

import os

import matplotlib

import matplotlib.pyplot as plt

plt.switch_backend(“Agg”)

matplotlib.use(‘Agg’)

mydir = r’/home/yourlinuxid/’

os.chdir(mydir)

os.getcwd()

plt.plot([1,4,3,6,12,20])

plt.savefig(‘myplot1.pdf’)

 

Now if you download the myplot1.pdf from your home directory in the linux server running Livy then you can see the graph created in pdf.

You have now successfully installed Jupyter notebook on Windows 10 and ran Python using pySpark to access Livy and Spark for Hadoop backend.

 


REFERENCES:

https://github.com/jupyter-incubator/sparkmagic

https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d

https://spark-summit.org/east-2017/events/secured-kerberos-based-spark-notebook-for-data-science/

 

 

 

Advertisements

Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.

 

<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search

[desktop]
app_blacklist=

 

Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0

/_/

 

<<<NEXT STEPS>>>               

 

NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip

2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]

/opt/cloudera/livy>unzip livy-server-0.3.0.zip

Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh

In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy

 

<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &

 

17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
{
“from”: 0,
“sessions”: [],
“total”: 0
}

 

<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:

print(1+2)

You will see result:

3

You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
{
“from”: 0,
“sessions”: [
{
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
},
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
}
],
“total”: 1
}

 

<<<SOME EXAMPLES>>>: 

Below is a pySpark example run on Hue notebook:


from random import random

NUM_SAMPLES = 1000
def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print “Pi is roughly %f” % (4.0 * count / NUM_SAMPLES)


 

Below is the result:

Pi is roughly 3.140000

 

 

EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext

from pyspark.sql import HiveContext

 

# Create a Hive Context

hive_context = HiveContext(sc)

 

print “Reading Hive table…”

mytbl = hive_context.sql(“SELECT * FROM db1.table1 limit 5”)

 

print “Registering DataFrame as a table…”

mytbl.show()  # Show first rows of dataframe

mytbl.printSchema()

 

REFERENCES:

https://blogs.msdn.microsoft.com/pliu/2016/06/18/run-hue-spark-notebook-on-cloudera/

http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/