Tag Archives: Livy

Install Jupyterhub

This blog is not yet ready and so do not use these instructions for now

Jupyterhub Prerequisites:

Before installing JupyterHub, you will need:

  • a Linux/Unix based system

  • Python 3.4 or greater. An understanding of using pip or conda for installing Python packages is helpful.

Installation using conda:

Check if Anaconda package is already installed:

dpkg -l | grep conda

If Anaconda is not installed download the Anaconda linux installer  from https://www.anaconda.com/download/#linux

/home/myuserid/mydownloads/anaconda# wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh

Anaconda3-5.0.1-Linux-x86_64. 100%[==============================================>] 525.28M 3.46MB/s in 2m 35s

Install by using following command:

$ bash Anaconda3-5.0.1-Linux-x86_64.sh

Do you accept the license terms? [yes|no]
[no] >>> yes

Anaconda3 will now be installed into this location:
/root/anaconda3

– Press ENTER to confirm the location
– Press CTRL-C to abort the installation
– Or specify a different location below

[/root/anaconda3] >>> /opt/anaconda3

installing: anaconda-5.0.1-py36hd30a520_1 …

installation finished.

Do you wish the installer to prepend the Anaconda3 install location

to PATH in your /root/.bashrc ? [yes|no]

[no] >>> yes

Appending source /opt/anaconda3/bin/activate to /root/.bashrc

$ source ~/.bashrc

Verify the Anaconda installation version:

$ conda list

 

# packages in environment at /opt/anaconda3:
#
_ipyw_jlab_nb_ext_conf 0.1.0 py36he11e457_0
alabaster 0.7.10 py36h306e16b_0
anaconda 5.0.1 py36hd30a520_1
anaconda-client 1.6.5 py36h19c0dcd_0
anaconda-navigator 1.6.9 py36h11ddaaa_0
anaconda-project 0.8.0 py36h29abdf5_0

 

Next Install jupyterhub package:

root$:~/mydownloads/anaconda# conda install -c conda-forge jupyterhub

# conda install notebook

 

Test the install:

$ jupyterhub -h

$ configurable-http-proxy -h

$ jupyterhub

But got an error:

[I 2017-12-04 00:20:23.809 JupyterHub app:871] Writing cookie_secret to /opt/anaconda3/jupyterhub_cookie_secret
[I 2017-12-04 00:20:23.876 alembic.runtime.migration migration:117] Context impl SQLiteImpl.
[I 2017-12-04 00:20:23.877 alembic.runtime.migration migration:122] Will assume non-transactional DDL.
[I 2017-12-04 00:20:23.922 alembic.runtime.migration migration:327] Running stamp_revision -> 3ec6993fe20c
[W 2017-12-04 00:20:24.272 JupyterHub app:955] No admin users, admin interface will be unavailable.
[W 2017-12-04 00:20:24.272 JupyterHub app:956] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2017-12-04 00:20:24.272 JupyterHub app:983] Not using whitelist. Any authenticated user will be allowed.
[E 2017-12-04 00:20:24.328 JupyterHub app:1525] Failed to bind hub to http://127.0.0.1:8081/hub/
[E 2017-12-04 00:20:24.330 JupyterHub app:1623]
Traceback (most recent call last):
File “/opt/anaconda3/lib/python3.6/site-packages/jupyterhub/app.py”, line 1621, in launch_instance_async
yield self.start()
File “/opt/anaconda3/lib/python3.6/site-packages/jupyterhub/app.py”, line 1523, in start
self.http_server.listen(self.hub_port, address=self.hub_ip)
File “/opt/anaconda3/lib/python3.6/site-packages/tornado/tcpserver.py”, line 142, in listen
sockets = bind_sockets(port, address=address)
File “/opt/anaconda3/lib/python3.6/site-packages/tornado/netutil.py”, line 197, in bind_sockets
sock.bind(sockaddr)
PermissionError: [Errno 13] Permission denied

Resolution: After checking with netstat -an the reason was that port 8081 was already being used on the ubuntu server by a mcafee process. So changed the port in the config file /etc/jupyterhub/jupyterhub_config.py to

c.JupyterHub.hub_port = 8181

See below how to generate and update the config file. After this change jupyterhub started successfully.

 

Configuration of Jupyterhub:

We recommend storing configuration files in the standard UNIX filesystem location, i.e./etc/jupyterhub Create the configuration file:

# jupyterhub –generate-config -f /etc/jupyterhub/jupyterhub_config.py
Writing default config to: /etc/jupyterhub/jupyterhub_config.py

$ openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -keyout jupyterhub.key -out jupyterhub.crt

Added these values to the file /etc/jupyterhub/jupyterhub_config.py using the generated ssl certificates:

c.JupyterHub.ssl_cert = ‘/etc/jupyterhub/sslcerts/jupyterhub.crt’
c.JupyterHub.ssl_key = ‘/etc/jupyterhub/sslcerts/jupyterhub.key’
c.JupyterHub.port = 443

c.JupyterHub.hub_port = 8181

Create a user and group in linux:

$ groupadd mygrp

$ useradd -m -g mygrp -c “my user” myusr

set a password for myusr

Next start Jupyterhub server:

$ jupyterhub -f /etc/jupyterhub/jupyterhub_config.py

[I 2017-12-04 22:27:28.147 JupyterHub app:1581] JupyterHub is now running at https://:443/

Login using myusr and password

 

 

 

 

$ sudo mkdir /srv/jupyterhub

$cd /srv/jupyterhub

$sudo chown charlotte

$ git clone http://github.com/minrk/jupyterhub-pydata-2016

$ cd jupyterhub-pydata-2016

conda env create -f environment.yml

source activate jupyterhub-tutorial

conda config –add channels conda-forge

https://docs.docker.com/engine/installation

pip install dockerspawner

docker pull jupyterhub/singleuser

 

/srv/jupyterhub$git clone https://github.com/letsencrypt/letsencrypt

cd letsencrypt

./letsencrypt-auto certonly –standalone -d mydomain.tld

key: /etc/letsencrypt/live/mydomain.tld/privkey.pem

cert: /etc/letsencrypt/live/mydomain.tld/fullchain.pem

 

Need a dns domain name defined first like mydomain.com.

/srv/jupyterhub/jupyterhub-pydata-2016$jupyterhub –generate-config

Edit the config file:

c.JupyterHub.ssl_key = ‘jupyterhub.key’

c.JupyterHub.ssl_cert = ‘jupyterhub.crt’

c.JupyterHub.port = 443

cp jupyterhub_config.py-ssl ./jupyterhub_config.py

#Configuration file for jupyterhub

c.JupyterHub.ssl_key = ‘/etc/letsencrypt/live/hub-tutorial.jupyter.org/privkey.pem’

c.JupyterHub.ssl_cert = ‘/etc/letsencrypt/live/hub-tutorial.jupyter.org/fullchain.pem’

c.JupyterHub.port = 443

$sudo setcap cap_net_bind_service=+ep /home/charlotte/anaconda3/envs/jupyterhub-tutorial/bin/node

/etc/letsencrypt$ sudo chmod 777 -R archive/

/etc/letsencrypt$ sudo chmod 777 -R live/

/src/jupyterhub/jupyterhub-pydata-2016$jupyterhub

 

Install Kernels for all users:

conda create -n py2 python=2 ipykernel

conda run -n py2 — ipython kernel install

jupyter kernelspec list

 

Authentication

Using GitHub OAuth

https:/github.com/settings/applications/new

github provides clientid, clientsecret and we need to provide oauth callback url

In ./env in repository:

export GITHUB_CLIENT_ID=from github

export GITHUB_CLIENT_SECRET=from github

export OAUTH_CALLBACK_URL=https://hub-tutorial.jupyter.org/hub/oauth_callback

$ cd /srv/jupyterhub

$ source ./env

python3 -m pip install oauthenticator

In jupyterhub_config.py:

from oauthenticator.github import LocalGitHubOauthenticator

c. JupyterHub.authenticator_class = LocalGitHubOauthenticator

c.LocalGitHubOauthenticator.create_system_users = True

restart jupyterhub

Sign in with Github

 

Using DockerSpawner

python3 -m pip install dockerspawner netifaces

docker pull jupyterhub/singleuser

In jupyterhub_config.py:

from oauthenticator.github import GitHubOAuthenticator

c.JupyterHub.authenticator_class = GitHubOAuthenticator

from dockerspawner import DockerSpawner

c.JupyterHub.spawner_class = DockerSpawner

 

Reference deployments:

https://github.com/jupyterhub/jupyterhub-deploy-docker

docker-compose

https://blogs.msdn.microsoft.com/uk_faculty_connection/2017/08/10/running-jupyterhub-on-and-off-campus-architectural-scenarios/

 

 

 

 

Advertisements

Install Jupyter notebook with Livy for Spark on Cloudera Hadoop

Environment

  • Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy)
  • Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH)

We will first install Anaconda and Sparkmagic on Windows 10 to install Jupyter Notebook using Anaconda.

  1. We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

2. Download the Anaconda installer from https://www.anaconda.com/download/ which was Anaconda 5.0.1 For Windows Installer Python 3.6 version 64-bit.

3. After successful install of the package on Windows click on the Anaconda Navigator in the Windows Start. Click on the Jupyterlab launch button.

4. On the Jupyterlab browser notebook and tried to load sparkmagic using:

%load_ext sparkmagic.magics

This gave an error of module not found.

5. We need to install sparkmagic. Run the following command on the notebook.

!pip install sparkmagic

It gives a successful install message in the output  that Sparkmagic-0.12.5 is installed along with few other packages.

After this run in the notebook:

!pip show sparkmagic

Name: sparkmagic

Version: 0.12.5

Summary: SparkMagic: Spark execution via Livy

Home-page: https://github.com/jupyter-incubator/sparkmagic

Author: Jupyter Development Team

Author-email: jupyter@googlegroups.org

License: BSD 3-clause

 

NEXTSTEP:

From https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

Copy the example_config.json into ~/.sparkmagic/config.json  (Usually in Windows the folder location will be C:\Users\youruserid\.sparkmagic\config.json ). Change the username, password and url of the Livy server.

{

“kernel_python_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

“auth”: “None”

},

“kernel_scala_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998

“auth”: “None”

},

“kernel_r_credentials”: {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

},

 

Next launch Jupiter Lab browser:

Click Windows->Start->Anaconda Navigator and Launch Jupiter Lab. Next run the below command in the Jupiter Notebook:

!jupyter nbextension enable –py –sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension…
– Validating: ok

 

Next run the following commands in the Jupyter notebook:

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pysparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pyspark3kernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkrkernel

!jupyter serverextension enable –py sparkmagic

 

NEXTSTEP:

From the Jupyter top right corner click on the Python 3 and change to the PySpark kernel.

Make sure Livy server is running by running curl on the Livy server:

$ curl localhost:8998/sessions
{"from":0,"total":0,"sessions":[]}

Run a simple command 1+1 in the Jupyter, the notebook will connect to Spark cluster to execute your commands. It will start Spark application with your first command.

Run another command:

%%sql

show databases

It will display the list of databases defined in Hive in the Hadoop Spark Cluster.

 

Another EXAMPLE: To draw a plot and store in a pdf file on the Livy server:

import os

import matplotlib

import matplotlib.pyplot as plt

plt.switch_backend(“Agg”)

matplotlib.use(‘Agg’)

mydir = r’/home/yourlinuxid/’

os.chdir(mydir)

os.getcwd()

plt.plot([1,4,3,6,12,20])

plt.savefig(‘myplot1.pdf’)

 

Now if you download the myplot1.pdf from your home directory in the linux server running Livy then you can see the graph created in pdf.

You have now successfully installed Jupyter notebook on Windows 10 and ran Python using pySpark to access Livy and Spark for Hadoop backend.

 


REFERENCES:

https://github.com/jupyter-incubator/sparkmagic

https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d

https://spark-summit.org/east-2017/events/secured-kerberos-based-spark-notebook-for-data-science/

 

 

 

Install Hue Spark Notebook with Livy on Cloudera

This blog will show simple steps to install and configure Hue Spark notebook to run interactive pySpark  scripts using Livy.

Environment used:

CDH 5.12.x , Cloudera Manager, Hue 4.0, Livy 0.3.0, Spark 1.6.0 on RHEL linux.

Sentry was installed in unsecure mode.

NOTE: Make sure the user who logs into Hue has access to Hive metastore.

 

<<<NEXT STEPS>>>               

Normally in Hue 4.0 the pySpark notebook is hidden and needs to be enabled in Cloudera manager. In Cloudera Manager, go to Hue->Configurations, search for “safety”, in Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, add the following configuration, save the changes, and restart Hue. Note you can keep some apps blacklisted like hbase,impala,search if you want by adding them like app_blacklist=hbase,impala,search

[desktop]
app_blacklist=

 

Livy requires at least Spark 1.6 and supports both Scala 2.10 and 2.11 builds of Spark. Check spark version in linux using command:

$ spark-shell –version

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  ‘_/

/___/ .__/\_,_/_/ /_/\_\   version 1.6.0

/_/

 

<<<NEXT STEPS>>>               

 

NOTE: Initially tried to install Livy version 0.4.0 after downloading version Apache Livy 0.4.0-incubating (zip) from https://livy.apache.org/download/ . 

This version didnt work with Hue 4.0. So tried version 0.3.0 below from Cloudera archive.

Create Livy install directory under /opt/cloudera:

/opt/cloudera>mkdir livy

/opt/cloudera/livy>$ wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip

2017-11-13 10:37:42 (65.1 MB/s) – “livy-server-0.3.0.zip” saved [95253743/95253743]

/opt/cloudera/livy>unzip livy-server-0.3.0.zip

Go to /opt/cloudera/livy/livy-server-0.3.0/conf and backup livy-env.sh

In /opt/cloudera/livy/livy-server-0.3.0/conf/livy-env.sh add below variables:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

export LIVY_LOG_DIR=/var/log/livy

NOTE: create the directory /var/log/livy and change ownership:

$ mkdir /var/log/livy

$ chown hdfs:hadoop /var/log/livy

drwxr-xr-x 2 hdfs hadoop 4096 Nov 12 22:22 livy

 

<<<NEXT STEPS>>>               

Start the Livy server using hdfs user:

/opt/cloudera/livy/livy-server-0.3.0/bin>sudo -u hdfs ./livy-server

NOTE: To run the Livy server in the background run the command:

/opt/cloudera/livy/livy-server-0.3.0/bin>nohup sudo -u hdfs ./livy-server > /var/log/livy/livy.out 2> /var/log/livy/livy.err < /dev/null &

 

17/11/13 10:49:17 INFO StateStore$: Using BlackholeStateStore for recovery.

17/11/13 10:49:17 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0

17/11/13 10:49:17 INFO InteractiveSessionManager: Heartbeat watchdog thread started.

17/11/13 10:49:17 INFO WebServer: Starting server on http://xyz.com:8998

17/11/13 10:50:05 INFO InteractiveSession$: Creating LivyClient for sessionId: 0

Check if Livy is running correctly:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 34 0 34 0 0 6178 0 –:–:– –:–:– –:–:– 8500
{
“from”: 0,
“sessions”: [],
“total”: 0
}

 

<<<NEXT STEPS>>>               

Go to Hue UI and bring up the Query->Editor->pySpark

Run an Example. First time it may take 10 secs to start spark context:

print(1+2)

You will see result:

3

You can now run interactive pySpark scripts in Hue 4.0 Notebook.

If you now run in linux below command you will see the session info:

/opt/cloudera/livy/livy-server-0.3.0>curl localhost:8998/sessions | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
176 176 176 176 0 0 33690 0 –:–:– –:–:– –:–:– 44000
{
“from”: 0,
“sessions”: [
{
“appId”: null,
“appInfo”: {
“driverLogUrl”: null,
“sparkUiUrl”: null
},
“id”: 2,
“kind”: “pyspark”,
“log”: [],
“owner”: null,
“proxyUser”: null,
“state”: “idle”
}
],
“total”: 1
}

 

<<<SOME EXAMPLES>>>: 

Below is a pySpark example run on Hue notebook:


from random import random

NUM_SAMPLES = 1000
def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a,b: a+b)

print “Pi is roughly %f” % (4.0 * count / NUM_SAMPLES)


 

Below is the result:

Pi is roughly 3.140000

 

 

EXAMPLE: Query a hive table and schema

#!/usr/bin/env python

from pyspark import SparkContext

from pyspark.sql import HiveContext

 

# Create a Hive Context

hive_context = HiveContext(sc)

 

print “Reading Hive table…”

mytbl = hive_context.sql(“SELECT * FROM db1.table1 limit 5”)

 

print “Registering DataFrame as a table…”

mytbl.show()  # Show first rows of dataframe

mytbl.printSchema()

 

REFERENCES:

https://blogs.msdn.microsoft.com/pliu/2016/06/18/run-hue-spark-notebook-on-cloudera/

http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark-2-2/