Category Archives: Spark

Install Jupyterhub

This blog is not yet ready and so do not use these instructions for now

Jupyterhub Prerequisites:

Before installing JupyterHub, you will need:

  • a Linux/Unix based system

  • Python 3.4 or greater. An understanding of using pip or conda for installing Python packages is helpful.

Installation using conda:

Check if Anaconda package is already installed:

dpkg -l | grep conda

If Anaconda is not installed download the Anaconda linux installer  from https://www.anaconda.com/download/#linux

/home/myuserid/mydownloads/anaconda# wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh

Anaconda3-5.0.1-Linux-x86_64. 100%[==============================================>] 525.28M 3.46MB/s in 2m 35s

Install by using following command:

$ bash Anaconda3-5.0.1-Linux-x86_64.sh

Do you accept the license terms? [yes|no]
[no] >>> yes

Anaconda3 will now be installed into this location:
/root/anaconda3

– Press ENTER to confirm the location
– Press CTRL-C to abort the installation
– Or specify a different location below

[/root/anaconda3] >>> /opt/anaconda3

installing: anaconda-5.0.1-py36hd30a520_1 …

installation finished.

Do you wish the installer to prepend the Anaconda3 install location

to PATH in your /root/.bashrc ? [yes|no]

[no] >>> yes

Appending source /opt/anaconda3/bin/activate to /root/.bashrc

$ source ~/.bashrc

Verify the Anaconda installation version:

$ conda list

 

# packages in environment at /opt/anaconda3:
#
_ipyw_jlab_nb_ext_conf 0.1.0 py36he11e457_0
alabaster 0.7.10 py36h306e16b_0
anaconda 5.0.1 py36hd30a520_1
anaconda-client 1.6.5 py36h19c0dcd_0
anaconda-navigator 1.6.9 py36h11ddaaa_0
anaconda-project 0.8.0 py36h29abdf5_0

 

Next Install jupyterhub package:

root$:~/mydownloads/anaconda# conda install -c conda-forge jupyterhub

# conda install notebook

 

Test the install:

$ jupyterhub -h

$ configurable-http-proxy -h

$ jupyterhub

But got an error:

[I 2017-12-04 00:20:23.809 JupyterHub app:871] Writing cookie_secret to /opt/anaconda3/jupyterhub_cookie_secret
[I 2017-12-04 00:20:23.876 alembic.runtime.migration migration:117] Context impl SQLiteImpl.
[I 2017-12-04 00:20:23.877 alembic.runtime.migration migration:122] Will assume non-transactional DDL.
[I 2017-12-04 00:20:23.922 alembic.runtime.migration migration:327] Running stamp_revision -> 3ec6993fe20c
[W 2017-12-04 00:20:24.272 JupyterHub app:955] No admin users, admin interface will be unavailable.
[W 2017-12-04 00:20:24.272 JupyterHub app:956] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2017-12-04 00:20:24.272 JupyterHub app:983] Not using whitelist. Any authenticated user will be allowed.
[E 2017-12-04 00:20:24.328 JupyterHub app:1525] Failed to bind hub to http://127.0.0.1:8081/hub/
[E 2017-12-04 00:20:24.330 JupyterHub app:1623]
Traceback (most recent call last):
File “/opt/anaconda3/lib/python3.6/site-packages/jupyterhub/app.py”, line 1621, in launch_instance_async
yield self.start()
File “/opt/anaconda3/lib/python3.6/site-packages/jupyterhub/app.py”, line 1523, in start
self.http_server.listen(self.hub_port, address=self.hub_ip)
File “/opt/anaconda3/lib/python3.6/site-packages/tornado/tcpserver.py”, line 142, in listen
sockets = bind_sockets(port, address=address)
File “/opt/anaconda3/lib/python3.6/site-packages/tornado/netutil.py”, line 197, in bind_sockets
sock.bind(sockaddr)
PermissionError: [Errno 13] Permission denied

Resolution: After checking with netstat -an the reason was that port 8081 was already being used on the ubuntu server by a mcafee process. So changed the port in the config file /etc/jupyterhub/jupyterhub_config.py to

c.JupyterHub.hub_port = 8181

See below how to generate and update the config file. After this change jupyterhub started successfully.

 

Configuration of Jupyterhub:

We recommend storing configuration files in the standard UNIX filesystem location, i.e./etc/jupyterhub Create the configuration file:

# jupyterhub –generate-config -f /etc/jupyterhub/jupyterhub_config.py
Writing default config to: /etc/jupyterhub/jupyterhub_config.py

$ openssl req -x509 -nodes -days 3650 -newkey rsa:2048 -keyout jupyterhub.key -out jupyterhub.crt

Added these values to the file /etc/jupyterhub/jupyterhub_config.py using the generated ssl certificates:

c.JupyterHub.ssl_cert = ‘/etc/jupyterhub/sslcerts/jupyterhub.crt’
c.JupyterHub.ssl_key = ‘/etc/jupyterhub/sslcerts/jupyterhub.key’
c.JupyterHub.port = 443

c.JupyterHub.hub_port = 8181

Create a user and group in linux:

$ groupadd mygrp

$ useradd -m -g mygrp -c “my user” myusr

set a password for myusr

Next start Jupyterhub server:

$ jupyterhub -f /etc/jupyterhub/jupyterhub_config.py

[I 2017-12-04 22:27:28.147 JupyterHub app:1581] JupyterHub is now running at https://:443/

Login using myusr and password

 

 

 

 

$ sudo mkdir /srv/jupyterhub

$cd /srv/jupyterhub

$sudo chown charlotte

$ git clone http://github.com/minrk/jupyterhub-pydata-2016

$ cd jupyterhub-pydata-2016

conda env create -f environment.yml

source activate jupyterhub-tutorial

conda config –add channels conda-forge

https://docs.docker.com/engine/installation

pip install dockerspawner

docker pull jupyterhub/singleuser

 

/srv/jupyterhub$git clone https://github.com/letsencrypt/letsencrypt

cd letsencrypt

./letsencrypt-auto certonly –standalone -d mydomain.tld

key: /etc/letsencrypt/live/mydomain.tld/privkey.pem

cert: /etc/letsencrypt/live/mydomain.tld/fullchain.pem

 

Need a dns domain name defined first like mydomain.com.

/srv/jupyterhub/jupyterhub-pydata-2016$jupyterhub –generate-config

Edit the config file:

c.JupyterHub.ssl_key = ‘jupyterhub.key’

c.JupyterHub.ssl_cert = ‘jupyterhub.crt’

c.JupyterHub.port = 443

cp jupyterhub_config.py-ssl ./jupyterhub_config.py

#Configuration file for jupyterhub

c.JupyterHub.ssl_key = ‘/etc/letsencrypt/live/hub-tutorial.jupyter.org/privkey.pem’

c.JupyterHub.ssl_cert = ‘/etc/letsencrypt/live/hub-tutorial.jupyter.org/fullchain.pem’

c.JupyterHub.port = 443

$sudo setcap cap_net_bind_service=+ep /home/charlotte/anaconda3/envs/jupyterhub-tutorial/bin/node

/etc/letsencrypt$ sudo chmod 777 -R archive/

/etc/letsencrypt$ sudo chmod 777 -R live/

/src/jupyterhub/jupyterhub-pydata-2016$jupyterhub

 

Install Kernels for all users:

conda create -n py2 python=2 ipykernel

conda run -n py2 — ipython kernel install

jupyter kernelspec list

 

Authentication

Using GitHub OAuth

https:/github.com/settings/applications/new

github provides clientid, clientsecret and we need to provide oauth callback url

In ./env in repository:

export GITHUB_CLIENT_ID=from github

export GITHUB_CLIENT_SECRET=from github

export OAUTH_CALLBACK_URL=https://hub-tutorial.jupyter.org/hub/oauth_callback

$ cd /srv/jupyterhub

$ source ./env

python3 -m pip install oauthenticator

In jupyterhub_config.py:

from oauthenticator.github import LocalGitHubOauthenticator

c. JupyterHub.authenticator_class = LocalGitHubOauthenticator

c.LocalGitHubOauthenticator.create_system_users = True

restart jupyterhub

Sign in with Github

 

Using DockerSpawner

python3 -m pip install dockerspawner netifaces

docker pull jupyterhub/singleuser

In jupyterhub_config.py:

from oauthenticator.github import GitHubOAuthenticator

c.JupyterHub.authenticator_class = GitHubOAuthenticator

from dockerspawner import DockerSpawner

c.JupyterHub.spawner_class = DockerSpawner

 

Reference deployments:

https://github.com/jupyterhub/jupyterhub-deploy-docker

docker-compose

https://blogs.msdn.microsoft.com/uk_faculty_connection/2017/08/10/running-jupyterhub-on-and-off-campus-architectural-scenarios/

 

 

 

 

Advertisements

Some helpful links

Cloudera/Hadoop:

tiny.cloudera.com/hw-reqs

tiny.cloudera.com/aws-ra

http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html

http://docplayer.net/25124019-Hadoop-security-authors-ben-spivey-and-joey-echeverria-provide-in-depth-information-about-the-security-features-available-in-hadoop-and-organize-them.html

http://blog.cloudera.com/blog/2015/03/how-to-quickly-configure-kerberos-for-your-apache-hadoop-cluster/

http://wpcertification.blogspot.com/

 

Jupyter:

https://blog.insightdatascience.com/using-jupyter-on-apache-spark-step-by-step-with-a-terabyte-of-reddit-data-ef4d6c13959a

Docker:

https://www.dataquest.io/blog/docker-data-science/

Miscellaneous:

https://blog.daftcode.pl/hype-driven-development-3469fc2e9b22

https://github.com/parth8891/NYC_Taxi_Data_Analysis

https://keshif.me/demo/VisTools

http://blog.thedigitalgroup.com/dattatrayap/high-speed-ingestion-into-solr-with-custom-talend-component-developed-by-tdg/

 

 

 

 

Install Jupyter notebook with Livy for Spark on Cloudera Hadoop

Environment

  • Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy)
  • Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH)

We will first install Anaconda and Sparkmagic on Windows 10 to install Jupyter Notebook using Anaconda.

  1. We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

2. Download the Anaconda installer from https://www.anaconda.com/download/ which was Anaconda 5.0.1 For Windows Installer Python 3.6 version 64-bit.

3. After successful install of the package on Windows click on the Anaconda Navigator in the Windows Start. Click on the Jupyterlab launch button.

4. On the Jupyterlab browser notebook and tried to load sparkmagic using:

%load_ext sparkmagic.magics

This gave an error of module not found.

5. We need to install sparkmagic. Run the following command on the notebook.

!pip install sparkmagic

It gives a successful install message in the output  that Sparkmagic-0.12.5 is installed along with few other packages.

After this run in the notebook:

!pip show sparkmagic

Name: sparkmagic

Version: 0.12.5

Summary: SparkMagic: Spark execution via Livy

Home-page: https://github.com/jupyter-incubator/sparkmagic

Author: Jupyter Development Team

Author-email: jupyter@googlegroups.org

License: BSD 3-clause

 

NEXTSTEP:

From https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

Copy the example_config.json into ~/.sparkmagic/config.json  (Usually in Windows the folder location will be C:\Users\youruserid\.sparkmagic\config.json ). Change the username, password and url of the Livy server.

{

“kernel_python_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

“auth”: “None”

},

“kernel_scala_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998

“auth”: “None”

},

“kernel_r_credentials”: {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

},

 

Next launch Jupiter Lab browser:

Click Windows->Start->Anaconda Navigator and Launch Jupiter Lab. Next run the below command in the Jupiter Notebook:

!jupyter nbextension enable –py –sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension…
– Validating: ok

 

Next run the following commands in the Jupyter notebook:

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pysparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pyspark3kernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkrkernel

!jupyter serverextension enable –py sparkmagic

 

NEXTSTEP:

From the Jupyter top right corner click on the Python 3 and change to the PySpark kernel.

Make sure Livy server is running by running curl on the Livy server:

$ curl localhost:8998/sessions
{"from":0,"total":0,"sessions":[]}

Run a simple command 1+1 in the Jupyter, the notebook will connect to Spark cluster to execute your commands. It will start Spark application with your first command.

Run another command:

%%sql

show databases

It will display the list of databases defined in Hive in the Hadoop Spark Cluster.

 

Another EXAMPLE: To draw a plot and store in a pdf file on the Livy server:

import os

import matplotlib

import matplotlib.pyplot as plt

plt.switch_backend(“Agg”)

matplotlib.use(‘Agg’)

mydir = r’/home/yourlinuxid/’

os.chdir(mydir)

os.getcwd()

plt.plot([1,4,3,6,12,20])

plt.savefig(‘myplot1.pdf’)

 

Now if you download the myplot1.pdf from your home directory in the linux server running Livy then you can see the graph created in pdf.

You have now successfully installed Jupyter notebook on Windows 10 and ran Python using pySpark to access Livy and Spark for Hadoop backend.

 


REFERENCES:

https://github.com/jupyter-incubator/sparkmagic

https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d

https://spark-summit.org/east-2017/events/secured-kerberos-based-spark-notebook-for-data-science/