Category Archives: Anaconda

Install Jupyter notebook with Livy for Spark on Cloudera Hadoop

Environment

  • Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy)
  • Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH)

We will first install Anaconda and Sparkmagic on Windows 10 to install Jupyter Notebook using Anaconda.

  1. We strongly recommend installing Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

2. Download the Anaconda installer from https://www.anaconda.com/download/ which was Anaconda 5.0.1 For Windows Installer Python 3.6 version 64-bit.

3. After successful install of the package on Windows click on the Anaconda Navigator in the Windows Start. Click on the Jupyterlab launch button.

4. On the Jupyterlab browser notebook and tried to load sparkmagic using:

%load_ext sparkmagic.magics

This gave an error of module not found.

5. We need to install sparkmagic. Run the following command on the notebook.

!pip install sparkmagic

It gives a successful install message in the output  that Sparkmagic-0.12.5 is installed along with few other packages.

After this run in the notebook:

!pip show sparkmagic

Name: sparkmagic

Version: 0.12.5

Summary: SparkMagic: Spark execution via Livy

Home-page: https://github.com/jupyter-incubator/sparkmagic

Author: Jupyter Development Team

Author-email: jupyter@googlegroups.org

License: BSD 3-clause

 

NEXTSTEP:

From https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

Copy the example_config.json into ~/.sparkmagic/config.json  (Usually in Windows the folder location will be C:\Users\youruserid\.sparkmagic\config.json ). Change the username, password and url of the Livy server.

{

“kernel_python_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

“auth”: “None”

},

“kernel_scala_credentials” : {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998

“auth”: “None”

},

“kernel_r_credentials”: {

“username”: “youruserid”,

“password”: “xxxxx”,

“url”: “http://livyhostname:8998 

},

 

Next launch Jupiter Lab browser:

Click Windows->Start->Anaconda Navigator and Launch Jupiter Lab. Next run the below command in the Jupiter Notebook:

!jupyter nbextension enable –py –sys-prefix widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension…
– Validating: ok

 

Next run the following commands in the Jupyter notebook:

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pysparkkernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\pyspark3kernel

!jupyter-kernelspec install c:\users\youruserid\appdata\local\continuum\anaconda3\lib\site-packages\sparkmagic\kernels\sparkrkernel

!jupyter serverextension enable –py sparkmagic

 

NEXTSTEP:

From the Jupyter top right corner click on the Python 3 and change to the PySpark kernel.

Make sure Livy server is running by running curl on the Livy server:

$ curl localhost:8998/sessions
{"from":0,"total":0,"sessions":[]}

Run a simple command 1+1 in the Jupyter, the notebook will connect to Spark cluster to execute your commands. It will start Spark application with your first command.

Run another command:

%%sql

show databases

It will display the list of databases defined in Hive in the Hadoop Spark Cluster.

 

Another EXAMPLE: To draw a plot and store in a pdf file on the Livy server:

import os

import matplotlib

import matplotlib.pyplot as plt

plt.switch_backend(“Agg”)

matplotlib.use(‘Agg’)

mydir = r’/home/yourlinuxid/’

os.chdir(mydir)

os.getcwd()

plt.plot([1,4,3,6,12,20])

plt.savefig(‘myplot1.pdf’)

 

Now if you download the myplot1.pdf from your home directory in the linux server running Livy then you can see the graph created in pdf.

You have now successfully installed Jupyter notebook on Windows 10 and ran Python using pySpark to access Livy and Spark for Hadoop backend.

 


REFERENCES:

https://github.com/jupyter-incubator/sparkmagic

https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d

https://spark-summit.org/east-2017/events/secured-kerberos-based-spark-notebook-for-data-science/

 

 

 

Advertisements

Install Anaconda Python package on Cloudera CDH.

 

This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.

Install Steps:

Installing the Anaconda Parcel

1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

2.Click the “Configuration” button on the top right of the Parcels page.

3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

4.Click the “Save Changes” button.

NOTE:  The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.

2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin

2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task

java.lang.NullPointerException

at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)

at com.google.common.collect.Collections2.filter(Collections2.java:92)

at com.cloudera.parcel.components.ParcelDownloaderImpl$RepositoryInfo.getParcelsWithValidNames(ParcelDownloaderImpl.java:673)

 

Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.

http://archive.cloudera.com/sqoop-connectors/parcels/latest/

https://archive.cloudera.com/sqoop-connectors/parcels/latest/

 

5.Click the “Download” button to the right of the Anaconda parcel listing.

6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.

7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.

9. Now Anaconda is deployed and can be used with Hue pySpark notebook. Test a small example in Hue->Query->Editor->pySpark:

Example: Pandas test

#!/usr/bin/env python

def import_pandas(x):
import pandas
return x+10

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x)).collect()

 


Result

[11, 12, 13, 14]

 

Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.

References:

http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/