Install Anaconda Python package on Cloudera CDH.

This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.

There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s distribution including Apache Hadoop:

  • Use the Anaconda parcel for Cloudera CDH. The following procedure describes how to install the Anaconda parcel on a CDH cluster using Cloudera Manager. The Anaconda parcel provides a static installation of Anaconda, based on Python 2.7, that can be used with Python and PySpark jobs on the cluster.
  • Use Anaconda Scale [note: this product is discontinued so may not be good idea to use. Also it needs Anaconda Enterprise license] , which provides additional functionality, including the ability to manage multiple conda environments and packages, including Python and R, alongside an existing CDH cluster.

Install Steps:

Installing the Anaconda Parcel

1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

2.Click the “Configuration” button on the top right of the Parcels page.

3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel:

4.Click the “Save Changes” button.

NOTE:  The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.

2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin

2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task




at com.cloudera.parcel.components.ParcelDownloaderImpl$RepositoryInfo.getParcelsWithValidNames(

Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.

5.Click the “Download” button to the right of the Anaconda parcel listing.

6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.

7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.

9. Now Anaconda is deployed and can be used with Hue pySpark notebook(note: Livy needs to be installed first as given in another blog in this site. Search Livy in the searchbox). Test a small example in Hue->Query->Editor->pySpark:

Example: Pandas test

#!/usr/bin/env python

def import_pandas(x):
import pandas
return x+10

int_rdd = sc.parallelize([1, 2, 3, 4]) x: import_pandas(x)).collect()


[11, 12, 13, 14]

Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.