Tag Archives: Anaconda

Install Anaconda Python package on Cloudera CDH.

 

This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.

Install Steps:

Installing the Anaconda Parcel

1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

2.Click the “Configuration” button on the top right of the Parcels page.

3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

4.Click the “Save Changes” button.

NOTE:  The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.

2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin

2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task

java.lang.NullPointerException

at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)

at com.google.common.collect.Collections2.filter(Collections2.java:92)

at com.cloudera.parcel.components.ParcelDownloaderImpl$RepositoryInfo.getParcelsWithValidNames(ParcelDownloaderImpl.java:673)

 

Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.

http://archive.cloudera.com/sqoop-connectors/parcels/latest/

https://archive.cloudera.com/sqoop-connectors/parcels/latest/

 

5.Click the “Download” button to the right of the Anaconda parcel listing.

6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.

7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.

9. Now Anaconda is deployed and can be used with Hue pySpark notebook. Test a small example in Hue->Query->Editor->pySpark:

Example: Pandas test

#!/usr/bin/env python

def import_pandas(x):
import pandas
return x+10

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x)).collect()

 


Result

[11, 12, 13, 14]

 

Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.

References:

http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/
Advertisements