This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook.
There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s distribution including Apache Hadoop:
- Use the Anaconda parcel for Cloudera CDH. The following procedure describes how to install the Anaconda parcel on a CDH cluster using Cloudera Manager. The Anaconda parcel provides a static installation of Anaconda, based on Python 2.7, that can be used with Python and PySpark jobs on the cluster.
- Use Anaconda Scale [note: this product is discontinued so may not be good idea to use. Also it needs Anaconda Enterprise license] , which provides additional functionality, including the ability to manage multiple conda environments and packages, including Python and R, alongside an existing CDH cluster.
Installing the Anaconda Parcel
1.From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.
2.Click the “Configuration” button on the top right of the Parcels page.
3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.anaconda.com/pkgs/misc/parcels/
4.Click the “Save Changes” button.
NOTE: The Anaconda parcel still didn’t show up in the Parcels list. There was a Null Pointer Exception in the Cloudera manager log.
2017-11-17 12:43:08,730 INFO 459754680@scm-web-5592:com.cloudera.server.web.cmf.ParcelController: Synchronizing repos based on user request admin
2017-11-17 12:43:08,880 WARN ParcelUpdateService:com.cloudera.cmf.persist.ReadWriteDatabaseTaskCallable: Error while executing CmfEntityManager task
Workaround to resolve this is to remove the following Sqoop urls in the Cloudera Manager Parcels ->Configuration – Remote Parcel Repository URLs for both the http and https urls. After that the Anaconda parcels showed up in the parcels list.
5.Click the “Download” button to the right of the Anaconda parcel listing.
6.After the parcel is downloaded, click the “Distribute” and “Activate” buttons to distribute and activate the parcel to all of the cluster nodes.
7.After the parcel is activated, Anaconda is now available on all of the cluster nodes. 8.The Cloudera Manager Spark status will show stale configuration. Restart and deploy the stale configurations by clicking the button in CM.
9. Now Anaconda is deployed and can be used with Hue pySpark notebook(note: Livy needs to be installed first as given in another blog in this site. Search Livy in the searchbox). Test a small example in Hue->Query->Editor->pySpark:
Example: Pandas test
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x)).collect()
[11, 12, 13, 14]
Now you can start using all python libraries in Anaconda package like pandas, NumPy and SciPy etc. using Hue pySpark notebook.