Some notes on Anaconda python package manager: Reference: https://medium.freecodecamp.org/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c Conda is the main installer for the Anaconda packagesConda can be used to create multiple environments with different python or other package versions.The Anaconda packages are installed under /<some path>/Anaconda3/pkgs and other sub-directoriesInside a new Conda installation, the root environment is activated by default, so you … Continue reading Anaconda Python notes
Following python code will read a Hive table and convert to Pandas dataframe so you can use Pandas to process the rows. NOTE: Be careful when copy/paste the below code the double quotes need to be retyped as they get changed and gives syntax error. -------------------------------------------------------------------------------------------------------------- import pandas as pd from pyspark import SparkConf, SparkContext … Continue reading Use Pandas in Jupyter PySpark3 kernel to query Hive table
Following python code makes REST calls to a secure Kerberos enabled Hadoop cluster to use webhdfs REST api to get file data: You need to first run $ knit userid@REALM to authenticate and initiate the Kerberos ticket for the user.Make sure the python modules requests and requests_kerberos have been installed. Otherwise install it for example: … Continue reading Run a Python program to access Hadoop webhdfs and Hive with Kerberos enabled
Programming language popularity
Jupyterhub Prerequisites: Before installing JupyterHub, you will need: a Linux/Unix based system and will need over 10GB of free space Python 3.4 or greater. An understanding of using pip or conda for installing Python packages is helpful. Installation using conda: Check if Anaconda package is already installed: $ dpkg -l | grep conda $ rpm -ql conda -- if using rhel/centos If Anaconda … Continue reading Install Jupyterhub
Cloudera/Hadoop: tiny.cloudera.com/hw-reqs tiny.cloudera.com/aws-ra http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html http://docplayer.net/25124019-Hadoop-security-authors-ben-spivey-and-joey-echeverria-provide-in-depth-information-about-the-security-features-available-in-hadoop-and-organize-them.html http://blog.cloudera.com/blog/2015/03/how-to-quickly-configure-kerberos-for-your-apache-hadoop-cluster/ http://wpcertification.blogspot.com/ https://henning.kropponline.de/ https://blogs.msdn.microsoft.com/pliu/2016/01/02/integrating-cloudera-cluster-with-active-directory-part-13/ Jupyter: https://blog.insightdatascience.com/using-jupyter-on-apache-spark-step-by-step-with-a-terabyte-of-reddit-data-ef4d6c13959a Docker: https://www.dataquest.io/blog/docker-data-science/ Miscellaneous: https://blog.daftcode.pl/hype-driven-development-3469fc2e9b22 https://github.com/parth8891/NYC_Taxi_Data_Analysis https://keshif.me/demo/VisTools http://blog.thedigitalgroup.com/dattatrayap/high-speed-ingestion-into-solr-with-custom-talend-component-developed-by-tdg/ http://www.bigendiandata.com/ https://requestbin.com/
Environment Cloudera CDH 5.12.x running Livy and Spark (see other blog on this website to install Livy) Anaconda parcel installed using Cloudera Manager (see other blog on this website to install Anaconda parcel on CDH) Non-Kerberos cluster. Kerberos based Hadoop cluster needs different setup and these instructions wont work. We will first install Anaconda and … Continue reading Install Jupyter notebook with Livy for Spark on Cloudera Hadoop
This blog will show how to install Anaconda parcel in CDH to enable Pandas and other python libraries on Hue pySpark notebook. http://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera/ There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s distribution including Apache Hadoop: Use the Anaconda parcel for Cloudera CDH. The following procedure describes how to install … Continue reading Install Anaconda Python package on Cloudera CDH.