Anaconda Python notes

Some notes on Anaconda python package manager: Reference: https://medium.freecodecamp.org/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c Conda is the main installer for the Anaconda packagesConda can be used to create multiple environments with different python or other package versions.The Anaconda packages are installed under /<some path>/Anaconda3/pkgs and other sub-directoriesInside a new Conda installation, the root environment is activated by default, so you … Continue reading Anaconda Python notes

Advertisements

Use Pandas in Jupyter PySpark3 kernel to query Hive table

Following python code will read a Hive table and convert to Pandas dataframe so you can use Pandas to process the rows. NOTE: Be careful when copy/paste the below code the double quotes need to be retyped as they get changed and gives syntax error. -------------------------------------------------------------------------------------------------------------- import pandas as pd from pyspark import SparkConf, SparkContext … Continue reading Use Pandas in Jupyter PySpark3 kernel to query Hive table

Tableau Desktop connect to Cloudera Hadoop using Kerberos

Reference: http://website4everything.blogspot.com/2015/04/connecting-tableau-to-hive-server-2.html The basic steps to connect Tableau to Cloudera Hive or Impala with Kerberos authentication involves the following steps: Download and Install the MIT Kerberos Client for Window Set the C:\ProgramData\MIT\Kerberos5\krb5.ini with  the Kerberos realm and server details (Optional) KRB5CCNAME system environment variable may need to be set at times to a temporary value: FILE:C:\temp\kerberos\krb5cache … Continue reading Tableau Desktop connect to Cloudera Hadoop using Kerberos

Run a Python program to access Hadoop webhdfs with Kerberos enabled

Following python code makes REST calls to a secure Kerberos enabled Hadoop cluster to use webhdfs REST api to get file data:   You need to first run $ knit userid@REALM to authenticate and initiate the Kerberos ticket for the user. Make sure the python modules requests and requests_kerberos have been installed. Otherwise install it … Continue reading Run a Python program to access Hadoop webhdfs with Kerberos enabled

MicroStrategy Desktop connect to Impala

Environment: MicroStrategy Desktop 10.11 Cloudera CDH 5.12 Impala 2.x Steps to connect MicroStrategy Destop to Cloudera Impala: Best thing about MicroStrategy Desktop unlike Tableau Desktop is it is free to download and use and a powerful BI visualization/query tool. Tableau Public Desktop is free but it only has few connectors and cannot connect to Hadoop … Continue reading MicroStrategy Desktop connect to Impala

ESRI-GIS Tools for Hadoop

The ESRI GIS Tools for Hadoop are a collection of GIS tools for spatial analysis of big data. References: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive   Aggregation Sample for Hive: point-in-polygon-aggregation-hive The following steps are taken from the above reference. Step-1: Make a folder anywhere in your local server where hive is installed: $mkdir /tmp/esri-git Step-2: Bring down the git repository: … Continue reading ESRI-GIS Tools for Hadoop

Business Intelligence, ETL and Data Science tools

Free or Opensource BI / ETL tools: Talend = ETL tool, leader in Gartner Magic Quadrant Streamsets = ETL tool Apache Nifi = ETL tool Pentaho = desktop and server version BI/ETL tool HUE = Hadoop Analytics server, BI, Query tool KNIME = Data Science leader in Gartner Magic Quadrant 2017 desktop version Jupyter Notebook … Continue reading Business Intelligence, ETL and Data Science tools

Install Jupyterhub

Jupyterhub Prerequisites: Before installing JupyterHub, you will need: a Linux/Unix based system and will need over 10GB of free space Python 3.4 or greater. An understanding of using pip or conda for installing Python packages is helpful. Installation using conda: Check if Anaconda package is already installed: $ dpkg -l | grep conda $ rpm -ql conda        -- if using rhel/centos If Anaconda … Continue reading Install Jupyterhub

Some helpful links

Cloudera/Hadoop: tiny.cloudera.com/hw-reqs tiny.cloudera.com/aws-ra http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html http://docplayer.net/25124019-Hadoop-security-authors-ben-spivey-and-joey-echeverria-provide-in-depth-information-about-the-security-features-available-in-hadoop-and-organize-them.html http://blog.cloudera.com/blog/2015/03/how-to-quickly-configure-kerberos-for-your-apache-hadoop-cluster/ http://wpcertification.blogspot.com/ https://henning.kropponline.de/ https://blogs.msdn.microsoft.com/pliu/2016/01/02/integrating-cloudera-cluster-with-active-directory-part-13/     Jupyter: https://blog.insightdatascience.com/using-jupyter-on-apache-spark-step-by-step-with-a-terabyte-of-reddit-data-ef4d6c13959a Docker: https://www.dataquest.io/blog/docker-data-science/ Miscellaneous: https://blog.daftcode.pl/hype-driven-development-3469fc2e9b22 https://github.com/parth8891/NYC_Taxi_Data_Analysis https://keshif.me/demo/VisTools http://blog.thedigitalgroup.com/dattatrayap/high-speed-ingestion-into-solr-with-custom-talend-component-developed-by-tdg/ http://www.bigendiandata.com/