conda update conda — ran 4 days and exited what to do.

How to upgrade conda or anaconda packages I was trying to update a very old conda version 4.7.12 to the latest and ran the command “conda update conda” as root on the base anaconda environment on Centos 7.9 linux. This is what happened. Never run “pip install” and “conda install” on the same environment. As … Continue reading conda update conda — ran 4 days and exited what to do. →

Upgrade Git in Centos 7

Upgrade Git version 1.8 to 2.36 in Centos 7 Red Hat Enterprise Linux, Oracle Linux, CentOS, Scientific Linux, et al. RHEL and derivatives typically ship older versions of git. You can download a tarball and build from source, or use a 3rd-party repository such as the IUS Community Project to obtain a more recent version of git. First install … Continue reading Upgrade Git in Centos 7 →

Copy HDFS file to Azure Synapse ADLS2 using Nifi

The Nifi pipeline for copying HDFS files to Azure Synapse ADLS2 folder will be as below. This was tested with Cloudera CDP 7.1.7 and Azure Synapse ADLS2. GetHDFSFileinfo will generate the list of both directory and files. Keep in mind that the GetHDFSFileInfo processor does not maintain any state, so every time it executes it … Continue reading Copy HDFS file to Azure Synapse ADLS2 using Nifi →

Install Microsoft ODBC driver for SQL Server (Linux)

Reference: https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver16#redhat18 Microsoft ODBC 18 for Red Hat Enterprise Server and Oracle Linux #Download appropriate package for the OS version #Choose only ONE of the following, corresponding to your OS version #Red Hat Enterprise Server 7 and Oracle Linux 7 (base) [root@]# lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.6.1810 (Core) … Continue reading Install Microsoft ODBC driver for SQL Server (Linux) →

SQLAlchemy query AWS Redshift database with python

First install psycopg2: $ conda install psycopg2 $ vi example-sql-redshift.py ### sqlalchemy python script to select rows from AWS Redshift ################################################################### import psycopg2 import sqlalchemy as db engine = db.create_engine('postgresql://userid:password@xxxxxxxx.us-west-1.redshift.amazonaws.com:5439/dbname') connection = engine.connect() print("after engine.connect") result = connection.execute("select * from public.tablename123 limit 20;") print("after execute select * from table") print(result, type(result)) print("rowcount", result.rowcount) #for row … Continue reading SQLAlchemy query AWS Redshift database with python →

Bash script to create kerberos users

#!/bin/bash if [ $# -eq 0 ]; then echo "ERROR: No arguments provided, use syntax: thisscript.sh myusername" exit 1 fi for i in "$@" do echo Argument: $i echo "KADMIN step..." kadmin.local -q "addprinc $i@KERBEROSREALM" kinit $i # echo "PASSWORD step..." # passwd $i mkdir /data/user/$i chown $i /data/user/$i chmod 700 /data/user/$i klist ls -l … Continue reading Bash script to create kerberos users →

Hive 3 ACID tables creation using ORC format

This is a great improvement in Hive 3 to allow full ACID (atomic, consistent, isolated, and durable) transactions. This will allow data to be fully consistent similar to RDBMS data. Without ACID there are chances of data loss or corruption. As of now there are few limitations as below: ACID tables can only be created using … Continue reading Hive 3 ACID tables creation using ORC format →

Migrating Streamsets SDC to Apache Nifi Install steps

Now that Streamsets DataCollector (SDC) is no longer open-source after version 4.x as of mid 2021 we need a similar Dataflow ETL tool to replace it. After looking at Apache Airflow and Apache Nifi two of the most popular open-source ETL tools, it appears Nifi is most similar to Streamsets SDC. Airflow is more a … Continue reading Migrating Streamsets SDC to Apache Nifi Install steps →

Connect SQLalchemy to Cloudera Impala or Hive

Below code will connect to Impala with Kerberos enabled. You can also connect to Hive by changing host and port to 10000. import sqlalchemy from sqlalchemy.engine import create_engine connect_args={'auth': 'KERBEROS', 'kerberos_service_name': 'impala'} engine = create_engine('hive://impalad-host:21050', connect_args=connect_args) conn = engine.connect() ResultProxy = conn.execute("SELECT * FROM db1.table1 LIMIT 3") print(ResultProxy.fetchall())

Is Apache Impala 65 to 200 times faster than Apache Hive on Tez

EbebJust now·3 min read Ran the same query SELECT COUNT(*) FROM DB1.TABLE1 on a 35 million rows table. Each query was run 4 times without any delay between each run on both Apache Impala impala-shell and Apache Hive beeline cli. This was to make sure to avoid session creation and data caching timing issue during … Continue reading Is Apache Impala 65 to 200 times faster than Apache Hive on Tez →