Use Pandas in Jupyter PySpark3 kernel to query Hive table

Following python code will read a Hive table and convert to Pandas dataframe so you can use Pandas to process the rows.

NOTE: Be careful when copy/paste the below code the double quotes need to be retyped as they get changed and gives syntax error.


import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

conf = (SparkConf().set(“spark.kryoserializer.buffer.max”, “512m”))

sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)

#Create a Hive Context

hive_context = HiveContext(sc)

print “Reading Hive table…”
sparkdf = hive_context.sql(“SELECT * FROM default.table123”)

print “Registering DataFrame as a table…” # Show first rows of dataframe



