Use Pandas in Jupyter PySpark3 kernel to query Hive table

Following python code will read a Hive table and convert to Pandas dataframe so you can use Pandas to process the rows.

NOTE: Be careful when copy/paste the below code the double quotes need to be retyped as they get changed and gives syntax error.

————————————————————————————————————–

import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

conf = (SparkConf().set(“spark.kryoserializer.buffer.max”, “512m”))

sc.stop()
sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)

#Create a Hive Context

hive_context = HiveContext(sc)

print “Reading Hive table…”
sparkdf = hive_context.sql(“SELECT * FROM default.table123”)

print “Registering DataFrame as a table…”

sparkdf.show() # Show first rows of dataframe

sparkdf.printSchema()

sparkdf.limit(10).toPandas().head(3)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.