Press "Enter" to skip to content

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
spark.sql("""
SELECT
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
FROM NorthCarolinaPopulation
WHERE
POPULATION <> 'A'
""")
.write.format("orc").saveAsTable("NorthCarolinaPopulationTyped")

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.