Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
spark.sql("""
SELECT
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
FROM NorthCarolinaPopulation
WHERE
POPULATION <> 'A'
""")
.write.format("orc").saveAsTable("NorthCarolinaPopulationTyped")

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

Related Posts

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow: Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info. Databricks Runtime 5.3 and 5.3 ML and above support […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

June 2019
MTWTFSS
« May  
 12
3456789
10111213141516
17181920212223
24252627282930