Using Polybase To Insert Into HDFS

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales:  with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster?  And if so, how close to linear scale can I get?

Related Posts

Generating Load For Kafka With JMeter

Anup Shirolkar shows us a way to use JMeter to generate load for Apache Kafka clusters: The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina […]

Read More

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More


December 2016
« Nov Jan »