Press "Enter" to skip to content

Category: Hadoop

Parquet 1.x Footer Content

Dmitry Tolpeko shows us what the footer of a Parquet 1.x file looks like:

Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.

A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

Click through for more details, as well as one downside to Parquet 1.x.

Comments closed

Grouping Data with Spark

Ed Elliott has two quick examples of grouping data in Spark:

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

Click through for the first example, GROUP BY.

Comments closed

Loading a Spark DataFrame in .NET

Ed Elliott shows how to get data and convert it into a Spark DataFrame using .NET:

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame?

This post will cover all the ways to get data into a DataFrame in .NET for Apache Spark.

Click through for several methods.

Comments closed

Answering NiFi Questions

Pierre Villard has a few answers to questions about Apache NiFi:

Over the last few weeks, I delivered four live NiFi demo sessions, showing how to use NiFi connectors and processors to connect to various systems, with 1000 attendees in different geographic regions. I want to thank you all for joining and attending these events! Interactive demo sessions and live Q&A are what we all need these days when working remotely from home is now a norm.  If you have not seen my live demo session, you can catch up by watching it here

I received hundreds of questions during these events, and my colleagues and I tried to answer as many as we could. As promised, here are my answers to some of the most frequently asked questions. 

Click through for the questions and answers.

Comments closed

DataFrame Cleaning in Spark

Craig Covey has an update to the Spark Starter Guide:

Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data. 

In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data. 

It’s interesting to see what’s available in Spark and how you can take advantage of it.

Comments closed