Press "Enter" to skip to content

Category: Hadoop

Using Spark Pools in Azure Synapse Analytics

Rahul Mehta shows how to create and use an Apache Spark pool in Azure Synapse Analytics:

In the last part of the Azure Synapse Analytics article series, we learned how to create a dedicated SQL pool. Azure Synapse support three different types of pools – on-demand SQL pool, dedicated SQL pool and Spark pool. Spark provides an in-memory distributed processing framework for big data analytics, which suits many big data analytics use-cases. Azure Synapse Analytics provides mechanisms to use SQL on-demand pool to query data as a service, SQL dedicated pool for data warehousing using distributed data processing engine, and Spark pool for analytics using in-memory big data processing engine. This article shows how to create a Spark pool in Azure Synapse Analytics and further how to process the data using it.

Click through for a demo on setup and a sample notebook to get started.

Comments closed

A Review of AWS Athena

John McCormack updates an older review:

AWS’s own documentation is the best place for full details on the Athena offering, this post hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. It’s not designed for regular querying of tables in a way that you would with an RDBMS. Performance is geared around querying large data sets which may include structured data or semi-structured data. There are no licensing costs like you may have with some Relational Database Management Systems (RDBMS) such as SQL Server and costs are kept low, as you only pay when you run queries in AWS Athena.

Click through for an overview of product benefits.

Comments closed

Parquet 1.x Footer Content

Dmitry Tolpeko shows us what the footer of a Parquet 1.x file looks like:

Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.

A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

Click through for more details, as well as one downside to Parquet 1.x.

Comments closed

Grouping Data with Spark

Ed Elliott has two quick examples of grouping data in Spark:

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

Click through for the first example, GROUP BY.

Comments closed

Loading a Spark DataFrame in .NET

Ed Elliott shows how to get data and convert it into a Spark DataFrame using .NET:

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame?

This post will cover all the ways to get data into a DataFrame in .NET for Apache Spark.

Click through for several methods.

Comments closed

Answering NiFi Questions

Pierre Villard has a few answers to questions about Apache NiFi:

Over the last few weeks, I delivered four live NiFi demo sessions, showing how to use NiFi connectors and processors to connect to various systems, with 1000 attendees in different geographic regions. I want to thank you all for joining and attending these events! Interactive demo sessions and live Q&A are what we all need these days when working remotely from home is now a norm.  If you have not seen my live demo session, you can catch up by watching it here

I received hundreds of questions during these events, and my colleagues and I tried to answer as many as we could. As promised, here are my answers to some of the most frequently asked questions. 

Click through for the questions and answers.

Comments closed