Hadoop – Page 36 – Curated SQL

In the last part of the Azure Synapse Analytics article series, we learned how to create a dedicated SQL pool. Azure Synapse support three different types of pools – on-demand SQL pool, dedicated SQL pool and Spark pool. Spark provides an in-memory distributed processing framework for big data analytics, which suits many big data analytics use-cases. Azure Synapse Analytics provides mechanisms to use SQL on-demand pool to query data as a service, SQL dedicated pool for data warehousing using distributed data processing engine, and Spark pool for analytics using in-memory big data processing engine. This article shows how to create a Spark pool in Azure Synapse Analytics and further how to process the data using it.

Click through for a demo on setup and a sample notebook to get started.

Comments closed

A Review of AWS Athena

Published 2021-01-29 by Kevin Feasel

John McCormack updates an older review:

AWS’s own documentation is the best place for full details on the Athena offering, this post hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. It’s not designed for regular querying of tables in a way that you would with an RDBMS. Performance is geared around querying large data sets which may include structured data or semi-structured data. There are no licensing costs like you may have with some Relational Database Management Systems (RDBMS) such as SQL Server and costs are kept low, as you only pay when you run queries in AWS Athena.

Click through for an overview of product benefits.

Comments closed

Joins in Synapse Analytics Spark

Published 2021-01-28 by Kevin Feasel

Ed Elliott continues a series:

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh).

Click through for lots of examples.

Comments closed

Parquet 1.x Footer Content

Published 2021-01-28 by Kevin Feasel

Dmitry Tolpeko shows us what the footer of a Parquet 1.x file looks like:

Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.
A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

Click through for more details, as well as one downside to Parquet 1.x.

Comments closed

Delta Table Compatibility between Azure Databricks and Azure Synapse Analytics

Published 2021-01-26 by Kevin Feasel

Paul Andrew performs a test:

Or, to ask the question another way…
Question: Can we use (read/write) Delta tables created in Azure Databricks with Azure Synapse Analytics – Spark Compute Pools and vice versa?

Read on for the answer, as well as a number of specific scenarios.

Comments closed

Common Table Expressions in Spark

Published 2021-01-21 by Kevin Feasel

Ed Elliott continues a series on SQL syntax concepts in Spark:

The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”.

Read on for the answer.

Comments closed

Window Functions in Spark

Published 2021-01-20 by Kevin Feasel

Ed Elliott continues a series on Spark examples:

The next example is how to do a ROW_NUMBER(), my favourite window function.

Ed’s example is ROW_NUMBER() but it also applies to other partitioning window functions such as RANK(), DENSE_RANK(), and NTILE().

Comments closed

Grouping Data with Spark

Published 2021-01-19 by Kevin Feasel

Ed Elliott has two quick examples of grouping data in Spark:

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

Click through for the first example, GROUP BY.

Comments closed

Loading a Spark DataFrame in .NET

Published 2021-01-18 by Kevin Feasel

Ed Elliott shows how to get data and convert it into a Spark DataFrame using .NET:

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame?
This post will cover all the ways to get data into a DataFrame in .NET for Apache Spark.

Click through for several methods.

Comments closed

Answering NiFi Questions

Published 2021-01-14 by Kevin Feasel

Pierre Villard has a few answers to questions about Apache NiFi:

Over the last few weeks, I delivered four live NiFi demo sessions, showing how to use NiFi connectors and processors to connect to various systems, with 1000 attendees in different geographic regions. I want to thank you all for joining and attending these events! Interactive demo sessions and live Q&A are what we all need these days when working remotely from home is now a norm. If you have not seen my live demo session, you can catch up by watching it here.
I received hundreds of questions during these events, and my colleagues and I tried to answer as many as we could. As promised, here are my answers to some of the most frequently asked questions.

Click through for the questions and answers.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Category: Hadoop

Using Spark Pools in Azure Synapse Analytics

A Review of AWS Athena

Joins in Synapse Analytics Spark

Parquet 1.x Footer Content

Delta Table Compatibility between Azure Databricks and Azure Synapse Analytics

Common Table Expressions in Spark

Window Functions in Spark

Grouping Data with Spark

Loading a Spark DataFrame in .NET

Answering NiFi Questions