Press "Enter" to skip to content

Month: June 2021

Improving Parquet External Table Performance in Dedicated SQL Pools

Jovan Popovic shows us two ways of accessing data in Parquet files in Azure Synapse Analytics dedicated SQL pools:

Azure Synapse Analytics enables you to read Parquet files stored in the Azure Data Lake storage using the T-SQL language and high-performance Parquet readers. The key characteristic of these high-performance Parquet readers is that they are using the native (C++) code for reading Parquet files, unlike the existing Polybase Parquet reader technology that uses the Java code. These native readers are introduced in the serverless SQL pools in Azure Synapse Analytics workspaces.

In many experiments, this native technology that is used in the serverless SQL pools demonstrated better performance compared to the existing Polybase external table in the dedicated SQL pools.

This native technology for reading Parquet files is now also available in the dedicated SQL pools. In the dedicated Pools in Azure Synapse Analytics, you can create external tables that use native code to read Parquet files and improve performance of your queries that access external Parquet files.

Click through for the process, as well as what kind of performance differences you can see. Some of the queries ended up being worse for native tables versus PolyBase tables, but the majority were a good bit better.

Leave a Comment

Creating a CSV File from a Table via BCP

Kenneth Fisher shows how to use the bcp command to create a file from a table:

This is a pretty handy little tool in your arsenal. I’ve talked about using bcp to transfer data from one instance to another before and this is another really great use for bcp. If you haven’t used it before bcp stands for Bulk Copy Protocol and is a command line tool for transferring data in and out of SQL Server. In this case you can use this command to generate a csv file from DBName.SchemaName.TableName:

Click through to see the command, as well as some helpful hints.

Leave a Comment

Accessing ADLS Gen2 Data in Serverless SQL Pools with SAS Tokens

Neel Ball shows how you can use various techniques, including SAS tokens, to access data stored in Azure Data Lake Storage Gen2 from Azure Synapse Analytics serverless SQL pools:

You have a data lake that contains employee and social feed data. You have data residing in an employee folder that is used by HR team members and twitter for live social feeds that is usually used by marketing folks. If you use SAS token or RBAC, you cannot control to the folder level.

How do you allow users to perform data exploration using synapse serverless with fine grain control on underlying storage.

Read on for one solution.

Leave a Comment

Using Azure Purview to Catalog S3 Datasets

Gauri Mahajan crosses the clouds:

Data exists in various types of formats and hosted in equally varied type repositories depending on the format of data. With the advent of the cloud, generally, every public cloud provider like Amazon AWS, Microsoft Azure, and Google Cloud Platform provides a variety of data hosting avenues. A large-scale enterprise is very likely to have a multi-cloud footprint, where data is hosted on more than one cloud. As the data footprint on the cloud becomes larger, the need for a data catalog becomes increasingly important. Each cloud provider provides its format of the cloud-native data catalog. But even after employing a cloud provider-specific catalog, an enterprise may struggle to get a 360-degree view of data on the cloud. The reason being each cloud provider providers the ability to catalog data hosted on its own cloud only, which compels the end-users to reconcile or reference data from multiple catalogs, as the data catalog on the cloud may not integrate with cross-cloud data hosting services. As the client needs evolve, so do the cloud services. Azure Purview is the brand-new data catalog and governance-related service on the Azure cloud. It has introduced features to support cataloging data hosted on AWS Simple Storage Service (S3), which is typically considered the storage layer of data lake on AWS.

Read on to see how.

Leave a Comment

Creating a dacpac for a Dedicated SQL Pool

Kevin Chant shows us how to build out a dacpac file for an Azure Synapse Analytics dedicated SQL Pool:

In reality, you can create a dacpac for a database that’s inside an Azure Synapse Analytics dedicated SQL Pool using a lot of the methods that you use to create them for SQL Server databases.

Azure Data Studio can be an appealing alternative SQL Server Data Tools (SSDT) for tasks like this. Due to various reasons. For instance, it’s a multi-platform solution that is easy to install.

With this in mind, I decided in this post to cover how to create a dacpac for an Azure Synapse Analytics dedicated SQL Pool using Azure Data Studio.

Click through to see how.

Leave a Comment

When Expressions Beat Local Variables

Erik Darling talks about a pet peeve of mine:

I see people writing stored procedures that use local variables either because they read that all functions in a where clause are dangerous, or because they need to perform a calculation on one of the parameters that get passed in.

In nearly all cases, you’re better off… not doing that.

Here are a couple examples of why.

I think a lot of this stems from advice to avoid non-SARGable predicates, but miss the idea that expressions of the sort Erik shows are perfectly cromulent.

Leave a Comment

Using Lobe for Training ML Models

Chris Webb reviews a free tool from Microsoft:

The most impressive thing about it is not what it does but how it does it: a lot of tools claim to make machine learning easy for non-technical users but Lobe really is easy to use. My AI/ML knowledge is very basic but I got up and running with it extremely quickly.

To test it out I downloaded lots of pictures of English churches and trained a model to detect whether the church had a tower or a spire. After I labelled the pictures appropriately:

Click through for Chris’s findings. Looks like the only thing it does today is image classification, but more functionality is forthcoming.

Leave a Comment

Scaling HDFS to an Exabyte

Konstantin Shvachko, et al, explain some of the changes to the Hadoop Distributed File System needed to scale to one exabyte of data:

LinkedIn runs its big data analytics on Hadoop. During the last five years, the analytics infrastructure has experienced tremendous growth, almost doubling every year in data size, compute workloads, and in all other dimensions. It recently reached two important milestones.

1. LinkedIn now stores 1 exabyte of total data across all Hadoop clusters.

2. Our largest 10,000-node cluster stores 500 PB of data. It maintains 1 billion objects (directories, files, and blocks) on a single NameNode serving RPCs with an average latency under 10 milliseconds, making it one of the largest (if not the largest) Hadoop cluster in the industry.

From the early days of LinkedIn, Apache Hadoop was the basis of our analytics infrastructure. Many teams assisted in this effort to make Hadoop our canonical big data platform.

Read on for different techniques they’ve used, as well as code changes implemented in HDFS to support this data size.

Leave a Comment