Press "Enter" to skip to content

Curated SQL Posts

Speeding Up Power Query with Evaluation Container Memory

Chris Webb notes a new toggle in Power Query:

However if you have just read the docs you may be wondering what these two new registry key settings actually do. In this post I’m only going to talk about one, MaxEvaluationWorkingSetInMB; I’ll leave ForegroundEvaluationContainerCount for a future post.

At various times in the past I have blogged about how, when you run a Power Query query, the query itself is executed inside a separate process called an evaluation (or mashup) container and how this process has a limit on the amount of memory it can use. Some transformations such as sorting a table, doing a group by, pivoting and unpivoting require an entire table of data to be held in memory and if these operations require more memory than the evaluation container is able to use then it starts paging and query performance gets a lot worse. 

Read on to see where setting the max evaluation working set in memory can help, as well as the caveats that Chris lays out.

Comments closed

Hosting a Python API with Flask

Mrinal Walia shows how you can build a Python API, such as one for generating machine learning predictions, using Flask:

Deployment is a crucial move in the ML workflow. It is a mark where we want to implement our ML model into utilization. Later, we can practice the model in practical life.

But how can we design the model as a treatment? We can develop an Application Programming Interface (API). With that, we can reach the model universally, can be a mobile application or web application. In Python, there’s a library that can assist us in building an API. It’s named Flask.

This article will explain how to construct a REST API for our machine learning model utilizing Flask. Without further ado, let’s begun!

Flask is the first step, but then I’d want to reverse proxy it with gunicorn or Nginx afterward.

Comments closed

Standalone Shiny Apps with systemd

Peter Solymos takes us through configuring a Shiny application which runs via systemd:

Lots of resources describe how you can host Shiny apps with Docker, Shiny Server, or via other means. But we also know Shiny apps can be launched locally. What makes your local setup different from these other options is that your local machine does not usually have a static internet protocol (IPv4) address. Without a static IPv4, it is really hard to share the app with other people because the address keeps changing unpredictably, and you might sometimes power off your machine.

Shiny uses the httpuv R package under the hood which is an HTTP and websocket server library. Could you just run Shiny directly on a remote server? This post explores this topic using systemd the system and service manager for most modern Linux distributions. All I am trying to do in this post is to make a point that the shiny R package is really self-sufficient and in the simplest case, it does not need any other layer for sharing an app.

Click through for the process. H/T R-Bloggers.

Comments closed

TempDB Configuration Tips

Jeff Iannucci adds on to tempdb configuration advice:

Step 1: Provision a separate drive. Put your TempDB away from your precious user data files, as well as the other system databases. This probably isn’t your job, so ask your friends on your infrastructure team to provision and mount a T: drive or whatever of a size that you gauge appropriately. What’s appropriate? Hey, I don’t know your system – that’s why they pay you the big bucks, friend.

Read on for the full set of tips.

Comments closed

Memory Grants for CHAR vs VARCHAR

Erik Darling has a head-slapping moment:

While working with a client recently, we found that someone, at some point in time, probably during the original migration from Access, had chosen CHAR columns rather than VARCHAR columns.

Okay, fine. How bad could it be?

Bad enough that… A whole bunch of columns that only had a single character in them were stored in CHAR(1000) columns.

I like CHAR…well, to be specific, NCHAR. But only when you’ll need exactly that many characters.

Comments closed

Improving Parquet External Table Performance in Dedicated SQL Pools

Jovan Popovic shows us two ways of accessing data in Parquet files in Azure Synapse Analytics dedicated SQL pools:

Azure Synapse Analytics enables you to read Parquet files stored in the Azure Data Lake storage using the T-SQL language and high-performance Parquet readers. The key characteristic of these high-performance Parquet readers is that they are using the native (C++) code for reading Parquet files, unlike the existing Polybase Parquet reader technology that uses the Java code. These native readers are introduced in the serverless SQL pools in Azure Synapse Analytics workspaces.

In many experiments, this native technology that is used in the serverless SQL pools demonstrated better performance compared to the existing Polybase external table in the dedicated SQL pools.

This native technology for reading Parquet files is now also available in the dedicated SQL pools. In the dedicated Pools in Azure Synapse Analytics, you can create external tables that use native code to read Parquet files and improve performance of your queries that access external Parquet files.

Click through for the process, as well as what kind of performance differences you can see. Some of the queries ended up being worse for native tables versus PolyBase tables, but the majority were a good bit better.

Comments closed

Creating a CSV File from a Table via BCP

Kenneth Fisher shows how to use the bcp command to create a file from a table:

This is a pretty handy little tool in your arsenal. I’ve talked about using bcp to transfer data from one instance to another before and this is another really great use for bcp. If you haven’t used it before bcp stands for Bulk Copy Protocol and is a command line tool for transferring data in and out of SQL Server. In this case you can use this command to generate a csv file from DBName.SchemaName.TableName:

Click through to see the command, as well as some helpful hints.

Comments closed

Accessing ADLS Gen2 Data in Serverless SQL Pools with SAS Tokens

Neel Ball shows how you can use various techniques, including SAS tokens, to access data stored in Azure Data Lake Storage Gen2 from Azure Synapse Analytics serverless SQL pools:

You have a data lake that contains employee and social feed data. You have data residing in an employee folder that is used by HR team members and twitter for live social feeds that is usually used by marketing folks. If you use SAS token or RBAC, you cannot control to the folder level.

How do you allow users to perform data exploration using synapse serverless with fine grain control on underlying storage.

Read on for one solution.

Comments closed

Using Azure Purview to Catalog S3 Datasets

Gauri Mahajan crosses the clouds:

Data exists in various types of formats and hosted in equally varied type repositories depending on the format of data. With the advent of the cloud, generally, every public cloud provider like Amazon AWS, Microsoft Azure, and Google Cloud Platform provides a variety of data hosting avenues. A large-scale enterprise is very likely to have a multi-cloud footprint, where data is hosted on more than one cloud. As the data footprint on the cloud becomes larger, the need for a data catalog becomes increasingly important. Each cloud provider provides its format of the cloud-native data catalog. But even after employing a cloud provider-specific catalog, an enterprise may struggle to get a 360-degree view of data on the cloud. The reason being each cloud provider providers the ability to catalog data hosted on its own cloud only, which compels the end-users to reconcile or reference data from multiple catalogs, as the data catalog on the cloud may not integrate with cross-cloud data hosting services. As the client needs evolve, so do the cloud services. Azure Purview is the brand-new data catalog and governance-related service on the Azure cloud. It has introduced features to support cataloging data hosted on AWS Simple Storage Service (S3), which is typically considered the storage layer of data lake on AWS.

Read on to see how.

Comments closed