Press "Enter" to skip to content

Month: August 2022

An Intro to Key Word Analysis

Lewis Prince continues a series on natural language processing:

Here we are with part 2 of this blog series on web scraping and natural language processing (NLP). In the first part I discussed what web scraping was, why it’s done and how it can be done. In this part I will give you details on what NLP is at a high level, and then go into detail of an application of NLP called key word analysis (KWA).

Read on for a high-level overview of the topic and how to do it in Cognitive Services. But not the topic model—that’d be a different post.

Comments closed

Purview Access Policies and SQL Server 2022

Srdan Bozovic links Purview and SQL Server 2022:

The focus of this article is on using Microsoft Purview to enable access to user data as well as specific system metadata in SQL Server 2022 running on Azure Arc–enabled servers.

With the SQL Server 2022 release, the goal is to enable three main scenarios:

– Browsing data in user-defined tables and views.

– Performance monitoring with system commands, functions, and views.

– Security auditing with security-related system functions and views.

If Azure Arc-enabled servers are required for Purview to work, I think that will seriously hinder uptake.

Comments closed

Resolving tempdb Issues in Azure SQL DB

Holger Linke troubleshoots some problems:

The tempdb system database is a global resource available to users who are connected to Azure SQL Database or any instance of SQL Server. It holds temporary user objects that are explicitly created by a user or application, and internal objects that are created by the SQL Server database engine itself. The most common tempdb issue is running out of space, either regarding tempdb’s overall size quota or the transaction log.

The available tempdb space in Azure SQL Database depends on two factors: the service tier (pricing tier) that the database is configured with, and the type of workload that is executed against the database. These are also the main factors to control if you are running out of tempdb space.

Click through for several error cases and how we can resolve them.

Comments closed

When Estimated and Actual Plans Differ

Brent Ozar notes that estimated plans are not guarantees:

A reader posted a question for Office Hours:

Hi Brent, What is your take on Hugo Kornelis’s explanation of execution plan naming. As her his explanation, estimated exec plan is simply an execution plan whereas actual execution plan = execution plan+run-time stats. Do you agree that the naming is flawed and confusing? – Yourbiggestfan

Click through to see examples of when estimated plans might look different from actual plans.

Comments closed

What’s New in SynapseML

Nellie Gustafsson and Mark Hamilton share an update:

SynapseML is a massively scalable (feel free to spin up hundreds of machines!) machine learning library built on Apache Spark. SynapseML makes it easy to train production-ready models to solve problems from simple classification and regression to anomaly detection, translation, image analysis, speech to text, and just about any ML challenge you are facing.  Under the hood, SynapseML integrates a wide array of ML technologies such as LightGBM, Vowpal Wabbit, ONNX, and the Cognitive Services into a single easy to use API compatible with MLFlow. We know, we know, everyone hates when developers invent new APIs, but you can rest easy because SynapseML integrates cleanly into existing Spark ML APIs so you can embed models directly into existing pipelines. We strive to make SynapseML available to developers wherever they work, and the library is available in a variety of languages like Python, Scala, Java, R. As of this release SynapseML is also usable from .NET, C#, F#.

Saving the best language for last, I see. Click through for the list of updates.

Comments closed

What’s in a Name?

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Data Retention: Definition and Policy

Joey Jablonski thinks about data retention:

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

This is something that most DBAs provide input into but don’t directly control. Still, it’s good to know some of the challenges around data retention and figure out how to apply it to your organization.

Comments closed

Cross-Platform SQL Server Availability Groups

Rajendra Gupta shows how to set up an Availability Group in SQL Server which runs on both Windows and Linux:

Microsoft supports SQL Server on Linux, and it has many of the same features as the Windows version. You can restore databases from Windows to Linux SQL or vice versa. The Linux SQL works with Red Hat, Ubuntu, SUSE enterprise, Kubernetes containers, and Docker.

Windows-based SQL instance supports SQL Server Always On Availability Groups for high availability and disaster recovery. If you are not familiar with Windows AG configuration, refer to the extensive series on Always on Availability Group (Toc at the bottom).

If you have both Windows and Linux SQL Server, is it possible to configure an availability group between them? Let’s explore this in this article.

This example uses async mode, which is the easier one to set up. With synchronous, you’re probably looking at using Pacemaker to sort out AG status.

2 Comments

Database-Driven Parameterization for Synapse Pipelines

Paul Hernandez does some configuring:

Particularly in Synapse, there are even no global parameters like in Azure Data Factory. 

When you want to move your development to another environment, typically CI/CDs pipelines are used. These pipelines consume an ARM template together with its parameter file to create a workspace in a target environment. The parameters can be overriding in the CD pipeline as explain here: https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-4-the-release-pipeline/ba-p/2034434

Even so, I have not found a proper way to change the values of a pipeline parameter (the same for data flows and datasets parameters). I saw some custom parameters manipulation to set the default value of a parameter and then deploy it without any value, or even JSON manipulation with PowerShell (the dark side for me).

Read on for an alternative solution which does the job well.

Comments closed

Sharing Power BI Content outside the Organization

Mara Pereira wants to share some data:

I am seeing more and more customers trying to use Premium capabilities to create data products that they can incorporate as part of their main product offering. This kind of reporting as a product solution will add a lot more value to their main product, so I can see why this is becoming quite trendy.

However, it became obvious that the current documentation can be a bit overwhelming and confusing at first.

So I decided to compile the process of sharing content outside of your organisation in a blog post. Happy days!

Click through to see how to share within the Power BI Service.

Comments closed