Press "Enter" to skip to content

Month: December 2020

Spark Starter Guide: Data Standardization

Ladon Robinson continues the Spark Starter Guide:

Standardization is the practice of analyzing columns of data and identifying synonyms or like names for the same item. Similar to how a cat can also be identified as a kitty, kitty cat, kitten or feline, we might want to standardize all of those entries into simply “cat” so our data is less messy and more organized. This can make future processing of the data more streamlined and less complicated. It can also reduce skew, which we address in Addressing Data Cardinality and Skew.

We will learn how to standardize data in the following exercises.

Check it out. I’m excited to see the Spark Starter Guide get fleshed out and written.

Comments closed

Azure Synapse Analytics Goes GA

Sacha Tomey recaps some announcements:

After much anticipation, today, Microsoft have announced the general availability of Azure Synapse Analytics! Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and Big Data analytics all into a single service, accelerating time to insights, enabling organisations to become data-driven. Azure Synapse combines capabilities spanning the needs of data engineering, machine learning, and BI without creating silos in processes and tools.

Read on for more info on this as well as info on Azure Purview.

Comments closed

Power BI Premium Per User

Adam Saxton is excited:

Are you curious what Power BI Premium Per User is all about? Adam walks you through how to get it and what it means from a user experience. Take advantage of Power BI Premium features without the Premium capacity price!

Click through for the video as well as a few links for more info.

Comments closed

Changing a Kubernetes Cluster to containerd

Andrew Pruski wants to get ahead of the game:

DISCLAIMER – You’d never do this for a production cluster. For those clusters, you’d simply get rid of the existing nodes and bring new ones in on a rolling basis. This blog is just me mucking about with my Raspberry Pi cluster to see if the update can be done in-place without having to rebuild the nodes (as I really didn’t want to have to do that).

Check it out. In addition to the Twitter thread Andrew mentions, the Kubernetes group has a full blog post with more details.

Comments closed

Use a Separate Deadlock Extended Events Trace

Kendra Little explains why it makes sense to have an extended events trace specifically for deadlocks:

We recently had customer ask why SQL Monitor creates an Extended Events session to capture deadlock graphs, when SQL Server has a built-in system_health Extended Events trace which also captures deadlock information?

There are a couple of reasons why a dedicated trace is desirable for capturing deadlock graphs, whether you are rolling your own monitoring scripts or building a monitoring application. I like this question a lot because I feel it gets at an interesting tension/balance at the heart of monitoring itself.

Click through for the answer.

Comments closed

Aggregate Functions in SQL Server

Hugo Kornelis takes us through the concept of aggregate functions:

SQL Server currently supports three operators that can compute aggregations: Hash MatchStream Aggregate, and Window Aggregate. These operators all use the same basic principle of maintaining internal counters as rows are processed, so that the final value of those internal counters is the expected value.

Read on to see the full list, as well as how they operate.

Comments closed

Introducing Azure Purview

Wolfgang Strasser gives us a once-over on a new service:

Today, at the Azure Data and Analytics event, a new Azure data governance service called Azure Purview (https://aka.ms/AzurePurview) was presented and made available in a public preview.

I have not had a chance to try the actual service, but I found a very interesting video (Microsoft mechanics video) where I took the following screenshots from.

Read on for Wolfgang’s thoughts. It’s definitely a step up from Azure Data Catalog.

Comments closed

ETL with R in SQL Server

Rajendra Gupta shows one reason for using R inside SQL Server:

Data professionals get requests to import, export data into various formats. These formats can be such as Comma-separated data(.CSV), Excel, HTML, JSON, YAML, Tab-separated data(.TSV). Usually, we use SQL Server integration service ETL packages for data transformations, import or export data.

SQL Machine Learning can be useful in dealing with various file formats. In the article, External packages in R SQL Server, we explored the R services and the various external packages for performing tasks using R scripts.

In this article, we explore the useful Rio package developed by Thomas J. Leeper to simplify the data export and import process.

One thing I’d like to reiterate is that even though you’re using R to move this data, you don’t need to perform any data science activities on the data—R can be the easiest approach for getting and cleaning up certain types of data.

Comments closed