Kevin Feasel – Page 856

In the past couple of days we looked into configurations and infrastructure and today it is again time to do an analysis, let’s call it end-to-end analysis using R or Python or SQL.

Read on for the process.

Comments closed

Row Estimates with Table Variables

Published 2020-12-17 by Kevin Feasel

Gail Shaw explains when table variables estimate one row and when they can generate estimates above one row:

At first glance, the question of how many rows are estimated from a table variable is easy.
But, is it really that simple? Well, not really. To dig into the why, first we need to identify why table variables estimate 1 row. The obvious answer is because they don’t have statistics. However…

Read on to learn the real answer.

Comments closed

A Use Case for Recursive CTEs

Published 2020-12-17 by Kevin Feasel

Jeffin Mathew takes us through a use case for recursive common table expressions:

An individual is working in HR and wants to find out which individual is managing who. This may be for several reasons such as, they need to ask the managers on the progress of their staff and if their appraisal is coming up or is due.
Another scenario may be that the company is enrolling more staff and wants to find out the capacity of the current staff or find individuals who have not yet got anyone to manage to give them the opportunity to do so.

Click through for the solution. Often times, we see recursive CTEs show up in hierarchical queries like this. When the number of records is small, they work really well. The issue comes with scale; that’s when a different table design becomes important.

Comments closed

Power BI Model Documenter

Published 2020-12-17 by Kevin Feasel

Marc Lelijveld has an update for us:

First of all, I worked on the cleanness of my code. As I’m not a native developer, I have to do a lot of trial and error to get stuff working exactly as I have it in mind. As of the beginning, the script contains a task to create the drop off folder for the connection file. Though, if the folder already existed, the script wrote an error to the screen, while everything was actually going as planned. In v1.2.0 of the model documenter, I enhanced the error handling to only write errors to the screen that actually matter.
Secondly, the transcript that runs while the tool is executed, generates a log file. This log file was not always entirely complete. I further enhanced the logging to easier debug in case of undesirable errors.

There are more improvements as well, so check it out.

Comments closed

The Azure Synapse Analytics Manage Hub

Published 2020-12-17 by Kevin Feasel

Saveen Reddy shows off the Manage Hub in Azure Synapse Analytics:

Azure Synapse Analytics allows you to provision a managed virtual network for your workspace. With the managed VNet, administators do not need to handle the burden of configuring traffic management rules, since that configuration is handled by Synapse. Moreover, using the managed VNet provides support for managed private endpoints. These endpoints are created in the managed VNet and enable access to Azure services. Communication between private endpoints and Azure resources occurs over private links, which transfer data through Microsoft’s network infrastructure.

Read on for a walkthrough.

Comments closed

Cross-Cluster and Cross-Service Kusto Queries in ADS

Published 2020-12-17 by Kevin Feasel

Julie Koesmarno shows off some new functionality in Azure Data Studio:

This blog post covers examples of cross-cluster and cross-service querying, including handy syntax, code snippets and notebooks that you can use in Azure Data Studio.
As some of you may already know, Kusto (KQL) extension is available in Azure Data Studio, which allows you to explore Azure Data Explorer (ADX) more natively. ADX also supports cross-cluster and cross-service queries between ADX, Azure AppInsights and Azure Log Analytics. This cross- service query preview feature is documented in Query data in Azure Monitor using Azure Data Explorer.

Click through for the demos.

Comments closed

Database Snapshots in SQL Server

Published 2020-12-17 by Kevin Feasel

Jamie Wick walks us through SQL Server database snapshots:

A SQL Server database snapshot is a read-only view of what the data pages in the source database looked like, at the time that the snapshot was created. Typically, snapshots are used to provide a point-in-time view of the database (for reporting or auditing purposes) or to allow for quick reversions during database upgrades/modifications. Since the snapshot only contains information on which values have changed, and what they were originally, it’s usually faster to revert the snapshot than having to restore the entire database from backup.

Click through for info on how they work as well as how they perform. I have used database snapshots to great effect in the past when testing changes in development environments back before the days of containers.

Comments closed

sparklyr 1.5 Released

Published 2020-12-16 by Kevin Feasel

Yitao Li announces version 1.5 of sparklyr:

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making Spark dataframes work with various dplyr verbs in the same way that R dataframes do. The full list of dplyr-related bugs and feature requests that were resolved in sparklyr 1.5 can be found in here.
In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.

Read on to learn more about this update. H/T R-Bloggers

Comments closed

ETL Anti-Patterns: a Festivus Miracle

Published 2020-12-16 by Kevin Feasel

Tim Mitchell is ready to air some grievances:

We’re rounding the corner to the second half of December, which means it’s time for my favorite holiday: Festivus! Like many of you, I enjoy gathering around the Festivus pole and sharing the time-honored traditions such as the Feats Of Strength and the Airing Of Grievances.
But my favorite Festivus tradition takes place right here on this blog: the Eleven Days of Festivus. Each year, I write a daily blog post each of the eleven days leading up to Festivus, usually around a central theme.

Tim has three posts up so far. First is around jumping straight into the code-writing phase:

Most data architects and developers are intensely curious folks. When we see a set of data, we want to immediately step into a data whisperer role. Where others may see a jumbled mess, we see an opportunity to discover patterns and answers. The best data architects crave those data discovery finds the same way a baseball player craves a bottom-of-the-9th game-winning home run.
That kind of intellectual curiosity is a necessary trait for data architects, but it can lead to a rush straight into writing ETL code. I’ve seen this a lot, and have done it myself (and admittedly still do it on occasion): skipping past the business-value analysis and diving straight into the haystack looking for needles. Getting raw data into a format that can easily be analyzed and validated is a critical part of the ETL development life cycle, but rarely is it the first step.

Second, processing too much data:

A common design flaw in enterprise ETL processes is that they are processing too much data. Having access to a great breadth and depth of data opens up lots of options for historical reporting and data analytics, but very often it is mistakenly assumed that all of the available data must be processed through ETL.
Although it may sound counterintuitive, there are many cases where purposefully leaving some data out of the ETL process leads to a better outcome.

Third is performing full loads when incremental loads are possible:

Earlier this year, I wrote about the concepts of incremental loads and discussed the benefits of loading data incrementally. To recap: an incremental load moves only the new and changed data from each source – rather than the entire bulk of the source data – through the ETL pipeline.
Using incremental loads can improve both the speed and accuracy of data movement and transformation. The time required to process data increases with the volume of said data, and extracting only the new and changed data from the source can ensure an accurate ‘point-in-time’ representation of the data. For these reasons, loading data incrementally is, for most data load needs, the better way to go.

This is a good series to track.

Comments closed

Running Spark on Azure Kubernetes Service

Published 2020-12-16 by Kevin Feasel

Tsuyoshi Matsuzaki walks us through running Apache Spark on Azure Kubernetes Service:

Apache Spark officially includes Kubernetes support, and thereby you can run a Spark job on your own Kubernetes cluster. (See here for official document. Note that Kubernetes scheduler is currently experimental.)
Especially in Microsoft Azure, you can easily run Spark on cloud-managed Kubernetes, Azure Kubernetes Service (AKS).
In this post, I’ll show you step-by-step tutorial for running Apache Spark on AKS. In this tutorial, artifacts, such as, source code, data, and container images are all protected by Azure credentials (keys).
Although managed services for Apache Spark, such as, Azure Databricks, Azure Synapse Analytics, and Azure HDInsight, is the best place to run Spark workloads, you will get much flexibility by running workloads on managed Kubernetes (AKS) – such as, spot VM support, start/stop cluster, confidential computing (Intel SGX) support, so on and so forth.

Read on to see how. Though of these options, I’d probably choose Azure Databricks or Azure Synapse Analytics well before the others.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Author: Kevin Feasel

End-to-End ML with Azure Databricks

Row Estimates with Table Variables

A Use Case for Recursive CTEs

Power BI Model Documenter

The Azure Synapse Analytics Manage Hub

Cross-Cluster and Cross-Service Kusto Queries in ADS

Database Snapshots in SQL Server

sparklyr 1.5 Released

ETL Anti-Patterns: a Festivus Miracle

Running Spark on Azure Kubernetes Service