Press "Enter" to skip to content

Author: Kevin Feasel

Changing Case in SSMS

Steve Jones has a quick tip for us:

I never knew I could change case for objects in SSMS easily. This actually was something that another individual pointed out to me, but once I tried it, I liked it and know I’ll use it at times.

Click through to see how to change your code to lower-case or upper-case in a single command.

Comments closed

The Continuing Relevance of Feature Engineering

Pete Warden points out something which is obvious and still needs to be said:

One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models.

Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong. I’m not the first person to raise this idea, but I have some thoughts I haven’t seen widely discussed on exactly why feature engineering isn’t going away anytime soon. One of them is that even the original vision case actually does rely on a *lot* of feature engineering, we just haven’t been paying attention. 

Read the whole thing.

Comments closed

Securely Access VMs with Azure Bastion

I have a post on Azure Bastion:

Azure Bastion is a service which acts as a managed RDP or SSH host, allowing you to use a web browser securely to connect to a virtual machine, even when that virtual machine does not have a public IP address. If you’re new to Azure networking, it may feel a little complicated, but let’s see how to configure and use Bastion.

Click through for a step-by-step guide on how to use the service.

Comments closed

Spark SQL Bucketing and Query Tuning

Tomaz Kastrun continues a series on Apache Spark. Part 13 looks at bucketing and partitioning in Spark SQL:

Partitioning and Bucketing in Hive are used to improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system (HDFS). The major difference between them is how they split the data.

Part 14 covers query hints:

This hint instructs Spark to use the hinted strategy on specified relation when joining tables together. When BROADCASTJOIN hint is used on Data1 table with Data2 table and overrides the suggested setting of statistics from configuration spark.sql.autoBroadcastJoinThreshold.

Spark also prioritise the join strategy, and also when different JOIN strategies are used, Spark SQL will always prioritise them.

Be sure to check those out.

Comments closed

Hypothetical Indexes in SQL Server

Eitan Blumin explains what hypothetical indexes are and why they might be useful:

Using Hypothetical Indexes, you can generate an estimated execution plan for a given query, that would essentially assume the existence of a “hypothetical” index as if it actually exists as a real index. Compare that estimated execution plan to its counterpart without the hypothetical index, and you’ll be able to determine whether creating this index for real is worth the time and effort.

Hypothetical Indexes are actually nothing new in SQL Server. It existed since SQL Server version 2005. However, its use is still not widespread to this day. Most likely because it’s not very easy to use and the relevant commands are undocumented.

Click through to see how to use them and an important warning if you try it in production.

Comments closed

Query Variation in QDS Toolbox

Jared Poche looks at a procedure to track query variation over time:

Since Query Store stores statistics long term, we can use it to look for trends or major changes in performance. The Query Variation component of the QDS Toolbox helps us review changes and has a lot of options to allow us to select the kinds of changes we are interested in.

So, let’s review our options and go through some examples.

Click through for some examples of how to use the query variation procedure.

Comments closed

Get the Previous Non-NULL Value in DAX

Kristyna Hughes calculates the lagged value in DAX:

If you are an avid report designer or user, you may have wanted to see the percent difference between one row and the previous one. Sounds easy right? Well let’s break it down. Percent difference follows a consistent formula: (current value – previous value)/previous value. Current value? Easy peasy. Previous value? That’s where this blog comes into play. It’s fairly simple to get the previous month value or previous date period value because Power BI can tell it exactly what the filter would be (September minus one month is always August ). However, there may not have been any invoices in August, so how can we compare September with the last month that had an invoice?

Click through to see how.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior. 

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Spark DataFrames

Tomaz Kastrun continues a series on working with Apache Spark. Part 10 looks at the DataFrame construct:

We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.

On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.

Part 11 looks at external R and Python packages and DataFrame support:

When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.

Part 12 gets into Spark SQL:

Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.

These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

With API unification, user can access Spark SQL using Scala spark-shell, using Python pyspark or using R sparkR shell.

DataFrames are so popular that they’ve become the de facto standard for working with data in Spark, and .NET languages only work with DataFrames, not with the raw RDDs.

Comments closed

Learning Experiences from Transactional Replication

Ned Otter shares war stories:

I’ve dealt with SQL replication for decades, and in a sense, not a lot has changed. I mean this from a basic configuration and troubleshooting perspective, though it has in some ways been extended a bit through the years, for new SQL Server features (like In-Memory OLTP, Azure, etc.).

Many refer to replication as the the Swiss Army Knife of SQL Server, and I can understand why, but with this “extreme flexibility” comes “extreme shortcomings”, and this post will delve into some of the issues you should be aware of.

Click through for plenty of useful tips.

Comments closed