Press "Enter" to skip to content

Author: Kevin Feasel

Generating Basic Features From Text Data In R With textfeatures

Abdul Majed Raja demonstrates the textfeatures package in R:

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called textfeatures that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

It’s a start for text analysis, though there’s a lot more after this.

Comments closed

SSIS: Target Server Version Leads To Script Task “Corruption”

Slava Murygin walks us through an ugly-looking error with an easy fix:

There are probably a lot of SSIS corruption errors, but that is one that is very easy to solve.
The whole error message is texted like this:

Script Task:Error: The Script Task is corrupted.
Script Task:Error: There were errors during task validation.
Script Task:Error: There was an exception while loading Script Task from XML: System.Exception: The Script Task “ST_74aca886806a416fa34ae89cac6237c2” uses version 15.0 script that is not supported in this release of Integration Services. To run the package, use the Script Task to create a new VSTA script. In most cases, scripts are converted automatically to use a supported version, when you open a SQL Server Integration Services package in %SQL_PRODUCT_SHORT_NAME% Integration Services. at Microsoft.SqlServer.Dts.Tasks.ScriptTask.ScriptTask.LoadFromXML(XmlElement elemProj, IDTSInfoEvents events)

That error came from an execution of SSIS package and it points that particular Script task is corrupted.
It is very confusing, because if you try to run the same package in the Visual Studio everything will be fine:

I’d consider the initial error that “The Script Task is corrupted.” to be a bad error message.  The longer description is helpful and explains the problem, but the word “corruption” has a certain scary connotation to DBAs and tossing it around when you really mean to say “Fix your target version” is unhelpful.

Comments closed

Getting A Random Row From A Small Data Set

David Fowler has a quick script to get a random row from a relatively small table:

A couple of times recently I’ve seen the question asked, ‘How can I select a single row at random from a table?’.

There are often a few ways of doing this suggested, most seem to rely using CTEs or temp tables.  I thought I’d share, in a quick post a very simple and easy way of doing it that I’ve used a couple of times.

The script David provides requires a table scan, so it doesn’t scale out very well.  But depending on your hardware, that can still be pretty efficient into the tens of thousands of rows.

Comments closed

Using DBCC OPTIMIZER_WHATIF To Mimic Production Hardware

Max Vernon has a technique for mimicking production hardware layouts when testing queries in development:

Attempting to debug production performance problems in your development environment can be problematic in many ways, leading to a frustrating troubleshooting experience. One very common situation is the resources on the development environment are substantially less robust than on the production system; for instance prod has 128 GB of RAM, while dev only has 16 GB, prod has 16 cores, while dev only has 4 cores. Unintuitively, this disparity can result in queries running faster in development than in production.

SQL Server has a little-known (and undocumented and unsupported) troubleshooting-related DBCC command that can be used to mimic production resource levels in your development environment. As with all undocumented features, do not try this in production.

Read on to learn how DBCC OPTIMIZER_WHATIF can lead the optimizer to choose different plans.  I almost never use this command, but it is helpful to have it in your back pocket.

Comments closed

Modifying Data In Temporal Tables

Jeanne Combrinck shows us how we can insert, update, and delete data in temporal tables:

You delete data in the current table with a regular DELETE statement. The end period column for deleted rows will be populated with the begin time of underlying transaction.
You cannot directly delete rows from history table while SYSTEM_VERSIONING = ON.

Set SYSTEM_VERSIONING = OFF and delete rows from current and history table but keep in mind that way system will not preserve history of changes. TRUNCATE, SWITCH PARTITION OUT of current table and SWITCH PARTITION IN history table are not supported while SYSTEM_VERSIONING = ON.

Data modification is reasonably straightforward with temporal tables.  Read on for examples.

Comments closed

TANSTAAQRC (Query Result Cache)

Andy Mallon explains that a query result cache does not exist in SQL Server:

I was recently doing a training session when a developer commented that it was OK to run an expensive query twice because on the second execution, SQL Server would use the “results cache” and be “practically free”. It’s not the first time I’ve heard someone refer to a “results cache” in SQL Server. This is one of those myths that is almost true, which makes it that much more believable. If you don’t know better, you might think SQL Server has a “results cache” because the second execution of a query is often faster.

SQL Server does not have a “results cache” and the second execution is not “practically free.”
SQL Server does have a “buffer cache” and the second execution is “faster, but not free.”

The SQL Server buffer cache holds data pages in memory, in the exact form that they reside on disk. The second execution will not have to perform physical I/O operations to satisfy the query, because it can use the buffer cache. However, it does have to perform all other operations. Think of it like this: the second execution still executes the entire execution plan, including all the expensive operations. It is faster, but not “practically free.”

Read the comments for Erik Darling’s plot twist.

Comments closed

Tracking Latency To Azure With PsPing

Arun Sirpal shows us how to use PsPing (part of the Sysinternals tool set) to determine latency between your computer and a VM in an Azure data center:

This is the tool of choice when wanting to find out latency to your Azure SQL Server. In addition to standard ICMP ping functionality, it can report the latency of connecting to TCP ports, the latency of TCP round-trip communication.

I use this to find the latency from my location to various Azure SQL Servers which are in different Azure regions. I am based in the heart of England so let’s look and compare a couple of locations (just out of curiosity). Once you have downloaded the tool you will need to CD to the directory and call the following command.

Read on to see how to use PsPing.

Comments closed

Real-Time Data Visualization With R And SQL Server

Tomaz Kastrun shows how simple it can be to plot real(ish)-time data from SQL Server using R:

In the previous post, I have showed how to visualize near real-time data using Python and Dash module.  And it is time to see one of the many ways, how to do it in R. This time, I will not use any additional frames for visualization, like shiny, plotly or any others others, but will simply use base R functions and RODBC package to extract data from SQL Server.

Extracting data from SQL Server will and simulating inserts in SQL Server table will primarily simulate the near real-time data. If you have followed the previous post, you will notice that I am using same T-SQL table and query to extract real-time data.

Tomaz is using the base plot library, but if you want something nicer, there are several good alternatives.

Comments closed

Using word2vec To Model User Behavior

Nishan Subedi walks us through an Etsy project to model user journeys via semantic embedding techniques:

We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It models the context given a target token and attempts to maximize the average likelihood of seeing any of the context tokens given a target token. The negative sampling draws a negative token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

Similarly, we consider the purchased item in a particular session to be a global contextual token that applies to the entire sequence of user interactions. The intuition behind this is that there are many touch points on the user’s journey that help them come to the final purchase decision, and so we want to share the purchase intent across all the different actions that they took. This is also referred to as the linear multi-touch attribution model.

This is a very interesting article, and their attempt at getting around the problem of unexpected explosive growth in demand.

Comments closed

When Wait Stats Aren’t Enough

Joe Obbish has an example of diagnosing performance problems when wait stats don’t indicate any problems:

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

This is a great investigation into the depths of debugging in SQL Server.  Joe wasn’t able to get a definitive solution to his problem, but he showed us a lot along the way.

Comments closed