Author: Kevin Feasel

Power BI Data Profiling

Published 2018-10-16 by Kevin Feasel

Matt Allington takes a look at a new feature in Power BI:

The data profiling tools look at the first 1,000 rows in the preview data loaded an shows you the big picture of what the data “looks” like.

Currently the profiling tool only works on the top 1000 rows of data. It also takes some time to prepare the profile of the columns (as could be expected), however the benefits of getting this stuff right before moving on far outweigh the slower load times (IMO). I would love to see an option to profile the entire set of data for one or more columns. I am sure this will come.

Teo Lachev shares some thoughts on what it would take to make this a killer feature:

That’s all data profiling you get for now. Here is what it will take to make Power BI data profiling a killer feature:

Allow data profiling over all the values (understandably there will be performance impact).
Add more aggregates, such as Min/Max/Std/Median.
The ability to dynamically filter the preview data for the selected bar in the profile.

As it is, there’s enough here to see the potential of where it could go.

Comments closed

Testing TDE Performance

Published 2018-10-16 by Kevin Feasel

Eduardo Pivaral tests the performance of a database with Transparent Data Encryption versus that same database without encryption:

Transparent data encryption (TDE) helps you to secure your data at rest, this means the data files and related backups are encrypted, securing your data in case your media is stolen.
This technology works by implementing real-time I/O encryption and decryption, so this implementation is transparent for your applications and users.

However, this type of implementation could lead to some performance degradation since more resources must be allocated in order to perform the encrypt/decrypt operations.

On this post we will compare how much longer take some of the most common DB operations, so in case you are planning to implement it on your database, you can have an idea on what to expect from different operations.

These results fit in reasonably well with what I’d heard, but it’s nice to have someone run the numbers.

Comments closed

Using Containers To Build A Home Lab

Published 2018-10-16 by Kevin Feasel

Dmitri Korotkevitch walks us through creating a home lab with Docker containers:

Obviously, in the real life, we do not work with vanilla SQL Server installation. We need to customize it by changing SQL Server settings and logins, creating and/or restoring the databases and do other actions. There are a couple of ways how you can do that.

The first approach is customizing existing container manually and creating the image from it using docker container commit command. After that, you can start the new containers from created image the same way as we already discussed. We will cover a couple ways to move data to and from containers later.

There is the better way, however. You can automate this process by utilizing docker build command. The process is very simple. You just need to define DockerFile, which contains the reference to the main image and specifies the build actions. You can copy scripts and database backups into the image, run SQLCMD, BCP and PowerShell scripts there – you, pretty much, have the full control. Internally, Docker runs every command inside deployment containers (creating and destroying them during the process) saving the final one as the target image.

Read the whole thing.

Comments closed

Connecting To Elasticsearch With R

Published 2018-10-15 by Kevin Feasel

Jerod Johnson has a sample of connecting to Elasticsearch with R:

You will need the following information to connect to Elasticsearch as a JDBC data source:

Driver Class: Set this to cdata.jdbc.elasticsearch.ElasticsearchDriver.

Classpath: Set this to the location of the driver JAR. By default, this is the lib subfolder of the installation folder.

The DBI functions, such as dbConnect anddbSendQuery , provide a unified interface for writing data access code in R. Use the following line to initialize a DBI driver that can make JDBC requests to the CData JDBC Driver for Elasticsearch:

Read on for the full instructions.

Comments closed

An Incompatible SQL Server Version Was Detected

Published 2018-10-15 by Kevin Feasel

Hamish Watson troubleshoots an issue with Visual Studio 2017 connecting to SQL Server 2017:

This blog post details the error you may get when using Visual Studio 2017 and you get errors that you cannot connect to SQL Server 2017 using Test Explorer or SQL Server Object Explorer.

TL;DR – upgrade Visual Studio from base version…..

Read on for Hamish’s explanation.

Comments closed

APPROX_COUNT_DISTINCT

Published 2018-10-15 by Kevin Feasel

Niko Neugebauer is happy with a new function in SQL Server 2019:

A rather interesting result takes place if we scale our database to 100GB TPCH and run the very same queries – the total elapsed time jumps to 50% difference (from 30%), the CPU execution time difference is kept at 50%, but the memory grant gives the biggest difference ever – those 24.476 MB are still intact for the APPROX_DISTINCT_COUNT, while the COUNT(DISTINCT) asks for just a bit over 11GB ! Besides going through a completely different gateway on the bigger machines, running COUNT(DISTINCT) will bring your system to a full stop way before the same will take place with the APPROX_DISTINCT_COUNT.
Regarding the precision – in my tests I did not see the difference going over 1%.

Test before using this function, but if you don’t the correct number and can make do with “close enough,” this can save a boatload of memory on larger tables.

Comments closed

Word Counts In DAX

Published 2018-10-15 by Kevin Feasel

Philip Seamark shows us a way of splitting strings into words in DAX:

Here is a technique you might consider if you need to split text down to individual words. This could be used to help count, rank or otherwise aggregate the words in some longer text. The approach detailed here uses spaces as a delimiter and will not be tripped up if multiple spaces are used between words.

There is no SPLIT function in DAX, so this approach uses the MID function to help find words.

The PBIX file used for the blog can be downloaded here.

[Updated 14th Oct, 2018]
A slightly updated version that uses UNICHAR/UNICODE to preserve the case (“A” versus “a”) of each letter can be downloaded here. The reason for this is DAX stores a dictionary of unique values for every column. It is the first instance of any value that is added to the dictionary and assigned a new ID. Subsequent values that are considered the same “A” and “a” are considered the same are assigned the same ID. Using the UNICHAR/UNICODE version helps preserve the original case of each letter.

It’s an interesting approach and reminded me a bit of using a tally table to split strings in T-SQL.

Comments closed

External Memory Pressure In SQL Server 2019 On Linux

Published 2018-10-15 by Kevin Feasel

Anthony Nocentino walks us through memory pressure in SQL Server on Linux:

Now in SQL Server 2017 with that 7GB program running would cause Linux to need to make room in physical memory for this process. Linux does this by swapping least recently used pages from memory out to disk. So under external memory pressure, let’s look at the SQL Server process’ memory allocations according to Linux. In the output below we see we still have a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In fact, our VmRSS is now only 2.95GB. VmSwap has increased to 5.44GB. Wow, that’s a huge portion of the SQL Server process swapped to disk.

In SQL Server 2019, there’s a different outcome! In the data below we see our 16GB VmSize which won’t change much because of the virtual address space for the process. With that large external process running SQL Server reduced VmRSS from 7.9GB (from Table 1) to 2.8GB only placing about 4.68GB in the swap file. That doesn’t sound much better, does it? I thought SQL Server was going to react to the external memory pressure…let’s keep digging and ask SQL Server what it thinks about this.

Anthony is doing some great work digging into this. This is an area where you do have to understand the differences between Windows and Linux.

Comments closed

Voice Control For Shiny Apps

Published 2018-10-12 by Kevin Feasel

Over at Jumping Rivers, an example of using a Javascript library to control a page using voice commands:

I have found that performance across all devices and browsers is definitely not equal. By far the best browser I have found for viewing the apps is Google Chrome. I have also tended to find that my Ubuntu machines don’t do as well as Microsoft machines in picking up words correctly. A chat I had with someone recently suggested this might be down to drivers under Ubuntu for the microphones but that is not my area of expertise. Voice recognition was also fine on both of my Blackberry phones (one running BB OS 10, the other running Android 7).

It is worth noting that this does require an internet connection to function, in Chrome the voice to text is performed in the cloud.

The other thing I have noticed is that annyang seems relatively sensitive to background noise. This isn’t so bad for functions called using specific phrases but does sometimes have a large effect on the multi-word splats. This is because the splats are greedy and the background noise makes the recognition engine think that you are still talking long after you finished which gives the appearance of the application hanging.

The solution is by no means perfect, but it does look quite interesting.

Comments closed

Security Improvements In Kafka And Confluent Platform

Published 2018-10-12 by Kevin Feasel

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0:

Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature set to help with:

Using standard and central directory services like Active Directory (AD)/Lightweight Directory Access Protocol (LDAP)

Simplifying the management of access control lists (ACLs)

Proactive management and monitoring of security configurations to address the gaps as soon as possible

The following new security features are available in both Confluent Platform 5.0 and Apache Kafka 2.0:

Support for ACL-prefixed wildcards to simplify the management of access control

Kafka Connect password protection with support for externalizing secrets (to “secrets stores,” etc., like Hashicorp Vault)

The following security features are available only in Confluent Platform 5.0:

AD/LDAP group support

Feature access controls in Confluent Control Center

Viewing of broker configurations in Confluent Control Center, including differences in security configurations between brokers

Let’s walk through each of these enhancements in detail.

Read on for examples.

Comments closed