Month: June 2018

The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.

Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.

Comments closed

Optimizing Conditionals In DAX

Published 2018-06-19 by Kevin Feasel

Marco Russo shows us a way to optimize mutually exclusive conditional calculations using DAX:

In previous articles, we discussed the importance of variables and how to optimize IF functions to reduce multiple evaluations of the same expression or measure. However, there are scenarios where the calculations executed in different branches of the same expression seem impossible to optimize. For example, consider the following pattern:

1

2

3

4

5

6

Amount :=

IF (

    <condition>,

    [Credit],

    [Debit]

)

In cases like this involving measures A and B, there does not seem to be any possible optimizations. However, by considering the nature of the two measures A and B, they might be different evaluations of the same base measure in different filter contexts.

Read on for a couple of examples.

Comments closed

Backing Up SQL Server To S3

Published 2018-06-19 by Kevin Feasel

David Fowler shows how to back up SQL Server directly to an AWS S3 bucket:

I’ve been having a little play around with AWS recently and was looking at S3 (AWS’ cloud storage) when I thought to myself, I wonder if it’s possible to backup up an on premise SQL Server database directly to S3?

When we want to backup directly to Azure, we can use the ‘TO URL’ clause in our backup statement. Although S3 buckets can also be accessed via a URL, I couldn’t find a way to backup directly to that URL. Most of the solutions on the web have you backing up your databases locally and then a second step of the job uses Power Shell to copy those backups up to your S3 buckets. I didn’t really want to do it that way, I want to backup directly to S3 with no middle steps. We like to keep things as simple as possible here at SQL Undercover, the more moving parts you’ve got, the more chance for things to go wrong.

So I needed a way for SQL Server to be able to directly access my buckets. I started to wonder if it’s possible to map a bucket as a network drive. A little hunting around and I came across this lovely tool, TNTDrive. TNTDrive will let us do exactly that and with the bucket mapped as a local drive, it was simply a case of running the backup to that local drive.

Quite useful if your servers are in a disk crunch. In general, I’d probably lean toward keeping on-disk backups and creating a job to migrate those backups to S3.

Comments closed

“Server Is Configured For Windows Authentication Only” Error

Published 2018-06-19 by Kevin Feasel

Kenneth Fisher diagnoses a misleading error:

In general, the errors SQL gives are highly useful. Of course every now and again you get one that’s just confounding. The other day I saw the following error in the log:

Login failed for user ”. Reason: An attempt to login using SQL authentication failed. Server is configured for Windows authentication only. [CLIENT: ]

This one confused me for a couple of reasons. First, the user ”. Why an empty user? That’s not really helpful. And second Server is configured for Windows authentication only.

But Kenneth shows that the server is configured for SQL authentication as well as Windows authentication. Click through to see what gives.

Comments closed

Non-SARGable Predicates And Computed Columns

Published 2018-06-19 by Kevin Feasel

Erik Darling shows that you can create a computed, indexed column to make a non-SARGable predicate perform a seek operation:

Before I show you what I mean, we should probably define what’s not SARGable in general.

Wrapping columns in functions: ISNULL, COALESCE, LEFT, RIGHT, YEAR, etc.

Evaluating predicates against things indexes don’t track: DATEDIFF(YEAR, a_col, b_col), a_col +b_col, etc.

Optional predicates: a_col = @a_variable or @a_variable IS NULL

Applying some expression to a column: a_col * 1000 < some_value

Applying predicates like this show that you don’t predi-care.

They will result in the “bad” kind of index scans that read the entire index, often poor cardinality estimates, and a bunch of other stuff — sometimes a filter operator if the predicate can’t be pushed down to the index access level of the plan.

Read on for an example.

Comments closed

Trying To Force A Plan For A Different Query With Query Store

Published 2018-06-19 by Kevin Feasel

Erin Stellato shows us that you cannot use a plan generated for one query as a forced plan for a different query in Query Store:

This is question I’ve gotten a few times in class…Can you force a plan for a different query with Query Store?

tl;dr

No.

Assume you have two similar queries, but they have different query_id values in Query Store. One of the queries has a plan that’s stable, and I want to force that plan for the other query. Query Store provides no ability to do this in the UI, but you can try it with the stored procedure. Let’s take a look…

I can see the potential benefit, but the downside risk is huge, so it makes sense not to allow this.

Comments closed

Metacat: Federated Metadata Discovery

Published 2018-06-18 by Kevin Feasel

Ajoy Majumdar and Zhen Li walk us through Metacat:

The core architecture of the big data platform at Netflix involves three key services. These are the execution service (Genie), the metadata service, and the event service. These ideas are not unique to Netflix, but rather a reflection of the architecture that we felt would be necessary to build a system not only for the present, but for the future scale of our data infrastructure.

Many years back, when we started building the platform, we adopted Pig as our ETL language and Hive as our ad-hoc querying language. Since Pig did not natively have a metadata system, it seemed ideal for us to build one that could interoperate between both.

Thus Metacat was born, a system that acts as a federated metadata access layer for all data stores we support. A centralized service that our various compute engines could use to access the different data sets. In general, Metacat serves three main objectives:

Federated views of metadata systems

Unified API for metadata about datasets

Arbitrary business and user metadata storage of datasets

It is worth noting that other companies that have large and distributed data sets also have similar challenges. Apache Atlas, Twitter’s Data Abstraction Layer and Linkedin’s WhereHows (Data Discovery at Linkedin), to name a few, are built to tackle similar problems, but in the context of the respective architectural choices of the companies.

If you’re interested, also check out their GitHub repo.

Comments closed

Understanding A Spark Streaming Workflow

Published 2018-06-18 by Kevin Feasel

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.
Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.
Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Comments closed

Calculating TF-IDF Using Apache Spark

Published 2018-06-18 by Kevin Feasel

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

Document search.

Document tagging.

Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution. It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Comments closed

File Growth Rate Under 1MB

Published 2018-06-18 by Kevin Feasel

John Morehouse shows that the file growth rate GUI for Management Studio doesn’t report values under 1MB correctly:

While I was recently, doing a review of a client’s environment I discovered that the GUI can lie to you when it comes to the database file growth rates. By default, the data file is set to a 1MB growth rate and the log file is configured for a 10% growth rate. Both are horrible settings for most OLTP environments. However, starting with SQL Server 2016, the default growth rates are configured for 64MB, which in my opinion is better than the previous defaults.

Using the GUI to look at a 2017 Scratch database I have, we can see that the data file is configured for 1MB and the log file is set for 64MB growth.

I don’t think there’s a good reason for a file growth rate under 1 MB at this point. That could have made sense in the late ’90s, but the idea of growing 128KB at a time is funny.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30