Month: March 2018

Then, let’s say the requirements are as follows:

1. No values that are either empty or only spaces
2. No leading spaces
3. No trailing spaces
4. Allow NULL if column allows NULL

Let’s look at how we could implement all of these independently, as there certainly are cases where you may wish to allow any or all of the situations in a column.

Click through for the scripts, as well as a time comparison to see how much overhead you’re adding.

Comments closed

Continuous Processing Mode With Spark Structured Streaming

Published 2018-03-26 by Kevin Feasel

Joseph Torres, et al, explain how continuous processing mode works with Apache Spark 2.3’s structured streaming:

Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.

Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays.

They also explain how this newer model differs from the prior model of collecting events in microbatches.

Comments closed

Lazy Evaluation With Scala

Published 2018-03-26 by Kevin Feasel

Mahesh Chand demonstrates how Scala can use lazy evaluation to reduce memory requirements:

In this blog, we will talk about lazy evaluation in Scala. How we can add efficiency to our application?

Efficiency is achieved not just by running things faster, but by avoiding things that shouldn’t be done in the first place.

In functional programming, lazy evaluation means efficiency. Laziness lets us separate the description of an expression from the evaluation of that expression. This gives us a powerful ability—we may choose to describe a “larger” expression than we need, and then evaluate only a portion of it. There are many ways to achieve lazy evaluation in Scala i.e using lazy keyword, views, streams etc.

The fastest operation is the one that doesn’t have to run at all.

Comments closed

Building SQL Server Database Projects With YAML

Published 2018-03-26 by Kevin Feasel

Gavin Campbell shows how to use Visual Studio Team Services’s new YAML build pipeline to build a SQL Server Data Tools project:

The documentation for how to specify build steps in YAML is still a work in progress. In summary, the current procedure is to visit the VSTS Tasks repo on GitHub, open the folder for the task your are interested in, and take a look at the task.json.

In our case, the first task we need is MSBuild, to build the database project. Looking inside task.json, we can see that the name of the task we need is MSBuild, and that there are a huge number of available inputs we can use to configure the task; solution to specify the project or solution to build, platform, configuration, and many more. In our case, we’ll just specify the path to our .sqlproj file and let msbuild take care of the rest.

Gavin walks us through the entire process step by step, though as he notes, some of these steps are liable to change as the VSTS team continues implementation.

Comments closed

Uploading Files To Azure Blob Storage With Data Factory V2

Published 2018-03-26 by Kevin Feasel

Ben Jarvis shows how to use Azure Data Factory V2 to upload files from an on-prem server to Azure Blob Storage:

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

Read on for the scripts and full process.

Comments closed

The Risk Of Opening Your SQL Server To The Internet

Published 2018-03-26 by Kevin Feasel

Max Vernon explains why it’s a bad idea to let your SQL Server be Internet-accessible:

First, it might be helpful to understand why someone might want to connect to their SQL Server via the Internet. Consider any of the following likely scenarios:

There is a client application or website hosted in a different data-center that needs access to the data stored in the SQL Server.

You need to provide geographical redundancy between several SQL Servers located in different countries that are only connected via the Internet.

You’re doing development on a hosted SQL Server and don’t understand the implications of having SQL Server exposed.

The typical way you might expose your SQ Server to the Internet when the server is behind a router or firewall is to open an incoming TCP port at the router or firewall directed at the SQL Server on port 1433. If you do this, you will very quickly start seeing incoming connection attempts from computers on the Internet. These attempts will try to login, typically via the “sa” account, continuously trying passwords.

It’s best to avoid this mess altogether, though if you do need to expose your SQL Server to the Internet, Max has tips on how to mitigate some of that risk.

Comments closed

Using DM_DB_LOG_STATS

Published 2018-03-26 by Kevin Feasel

Kellyn Pot’vin-Gorman explains the DM_DB_LOG_STATS dynamic management function in SQL Server 2017:

This is an addition to SQL Server 2017 and available in Azure Database/Azure Data Warehouse to simplify and synchronize how we query information about log statistics, not just backups. The added benefit of this is that it only requires the VIEW DATABASE STATE privilege to utilize it, which eases the demand on anyone who’s tried to grant rights to view backup information to non-DBA personnel in previous releases. With this DMV, you can monitor, alert and kick off jobs to manage the transaction log backups.

It does require the database_id to be passed to it to provide results, but who keeps that around? Much easier to just do a Cross Apply to sys.databases and make your life simple.

Click through to see what’s included on this useful DMV.

Comments closed

Using SQL Server’s PIVOT Operator

Published 2018-03-26 by Kevin Feasel

Dan Blank shows how to use the PIVOT operator in SQL Server:

I was recently approached by my firms Marketing Manager with a request for some information. She wanted to know “Which departments have our top clients never done any work for?”. For some clarity, I work in a law firm with 11 departments. The request seems pretty straightforward at first. Then once I got thinking about how the output of this report would be presented it made me reconsider quite how simple a request this was. She handed me a drawing with her vision for the output. What she wanted was :

Client’s names down the left,
List of Departments across the top,
Ticks and crosses at the intersection to show whether they had or had not done work for them.

You can make a good argument that presentation mechanics like this are meant for a different tool (a presentation layer) but it’s useful to know how to pivot and unpivot data within T-SQL for more reasons than just presentation.

Comments closed

The Secret Power Of TRIM

Published 2018-03-26 by Kevin Feasel

Dave Mason points out something quite useful about TRIM in SQL Server 2017:

Now I can tidy things up and remove both leading and trailing spaces with a single call to TRIM():
SELECT TRIM(b.foo)
FROM dbo.bar b;
On the surface, this may not seem like that big of a deal. And I would tend to agree. But, in this example, it saves a few key strokes and makes the code slightly more readable. And it is nice for T-SQL to finally have a function that has been around in other languages for far longer than I’ve been writing code for a living.

But Wait, There’s More!

Click through for that more. This makes TRIM a lot more useful, so go check it out.

Comments closed

Joining Multiple Types Of Data With KSQL

Published 2018-03-23 by Kevin Feasel

Robin Moffatt has an example where he enriches streaming CSV data with information stored in MySQL:

This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE clause), or even aggregations.

You can see which queries are running with the SHOW QUERIES; statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.

The DESCRIBE EXTENDED command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.

It’s pretty easy to do; click through to see just how easy.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31