Press "Enter" to skip to content

Month: January 2017

Understanding DATEADD And DATEDIFF

Matan Yungman and Guy Glantser take a hack at DATEDIFF versus DATEADD for date calculations.  First up is Matan:

Pretty simple right?

Well, it is, and since this problem is pretty common, I used this solution in many performance tuning sessions I performed over the years.

There’s a slight problem though: This solution isn’t 100% accurate.

When carefully looking at the results, I find out that for the first query, I get 5859 rows, and for the second query, I get 5988 rows. Where does this difference come from?

Then, Guy gives his take on the problem:

I tested both queries on a sample table, which has millions of rows, and only around 500 rows in the last 90 days. The first query produced a table scan, while the second query produced an index seek. Of course, the execution time of the second query was much lower than the first query.

Both queries were supposed to return the orders in the last 90 days, but the first query returned 523 rows, and the second query returned 497 rows. So what’s going on?

The answer has to do with the way DATEDIFF works. This function returns the number of date parts (days, years, seconds, etc.) between two date & time values. It does that by first rounding down each one of the date & time values to the nearest date part value, and then counting the number of date parts between them.

They both start from the same base problem, but end up with slightly different formulations of a solution.

Comments closed

Structured Streaming With Spark

Tathagata Das, et al, discuss enterprise-grade streaming of structured data using Spark:

Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.

This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.

Comments closed

Searching For A Query

Kendra Little shows how to search the plan cache and query store for a particular query:

If I’m looking in SQL Server’s Execution Plan Cache, I like to use the sys.dm_exec_text_query_plan dynamic management view. This stores those XML query plans as text.

I learned about using this DMV from Grant Fritchey in his post, “Querying the Plan Cache, Simplified.” Grant points out that while doing wildcard searches in the text version of a query plan isn’t fast, querying it as XML is often even slower.

This is definitely worth a read.

Comments closed

Thinking About Availability Group Outages

Brent Ozar reminds us to think about graceful degradation of applications:

There’s a gray bar across the top that says, “This site is currently in read-only mode; we’ll return with full functionality soon.”

That’s not a hidden feature of Always On Availability Groups. Rather, it’s a hidden feature of really dedicated developers whose application:

This is where a bit of foresight and hard work can really pay off.  Read the whole thing.

Comments closed

Beginning With Amazon Athena

Jen Underwood looks at the basics behind Amazon Athena:

Today early adopters of Amazon Athena are using it for big data analytics pipeline projects along with Kinesis streaming data and other Amazon data sources.

Athena is serverless parallel query pay-per-use service. There is no infrastructure to set up or manage. It scales automatically and can handle large datasets or complex distributed queries.

The easy way of thinking about Athena is that it’s ElasticMapReduce (a pay-as-you-go Hadoop cluster) without the ceremony of administering or spinning up the cluster.

Comments closed

R Visuals In Power BI

Ryan Wade ties ggplot2 visuals into Power BI:

The package that we are going to use to develop our custom visualization is ggplot2. The ggplot2 package is arguably the most popular data visualization package in R. It is based on the “grammar of graphics” concept that was created by the statistician, Leland Wilkinson. The ggplot2 package allows you to approach creating charts and graphs in the same manner that Bob Ross approached painting trees in the forest. With ggplot2 you are able to start with a blank canvas and add layers upon layers via short code snippets that builds on each other until you end up with the desired visualization.

The pbix file that is being used in this blog can be found here: http://bit.ly/2jwoCyP. The GentleIntroToR_ChartExample.pbix file contains an example of using R to create a box plot chart that shows the distribution of player scores for the L.A. Lakers. Chiclet slicers were added that allows you to filter by division and/or opponent. The R visualization was created in four steps.

Check out the PBIX file.

Comments closed

Log Shipping On Linux

Andrew Pruski digs into how to set up log shipping for SQL Server on Linux:

What I’m going to do is setup two instances of SQL Server running on linux and log ship one database from one to another. So the first thing I did was get two VMs running Ubuntu 16.04.1 LTS which can be download from here.

Once both servers were setup (remember to enable ssh) I then went about getting SQL setup, I’m not going to go through the install in this post as the process is documented fully here. Don’t forget to also install the SQL Tools, full guide is here.

Read on for the guide, but also be sure to read his disclaimer.

Comments closed

JSON Basics In SQL Server

Neil Gelder gives an introduction to JSON in SQL Server 2016:

A new feature in SQL Server 2016 (also available in Azure SQL database) is the ability to create and query  JSON (Javascript object notation) documents, which have now become a common alternative to XML.

Lets look at some examples, I’ll be using tables from the new sample database for SQL Server 2016 WorldWideImporters which you can download from this link

I’m of two minds with JSON support:  I think it’s very useful for building output sets for service calls and might be fine for inputs when you can’t use a table-valued parameter for some reason, but if you’re doing a lot of JSON splitting of data in a table, that’s a violation of first normal form.

Comments closed