Press "Enter" to skip to content

Month: February 2018

Row Goals And Semi Joins

Paul White continues his row goals series:

The remaining physical join type is nested loops, which comes in two flavours: regular (uncorrelated) nested loops and apply nested loops (sometimes also referred to as a correlated or lateral join).

Regular nested loops join is similar to hash and merge join in that the join predicate is evaluated at the join. As before, this means there is no value in setting a row goal on either input. The left (upper) input will always be fully consumed eventually, and the inner input has no way to determine which row(s) should be prioritized, since we cannot know if a row will join or not until the predicate is tested at the join.

By contrast, an apply nested loops join has one or more outer references (correlated parameters) at the join, with the join predicate pushed down the inner (lower) side of the join. This creates an opportunity for the useful application of a row goal. Recall that a semi join only requires us to check for the existence of a row on join input B that matches the current row on join input A (thinking just about nested loops join strategies now).

In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

This has the depth and quality that you naturally expect from Paul, making it an immediate read.

Comments closed

Digging Into dbachecks

Rob Sewell walks us through dbachecks configuration:

So I can import this configuration and run my checks with it any time I like. This means that I can create many different test configurations for my many different environment or estate configurations.

Yes, I know “good/best practice” says we should use the same configuration for all of our instances but we know that isn’t true. We have instances that were set up 15 years ago that are still in production. We have instances from the companies our organisation has bought over the years that were set up by system administrators. We have instances that were set up by shadow IT and now we have to support but cant change.

As well as those though, we also have different environments. Our development or test environment will have different requirements to our production environments.

This is a good, code-filled demonstration with a few videos as well.  If you want to get started with dbachecks, this is a good place to begin.

Comments closed

Loading Data From Excel With Biml

Ben Weissman loads an Excel file with Biml:

Did you know, that you could call GetDatabaseSchema on Excel files? You can!

Just define an ExcelConnection first:

01_Environment.biml
1
2
3
4
5
6
7
8
9
10
11
12
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <ExcelConnection Name="MyExcel" ConnectionString="Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Flatfiles\XLS\MyExcel.xlsx;Extended Properties=&quot;Excel 12.0 XML;HDR=YES&quot;;" />
        <OleDbConnection Name="Target" ConnectionString="Data Source=localhost;initial catalog=MySimpleBiml_Destination;provider=SQLNCLI11;integrated security=SSPI"></OleDbConnection>
    </Connections>
    <Databases>
        <Database Name="MySimpleBiml_Destination" ConnectionName="Target"></Database>
    </Databases>
    <Schemas>
        <Schema Name="dbo" DatabaseName="MySimpleBiml_Destination"></Schema>
    </Schemas>
</Biml>

You can call GetDatabaseSchema on that connection and loop through the tables just like any regular database.

Click through to see what to do with this connection.

Comments closed

Gotchas When Indexing Partitioned Tables

Andrew Pruski gives us a couple of considerations when creating indexes on partitioned tables in SQL Server:

Looking at that data page, we can see that SQL has added a UNIQUIFIER column. Now this is standard SQL behaviour, SQL does this to all non-unique clustered indexes whether they are on a partitioned table or not.

But also look at the CreatedDate column. It’s after the ID column on the page. If this was a non-partitioned table, we would see that after ColA & ColB (the order the columns are on the table). This has happened because SQL has implicitly added the partitioning key into the index definition, which has changed the physical order of the data on the page.

Read the whole thing.

Comments closed

Web Analytics With R

Maelle Salmon performs some analysis on the Locke Data blog:

Often, the URL of a blog post can be guessed based on its title, e.g. this one can be read here. But even if the transition from the Markdown file information to an URL is logical, it was best to get URLs from the in situ blog posts, and then join them to the blog post information collected previously, since some special characters got special treatment that I could not fully understand by looking at blogdown source code.

I first extracted all posts URLs from the website map.

Check it out.

Comments closed

Using stringr To Remove HTML

I have a quick post on removing HTML markup with stringr:

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

Click through for the script.  If you need to do something nice with the text afterward, my technique is probably too much sledgehammer for niceties, but it does the trick for pre-processing before vectorization.

Comments closed

The DATETIME Type In SQL Server

Randolph West gets into the DATETIME data type:

DATETIME is an eight-byte datatype which stores both a date and time in one column, with an accuracy of three milliseconds. As we’ll see though, the distribution of this granularity may not be exactly what we’d expect.

Valid DATETIME values are January 1, 1753 00:00:00.000, through December 31, 9999 23:59:59.997. On older databases designed prior to SQL Server 2008, because there was no explicit support for date values, it was sometimes customary to leave off the time portion of a DATETIME value, and have it default to midnight on that morning. So for example today would be stored as February 21, 2018 00:00:00.000.

If you’re not particularly familiar with SQL Server data types, this is detailed enough information to get you going and to explain exactly why you shouldn’t use DATETIME anymore…

Comments closed

Configuring Logstash

Mike Hillwig gets us started on Logstash:

Logstash is an incredibly powerful tool. If you can put data into a text file, Logstash can parse it. It works well with a lot of data, but I’m finding myself using it more to use it for event data. When I say event data, if it triggers a log event and it writes to a log, it’s an event. For the purposes of my demos, I’m using data from the Bureau of Transportation Statistics. They track flight performance data, which works perfectly for my uses. It’s a great example dataset without using anything related to my real job.

Logstash configuration files typically have three sections, INPUT, FILTER, and OUTPUT. However, FILTER is optional.

This is the first part in a series, so stay tuned.

Comments closed

Converting Int To Time

Bill Fellows has a pop quiz for us:

Given the following DDL

CREATE TABLE dbo.IntToTime
( CREATE_TIME int
);

What will be the result of issuing the following command?

ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME time NULL;

Clearly, if I’m asking, it’s not what you might expect.

Click through if you have not memorized your implicit conversion tables.

Comments closed

The Power Of Window Functions

Ben Richardson has an introduction to the concept of window functions:

This is a much more efficient result. In the first line of the script the id, name and gender columns are retrieved. These columns do not contain any aggregated results.

Next, for the columns that contain aggregated results, we simply specify the aggregated function, followed by the OVER clause and then within the parenthesis we specify the PARTITION BY clause followed by the name of the column that we want our results to be partitioned as shown below.

This post focuses on normal aggregates which accept windows.  Once you understand that, there’s a wide world beyond it, and you’ll quickly see how useful window functions can be.

Comments closed