Press "Enter" to skip to content

Month: October 2018

Deciding Whether To Clean Up Temp Tables

Grant Fritchey looks at what difference explicitly dropping temporary tables in a procedure makes:

I then set up Extended Events to capture the query metrics and I executed each of the queries multiple times (also, just for the test, I discarded the results because I didn’t want that process mucking with my measurements). After executing both procedures 500 times, the results were quite simple. The average execution time with an explicit drop was 8,672 microseconds. Meanwhile, the average for not dropping the temporary table was 8,530 microseconds. That’s about a 1% difference across hundreds of executions. Reads were identical and so were writes.

In short, the behavior is the same.

What about the impact on the system? Could I see changes in memory or I/O as these different processes ran?

Grant didn’t notice any difference but check Allen White and Jay Robinson’s answers in the comments.  Temp table reuse can happen (if you follow the rules) and can make a difference when a procedure is called frequently enough.

Comments closed

Using IO Cache To Speed Up Spark Jobs

Chris Seferlis looks at what the HDInsight team has done to speed up Apache Spark jobs:

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Read on for more details.

Comments closed

Power BI: Choosing Between Default Or Custom Date Dimensions

Reza Rad looks at the default date dimension available in Power BI and compares it to what you can build yourself:

Oh yes! Of course like any other table structures in the Power BI in-memory based structure, every date table, consumes memory. But it would do the same even if you create your custom date dimension! Whenever you do the role-playing dimension scenario you are also consuming even more memory! The main difference is that Power BI default Date dimension will be created even if you do not want to do the date-based analysis on a date field! For example, even if you don’t use DueDate in your date-based analysis, still Power BI creates a date dimension for it. You cannot stop it for one field. You have to either stop the default creation of the Date dimension for the entire model or use it for the entire model, you cannot customize it per field. But with the custom date dimension, you can.

Reza does a good job of giving us the trade-offs between these two and explaining when we should use each.

Comments closed

Building Custom R Visuals In Power BI

Brad Lewellyn shows us how to create custom R visuals within Power BI:

Over the last few posts, we’ve shown how to use custom R visuals built by others.  Today, we’re going to build our own using the Custom R Visual available in Power BI Desktop.  If you haven’t read the second post in this series, Getting Started with R Scripts, it is highly recommended you do so now, as it provides necessary context for how to link Power BI to your local R ISE.

In the previous post, we created a bunch of log-transformed measures to find good predictors for Revenue.  We’re going to use these same measures today to create a basic linear regression model to predict Revenue.  If you want to follow along, the dataset can be found here.  Here’s the custom DAX we used to create the necessary measures.

Click through for the example.

Comments closed

Cannot Rollback TRUNCATE In Redshift

Derik Hammer notes that you cannot rollback a TRUNCATE TABLE operation in Redshift:

In SQL Server, or PostgreSQL for that matter, the TRUNCATE command is allowed in a transaction and it will commit or rollback like any other DML operation. In all of the scripts, below, I will do the following.

  1. Check my row counts

  2. Begin a transaction

  3. TRUNCATE the table

  4. INSERT one row

  5. Check my row counts from within the transaction

  6. ROLLBACK

  7. Check to see if  my row checks revert or not

That’s an important behavioral difference when loading data using a truncate-and-reload pattern.

Comments closed

How SQL Server Implements Index Spooling

Hugo Kornelis has a long article which dives into the way SQL Server handles index spooling:

A Table Spool operator stores its data in a worktable that is structured as a clustered index. The index is not built on any of the columns in the data, nor on any artificially added columns. It is structured on zero columns. As is normal for a clustered index on a set of columns that is not unique for the set, a 4-byte uniqueifier is then added to the data to give each row a unique internal address.

The worktable for an Index Spool operator is also structured as a clustered index. However, this operator does actually index actual columns from its data instead of just relying on a uniqueifier. The indexed columns are chosen to effectively satisfy the Seek Predicate property. The statement in the Microsoft’s documentation that a nonclustered index is used for Index Seek is not correct.

A stack spool is represented in execution plans as a combination of an Index Spool and a Table Spool, both with the With Stack property present and set to True. This is misleading because it is actually a different type of spool. The worktable it uses is built as a clustered index on a single column, representing the nesting level. Because this is not unique, a uniqueifier is added where needed.

This is a deep look at some operators which people tend to gloss over but can have huge performance impacts.

Comments closed

Taking Advantage Of Vectorization In R

John Mount explains, using Conway’s Game of Life, the importance of using vectors in R over scalars:

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Conway’s Game of Life is one of the most interesting examples of cellular automata. It is traditionally simulated on a rectangular grid (like a chessboard) and each cell is considered either live or dead. The rules of evolution are simple: the next life grid is computed as follows:

  • To compute the state of a cell on the next grid sum the number of live cells in the eight neighboring cells on the current grid.

  • If this sum is 3 or if the current cell is live and the sum is 2 or 3, then the cell in the next grid will be live.

Not only is the R code faster, but it’s also terser.

Comments closed

In Lieu Of Lambda Architecture, Using Faster Databases

Justin Langseth argues that the Lambda architecture is not really necessary if you are using the right data stores:

Basically, the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately, both the data streaming, database, and BI layers have all evolved significantly since then, and now there exist databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:

  1. Accept real-time streams of data at high rates.
  2. Simultaneously respond to large volumes of queries, including on the most recently added data.
  3. Store all the history needed for analysis.

We call these engines “fast data sinks” and there are four main groups of them today:

It’s an interesting argument.

Comments closed

Automating E-mail Of Windows Event Log Alerts Via Powershell

Dave Bermingham shows us how to automatically fire off e-mails for specific Windows Event Log event IDs:

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password then you will want to encrypt your credentials. Gmail requires and SSL connection so your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script that when used in conjunction with Task Scheduler which will send an email alert automatically when any specified Event is logged in the Windows Event Log.

Read on for the script as well as some bonus troubleshooting.

Comments closed

Missing Foreign Keys—A Cultural Problem

Martin Catherall tells a spooky Halloween story:

By large databases I’m roughly meaning databases with several hundred tables, and I usually see a lot of these tables with several hundred GB’s of data in them.

When I generally ask about the reason for no foreign key, I’m told

  1. they add  overhead
  2. they give no benefit
  3. we can’t enter our data properly when we have them

The last one in the above list is generally down to poor modelling – an example being entering a later part of a financial transaction before the first part of the financial transaction has taken place. Once both parts of the financial transaction have taken place then the database is left in a consistent state – BUT, that generally being handled by the app NOT the database – OUCH!

There are times where key constraints are too much—often-updating fact tables might be one such scenario.  And some of “too much” comes down to hardware quality.  But for the most part, those key constraints are one of the clearest forms of database documentation available, not to mention their normal benefits.

Comments closed