Press "Enter" to skip to content

Category: Uncategorized

Estimating Columnstore Compression

Niko Neugebauer has a way of estimating disk size after creating a columnstore index on a table:

For anyone working with SQL Server since version 2005 (with Service Pack 2 to be precise) there is a very common task when thinking or research the system optimisation – the usage of the compression, and before advancing with this step the question that one usually receives is – “can you estimate how much improvement we shall get?”
For this purpose since SQL Server 2008, we have a very useful stored procedure that is called sp_estimate_data_compression_savings, that is capable of providing us with the estimation of how much storage we can save by enabling or moving to a more effective compression method (as in NONE | ROW | PAGE types).
Columnstore Indexes have appeared in SQL Server 2012 (that is well over 6 years ago) and even though from time to time I would ask and suggest to enable the stored procedure sp_estimate_data_compression_savings to start supporting Columnstore Indexes, until now there is no such support.

Until now – I am introducing a conjunction of my scripts in a comprehensive and reasonably capable stored procedure that is called “cstore_sp_estimate_columnstore_compression_savings” and that is a part of my free & open-sourced Columnstore Indexes Script Library, freely available on GitHub.

That’s a useful addition, especially when you’re trying to sell management on using clustered columnstore indexes.

Comments closed

The Evolution Of Hadoop

Holden Ackerman has an interesting analysis of Qubole customers’ adoption of Hadoop 2:

In Qubole’s 2018 Data Activation Report, we did a deep-dive analysis of how companies are adopting and using different big data engines. As part of this research, we found some fascinating details about Hadoop that we will detail in the rest of this blog.

A common misconception in the market is that Hadoop is dying. However, when you hear people refer to this, they often mean “MapReduce” as a standalone resource manager and “HDFS” as being the primary storage component that is dying. Beyond this, Hadoop as a framework is a core base for the entire big data ecosystem (Apache Airflow, Apache Oozie, Apache Hbase, Apache Spark, Apache Storm, Apache Flink, Apache Pig, Apache Hive, Apache NiFi, Apache Kafka, Apache Sqoop…the list goes on).

I clipped this portion rather than the direct analysis because I think it’s an important point:  the Hadoop ecosystem is thriving as the matter of primary importance switches from what was important a decade ago (batch processing of large amounts of data on servers with direct attached storage) to what is important today (a combination of batch and streaming processing of large amounts of data on virtualized and often cloud-based servers with network-attached flash storage).

Comments closed

DISTINCT, GROUP BY, And Transaction Isolation Levels

Rob Farley has an interesting post where two similar-looking queries can provide different outputs given certain transaction isolation levels:

Now, it’s been pointed out, including by Adam Machanic (@adammachanic) in a tweet referencing Aaron’s post about GROUP BY v DISTINCT that the two queries are essentially different, that one is actually asking for the set of distinct combinations on the results of the sub-query, rather than running the sub-query across the distinct values that are passed in. It’s what we see in the plan, and is the reason why the performance is so different.

The thing is that we would all assume that the results are going to be identical.

But that’s an assumption, and isn’t a good one.

Rob starts out with READ UNCOMMITTED but then gets into the “normal” READ COMMITTED transaction isolation level that most places use.

Comments closed

Allowing Azure Service Access

Arun Sirpal points out the importance of a tiny checkbox:

When you create a “logical” Azure SQL Server (I say logical because we are not really physically creating anything) there is a setting that is ticked ON by default which is called “Allow Azure services to access server”.

The question is, what does it mean? (See the highlighted section below)

Read on to see what this does and why Arun doesn’t like the default.

Comments closed

SQL Persistent Storage In Azure Container Services

Andrew Pruski shows how to use Kubernetes persistent volumes in Azure Container Services:

I’ve been playing around with SQL Server running in Kubernetes in Azure Container Services (AKS)for a while now and I think that the technology is really cool.

You can get a highly available instance of SQL Server up and running with a few lines of code! Ok, there’s a bit of setup to do but once you get that out of the way, you’re good to go.

One thing that has been missing though, is persistent storage. Any changes made to the SQL instance would have been lost if the pod that it was running in failed and was brought back up.

Until now.

Click through to learn how.  It’s certainly not trivial, but Andrew does a good job showing us the step-by-step.

Comments closed

Date Correlation Optimization

Monica Rathbun explains another quasi-hidden SQL Server configuration option:

According to MSDN – The DATE_CORRELATION_OPTIMIZATION database SET option improves the performance of queries that perform an equi-join between two tables whose date or datetime columns are correlated, and which specify a date restriction in the query predicate.

How many of you read what MSDN says and thinks “wuuuuuttt, English please”? I do.

Read on for the English translation.

Comments closed

Integrating Azure Data Catalog With Power BI

Gaston Cruz shows how to tie view Azure Data Catalog data in Power BI:

A Self Service culture will allow to address analysts to generate their own reports, lists, and dashboards without dependence on the schedule and availability of IT staff. In these cases reports combine different sources of information are generated, many of which may not have been used historically in the company, and this in turn implies that a large number of cases which source you do not know used to implement certain reports.

Azure Data Catalog comes as an option to break that cycle of discovery that is usually done manually. This means that after the first cycle where the business analyst discovers the sources of optimal data to generate certain reports the can register, and add information (metadata) to make this source easier to discover future analysts requiring such data for the implementation of similar reports. The discovery of these sources, and capability to add metadata are procedures do not have to give at the same time but Data Catalog allows work annotations by analysts as a continuous work in time where more information is added to the repository every time.

Click through for a demo.

Comments closed

Scripting Tables With SSMS

Tim Cost shows a few ways to script tables using SQL Server Management Studio:

Still … there is a trick here, and I don’t see a lot of people using it.  Maybe it’s just me, maybe I’m lazier than the average dev, but I often find myself using the Script Table As menu and choosing SELECT To and Clipboard.  This creates a nice select statement with all my fields wrapped in hard brackets.  I can then copy this into an INSERT query I might be working on to save myself some typing.  I can quickly copy the field list from the Query ‘Script Table As’ gives me and use it in the top of my INSERT query, then I can copy the entire SELECT query into the bottom of my INSERT query and Bob’s yer Uncle, I’ve got a simple INSERT query ready to go.  Note:  This is most useful when I’m trying to create a new table based on an existing table with only minor changes to field names.  I use this frequently when I’m establishing a reporting database based on staging tables.

That’s three ways to do it in Management Studio; the next step in the process is using SMO to script using a .NET language (C#, F#, Powershell).

Comments closed