Press "Enter" to skip to content

Month: June 2018

Estimating Columnstore Compression

Niko Neugebauer has a way of estimating disk size after creating a columnstore index on a table:

For anyone working with SQL Server since version 2005 (with Service Pack 2 to be precise) there is a very common task when thinking or research the system optimisation – the usage of the compression, and before advancing with this step the question that one usually receives is – “can you estimate how much improvement we shall get?”
For this purpose since SQL Server 2008, we have a very useful stored procedure that is called sp_estimate_data_compression_savings, that is capable of providing us with the estimation of how much storage we can save by enabling or moving to a more effective compression method (as in NONE | ROW | PAGE types).
Columnstore Indexes have appeared in SQL Server 2012 (that is well over 6 years ago) and even though from time to time I would ask and suggest to enable the stored procedure sp_estimate_data_compression_savings to start supporting Columnstore Indexes, until now there is no such support.

Until now – I am introducing a conjunction of my scripts in a comprehensive and reasonably capable stored procedure that is called “cstore_sp_estimate_columnstore_compression_savings” and that is a part of my free & open-sourced Columnstore Indexes Script Library, freely available on GitHub.

That’s a useful addition, especially when you’re trying to sell management on using clustered columnstore indexes.

Comments closed

Traditional Database Security Doesn’t Protect Data

Alex Yates has a controversial topic and some interesting thoughts:

Unfortunately, traditional database security has failed us.

Developers need access to the dev database to do their work. They need to be able to use appropriate test data to test their code. Traditional security features (logins, roles and users – even encryption technologies, dynamic data masking and row level security, etc) can be used to manage who has access to the data in production systems, but if a dev or test database already has the sensitive data these fundamental security features are worthless with regard to data protection. Even encrypted data only remains secure if the keys remain safe.

Sure, traditional security features protect the data in the production system – but not if it has already been copied to a less secure environment. And most people don’t track that with anywhere near as much rigor as they should.

To protect data effectively, we need to think much more consciously, not just about the production database, but also about all the other databases and backups that make up our database lifecycles, including dev and test systems and dev workstations. We need to know exactly where our security perimeter lies. Any copy of sensitive production data needs to live within the security perimeter and not outside it.

Read the whole article.

1 Comment

Biml 2018 Release Day

Andy Leonard has a bunch of new presents to unwrap:

Varigence keeps giving away cool stuff! Nowhere is Varigence’s commitment to community more evident than in the feature list for BimlExpress 2018. The previous version – BimlExpress 2017 – included the Preview Pane. BimlExpress 2018 includes the ability to Convert SSIS Packages to Biml:

How cool is that? And it’s in the free (FREE!) version!

As with BimlFlex and BimlStudio, there are too many cool features to list here. Head over to the BimlExpress 2018 feature page to learn more.

Converting existing packages to Biml was a great feature that I could never afford.  It’s exceedingly nice of Scott Currie & crew to make that available in the free product.

Comments closed

Database Snapshots In dbatools

Chrissy LeMaire shows how easy it is to create and manage database snapshots with dbatools:

New-DbaDbSnapshot

To create a new snapshot, you no longer need to know the path of the snapshot location (though we do support custom paths). You don’t even need to specify a name! But you can, of course.

I haven’t used database snapshots in a while, but I do appreciate them, especially for testing scenarios.

Comments closed

Deploying SQL Server Code Using Jenkins

Chris Adkin has started a series on continuous integration of SQL Server databases with Jenkins and Docker:

The mainstay of my presentation material this year has been my deck on continuous integration, Docker and Jenkins. For people who have not had the chance to see this presentation or have seen it and wanted to get some more context around it, I have written this first in a series of posts. Much, in fact just about all of the material in this post features in other posts on my blog. The aim of this set of posts is to present the material in a more digestible manner for people who might not be fully fully familiar with Docker and Jenkins.

This first post will cover an introduction to Jenkins and use of the “Sidecar pattern” for deploying DACPACs to. Subsequent posts will expand on this to include:

  • Multi branch build pipelines
  • Parallelism
  • Unit testing with tSQLt
  • The management of database state via Docker volumes

Many people in the SQL Server community have displayed a great interest in containers, only to be left scratching their heads thinking “Well, that is nice, but what can I practically use them for ?”. In my humble opinion, spinning up SQL Server inside a container as a deployment target for a continuous integration pipeline, is one of the, if not the best ways to leverage SQL Server and Docker.

I’m looking forward to the rest of the series.

Comments closed

Parallelizing Linear Regression With MapReduce

Arthur Charpentier shows us the math behind using MapReduce to parallelize a linear regression:

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Arthur gives us an interesting example in R to boot.

Comments closed

Confluent Hub: A Central Repo For Kafka Connect

Tim Berglund announces Confluent Hub:

Connect has been an integral part of Apache Kafka since version 0.9, released late 2015. It has proved to be an effective framework for streaming data in and out of Kafka from nearby systems like relational databases, Amazon S3, HDFS clusters, and even nonstandard legacy systems that typically show themselves in the enterprise. Connect is an API on which the connectors themselves are built, plus a run-time framework that runs them in a scalable, fault-tolerant way. The intent was for the community to provide its own connectors to plug into this framework and do the work of data integration while saving everyone a bunch of unrewarding coding that was near-boilerplate and didn’t add a lot of differentiated value to the business.

So where would those connectors live? Well, GitHub, for starters. At the time of this writing, there were 660 repositories matching the search phrase “Kafka Connect” on the popular hosting service, all in various stages of repair and levels of maintenance. Beyond those, Confluent’s popular Connectors page has proven to be one of the best ways to find connectors, some of which are supported by Confluent, and others of which have robust community support behind them. The Connectors page lists for each entry the type of connector, the developer, a few tags, and how you could obtain the code—but that’s really all it did. You still had to go find the released JARs for the connector, download them, and know how to install them properly. And if there were no released JARs available, you had to clone the repository, figure out how to run the build, and then install the JARs into your own Kafka Connect installation. Maybe not rocket science, but we all know it’s never as simple as it sounds. And besides, this was just connectors—no transformations or converters were available on this page.

We knew there was a better way. We wanted something that was easier to use, would avoid you having to building a connector from source every time you wanted an update (and learning a new build tool every now and then), and would be built on top of a meaningful and functional discovery mechanism. And most importantly, we wanted to avoid the pitfalls of manually moving JARs around and having to debug why Connect didn’t find them.

This looks like a good addition to the Kafka ecosystem.

Comments closed

Tuning Spark Jobs Running On YARN

Anubhav Tarar shows us ways of optimizing YARN to run Apache Spark jobs:

1. yarn-client mode:  In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,

  • spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter

For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m

Check it out for a few other configuration settings you can tweak.

Comments closed

YARN Fundamentals

Anushree Subramaniam gives us a primer on Apache YARN, the resource manager which drives Hadoop:

In Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1), MapReduce performed both processing and resource management functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers. The Task Trackers periodically reported their progress to the Job Tracker.

This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its article that according to Yahoo!, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm.

To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.

There’s a lot of depth to YARN.

Comments closed

Clustered Columnstore Index Online Rebuild

Niko Neugebauer looks at a feature which will pop up in SQL Server vNext:

The current state of the Clustered Columnstore Index ONLINE rebuild points to be an unfinished version, which will definitely get vastly improved before being released & supported in SQL Server. I have seen a couple of deadlocks and canceled transactions and so I decided that this blog post will get updated as soon as there will be an official announcement of this feature.
If you are still looking to start working on this feature, then I would suggest trying it on smaller tables. Like really, really small ones.
Oh, and for online rebuild operation focus on using partition rebuild – you are using the partitioning, right ? 🙂

Niko gave this a try in Azure SQL Database, as there is no publicly available version of SQL Server which supports this.  I’ve been waiting for this feature for 3 years now, so I’ll be happy to see it in production.

Comments closed