Press "Enter" to skip to content

Curated SQL Posts

Neural Topic Models On Amazon SageMaker

David Ping, et al, show off topic modeling on Amazon SageMaker:

Topic Modeling is used to organize a corpus of documents into “topics” which is a grouping based on a statistical distribution of words within the documents themselves. Amazon Comprehend, our fully managed text analytics service, provides a pre-configured topic modeling API that is best suited for the most popular use cases like organizing customer feedback, support incidents or workgroup documents. Amazon Comprehend is the suggested topic modeling choice for customers as it removes a lot of the most routine steps associated with topic modeling like tokenization, training a model and adjusting parameters. Amazon SageMaker’s Neural Topic Model (NTM) caters to the use cases where a finer control of the training, optimization, and/or hosting of a topic model is required, such as training models on text corpus of particular writing style or domain, or hosting topic models as part of a web application. While Amazon SageMaker NTM provides a starting point of state-of-the-art topic modeling, customers have the flexibility to modify the network architecture as well as hyperparameters to accommodate the idiosyncrasies of their data sets as well as to tune the trade-off between a multitude of metrics such as document modeling accuracy, human interpretability and granularity of the learned topics, based on their applications. In addition, Amazon SageMaker NTM leverages the full power of the Amazon SageMaker platform: easily configurable training and hosting infrastructure, automatic hyperparameter optimization, and fully-managed hosting with auto-scaling.

They walk through the entire topic modeling process, so check it out.

Comments closed

AVG And Data Types

Kendra Little explains how the AVG() function works with a couple different data types:

This week’s Quizletter featured a quiz on using the AVG() function in SQL Server.

I was inspired to write this quiz because I’ve been teaching TSQL School each week. When we covered aggregate functions, I remembered just how tricksy and non-average that AVG() can be.

Want to take the quiz before you read the answers?

Head over here and take the quiz first.

Or read this post and then take the quiz, if you prefer. I’m not gonna call it cheating, because it’s learning either way.

Kendra explains each of the answers, so I’d recommend taking the quiz first.

Comments closed

Estimating Columnstore Compression

Niko Neugebauer has a way of estimating disk size after creating a columnstore index on a table:

For anyone working with SQL Server since version 2005 (with Service Pack 2 to be precise) there is a very common task when thinking or research the system optimisation – the usage of the compression, and before advancing with this step the question that one usually receives is – “can you estimate how much improvement we shall get?”
For this purpose since SQL Server 2008, we have a very useful stored procedure that is called sp_estimate_data_compression_savings, that is capable of providing us with the estimation of how much storage we can save by enabling or moving to a more effective compression method (as in NONE | ROW | PAGE types).
Columnstore Indexes have appeared in SQL Server 2012 (that is well over 6 years ago) and even though from time to time I would ask and suggest to enable the stored procedure sp_estimate_data_compression_savings to start supporting Columnstore Indexes, until now there is no such support.

Until now – I am introducing a conjunction of my scripts in a comprehensive and reasonably capable stored procedure that is called “cstore_sp_estimate_columnstore_compression_savings” and that is a part of my free & open-sourced Columnstore Indexes Script Library, freely available on GitHub.

That’s a useful addition, especially when you’re trying to sell management on using clustered columnstore indexes.

Comments closed

Traditional Database Security Doesn’t Protect Data

Alex Yates has a controversial topic and some interesting thoughts:

Unfortunately, traditional database security has failed us.

Developers need access to the dev database to do their work. They need to be able to use appropriate test data to test their code. Traditional security features (logins, roles and users – even encryption technologies, dynamic data masking and row level security, etc) can be used to manage who has access to the data in production systems, but if a dev or test database already has the sensitive data these fundamental security features are worthless with regard to data protection. Even encrypted data only remains secure if the keys remain safe.

Sure, traditional security features protect the data in the production system – but not if it has already been copied to a less secure environment. And most people don’t track that with anywhere near as much rigor as they should.

To protect data effectively, we need to think much more consciously, not just about the production database, but also about all the other databases and backups that make up our database lifecycles, including dev and test systems and dev workstations. We need to know exactly where our security perimeter lies. Any copy of sensitive production data needs to live within the security perimeter and not outside it.

Read the whole article.

1 Comment

Biml 2018 Release Day

Andy Leonard has a bunch of new presents to unwrap:

Varigence keeps giving away cool stuff! Nowhere is Varigence’s commitment to community more evident than in the feature list for BimlExpress 2018. The previous version – BimlExpress 2017 – included the Preview Pane. BimlExpress 2018 includes the ability to Convert SSIS Packages to Biml:

How cool is that? And it’s in the free (FREE!) version!

As with BimlFlex and BimlStudio, there are too many cool features to list here. Head over to the BimlExpress 2018 feature page to learn more.

Converting existing packages to Biml was a great feature that I could never afford.  It’s exceedingly nice of Scott Currie & crew to make that available in the free product.

Comments closed

Database Snapshots In dbatools

Chrissy LeMaire shows how easy it is to create and manage database snapshots with dbatools:

New-DbaDbSnapshot

To create a new snapshot, you no longer need to know the path of the snapshot location (though we do support custom paths). You don’t even need to specify a name! But you can, of course.

I haven’t used database snapshots in a while, but I do appreciate them, especially for testing scenarios.

Comments closed

Deploying SQL Server Code Using Jenkins

Chris Adkin has started a series on continuous integration of SQL Server databases with Jenkins and Docker:

The mainstay of my presentation material this year has been my deck on continuous integration, Docker and Jenkins. For people who have not had the chance to see this presentation or have seen it and wanted to get some more context around it, I have written this first in a series of posts. Much, in fact just about all of the material in this post features in other posts on my blog. The aim of this set of posts is to present the material in a more digestible manner for people who might not be fully fully familiar with Docker and Jenkins.

This first post will cover an introduction to Jenkins and use of the “Sidecar pattern” for deploying DACPACs to. Subsequent posts will expand on this to include:

  • Multi branch build pipelines
  • Parallelism
  • Unit testing with tSQLt
  • The management of database state via Docker volumes

Many people in the SQL Server community have displayed a great interest in containers, only to be left scratching their heads thinking “Well, that is nice, but what can I practically use them for ?”. In my humble opinion, spinning up SQL Server inside a container as a deployment target for a continuous integration pipeline, is one of the, if not the best ways to leverage SQL Server and Docker.

I’m looking forward to the rest of the series.

Comments closed

Parallelizing Linear Regression With MapReduce

Arthur Charpentier shows us the math behind using MapReduce to parallelize a linear regression:

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Arthur gives us an interesting example in R to boot.

Comments closed

Confluent Hub: A Central Repo For Kafka Connect

Tim Berglund announces Confluent Hub:

Connect has been an integral part of Apache Kafka since version 0.9, released late 2015. It has proved to be an effective framework for streaming data in and out of Kafka from nearby systems like relational databases, Amazon S3, HDFS clusters, and even nonstandard legacy systems that typically show themselves in the enterprise. Connect is an API on which the connectors themselves are built, plus a run-time framework that runs them in a scalable, fault-tolerant way. The intent was for the community to provide its own connectors to plug into this framework and do the work of data integration while saving everyone a bunch of unrewarding coding that was near-boilerplate and didn’t add a lot of differentiated value to the business.

So where would those connectors live? Well, GitHub, for starters. At the time of this writing, there were 660 repositories matching the search phrase “Kafka Connect” on the popular hosting service, all in various stages of repair and levels of maintenance. Beyond those, Confluent’s popular Connectors page has proven to be one of the best ways to find connectors, some of which are supported by Confluent, and others of which have robust community support behind them. The Connectors page lists for each entry the type of connector, the developer, a few tags, and how you could obtain the code—but that’s really all it did. You still had to go find the released JARs for the connector, download them, and know how to install them properly. And if there were no released JARs available, you had to clone the repository, figure out how to run the build, and then install the JARs into your own Kafka Connect installation. Maybe not rocket science, but we all know it’s never as simple as it sounds. And besides, this was just connectors—no transformations or converters were available on this page.

We knew there was a better way. We wanted something that was easier to use, would avoid you having to building a connector from source every time you wanted an update (and learning a new build tool every now and then), and would be built on top of a meaningful and functional discovery mechanism. And most importantly, we wanted to avoid the pitfalls of manually moving JARs around and having to debug why Connect didn’t find them.

This looks like a good addition to the Kafka ecosystem.

Comments closed

Tuning Spark Jobs Running On YARN

Anubhav Tarar shows us ways of optimizing YARN to run Apache Spark jobs:

1. yarn-client mode:  In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,

  • spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter

For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m

Check it out for a few other configuration settings you can tweak.

Comments closed