Converting Factors To Numbers In R

Kevin Feasel

2018-06-26

R

Sebastian Sauer shows us a pitfall of brute-force conversion of factors to integers:

Oh no! That’s not what we wanted! R has messed the thing up (?). The reason is that R sees the first factor level internally as the number 1 . The second level as number two. What’s the first factor level in our case? Let’s see:

factor(tips$sex) %>% head()#> [1] Female Male Male Male Female Male #> Levels: Female Malefactor(tips$sex_r) %>% head()#> [1] 1 0 0 0 1 0#> Levels: 0 1

That’s confusing: “0” is the first level of sex_r – internally for R represented by “1”. The second level of sex_r is “1” – internally represented by “2”.

Fortunately, we get the easy answer at the end of the post.

Neural Topic Models On Amazon SageMaker

David Ping, et al, show off topic modeling on Amazon SageMaker:

Topic Modeling is used to organize a corpus of documents into “topics” which is a grouping based on a statistical distribution of words within the documents themselves. Amazon Comprehend, our fully managed text analytics service, provides a pre-configured topic modeling API that is best suited for the most popular use cases like organizing customer feedback, support incidents or workgroup documents. Amazon Comprehend is the suggested topic modeling choice for customers as it removes a lot of the most routine steps associated with topic modeling like tokenization, training a model and adjusting parameters. Amazon SageMaker’s Neural Topic Model (NTM) caters to the use cases where a finer control of the training, optimization, and/or hosting of a topic model is required, such as training models on text corpus of particular writing style or domain, or hosting topic models as part of a web application. While Amazon SageMaker NTM provides a starting point of state-of-the-art topic modeling, customers have the flexibility to modify the network architecture as well as hyperparameters to accommodate the idiosyncrasies of their data sets as well as to tune the trade-off between a multitude of metrics such as document modeling accuracy, human interpretability and granularity of the learned topics, based on their applications. In addition, Amazon SageMaker NTM leverages the full power of the Amazon SageMaker platform: easily configurable training and hosting infrastructure, automatic hyperparameter optimization, and fully-managed hosting with auto-scaling.

They walk through the entire topic modeling process, so check it out.

AVG And Data Types

Kendra Little explains how the AVG() function works with a couple different data types:

This week’s Quizletter featured a quiz on using the AVG() function in SQL Server.

I was inspired to write this quiz because I’ve been teaching TSQL School each week. When we covered aggregate functions, I remembered just how tricksy and non-average that AVG() can be.

Want to take the quiz before you read the answers?

Head over here and take the quiz first.

Or read this post and then take the quiz, if you prefer. I’m not gonna call it cheating, because it’s learning either way.

Kendra explains each of the answers, so I’d recommend taking the quiz first.

Estimating Columnstore Compression

Niko Neugebauer has a way of estimating disk size after creating a columnstore index on a table:

For anyone working with SQL Server since version 2005 (with Service Pack 2 to be precise) there is a very common task when thinking or research the system optimisation – the usage of the compression, and before advancing with this step the question that one usually receives is – “can you estimate how much improvement we shall get?”
For this purpose since SQL Server 2008, we have a very useful stored procedure that is called sp_estimate_data_compression_savings, that is capable of providing us with the estimation of how much storage we can save by enabling or moving to a more effective compression method (as in NONE | ROW | PAGE types).
Columnstore Indexes have appeared in SQL Server 2012 (that is well over 6 years ago) and even though from time to time I would ask and suggest to enable the stored procedure sp_estimate_data_compression_savings to start supporting Columnstore Indexes, until now there is no such support.

Until now – I am introducing a conjunction of my scripts in a comprehensive and reasonably capable stored procedure that is called “cstore_sp_estimate_columnstore_compression_savings” and that is a part of my free & open-sourced Columnstore Indexes Script Library, freely available on GitHub.

That’s a useful addition, especially when you’re trying to sell management on using clustered columnstore indexes.

Traditional Database Security Doesn’t Protect Data

Alex Yates has a controversial topic and some interesting thoughts:

Unfortunately, traditional database security has failed us.

Developers need access to the dev database to do their work. They need to be able to use appropriate test data to test their code. Traditional security features (logins, roles and users – even encryption technologies, dynamic data masking and row level security, etc) can be used to manage who has access to the data in production systems, but if a dev or test database already has the sensitive data these fundamental security features are worthless with regard to data protection. Even encrypted data only remains secure if the keys remain safe.

Sure, traditional security features protect the data in the production system – but not if it has already been copied to a less secure environment. And most people don’t track that with anywhere near as much rigor as they should.

To protect data effectively, we need to think much more consciously, not just about the production database, but also about all the other databases and backups that make up our database lifecycles, including dev and test systems and dev workstations. We need to know exactly where our security perimeter lies. Any copy of sensitive production data needs to live within the security perimeter and not outside it.

Read the whole article.

Biml 2018 Release Day

Andy Leonard has a bunch of new presents to unwrap:

Varigence keeps giving away cool stuff! Nowhere is Varigence’s commitment to community more evident than in the feature list for BimlExpress 2018. The previous version – BimlExpress 2017 – included the Preview Pane. BimlExpress 2018 includes the ability to Convert SSIS Packages to Biml:

How cool is that? And it’s in the free (FREE!) version!

As with BimlFlex and BimlStudio, there are too many cool features to list here. Head over to the BimlExpress 2018 feature page to learn more.

Converting existing packages to Biml was a great feature that I could never afford.  It’s exceedingly nice of Scott Currie & crew to make that available in the free product.

Database Snapshots In dbatools

Chrissy LeMaire shows how easy it is to create and manage database snapshots with dbatools:

New-DbaDbSnapshot

To create a new snapshot, you no longer need to know the path of the snapshot location (though we do support custom paths). You don’t even need to specify a name! But you can, of course.

I haven’t used database snapshots in a while, but I do appreciate them, especially for testing scenarios.

Deploying SQL Server Code Using Jenkins

Chris Adkin has started a series on continuous integration of SQL Server databases with Jenkins and Docker:

The mainstay of my presentation material this year has been my deck on continuous integration, Docker and Jenkins. For people who have not had the chance to see this presentation or have seen it and wanted to get some more context around it, I have written this first in a series of posts. Much, in fact just about all of the material in this post features in other posts on my blog. The aim of this set of posts is to present the material in a more digestible manner for people who might not be fully fully familiar with Docker and Jenkins.

This first post will cover an introduction to Jenkins and use of the “Sidecar pattern” for deploying DACPACs to. Subsequent posts will expand on this to include:

  • Multi branch build pipelines
  • Parallelism
  • Unit testing with tSQLt
  • The management of database state via Docker volumes

Many people in the SQL Server community have displayed a great interest in containers, only to be left scratching their heads thinking “Well, that is nice, but what can I practically use them for ?”. In my humble opinion, spinning up SQL Server inside a container as a deployment target for a continuous integration pipeline, is one of the, if not the best ways to leverage SQL Server and Docker.

I’m looking forward to the rest of the series.

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930