Binning And Recoding In R

Kevin Feasel

2018-08-10

R

Sebastian Sauer shows a few methods of practical data reshaping in R:

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Click through for examples of techniques you can use.

Running Apache Kafka On Kubernetes

Rohit Bakhshi walks us through how to install Kafka on a Kubernetes cluster:

Now available on GitHub in developer preview are open-source Helm Chart deployment templates for Confluent Platform components. These templates enable developers to quickly provision Apache Kafka, Apache ZooKeeper, Confluent Schema Registry, Confluent REST Proxy, and Kafka Connect on Kubernetes, using official Confluent Platform Docker images.

Helm is an open-source packaging tool that helps you install applications and services on Kubernetes. Helm uses a packaging format called charts. A chart is a collection of YAML templates that describe a related set of Kubernetes resources.

For stateful components like Kafka and ZooKeeper, the Helm Charts use both StatefulSets to provide an identity to each pod in the form of an ordinal index, and Persistent Volumes that are always mounted for the pod. For stateless components, like REST Proxy, the Helm Charts utilize Deployments instead to provide an identity to each pod. Each component’s charts utilize Services to provide access to each pod.

Read on for more.

Creating A Basic Availability Group

Ginger Keys shows us two methods for creating Basic Availability Groups in SQL Server 2016 and later:

AlwaysOn Basic Availability Groups (BAGs) are available with SQL Server 2016 and 2017 Standard edition.  The functionality is generally the same as database mirroring (which has been deprecated).  This feature replicates transactions to a database on a secondary server, and is useful for disaster recovery should something happen to the primary server.

If you have a database that requires an extra layer of protection or ‘BAG of tricks’, deploying a Basic Availability Group is useful for providing disaster recovery and high availability for the one database.  Also there is major cost savings since it is not necessary to purchase SQL Enterprise edition…this can be done in Standard edition of SQL Server.

BAGs provide a failover environment for only one database, and there can only be two replicas in the group.  Replication can be synchronous or asynchronous, and there is no read access, no backups and no integrity checks on the secondary.   The secondary replica remains inactive unless there is a failover, and Basic AGs can remain on-premises or span from on-prem to Azure.

Read on for the two methods.

In Defense Of The Layered Architecture

Sylvia Fronczak defends the layered architecture approach to software development:

There are disadvantages to the layered architecture approach. And some aspects of it have been deemed not architecturally pure for domain-driven design.

However, there are disadvantages to most approaches. The key is to understand both the advantages and disadvantages. Pick what architectural patterns work best for your particular problem.

Here are some factors you want to keep in mind.

  1. The layered architecture is very database-centric. As mentioned before, since everything flows down, it’s often the database that’s the last layer. Critics of this architecture point out that your application isn’t about storing data; it’s about solving a business problem. However, when so many applications are simple CRUD apps, maybe the database is more than just a secondary player.

  2. Scaleability can be difficult with a layered architecture. This is tied to the fact that many layered applications tend to take on monolithic properties. If you need to scale your app, you have to scale the whole app! However, that doesn’t mean your layered application has to be a monolith. Once it becomes large enough, it’s time to split it out—just like you would with any other architecture.

  3. A layered application is harder to evolve, as changes in requirements will often touch all layers.

  4. A layered architecture is deployed as a whole. That’s even if it’s modular and separated into good components and namespaces. But that might not be a bad thing. Unless you have separate teams working on different parts of the application, deploying all at once isn’t the worst thing you can do.

I completely disagree with critique #1 (as opposed to Sylvia’s defense):  most of the time, the database will outlive that puny application by a matter of decades.  That is, when some team is developing version 8 of the application, they’ll use essentially the same database as in V1.  Creating a design which optimizes for data storage and recall is, in general, much more important than which Javascript library you’re using or how you lay out those files.

The Key Hierarchy And SQL Server Encryption

David Fowler walks us through the various keys used in encrypting data in SQL Server:

I’m sure that we all know that SQL Server includes all sorts of interesting functionality to allow us to encrypt our data and like with all encryption techniques, that data is encrypted using keys.

In SQL Server we’ve got a number of different keys, we’ve got the Service Master Key, Database Master Keys, Symmetric Keys, Asymmetric Keys and Certificates.  These keys can be used to encrypt data but they can also be used to encrypt other keys and this is where the key hierarchy comes in.

Warning:  it’s turtles all the way down.

Power BI August Release And SSAS Performance Improvements

Chris Webb points out something new in the Power BI August 2018 release:

While I was playing around with the new release (August 2018) of Power BI Desktop I noticed there was an undocumented change: similar to the OData improvements I blogged about here, there is a new option in the AnalysisServices.Database() and AnalysisServices.Databases() M functions that turns on a newer version of the MDX generation layer used by the Power Query engine. Like the OData improvements it is an option called Implementation=”2.0”, used like this:

AnalysisServices.Databases(	"localhost",	[	TypedMeasureColumns=true,	Implementation="2.0"	]
)

…and also, as with the OData improvements, you will need to manually edit any existing M queries to take advantage of this.

Read on for Chris’s test and analysis of the resulting MDX output.

Ways To Check For Non-Existence

Brent Ozar shows two methods for finding records missing associated child records:

You’re writing a query, and you wanna check to see if rows exist in a table.

I’m using the free Stack Overflow database, and I wanna find all of the users who have not left a comment. The tables involved are:

  • In dbo.Users, the Id field uniquely identifies a user.
  • In dbo.Comments, there’s a UserId field that links to who left the comment.

A quick way to write it is:

And this works fine. When you read the query, you might think SQL Server would run that SELECT * FROM dbo.Comments query for every single row of the Users table – but it’s way smarter than that, bucko. It scans the Comments index first because it’s much larger, and then joins that to the Users table.

But this isn’t the only way to query these tables, and Brent shows how to tell which method works better.

Is sp_execute_external_script Replacing CLR?

Kevin Feasel

2018-08-10

CLR

Solomon Rutzky makes me invoke Betteridge’s Law of Headlines:

With the additional (and annoying) configuration step required to get SQLCLR Assemblies to load starting in SQL Server 2017, some people have been wondering what is going on with SQLCLR. Considering that this new restriction is the only real change to SQLCLR since SQL Server 2012 (three versions ago), then with (limited) support for languages such as R (starting in SQL Server 2016) and Python (starting in SQL Server 2017) being added, it might even look like SQLCLR is being deprecated (i.e. phased-out) in favor of these new languages.

Could this be true? There is no official indication, but could it be unofficially / “effectively” deprecated? Well, let’s take a look.

As someone who likes CLR, I want both CLR and Machine Learning Services to co-exist.  This would be true even if ML Services supported F# and the lesser .NET languages.

Categories

August 2018
MTWTFSS
« Jul Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031