Press "Enter" to skip to content

Month: August 2017

Compacting Shared Libraries In R

Dirk Eddelbuettel compacts the tidyverse:

Of course, there is a third way: just run strip --strip-debug over all the shared libraries after the build. As the path is standardized, and the shell does proper globbing, we can just do

$ strip --strip-debug /usr/local/lib/R/site-library/*/libs/*.so

using a double-wildcard to get all packages (in that R package directory) and all their shared libraries. Users on macOS probably want .dylibon the end, users on Windows want another computer as usual (just kidding: use .dll). Either may have to adjust the path which is left as an exercise to the reader.

When running this against the tidyverse library, shared library sizes dropped from 78 MB down to 6.6 MB.  Not bad for a single command. H/T R-Bloggers

Comments closed

Cross-Container Transactions With Memory-Optimized Objects

Ned Otter continues his series on In-Memory OLTP isolation levels:

Why will it fail?

It will fail because the initiation mode of this transaction is not autocommit, which is required for READ COMMITED SNAPSHOT when referencing memory-optimized tables (the initiation mode is explicit, because we explicitly defined a transaction).  So to be totally clear, for queries that only reference memory-optimized tables, we can use the READ COMMITTED or READ COMMITTED SNAPSHOT isolation levels, but the transaction initiation mode must be autocommit. Keep this in mind, because in a moment, you’ll be questioning that statement….

There are some interesting implications that Ned teases out, so I recommend giving it a careful read.

Comments closed

A dplyr Quiz

John Mount wants to know how well you understand dplyr:

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?

dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns.

But, is it an easy task to study and characterize dplyr itself?

Take John’s quiz and find out.  He wasn’t kidding about it being an advanced quiz.

Comments closed

Basics Of Dplyr

Dave Mason is dipping his toes into the R waters:

I think my first exposure to R was at PASS Summit 2016. Since then, I’ve made an effort to attend R sessions at SQL Saturdays. The one commonality I seem to find in all of them is a demo with (or mention of) the dplyr package. It’s a package of functions that manipulate data in data frame objects (think of them as SQL Server/relational tables…or if you’re a .NET developer, a System.Data.DataTable object). R feels inexorably tied to dplyr at this early stage for me. R is probably way more vast than I realize, but what would it be without dplyr? Would it still be as popular? Would it still be as powerful?

What’s It Good For

I’m not sure if I’m perceiving this the right way yet, but dplyr sure feels a lot like LINQ, a .NET Framework technology that provides query-like capability for C#. For instance, you can select a subset of objects from an array, sort them, find a minimum or maximum, etc. It’s kind of like querying SQL Server, just without SQL Server.

I like the comparison of dplyr against LINQ, as they’re both data querying and transformation tools whose motif is a series of functions chained together.

Comments closed

Powershell: Text Search In A Table Value

Shane O’Neill clarifies a misunderstanding in Powershell:

and you are running the following PowerShell command to check if the results contain a value…

1
2
3
$String = "abc"
$Array = @(Invoke-Sqlcmd -ServerInstance "SQLServer" -Database "Database" -Query "SELECT code FROM dbo.users")
$Array.Contains($string)

It will return FALSE.

Now we know that the FALSE is false because we know that the string is in there!
This code is proven to work with arrays as stated here by the “Hey, Scripting Guy!”s so this was getting filed under “WTF PowerShell”

It’s a good post, so check it out and remember that DataRows aren’t strings.

Comments closed

Getting Started With SQL Server For Free

Jeff Mlakar points out some free resources for getting started with SQL Server:

The basics of the “Learn SQL Server Starter Pack”:

  1. SQL Server 2016 DE
    1. You can get the Developer Edition (DE) for…wait for it…FREE!
    2. Out in the wild you’ll see mostly Standard Edition (SE) or Enterprise Edition (EE). The great thing about DE is that it is identical to EE (it has all the features) in every aspect except that it cannot be licensed on a production machine. It must only be used for TEST or DEV environments. For home lab purposes you can use it as your development environment and have access to all the features to learn on!
    3. Download it here – SQL Server Downloads
    4. While you are here get used to reading the release notes and what is new in the version. You don’t need to understand everything in here right away but get used to the jargon and how Microsoft describes their features.
  2. Windows Server 2016 Evaluation Edition
    1. Download Windows Server 2016 Evaluation Edition
    2. You can evaluate the software for 180 days then will need to activate it. Then you can try to register for another eval and try again
  3. Virtual Box
    1. Virtual Box is a free, simple, and reliable virtualization tool. You’ll be able to do a lot to get started and build up your virtualization knowledge with this.
    2. Download the latest version of Virtual Box
    3. You don’t need to know very much about hypervisors and such – Virtual Box is very easy to learn with good documentation.

Evaluation versions are good for learning because they force you to tear down and rebuild your environment!

Jeff then links to a number of free resources to help out with the learning experience.

Comments closed

A Simple Example With Spark And Kafka

Gary Dusbabek has a nice example showing how to build a simple application with Spark and Kafka:

This is a hands-on tutorial that can be followed along by anyone with programming experience. If your programming skills are rusty, or you are technically minded but new to programming, we have done our best to make this tutorial approachable. Still, there are a few prerequisites in terms of knowledge and tools.

The following tools will be used:

  • Git—to manage and clone source code

  • Docker—to run some services in containers

  • Java 8 (Oracle JDK)—programming language and a runtime (execution) environment used by Maven and Scala

  • Maven 3—to compile the code we write

  • Some kind of code editor or IDE—we used the community edition of IntelliJ while creating this tutorial

  • Scala—programming language that uses the Java runtime. All examples are written using Scala 2.12. Note: You do not need to download Scala.

The Hello World of streaming apps is a Twitter client.

Comments closed

Talking To Secure Hadoop Clusters

Mubashir Kazia shows how to connect to a secured Hadoop cluster using Active Directory:

The primary form of strong authentication used on a secure cluster is Kerberos. Kerberos supports credentials delegation where a server process to which a user has authenticated, can perform actions on behalf of the user. This involves the server process accessing databases or other web services as the authenticated user. Historically the form of delegation that was supported by Kerberos is now called “full delegation”. In this type of delegation, the Ticket Granting Ticket (TGT) of the user is made available to the server process and server can then authenticate to any service where the user has been granted authorization. Until recently most Kerberos Key Distribution Center(KDC)s other than Active Directory supported only this form of delegation. Also Java until Java 7 supported only this form of delegation. Starting with Java 8, Java now supports Kerberos constrained delegation (S4U2Proxy), where if the KDC supports it, it is possible to specify which particular services the server process can be delegated access to.

Hadoop within its security framework has implemented impersonation or proxy support that is independent of Kerberos delegation. With Hadoop impersonation support you can assign certain accounts proxy privileges where the proxy accounts can access Hadoop resources or run jobs on behalf of other users. We can restrict proxy privileges granted to a proxy account to act on behalf of only certain users who are members of certain groups and/or only for connections originating from certain hosts. However we can’t restrict the proxy privileges to only certain services within the cluster.

What we are discussing in this article is how to setup Kerberos constrained delegation and access a secure cluster. The example here involves Apache Tomcat, however you can easily extend this to other Java Application Servers.

This is a good article showing specific details on using Kerberos in applications connecting to Hadoop.

Comments closed

Power Query: Joining On Date Ranges

Reza Rad shows how to build merge joins between date ranges in Power Query:

Customer’s table has the history details of changes through the time. For example, the customer ID 2, has a track of change. John was living in Sydney for a period of time, then moved to Melbourne after that.

The problem we are trying to solve is to join these two tables based on their customer ID, and find out the City related to that for that specific period of time. We have to check the Date field from Sales Table to fit into FromDate and ToDate of the Customer table.

This is a common type 2 SCD scenario.  I’d be concerned that this solution would not work with large data sets which may already be pushing the size limits of the Vertipaq engine.

Comments closed