Author: Kevin Feasel

Scraping The PASS Budget

Published 2018-01-02 by Kevin Feasel

Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:

With tabulizer, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!

We need to fall back to tabulizer::extract_text() and do a lot of wrangling to reconstruct the tables.

Steph shows her work, so click through to see the scripts.

Comments closed

More On Machine Learning Services

Published 2018-01-02 by Kevin Feasel

Ginger Grant continues her Machine Learning Services series with a couple more posts. First up is on memory allocation:

Enabling Machine Learning Services on SQL Server which I discussed in a previous blog post, requires you to enable external scripts. Machine Learning Services are run as external processes to SQLPAL. This means that when you are running Python or R code you are running it outside of the managed processes of SQL Server and SQLPAL. This design means that the resources used to run Machine Learning Services will run outside of the resources allocated for SQL Server. If you are planning on using Machine Learning Services you will want to review the server memory options which you may have set for SQL Server. If you have set the max server memory For example, if your server has 16 GB of RAM memory, and you have allocated 8 GB to SQL Server and you estimate that the operating system will use an additional 4 GB, that means that machine learning services will have 4 GB remaining which it can use.

By design, Machine Learning Services will not starve out all of the memory for SQL Server because it doesn’t use it. This means DBAs to not have to worry about SQL Server processes not running because some R program is using all the memory as it does not use the memory SQL Server has allocated. You do have to worry about the amount of memory allocated to Machine Learning Services as by default, using our previous example where there was 4 GB which Machine Learning Services can use, it will only use 20% of the available memory or 819 KB of memory. That is not a lot of memory. Most likely if you are doing a lot of Machine Learning Services work you will want to use more memory which means you will want to change the default memory allocation for external services.

Ginger also talks about the Launchpad service:

When calling external processes, internally SQL Server uses User IDs to call the Launchpad service, which is installed as part of Machine Learning Services and must be running for SQL Server to be able to execute code written in R or Python. The number of users is set by default. To change the number of users, open up SQL Server Configuration Manager by typing SQLServerManager14.msc at the run prompt. For some unknowable reason Microsoft decided to hide this application which was previously available by looking at the installed programs on the server. Now for some reason they think everyone should memorize this obscure command. Once you have the SQL Server Configuration Manager open, right click on the SQL Server Launchpad service and select the properties which will show the window, as shown below. You will notice I am running an instance called SQLServer2017 which is listed in parenthesis in the window name.

Both are worth reading.

Comments closed

In-Memory OLTP: When You’re Out Of Space

Published 2018-01-02 by Kevin Feasel

Ned Otter shows us what happens when you run out of disk space and you’re using memory-optimized objects:

In my lab, I’m running Windows Server 2012. Let’s use Powershell to install the File System Resource Manager, which will allow us to create a quota for the relevant folder:

add-windowsfeature –name fs-resource-manager –includemanagementtools

After installing the Windows feature we can set the quota for the folder, but we shouldn’t enable it just yet, because first we have to verify the current size of the folder.

On my server, I created a quota of 1.5GB, and then enabled it.

Now let’s INSERT rows into the table, in batches of 1000, until we reach the limit (the INSERT script is listed in Part 2, I’m trying to keep this post from getting too long).

Click through to see what happens. It’s not exactly a swath of carnage, but it’s also something you really don’t want to happen.

Comments closed

Simulating Network Latency

Published 2018-01-02 by Kevin Feasel

John Paul Cook shows how to use WANem to simulate network latency in a Hyper-V environment:

Access WANem from either SQL Server virtual machine using a case sensitive URL that includes WANem’s IP address. In this example, the URL is http://99.99.99.99/WANem. Inside the SQL Server virtual machines, I set the browser’s start page to the WANem home page. Create a delay of 1000 msec and retest SQL Server to SQL Server connectivity.

It looks like a good way of proving out whether your setup can handle extreme latency before you build it for real.

Comments closed

Avoid Impersonation And The Trustworthy Flag

Published 2018-01-02 by Kevin Feasel

Solomon Rutzky explains how you can use module signing to avoid the security risks which come with impersonation and setting Trustworthy on:

Admittedly, using Cross-Database Ownership Chaining and/or Impersonation and/or TRUSTWORTHY are quicker and easier to implement than Module Signing. However, the relative simplicity in understanding and implementing these options comes at a cost: the security of your system.

Cross-DB Ownership Chaining:

security risk (can spoof User / DB-level)

db_ddladmin & db_owner users can create objects for other owners

Users with CREATE DATABASE permission can create new databases and attach existing databases

Impersonation:

If IMPERSONATE permission is required:

can be used any time

No granular control over permissions

Cross-DB operations need TRUSTWORTHY ON

Need to use ORIGINAL_LOGIN() for Auditing

Elevated permissions last until process / sub-process ends or REVERT

TRUSTWORTHY:

Bigger security risk

can also spoof Logins, such as “sa” !

If using SQLCLR Assemblies, no per-Assembly control of ability to be marked as either EXTERNAL_ACCESS or UNSAFE; all Assemblies are eligible to be marked as either of those elevated permission sets.

The common theme across all three areas is no control, within a Database, over who or what can make use of the feature / option, or when it can be used.

Read the whole thing.

Comments closed

DBA Salary Gaps

Published 2018-01-02 by Kevin Feasel

Eugene Meidinger has a great post looking at DBA salaries for women versus men:

Goofy outliers are an issue, but the larger the dataset the smaller the issue. If Bill Gates walks into a bar, the average wealth in the bar goes up by a billion. If he walks into a football stadium, everyone gets a million dollar raise.

One way of looking at the issue is to compare the median to the mean. The median is the salary smack dab in the middle, whereas mean is what we normally think of when we think of average.

The median doesn’t care where Bill Gates is, but the mean is sensitive to outliers. If we compare the two, that should give us an idea if we have too much skew in either direction.

If you’re not well-versed in descriptive statistics, Eugene has a good, methodical process and explains each step well.

Comments closed

Outlier Detection With dplyr And ruler

Published 2017-12-29 by Kevin Feasel

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

Most basic outlier detection techniques.
A way to implement them using dplyr and ruler.
A way to combine their results in order to obtain a new outlier detection method.
A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing. H/T R-Bloggers

Comments closed

rquery: Relational Algebra In R

Published 2017-12-29 by Kevin Feasel

John Mount announces rquery:

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production).

As an example: rquery operators allow us to write our earlier “treatment and control” example as follows.
dQ <- d %.>%
  extend_se(.,
            if_else_block(
              testexpr =
                "rand()>=0.5",
              thenexprs = qae(
                a_1 := 'treatment',
                a_2 := 'control'),
              elseexprs = qae(
                a_1 := 'control',
                a_2 := 'treatment'))) %.>%
  select_columns(., c("rowNum", "a_1", "a_2"))

It’s an interesting idea.

Comments closed

Getting Started With dplyr

Published 2017-12-29 by Kevin Feasel

Abdul Majed Raja has a dplyr tutorial:

dplyr is one of the most popular r-packages and also part of tidyverse that’s been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it’s one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us – in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else’s.
Hence, this post aims to bring out some well-known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential using a much familiar – Titanic Dataset.

This covers the main pieces of dplyr, including its pipeline. dplyr is a key part of the tidyverse, and knowing it well makes R so much easier. H/T R-Bloggers

Comments closed

Organizing SQL Queries

Published 2017-12-29 by Kevin Feasel

Eleni Markou shows a few techniques available to organize SQL queries, especially for analytics:

Jupyter Notebook

For the advocates of python, a commonly used application is Jupyter Notebook. Jupyter Notebook is a server-client application that allows editing and running of python code via a web browser combining python code, SQL, equations, text, and visualizations. It also offers syncing with GitHub repositories.

More specifically, Jupyter Notebook will be rendered by GitHub directly on your repo page. This means that one can enjoy all the benefits that Git offers regarding version control, branching, merging and collaborative development when using Jupyter Notebook.

The best strategy is probably a multi-tiered strategy. It absolutely starts with source control, but it doesn’t have to end there.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31