Tabulizer

Kevin Feasel

2016-12-02

R

Troy Walters uses the Tabulizer package to extract tables from a PDF and turn them into an R matrices or data frames:

Next we will use the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November 25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

This is nice.  I have to imagine it only works for text-based PDFs and not ones which are generated from a series of images.

Solving The German Tank Problem

Kevin Feasel

2016-12-02

R

Frank Portman shows how to figure out how many taxicabs—or tanks—there are:

For the uninitiated, the Taxicab / Germany Tank problem is as follows:

Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?

This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200-300) while “intelligence” (unintelligence) operations overestimated numbers about 6-7x. Read the full details on Wikipedia here (and donate while you’re over there).

Click through for the solution and how to implement it in R.

Looking For Wait Types

Ewald Cress uses the debugger to search for particular waits:

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

This post reminds me that my debugger skills aren’t very good.

Polybase MapReduce Containers

I have a post looking at how Polybase generates MapReduce containers:

Once we did that and I restarted all of the services, I ended up getting an interesting error message from SQL Server:

Msg 7320, Level 16, State 110, Line 2
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to JobSubmitter_SubmitJob: Error [org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=512

The error message is pretty clear:  the Polybase service wants to create containers that are 1536 MB in size, but the maximum size I’m allowing is 512 MB.  Therefore, the Polybase MapReduce operation fails.

Long story short, I needed enough RAM to be able to give 4 1/2 GB to YARN for creating MapReduce containers in order to run my query.

AWS Data Lake

Nick Corbett announces that Amazon is rolling out their own data lake solution:

Separating storage from processing can also help to reduce the cost of your data lake. Until you choose to analyze your data, you need to pay only for S3 storage. This model also makes it easier to attribute costs to individual projects. With the correct tagging policy in place, you can allocate the costs to each of your analytical projects based on the infrastructure that they consume. In turn, this makes it easy to work out which projects provide most value to your organization.

The data lake stores metadata in both DynamoDB and Amazon ES. DynamoDB is used as the system of record. Each change of metadata that you make is saved, so you have a complete audit trail of how your package has changed over time. You can see this on the data lake console by choosing History in the package view:

Having a competitor in the data lake space is a good thing for us, though based on this intro post, it seems that Amazon and Microsoft are taking different approaches to the data lake, where Microsoft wants you to stay in the data lake (e.g., writing U-SQL or Python statements to query the data lake) and Amazon wants you to shop the data lake and check out the specific S3 buckets and files for your own processing.

Python Support In Azure Data Lake

Saveen Reddy announces that Python is now a first-tier language in the Azure Data Lake:

This week, were are now making announcing even more support for Python. As of today Python is now a first-class language supported by our management SDKs. This enables you to develop applications or automate the Data Lake services. Check out or Getting Started articles that now include many python samples

Saveen has a Jupyter notebook which demonstrates Python in Azure Data Lake Store.

Canary Tests

Kendra Little gives her thoughts on how to identify a good DBA team:

What I learned was not to judge a team by their SQL Server. Some configurations may look problematic, but make a lot more sense when I talk to the team and dig into problems they’re facing.

For instance, there’ve been many times when a team was facing a performance issue, and at first glance their SQL Server looked stupidly underprovisioned in terms of memory. Upon digging into the problem I found that adding more memory wouldn’t solve their particular problem. One size doesn’t usually fit all.

Read on for hints and thoughts.

Azure Functions

Steph Locke has taken a shine to Azure Functions:

Azure Functions take care of all the hosting, all the retry logic, all the parallelisation, all the authentication gubbins, all the monitoring for you. The only bits of code you really have to write is the important stuff – the code that implements the business process. This makes a coding project go from >500 lines to <50, and it should be better quality too! This is super handy for data integration, and I would recommend it over and above Data Factory, unless you need to do some Hadoop stuff and maybe not even then.

The wag in me says that with F#, you could take it from 50 lines to 10…  Read the whole thing.

Row-Level Security With Power BI

Callum Green shows how to use row-level security with Power BI Desktop:

In the June 2016 monthly Power BI release, Row Level Security (RLS) was introduced into Power BI desktop. This is great news for people using the application, especially as the configuration is stored within the Power BI model.  Previously, you had to create the security in the web environment, which could easily be overwritten when publishing multiple times from a desktop workbook.

In this blog, I will show you how to set up RLS in Power BI desktop and how to test it works. My example uses the AdventureWorksDW2014 database (download here), specifically applying permissions for a manager. Each manager will only be able to see data for the Sales Representatives that report to them.

This is different from the SQL Server 2016 feature of the same name, but the concept is the same.

SQL Server For Linux Tools

Sanjay Nagamangalam looks at different tools you can use to connect to SQL Server:

  • New SQL command line tools for Linux: We’ve created Linux-native versions of your favorite SQL command line tools such as sqlcmdand bcp and sqlpackage and also added the new mssql-conf tool that lets you configure various properties for the SQL Server instance on Linux (e.g., SA password, TCP port and collation).

  • New versions of SSMS, SSDT and SQL PowerShell: We have released updated versions (v17.0 RC1) of our flagship SQL Server tools including SQL Server Management Studio (SSMS), Visual Studio SQL Server Data Tools (SSDT) and SQL PowerShell with support for the SQL Server v.Next on Windows and Linux.

They also have a plugin for Visual Studio Code, which can be helpful if you’re running on Linux.

Categories

December 2016
MTWTFSS
« Nov Jan »
 1234
567891011
12131415161718
19202122232425
262728293031