Press "Enter" to skip to content

Author: Kevin Feasel

Powershell Remoting To Linux

Max Trinidad shows that Powershell remoting works for connecting Linux to Windows and vice versa:

Connecting Linux to Windows

This is strange but, more likely, you will get an error the first time you try to connect. Just try the second try and it will work.

I need to see what Powershell objects for Linux currently exist; my guess is “not many, if any” but as those start getting fleshed out, I think even the most adamant of grep-sed-awk users will want to pick up at least a little bit of Powershell.

Comments closed

Elastic Database Pools

Arun Sirpal describes Azure elastic database pools:

The key to using elastic database pools is that you must understand the characteristics of the databases involved and their utilisation patterns, if you do not understand this then the idea of using an elastic database pool may cause problems.

The maximum amount my pool has is 100 eDTUs, I know for a fact that the S2 databases will not be used at the same time, the other S0 databases might be used at the same time at the most 3 of them at the same time. Basically what I am saying here is that I know that when the databases concurrently peak I know that it will not go beyond the 100 eDTU limit.

One thing that Arun does not mention is the relative ease of interconnecting databases within a pool, so even if it doesn’t end up being cheaper on net, that might be a benefit worth having.

Comments closed

Custom R Visuals In Power BI

Ginger Grant notes that there are R-powered custom visuals for Power BI:

Interacting with R visuals works differently than with other report visualizations as you cannot click on elements within the visualization and filter other items on the page. Other visuals on the page will filter the data contained within the R visual. For example, let’s say my report contains a total field, a slicer which contains years and a correlation plot which contains products. If the slicker is changed to select a year, total field and the data within the R visual will change to reflect that. If on the other hand, I choose to click on the R visual to select one of the product categories, the total field will not change and the R visual will not change. The R visual’s appearance will not change in any way.

Read on for more.

Comments closed

Data Wrangling: R Versus M

Ryan Wade argues that R is a better language choice for working with data in Power BI than M:

Now let’s do something that I think is pretty slick. Let’s create a data set that combines the home games of the Pacers (IND) and the home games of the Hawks (ATL). Given the naming convention used by the files we will have to identify the files in our working directory that starts with an eight numeric digits > then a period > then a 3 character team abbreviation for the away team > then either “ATL” or “IND” > then finally “.csv”. We can create a regular expression to find the files that matches that pattern. I did so in the code below:

I’m interested in catching the rest of the series.  This is a controversial statement that I’m not entirely sold on yet, but Ryan does set the stage for his full argument.

Comments closed

Tabulizer

Troy Walters uses the Tabulizer package to extract tables from a PDF and turn them into an R matrices or data frames:

Next we will use the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November 25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

This is nice.  I have to imagine it only works for text-based PDFs and not ones which are generated from a series of images.

Comments closed

Solving The German Tank Problem

Frank Portman shows how to figure out how many taxicabs—or tanks—there are:

For the uninitiated, the Taxicab / Germany Tank problem is as follows:

Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?

This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200-300) while “intelligence” (unintelligence) operations overestimated numbers about 6-7x. Read the full details on Wikipedia here (and donate while you’re over there).

Click through for the solution and how to implement it in R.

Comments closed

Looking For Wait Types

Ewald Cress uses the debugger to search for particular waits:

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

This post reminds me that my debugger skills aren’t very good.

Comments closed

Polybase MapReduce Containers

I have a post looking at how Polybase generates MapReduce containers:

Once we did that and I restarted all of the services, I ended up getting an interesting error message from SQL Server:

Msg 7320, Level 16, State 110, Line 2
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to JobSubmitter_SubmitJob: Error [org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=512

The error message is pretty clear:  the Polybase service wants to create containers that are 1536 MB in size, but the maximum size I’m allowing is 512 MB.  Therefore, the Polybase MapReduce operation fails.

Long story short, I needed enough RAM to be able to give 4 1/2 GB to YARN for creating MapReduce containers in order to run my query.

Comments closed

AWS Data Lake

Nick Corbett announces that Amazon is rolling out their own data lake solution:

Separating storage from processing can also help to reduce the cost of your data lake. Until you choose to analyze your data, you need to pay only for S3 storage. This model also makes it easier to attribute costs to individual projects. With the correct tagging policy in place, you can allocate the costs to each of your analytical projects based on the infrastructure that they consume. In turn, this makes it easy to work out which projects provide most value to your organization.

The data lake stores metadata in both DynamoDB and Amazon ES. DynamoDB is used as the system of record. Each change of metadata that you make is saved, so you have a complete audit trail of how your package has changed over time. You can see this on the data lake console by choosing History in the package view:

Having a competitor in the data lake space is a good thing for us, though based on this intro post, it seems that Amazon and Microsoft are taking different approaches to the data lake, where Microsoft wants you to stay in the data lake (e.g., writing U-SQL or Python statements to query the data lake) and Amazon wants you to shop the data lake and check out the specific S3 buckets and files for your own processing.

Comments closed

Python Support In Azure Data Lake

Saveen Reddy announces that Python is now a first-tier language in the Azure Data Lake:

This week, were are now making announcing even more support for Python. As of today Python is now a first-class language supported by our management SDKs. This enables you to develop applications or automate the Data Lake services. Check out or Getting Started articles that now include many python samples

Saveen has a Jupyter notebook which demonstrates Python in Azure Data Lake Store.

Comments closed