Zippy Base R

Kevin Feasel

2018-01-16

R

John Mount defends the honor of base R:

The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem:

  • Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely base R vectorized operations. We are hoping to build some teaching materials about this methodology.

  • Idiomatic Base R (shown dashed): an idiomatic R method using stats::aggregate() to solve the problem. This method is re-plotted in both graphs as a dashed line and works as a good division between what is fast versus what is slow.

  • data.table: a straightforward data.table solution (another possible demarcation between fast and slow).

  • dplyr (no grouped filter): a dplyr solution (tuned to work around some known issues).

Read the whole thing, including the comments section, where there’s a good bit of helpful back-and-forth.

Non-English Natural Language Processing

The folks at BNOSAC have announced a new natural language processing toolkit for R:

BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic ‘tokenization’, ‘parts of speech tagging’, ‘lemmatization’, ‘morphological feature tagging’ and ‘dependency parsing’ of raw text. Next to text parsing, the package also allows you to train annotation models based on data of ‘treebanks’ in ‘CoNLL-U’ format as provided at http://universaldependencies.org/format.html.

The package provides direct access to language models trained on more than 50 languages.

Click through to check it out.

SSMS Shortcuts

Wayne Sheffield continues his SSMS shortcuts series.  He starts off with a powerful way of selecting vertical columns of text.  Then he shows how to make text all lowercase or uppercase.

From there, he gets to one of my favorite features which I commonly forget exists:

We’re all used to using the clipboard in Windows programs. You copy something into it with Ctrl+C, and paste it into your document with Ctrl+V. However, did you know that the SSMS clipboard remembers the last 20 items that were put into the clipboard, and that you can cycle through all of these clipboard values? The keyboard shortcut Ctrl+Shift+V will paste the most recent item added to the clipboard. Using this shortcut repeatedly will cycle through the “Clipboard Ring”, pasting that item into the document. Now you don’t have to go back and copy items again!

Next, he shows how you can drag and drop to get all columns into a query window quickly.  Finally, Wayne shows you how to create shortcuts for important queries.  In my case, various forms of sp_whoisactive dominate this:  Ctrl+F1 for my desired layout, Ctrl-3 for my queries (three for me), Ctrl-4 for my desired layout plus execution plans (four for more).

SQL Server Internal Row Structures

David Fowler gets to the guts of a row as stored in SQL Server:

DBCC page will take in a database name or id, file id and page id and return a representation of the specified page depending on the print options that you choose.

We’ve got four different print options that we can choose,

0 – Return only the page header
1 – Return the page header and hex dump of each row
2 – Return the page header and full page hex dump
3 – Return the page header, hex dump of each row as well as the details on each column

Read the whole thing.

The Stage-And-Switch Technique For Deployments

Michael Swart amps up the complexity factor in his online deployment series:

There’s two things going on here (and one hidden thing):

  1. The first two messages point out that a procedure is referencing the column ColdRoomSensorNumber with schemabinding. The reason it’s using schemabinding is because it’s a natively compiled stored procedure. And that tells me that the table Warehouse.ColdRoomTemperatures is an In-Memory table. That’s not all. I noticed another wrinkle. The procedure takes a table-valued parameter whose table type contains a column called ColdRoomSensorLabel. We’re going to have to replace that too. Ugh. Part of me wanted to look for another example.

  2. The last message tells me that the table is a system versioned table. So there’s a corresponding archive table where history is maintained. That has to be dealt with too. Luckily Microsoft has a great article on Changing the Schema of a System-Versioned Temporal Table.

  3. One last thing to worry about is a index on ColdRoomSensorNumber. That should be replaced with an index on ColdRoomSensorLabel. SSDT didn’t warn me about that because apparently, it can deal with that pretty nicely.

I’m glad that Michael went with a more complex example—it’s easy to tell this story with a simple procedure versioning, but in seeing a larger change, you can see the rhythm in the process—it’s all the same pattern of steps over and over.

Expanding LVM Drives

David Klee shows how to expand an LVM drive on Linux:

Next in our SQL Server on Linux series is one important question. On Windows, if you’re about to run out of space, you get your VM admin / storage admin to expand one or more of your drives, and you go to Disk Management and expand the drive with no downtime. How do we accomplish this same task on Linux?

First, SSH into your VM. Get your appropriate system engineer to expand the drive that needs to be expanded. You won’t be able to see it at first in Linux because, just like in Windows, it’ll need to rescan the storage to ‘see’ the extra space. Sometimes Windows does it automatically, and sometimes you have to initiate it manually. In Linux it only does this on system startup.

Let’s grow our data drive from 250GB to 300GB first.

Click through to see how to do that.

Measuring Progress With Power BI

Stacia Varga shows how to use Power BI to simplify data analysis, using the example of New Year’s resolution goals:

First, the actual data represents the accumulation of data by day from the beginning of the year, whereas the target data represents the final tally at the end of a defined period. Each goal has a different frequency: daily, quarterly, and weekly. Currently, the comparison between actual and target data makes it appear that I’m falling way short of my goals. However, even if I were making solid progress on my goals on a daily basis, the comparison of the two values will never meet until the end of the defined period for any given goal. I need a way to prorate the target data so that I can more reasonably measure my progress.

Second, displaying the actual and target values in a table requires me to do mental math to determine how close (or not) I am to achieving my goals. Now, I’m pretty good at mental math, but a better way to see progress is to use data visualizations. I’m sure you’ve heard the saying… A picture is worth a thousand words.

This is a great post if you’re interested in getting started with Power BI.

Row Goals In SQL Server 2017

Erik Darling points out a new bonus when you upgrade to SQL Server 2017 CU3:

Don’t go looking in SSMS just yet. If you get an actual or estimated plan from a query in SSMS, it’s not in the XML.

However, If you get them from the plan cache later, you can see them in the XML.

According to People Much Smarter Than Me®, SSMS strips out XML that it doesn’t recognize, so we’ll have to wait for the next version to drop before we can access it easily.

Erik also has links to get more information.

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031