Press "Enter" to skip to content

Month: January 2018

Structural Topic Models In R

Julia Silge has a great post on building Structural Topic Models in R using stm and tidytext:

The stm package has a summary() method for trained topic models like these that will print out some details to your screen, but I want to get back to a tidy data frame so I can use dplyr and ggplot2 for data manipulation and data visualization. I can use tidy() on the output of an stm model, and then I will get the probabilities that each word is generated from each topic.

I haven’t watched the video yet, but that’s on my to-do list for today.

Comments closed

The Grammar of Graphics

I’ve started a new series:

Instead, we will start with Wickham’s paper on ggplot2.  This gives us the basic motivation behind the grammar of graphics by covering what a grammar does for us:  “A grammar provides a strong foundation for understanding a diverse range of graphics. A grammar may also help guide us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics. This is easy to see by analogy to the English language: good grammar is just the first step in creating a good sentence” (3).

With a language, we have different language components like nouns (which can be subjects, direct objects, or indirect objects), verbs, adjectives, adverbs, etc.  We put together combinations of those individual components to form complete sentences and transmit ideas.  Our particular word choice and language component usage will affect the likelihood of success in idea transmission, but to an extent, we can work iteratively on a sentence, switching words or adding phrases to get the point across the way we desire.

With graphics, we can do the same thing.  Instead of thinking of “a graph” as something which exists in and of itself, we should think of different objects that we combine into its final product:  a graph.

I call this first post the poor man’s literature review.  The rest of the series is code- and visual-heavy.

Comments closed

The Cost Of Downtime

Allan Hirt points out a new rule in the UK:

However, as of this week, if you have a company or work in the UK, things just got a whole lot more interesting. The UK government officially released a statement on January 28 which affects “critical industries”. Long story short: if you fall under the classification which seems to be limited right now to energy, transport, water, and health firms, you could be fined up to £17 million ($24 million in US Dollars at today’s exchange rate) in the event of a cyber attack taking you down. It was the WannaCry outages that precipitated the response (as an example, FedEx says WannaCry cost them about $300 million US Dollars). Remember this doozie from British Airways? Also covered under this new Network and Information Systems (NIS) Directive; it’s not just about security, but includes other things like power outages, hardware failure, and environmental hazards.

Read the whole thing.

Comments closed

Finding Errors In A SQL Query Using SSMS

Bert Wagner points out a useful feature of SSMS:

While all of these error messages point me in the right direction, I’m too lazyto try and remember how each version of SSMS refers to the error location. This becomes particularly problematic when jumping around environments and not being able to use a consistent version of SSMS.

There is one feature that works consistently though that makes navigating to the specific error lines easily.

Click through for that answer.  It doesn’t work for dynamic SQL, but it can be quite helpful when it does work.

Comments closed

Power BI Desktop File Shrinkage

Eugene Meidinger notes that Power BI Desktop files are a bit smaller now:

I was working on a demo for my upcoming Pluralsight course, and I noticed something odd. It used to be that a empty PBIX file was 123 KB, but some point since May 2017, the file size has become 10 (!) KB. So what’s the cause of the difference?

If you rename a .pbix file to .zip, you can crack it open. If we look at two nearly empty files side by side, we can see the difference comes from the data model. In this example, each data model has a single value that I manually entered.

Yet another reason why it pays to keep up to date on Power BI versions.

Comments closed

Switching Power BI From Imported To Live Query

Dustin Ryan shows us a neat hack to turn a Power BI Desktop file which uses an imported data model into one which uses Live Query:

In this blog post, I’m going to walk you through modifying a Power BI Desktop file with an imported data model to use an external data model hosted in Azure Analysis Services or SQL Server Analysis Services 2017. This isn’t supported by any stretch of the imagination but if you’re in a pinch and have to convert a Power BI Desktop file from an imported data model to Live Query then this may be helpful to you. Also, this method works as of the January 2018 release of Power BI Desktop but there’s no guarantee that this method will work in future releases of Power BI Desktop.

I was inspired to write this blog post after reading this post in the Power BI community forums by odegarun. It’s a great post with some good instructions, but I wanted to provide a clearer walk through as well as validate a couple other things with the process for my customers and anyone else that might be interested.

Check it out with the understanding that this is not a supported operation.

Comments closed

Mapping Server Audit Action IDs

Solomon Rutzky tries to reverse engineer the action_id values used in SQL Server audits:

This post is, for the most part, a continuation of Server Audit Mystery 1: Filtering class_type gets Error Msg 25713. In that post I was trying to filter on the class_type field / predicate source (i.e. the object, or whatever, that the event is on). The tricky part was that class_type is supposed to be filterable on string values such as U for “User Table”, P for “Stored Procedure”, etc. But, the reality is that it has to be filtered using numbers, and that list does not seem to be documented anywhere. Fortunately, I was able to find the list of possible numbers, so all is well with regards to class_type.

When it comes to the action_id field, however, we are not so lucky. It, too, should be filterable on the string codes that are returned by sys.database_audit_specification_details and sys.fn_get_audit_file. But, just like class_type, only numbers can be used for filtering. And, like class_type, that list of numbers does not seem to be documented anywhere. But, unlike class_type, I cannot find any list of numbers (in SQL Server) that corresponds to the available actions. I did find a list of statements by looking at Extended Events info (since Audits use Extended Events):

SELECT *
FROM   sys.dm_xe_map_values
WHERE  [name] = N'statement_type';

but it wasn’t the correct list. So, this means that I have to go hunting for these values the even-harder way

Read on for some fun with binary tree search.

Comments closed

Convenience Functions In wrapr

John Mount walks us through some of the language conveniences available in the wrapr library:

wrapr supplies additional q*() methods.

  • qae() “quote assignment expression” where both sides of assignments is taken as un-evaluated. I.e.: qae(x = 5+1) yields c(‘x’ = ‘5 + 1’) regardless if x is bound or unbound in the environment. This is a bit of a complement to := which looks-up bindings/references (i.e.: x = "z"; x := 5+1 returns c(‘z’ = ‘6’)).

  • qe() “quote expressions” for quoting complex expressions. Similar to quote(), except it returns a list of strings (not a language object). The qe()method is not as interesting to end-users as the other methods mentioned, it is designed to help in implementation of methods that take a non-assignment expression or list of expressions such as rquery::select_rows_nse().

Read the whole thing.  := probably gives the most obvious immediate benefit but the whole set seems useful.

Comments closed

Installing Spark On Windows

Nigel Meakins is starting a new series on Spark and his first post involves installing Spark on Windows:

WinUtils provides a number of HDFS-emulating utilities that allow us to run Spark as though it were talking to an HDFS storage system (at least to a certain degree). Without this you will get all manner of file system-related issues wit Spark and won’t get off the launchpad.

Within the WinUtils archive you may have a number of Hortonworks Data Platform versioned folders. For the version of Spark I’m using, being 2.2.1, I have chosen hadoop-2,7,1\bin for my files. Unzip and copy the contents of the bin directory to a directory of your choice. It must however be called ‘bin’ in order to be located by the calling programs. I actually placed mine in the C:\Spark\bin directory together with the other executables that Spark uses but this is not essential.

Once done, you will need to set the following environment variable:

HADOOP_HOME = <your winutils ‘bin’ parent directory>

Note we don’t include the \bin, so for my example this is C:\Spark.

I have a post on installing Spark on Windows that might help if you get stuck on the WinUtils part.

Comments closed