Simon Jackson introduces pipelearner, a tool to help with creating machine learning pipelines:

This post will demonstrate some examples of what pipeleaner can currently do. For example, the Figure below plots the results of a model fitted to 10% to 100% (in 10% increments) of training data in 50 cross-validation pairs. Fitting all of these models takes about four lines of code in pipelearner.

Click through for some very interesting examples.

Monitoring Car Data With Spark And Kafka

Carol McDonald builds a model to determine where Uber cars are clustered:

Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.

This is a fairly detailed post, well worth the read.

Web App Security

Vishwas Parameshwarappa has an article on securing web applications:

The Cross-site request forgery (CSRF) exploit uses cross-site scripting (mentioned above), browser insecurities, and other techniques to cause a user to unwittingly perform an action within their current authenticated context that allows the attacker to access the user’s account. This type of attack usually occurs when a malicious email, blog, or a message causes a user’s Web browser to perform an unwanted action on a trusted site for which the user is currently authenticated.

This is a nice overview of the most common attack vectors for web applications.

Deleting SSAS Partitions

Chris Koester shows how to use TMSL and Powershell to delete an Analysis Services tabular model partition:

The sample script below shows how this is done. The sequence command is used to delete multiple partitions in a single transaction. This is similar to the batch command in XMLA. In this example we’re only performing delete operations, but many different operations can be performed in sequence (And some in parallel).

Click through for a description of the process as well as a script to do the job.

Always Encrypted With Powershell

Jakub Szymaszek shows how to configure Always Encrypted support from Powershell:

Note: In a production environment, you should always run tools (such as PowerShell or SSMS) provisioning and using Always Encrypted keys on a machine that is different than the machine hosting your database. The primary purpose of Always Encrypted is to protect your data, in case the environment hosting your database gets compromised. If your keys are revealed to the machine hosting the database, an attacker can get them and the benefit of Always Encrypted will be defeated.

That’s a good warning.

Azure Resource Explorer

Kenneth Fisher discusses the Azure Resource Explorer:

Now this is just default tab. The GET, PUT tab. Which basically shows you the get command of the resource manager API that calls this information, and if you hit the edit button you can actually change information in the JSON output and issue a PUT command to send it back. I’ll admit up front that this is a bit beyond me as I don’t do API calls and I’m new enough to Azure that I don’t know what I can and can’t change (everything I’ve tried so far hasn’t worked). There are several other tabs, though, including a Powershell one and I’m a bit more familiar with Powershell. In it, you can see some of the Powershell commands associated with the resource manager and this particular object.

Read on for more information.

Columnstore Partitioning

Niko Neugebauer warns against partitioning small tables with clustered columnstore indexes:

Needless to say that looking at the execution plans you notice that the actual execution plan shows 10 times difference between them, even though both tables contain the very same data!
The query cost for the partitioned table is staggering – it is around 10 times bigger (~8.8) vs (~0.81) for the first query.
The execution times reflect in part this situation: 12 ms vs 91 ms. Non-partitioned table performs almost 9 times faster overall and the spent CPU time is reflecting it: 15 ms vs 94 ms. Remember, that both tables are Columnstore Indexes based ! Partitioning your table in a wrong way will contain a huge penalty that might not be directly detectable through the execution plan of the complex queries. Well, you might want to use the CISL, just saying

If you can’t fill a single rowgroup, your partition is too granular.  Even then, I’d like to see double-digit rowgroups per partition, though that’s just me.

Polybase And Azure SQL Data Warehouse

I have a post on using Polybase with Azure SQL Data Warehouse:

That’s a header row, and I’m okay with it not making its way in.  As a quick aside, I should note that I picked tailnum as my distribution key.  The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values.  After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

Click through for more details, including the CETAS statement, which I’d love to see in on-prem SQL Server.

Filtered Indexes And Parameters

Erik Darling shows an example of what happens when you have a filtered index and parameterize the filter:

It Is Known

That when you use filtered indexes, they get ignored when your queries are parameterized. This is a Plan Caching Thing©, of course. The simplest example is a bit column with a filtered index. If your index is on WHERE Bit = 1, it doesn’t have data for WHERE Bit = 0. That index would only be suitable for one variation of the query, so caching a plan that uses an index which can’t be reused for every variation isn’t feasible.

Read on for a couple examples, and check the comments on this as well.

Using RTVS

Kevin Feasel



David Eldersveld gives three reasons why you might be interested in R Tools for Visual Studio:

2. Incorporate R projects as part of a broader Visual Studio solution
Many Visual Studio solutions end up being a collection of individual projects. More often than not, these projects are logically joined by virtue of being part of the same business solution, but each one can incorporate different components or languages. For example, you may architect a solution that involves separate projects for loading data­­ with Azure Data Factory, analysis with R, a front-end C# web app, etc. Rather than keep your R code siloed off in a separate solution, unite it with the rest of your code for development and source control.

This is my primary reason.  R Studio is still my go-to option, but RTVS is maturing fairly nicely.  It still feels slower than R Studio when displaying data on-screen (especially when you’re spitting out a couple hundred lines of text), but that Visual Studio integration will go far.  A fourth reason that David does not mention:  it generates the really ugly sp_execute_external_script code for SQL Server R Services.


November 2018
« Oct