Bridging The R-Python Gap

Siddarth Ramesh argues that revoscalepy helps R developers acquaint themselves with Python:

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.

Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Having worked with both, my loyalties tend to lie with R for a couple of reasons.  But this might help some people bridge the gap.

Using Keras To Predict Customer Churn

Matt Dancho has an example of building a neural net using Keras to predict customer churn:

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

  • correlate(): Performs tidy correlations on numeric data

  • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.

  • fashion(): Makes the formatting aesthetically easier to read.

This is a very useful tutorial.

Powershell Speed Testing

Shane O’Neill shows off a Powershell script which allows you to simplify performance testing:

Apart from catching up on news during my commute I only really use notifications for a certain number of hashtags i.e. #SqlServer, #tsql2sday, #sqlhelp, and #PowerShell.

So during work, every so often a little notification will pop up on the bottom right of my window and I can quickly glance down and decide whether to ignore it or check it out.

That’s what happened with the following tweet:

Click through for Shane’s demo.

Vertical Selection In SSMS

Bert Wagner shows off vertical selection in SSMS (using the Alt key):

Sometimes when writing an ad hoc query you might want to take the results of one query and put them into an IN() statement of another query.

Sure, you can write a subquery to put into your IN() statement…but that’s too much work for a one-time use disposable query.

What you can do instead is:

  1. Copy your values of interest

  2. Paste them into your IN() statement

  3. Hold down the ALT key while dragging the mouse down in front of all of your pasted values

  4. Type a comma (see video above for an easier demonstration).

For SSMS speedrunning strats, you can also hold down ALT + SHIFT and use your keyboard arrow keys instead of using the mouse.

How Non-Clustered Index Key Columns Are Stored

Kendra Little walks through page-level details on a non-clustered index:

Just like in the root page and the intermediate pages, the FirstName and RowID columns are present.

Also in the leaf: CharCol, our included column appears! It was not in any of the other levels we inspected, because included columns only exist in the leaf of a nonclustered index.

Kendra does a great job of explaining the topic.

Handling Session State With Memory-Optimized Tables

Perry Skountrianos shows how to configure ASP.Net to use memory-optimized tables for session state:

ASP.NET session state enables you to store and retrieve values for a user as the user navigates the different ASP.NET pages that make up a Web application. Currently, ASP.NET ships with three session state providers that provide the interface between Microsoft ASP.NET’s session state module and session state data sources:

  • InProcSessionStateStore, which stores session state in memory in the ASP.NET worker process
  • OutOfProcSessionStateStore, which stores session state in memory in an external state server process
  • SqlSessionStateStore, which stores session state in Microsoft SQL Server database

This blog post focuses on the SqlSessionStateStore provider and describes how you can configure it to use SQL Server In-Memory OLTP as the storage option for session data. You can either use the latest ASP.NET async version of the SQL Session State provider (which is the recommended approach), or configure an earlier version of the provider to work with In-Memory OLTP by downloading and running the In-Memory OLTP SQL scripts from our sql server samples github repo.

The me of seven years ago really needed this.  But with the strong shift against session-based data collection and back to stateless or client-held state paradigms, I’m not sure how many people this helps.

Power BI Helper: Expression Dependency Trees

Reza Rad announces a new feature of Power BI Helper:

I’m excited to share the news with you that we have added a new feature in Power BI Helper; Expression Tree. Expression Tree will expand the tree of expression for a Measure or calculated column, you can see what other measures are used to create this expression, and where other measures, calculated columns, or even normal columns are located (in which table). This feature is in addition to previous two features of this tool which were; Showing tables and fields used in visualization pages of a Power BI Report, and ability to search for a column or table that used in visualization pages of a report. In this post, I’ll explain how this new feature works.

Read on for the explanation.  I can see this being quite useful.

Additional Restore-DbaDatabase Functionality

Stuart Moore shows off a few examples of advanced Restore-DbaDatabase usage:

No matter how hard the dbatools; team tries, there’s always someone who wants to do things we’d never thought. This is one of the great things with getting feedback direct from a great community. Unfortunately a lot of these ideas are either too niche to implement, or would be a lot of complex code for a single use case.

As part of the Restore-DbaDatabase stack rewrite, I wanted to do make things easier for users to be able to get their hands dirty within the Restore stack. Not necessarily needing to dive into the core code and the world of GitHub Pull Requests, but by manipulating the data flowing through the pipeline using standard PowerShell techniques, all the while being able to do the heavy lifting without code.

Click through for several examples.

Tidy Word Vectors Revisited

Julia Silge revisits her Hacker News word vectorization problem:

So hooray! We have found word vectors again, a bit faster, with clearer and easier-to-understand code. I do argue that this is a real benefit of this approach; it’s based on counting, dividing, and matrix decomposition and is thus much easier to understand and implement than anything with a neural network. And the results?

Click through to see the new method, as well as some basic analogy testing.

Machine Learning Data Preparation Tips

Jen Underwood has some good tips when preparing data for a machine learning operation:

Data preparation for machine learning requires business domain expertise, bias awareness and an experimental thought process. Before preparing your data, you’ll first define a business problem solve. During that exercise, you’ll select an outcome metric and brainstorm potential input variables that influence it from many varied perspectives. From there you will begin identifying, collecting, cleaning, shaping and sampling data to run through automated machine learning model processes.

Note that it is also not unusual for relevant machine learning input data to occur outside of existing transactional processes. If that is the case, you can still start creating a first-generation machine learning model with existing data and continue to build new model versions over time as supplementary data is acquired.

Click through for the ten tips.

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930