Press "Enter" to skip to content

Author: Kevin Feasel

Network And Sankey Diagrams In Python And R

Tony Hirst has a roundup of various R and Python packages which build network charts or Sankey diagrams:

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Sankey diagrams are on my list of dangerous visuals:  done right, they are informative, but it’s easy to try to put too much into the diagram and thereby confuse everybody.

Comments closed

Predicting Restaurant Reservations With A Neural Net

Kevin Jacobs builds a simple neural net using Pandas and sklearn:

The first thing to notice is that our values are not normalized. The number of visitors is a number and gets larger and larger. To normalize it, we simply divide it by 100, since all numbers are below 1. The same holds for the lag. Most of the lags are lower than 30. Therefore, I will divide the lag size by 30.

Notice that there are many more approaches for normalizing the data! This is just a quick normalization on the data, but feel free to use your own normalization method. My normalization process is closely related to the MinMaxScalar normalization which can be found in sklearn (scikit-learn).

With just a few lines of Python code we can create a Multi-Layer Perceptron (MLP):

Click through for the code.

Comments closed

“Caveman” Graphs In SQL

Denis Gobo puts together some basic Management Studio data visualization:

I found this technique on Rich Benner’s SQL Server Blog: Visualising the Marvel Cinematic Universe in T-SQL and decided to play around with it after someone asked me to give him the sizes of all databases on a development instance of SQL Server

The way it works is that you take the size of the database and then divide that number against the total size of all databases. You then use the replicate function with the | (pipe) character to generate the ‘graph’  so 8% will look like this ||||||||

You can use this for tables with most rows, a count per state etc etc. By looking at the output the graph column adds a nice visual effect to it IMHO

It does the job and doesn’t require you to go out to a different product, so it works pretty well for occasional administrative queries.

Comments closed

Locks And Partitioning

Erik Darling looks at the confusing mess that is SQL Server partitioning:

In the Chicago perf class last month, we had a student ask if partition level locks would ever escalate to a table level lock. I wrote up a demo and everything, but we ran out of time before I could go over it.

Not that I’m complaining — partitioning, and especially partition level locking, can be pretty confusing to look at.

If you really wanna learn about it, you should talk to Kendra — after all, this post is where I usually send folks who don’t believe me about the performance stuff.

Click through for that demo and explanation.

Comments closed

Getting Started With Always Encrypted

Monica Rathbun kicks off a series on Always Encrypted:

There are two possibilities Deterministic and Randomized.

MSDN defines Deterministic encryption as always generates the same encrypted value for any given plain text value. Which means that if you have a birthdate of 01/03/1958 it will always be encrypted with the same value each time such as ABCACBACB. This allows you to index it, use it in WHERE clauses, GROUP BY and JOINS.

Randomized encryption per MSDN- uses a method that encrypts data in a less predictable manner. This makes Randomized encryption more secure, because using the example above each encrypted value of 01/03/1958 will be different. It could be ABCACBACB, BBBCCAA, or CCCAAABBB. All three encrypted values are subsequently decrypted to the same value. Since the encrypted value is random you cannot perform search operations etc. as you can with Deterministic.

Part 1 is about building the certificates and keys needed to encrypt data.

Comments closed

Upgrading That Expired Evaluation Copy Of SQL Server

Cody Konior finds a way to extricate the poor souls who need to upgrade expired evaluation copies of SQL Server from their mess:

Common advice here is to set the clock backwards. My problem with that is that you’re probably doing this on an unsupported unknown black-box flaming garbage can of a system set up by someone who wasn’t meant to do it – because otherwise they wouldn’t be using the evaluation edition. So what are the repercussions of setting the clock backwards? Perhaps their application spawning silently in the background and trashing this or other databases with bad date information? Perhaps you’ll lose your RDP connection and then be unable to connect back in because of the SSPI error generated by a clock mismatch?

No thanks. Instead you need to do some detective work.

Read the whole thing.

Comments closed

Using Pester For Configuration Checks

Andrew Pruski shows how to use Pester to audit SQL Server configuration settings:

One Pester test running!

What I like about this is that it can be easily dropped into a job scheduler (e.g.- Jenkins) and then you’ve got a way to routinely check (and correct) all the configuration settings of the SQL instances that you monitor.

Pester would not have been my first thought for configuration checking, but it does serve as another useful option.

Comments closed

Bridging The R-Python Gap

Siddarth Ramesh argues that revoscalepy helps R developers acquaint themselves with Python:

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.

Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Having worked with both, my loyalties tend to lie with R for a couple of reasons.  But this might help some people bridge the gap.

Comments closed

Using Keras To Predict Customer Churn

Matt Dancho has an example of building a neural net using Keras to predict customer churn:

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

  • correlate(): Performs tidy correlations on numeric data

  • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.

  • fashion(): Makes the formatting aesthetically easier to read.

This is a very useful tutorial.

Comments closed