Press "Enter" to skip to content

Curated SQL Posts

Thinking About Slowly Degrading Page Performance

Ritesh Maheshwari talks about how LinkedIn deals with performance regressions:

Looking at the chart above, where the dotted red line is a reference point to show where we started the year, notice how site speed improvements tend to be significant and noticeable, as they are optimization-driven. Degradations, however, can generally be of any “amount,” as they happen for various reasons. LinkedIn’s page-serving pipeline has many moving parts. We deploy code multiple times per day, operate a micro-service architecture with hundreds of services, and infrastructure upgrades are frequent. A slowdown in any of these components can cause degradations.

While large degradations can be caught using A/B testingcanary analysis, or anomaly detection, small ones tend to leak to production. Thus, performance of a page has a tendency to always degrade over time.

This led to having the centralized Performance Team focus on identifying these leaks, called “site speed regressions,” and to craft tools and processes to fix them.

It’s an interesting principle.  I could see this principle work for tracking database performance degradation as well.

Comments closed

Promises And Closures In R

Damian Rodziewicz looks at the new promises package in R:

Citing Joe Cheng, our aim is to:

  1. Execute long-running code asynchronously on separate thread.
  2. Be able to do something with the result (if success) or error (if failure), when the task completes, back on the main R thread.

A promise object represents the eventual result of an async task. A promise is an R6 object that knows:

  1. Whether the task is running, succeeded, or failed

  2. The result (if succeeded) or error (if failed)

This looks pretty exciting.  H/T R-Bloggers

Also, Sebastian Warnholz has a post on promises and closures in case you’re not familiar with the concepts:

Every argument you pass to a function is a promise until the moment R evaluates it. Consider a function g with arguments x and y. Let’s leave out one argument in the function call:

g <- function(x, y) x
g(1)

## [1] 1

R will be forgiving (lazy) until the argument y is actually needed. Until then y exists in the environment of the function call as a ‘name without a value’. Only when R needs to evaluate y a value is searched for. This means that we can pass some non-existent objects as arguments to the function g and R won’t care until the argument is needed in the functions body.

Read the whole thing.  Once again, H/T R-Bloggers

Comments closed

Team Data Science Process Updates

David Smith announces updates to the Team Data Science Process:

It’s been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

  • The IDEAR (Interactive Data Exploration, Analysis and Reporting) framework, an open source extension to R and Python designed to standardize the process of data exploration and reporting;

  • Guidance for use of Spark 2.0, including an end-to-end Spark v2.0 walkthrough;

  • Guidance for use of in-database Python with SQL Server, including an end-to-end in-database Python tutorial;

Click through for more changes, as well as links to further resources.

Comments closed

Python For The DBA: Copying SQL Logins

David Fowler has an example of how DBAs can use Python to do something interesting:

There are plenty of times when you might want to copy your SQL logins (including the SID) from one server to another.  Perhaps you’re running an AG and need to make sure that all users exist on all your secondaries with the correct SID, perhaps you’re migrating servers and need all the logins on your new server.  Whatever the reason, there are a number of different ways in which you can do this but they usually require scripting out on one server and then running the script into another server, or of course there’s Powershell.

The below script will use Python to copy all or specified logins from one server to another, including the password and SID.

Click through for the script.  The main use case for a SQL Server DBA to learn Python as a DBA scripting language would be if you run SQL on Linux—I don’t think Powershell on Linux is far enough developed to handle the full range of DBA tasks.  Otherwise, I’d use Powershell and dbatools.

Comments closed

Stack Shuffle Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Enlighten Stack Shuffle Custom Visual.  The Enlighten Stack Shuffle is helpful when you want to display a Top N set of values.  For example if you want to display your top 5 selling employees this visual can make that very easy.

This looks pretty good on a dashboard, especially if you have a top-heavy data set, where the top few items are by far the most important.

Comments closed

Sentiment Analysis With Python In SQL Server

Nellie Gustafsson has a quick example of sentiment analysis using SQL Server Machine Learning Services:

You don’t have to be a data scientist to use machine learning in SQL Server. You can use pre-trained models available for usage out of the box to do your analysis. The following example shows you how you quickly get started and do text sentiment analysis.

Before starting to use this model, you need to install it. The installation is quick and instructions for installing the model can be found here: How to install the models on SQL Server

Once you have SQL Server installed with Machine Learning Services, enabled external script execution, and installed the pre-trained model, you can execute the following  script to create a stored procedure that uses Python and the microsoftml function get_sentiment with the pre-trained model to determine the probability of positive sentiment of a text:

Click through to read the whole thing.

Comments closed

SQL Server Operations Studio

Rohan Kumar announces a new management interface for SQL Server (among other things):

These new cross-platform capabilities have made SQL Server accessible to users of Windows, Linux and Mac. At PASS Summit, we are excited to provide a sneak peek at Microsoft SQL Operations Studio. In a few weeks, users will be able to download and evaluate this free, light-weight tool for modern database development and operations on Windows, Mac or Linux machines for SQL Server, Azure SQL Database, and Azure SQL Data Warehouse. Increase your productivity with smart T-SQL code snippets and customizable dashboards to monitor and quickly detect performance bottlenecks in your SQL databases on-premises or in Azure. You’ll be able to leverage your favorite command line tools like Bash, PowerShell, sqlcmd, bcp and ssh in the Integrated Terminal window. Users can contribute directly to SQL Operations Studio via pull requests from the GitHub repository.

Steve Stedman comments on it:

This appears to be a new cross platform (Mac, Linux, and Windows) for running queries against SQL Server.

Perhaps this is a replacement for SQL Server Management Studio, or may just a subset of what SSMS does today, however it works cross platform, unlike SSMS.

You won’t be giving up Management Studio anytime soon, but there are some really cool parts of SQL Operations Studio forthcoming.

1 Comment

Vectorized UDFs For PySpark

Li Jin talks about a performance optimization coming in Apache Spark 2.3:

To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.

Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.

This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.

Comments closed

Hive Without Hadoop

Anubhav Tarar notes that you can run Hive queries against data even without a Hadoop installation:

Starting with release 0.7Hive also supports a mode to run map-reduce jobs in local-mode automatically

you just have to do two things first create your warehouse in local system and give the default fs name to local put these properties inside your hive-site.xml

This is a fairly short post; click through to see the changes you’d make to hive-site.xml.

Comments closed

Text Preprocessing With R

Sibanjan Das has started a new series on text mining in R:

Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.

The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.

Comments closed