Including A Progress Bar In R

Kevin Feasel

2018-01-24

R

Peter Solymos has an update to his pbapply library:

The pbapply R package that adds progress bar to vectorized functions has been know to accumulate overhead when calling parallel::mclapply with forking (see this post for more background on the issue). Strangely enough, a GitHub issue held the key to the solution that I am going to outline below. Long story short: forking is no longer expensive with pbapply, and as it turns out, it never was.

H/T R-Bloggers

Building A Model Using SQL Server ML Services

Kevin Feasel

2018-01-24

R

I have a post which shows how to build a simple R model to predict demand for an item:

I am a huge fan of the Poisson distribution.  It is special in that its one parameter (lambda) represents both the mean and the variance of the distribution.  At the limit, a Poisson distribution becomes normal.  But it’s most useful in helping us pattern infrequently-occurring events.  For example, selling 3-4 watches per day.

Estimating a Poisson is also easy in R:  lambda is simply the mean of your sample counts, and there is a function called rpois() which takes two parameters:  the number of events you want to generate and the value of lambda.

So what I want to do is take my data from SQL Server, feed it into R, and return back a prediction for the next seven days.

This was a simple post, but the next two in the series will expand upon it and build out a full implementation.

Exploring The MNIST Dataset

David Robinson performs exploratory data analysis on the MNIST digit database:

The challenge is to classify a handwritten digit based on a 28-by-28 black and white image. MNIST is often credited as one of the first datasets to prove the effectiveness of neural networks.

In a series of posts, I’ll be training classifiers to recognize digits from images, while using data exploration and visualization to build our intuitions about why each method works or doesn’t. Like most of my posts I’ll be analyzing the data through tidy principles, particularly using the dplyr, tidyr and ggplot2 packages. In this first post we’ll focus on exploratory data analysis, to show how you can better understand your data before you start training classification algorithms or measuring accuracy. This will help when we’re choosing a model or transforming our features.

Read on for the analysis.

The Whys Of Azure ML Workbench

Ginger Grant explains why Azure Machine Learning Workbench exists:

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Read on for more.

Smart Differential Backups

Tracy Boggiano continues her smart backups series, this time looking at differential backups:

SQL Server 2017 introduced a new column for taking smarter backups for differential backups as part of the community-driven enhancements. A new column modified_extent_page_count is introduced in sys.dm_db_file_space_usage to track differential changes in each database file of the database.  The blog referenced states it takes just as many resources to take a differential backup as a full when there are between 70% and 80% of pages changes. With this field and the allocated_extent_page_count field, we can calculate the percentage of pages changed since the last full backup. So I have added logic into the differential backups that I use in combination with the configuration tables from my Github repository.  To support this change we will be adding two new fields to the DatabaseBackupConfig table:

  • SmartBackup
  • DiffChangePercent

The main part of the code determines if you are running SQL Server 2017 then determine which databases the percentage is greater than or equal to the value you put in the table.  Then it puts in two separate variables which databases to take full backups of and which ones to take differential backups of.

Click through for the script.

Lambda Architecture In Azure

Jared Zagelbaum describes the Lambda architecture pattern and explains how you can use tooling in Azure to implement it:

Lambda is an organic result of the limitations of existing tools. Distributed systems architects and developers commonly criticize its complexity – and rightly so. Those of us that have worked extensively in Extract-Transform-Load and symmetric multiprocessing systems see red flags when code is replicated in multiple services. Ensuring data quality and code conformity across multiple systems, whether massively parallel processing (MPP) or symmetrically parallel system (SMP), has the same best practice: the least amount of times you reproduce code is always the correct number of times.

We reproduce code in lambda because different services in MPP systems are better at different tasks. The maturity of tools historically hasn’t allowed us to process streams and batch in a single tool. This is starting to change, with Apache Spark emerging as a single preferred compute service for stream and batch querying, hence the timing of Azure Databricks. However, on the storage side, what was meant to be an immutable store that is the data lake in practice, can become the dreaded swamp when governance or testing fails; which is not uncommon. A fundamentally different assumption to how we process data is required to combat this degradation. Enter: the kappa architecture, which we’ll examine in the next post of this series.

Interesting reading.

Cloning A SQL Server Installation

Jana Sattainathan shows how you can find the configuration options used when installing SQL Server:

Let us say that you want to install SQL Server on another host say NewHost but you want it to have the same settings as another host/instance on say ModelHost. In fact, let us say that these are almost identical hosts with similar drive locations and such and you are positive that you want the exact setup on both.

The hard way is to look at what features ModelHost has and try to click through the installation wizard with the correct options selected/input.

Fortunately, there is an easier way to clone an installation. Even if you installed manually using the installation wizard, SQL Server still generates a configuration file with all the settings used (except passwords) for the installation. We can simply use that file from the ModelHost to drive the installation on NewHost

Click through to see where that configuration file is and how you can use it.

Don’t Run Services As Root On Linux

Kellyn Pot’vin-Gorman explains why running SQL Server as root is a bad idea:

Although enhancements have changed Windows installations for applications to run with a unique user, I created a mssql OS user even back on SQL Server 2000 on Windows as I had a tendency to use similar security practices for all database platforms as a multi-platform DBA.  With that being said-  yes, it introduced complexity, but it was for a reason: users should be restricted to the least amount of privileges required.  To grant any application or database “God” powers on a host is akin to granting DBA to every user in the database, but at the host level.  As important as security is to DBAs INSIDE the database, it should be just as important to us OUTSIDE of it on the host it resides on.

Security is important and has become more complex with the increase of security breaches and introduction of the cloud.  One of the most simple ways to do this is to ensure that all application owners on a host are granted only the privileges they require.  The application user should only utilize SUDO, stick bit, iptables, SUID, SGID and proper group creation/allocation if and when required.

It’s the same reason we don’t recommend giving everyone sa rights to databases.  Read on for more.

Automatic Tuning In SQL Server 2017

Arun Sirpal shows off one of the more interesting features in SQL Server 2017:

Before we begin any further let’s do a little recap. Automatic tuning in SQL Server 2017 notifies you whenever a potential performance issue is detected, and lets you apply corrective actions, or lets the Database Engine automatically fix performance problems, this is also available in Azure SQL Database.

There are 2 parts to it, automatic plan correction and automatic index management, for SQL Server 2017, automatic index management it IS NOT part of the product.

To switch automatic plan correction on you will need to run the following code against your database.

I’m looking forward to seeing this expand much further.

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031