Press "Enter" to skip to content

Month: April 2018

Tidy Anomaly Detection With Anomalize

Abdul Majed Raja walks us through an example using the anomalize package:

One of the important things to do with Time Series data before starting with Time Series forecasting or Modelling is Time Series Decomposition where the Time series data is decomposed into Seasonal, Trend and remainder components. anomalize has got a function time_decompose() to perform the same. Once the components are decomposed, anomalize can detect and flag anomalies in the decomposed data of the reminder component which then could be visualized with plot_anomaly_decomposition() .

btc_ts %>% 
  time_decompose(Price, method = "stl", frequency = "auto", trend = "auto") %>%
  anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
  plot_anomaly_decomposition()

As you can see from the above code, the decomposition happens based on ‘stl’ method which is the common method of time series decomposition but if you have been using Twitter’s AnomalyDetection, then the same can be implemented in anomalize by combining time_decompose(method = “twitter”) with anomalize(method = "gesd"). Also the ‘stl’ method of decomposition can also be combined with anomalize(method = "iqr") for a different IQR based anomaly detection.

Read on to see what else you can do with anomalize.

Comments closed

Using CLR To Call HTTP Endpoints From T-SQL

Jovan Popovic shows how to use cURL from within SQL Server with the CLR:

This code will add one function and one procedure in SQL Server that implements the following CURL functions:

  • CURL.XGET – function that calls API on some http endpoint using get method and fetches the response. It has two parameters:

    • @H representing the header information that should be sent to remote endpoint (null for none).
    • @url representing the endpoint where the Http request should be sent.
  • CURL.XPOST – procedure that sends text to some http endpoint using post method and prints response. It has three parameters:

    • @H representing the header information that should be sent to the remote endpoint (null for none).

    • @d representing the data that should be sent to remote endpoint in the request body.

    • @url representing the endpoint where the Http request should be sent.

Click through for the process.  I’ve never been afraid of CLR and it is a rather useful tool, but pushing for CLR is definitely like swimming upstream.

Comments closed

Uploading Data Sets To Azure ML From R

Leila Etaati continues her series on the Azure ML R package by showing how to upload a data set:

There is a function in AzureML package name “workspace” that creates a reference to an AzureML Studio workspace by getting the authentication token and workspace id as below:

to work with other AzureML packages you need to pass this object to them.

for instance for exploring the all experiments in Azure ML there is a function name “experiments” that gets the “ws” object as input to connect the desire azure ml environment and also a filter.

Click through for  more.

Comments closed

Configuring Telegraf To Monitor Windows Servers

Tracy Boggiano continues her series on configuring Telegraf:

The solution for this is fairly simple now that you have setup Part 1 of this series.  You can download the Windows conf file for Telegraf from my presentation.  Below are the important pieces of the file. The main part of the OUTPUT PLUGINS being to place the data in the InfluxDB database. The data will be housed in the same database as our SQL performance metrics. Next, you can collect any Windows Performance Counters you want and group them into a “Measurement”. I’m using the dashboard that is on the Grafana website along with the performance metrics they have set up to be collected.

Click through for Tracy’s setup script to get an idea of which Perfmon counters she’s tracking.

Comments closed

Limitations Of Object-Level Security In Tabular Models

Teo Lachev gives us the skinny on object-level security in an Analysis Services Tabular model:

Object-level security (OLS) is a frequently requested security feature when implementing semantic models. For example, a current project disallows some sales persons to see sensitive measures, such as Margin and Profit. In Multidimensional, modelers could use cell security to meet such requirements with the risk of compromising performance. Starting with SQL Server 2017 (compatibility level 1200), Tabular supports object-level security to disallow access to entire tables or specific columns without a performance penalty. As it stands, OLS has the following limitations (for a full list, review the documentation):

Click through for the list of limitations.

Comments closed

Synchronizing User Logins Across SQL Server Instances

Hamish Watson shows how easy it is to synchronize SQL authenticated logins using dbatools:

When building new servers the most important thing after restoring and securing the database is syncing up the users. This is especially important for Availability Groups as SQL Authenticated users required the SIDS to be the same.

In the past I had some very long winded code that would do the sync – it was a mixture of TSQL and PowerShell. It worked but you know – it was cumbersome.

Read on to see how life gets easier with dbatools.

Comments closed

How The Hash Match Join Operation Works

Dmitry Piliugin explains what happens when SQL Server calls for a hash match to join two tables together:

Hash Match in the join mode consumes two inputs, as we are joining two tables. The main idea is to build the hash table using the first “build” input, and then apply the same approach hash the second “probe” input to see if there will be matches of hashed values.

Query Processor (QP) is doing many efforts while building the plan to choose the correct join order. From the Hash Match prospective, it means that QP should choose what table is on the Build side and what is on the Probe side. The Build size should be smaller as it will be stored in memory when building a hash table.

Building a hash table begins with hashing join key values of the build table and placing them to one or another bucket depending on the hash value. Then QP starts processing the probe side, it applies the same hash function to the probe values, determining the bucket and compares the values inside of the bucket. If there is a match – the row is returned.

That would be the whole story if we had infinite memory, but in the real world, it is not true. More to the point, SQL Server allocates memory to the query before the execution starts and does not change it during the execution. That means that if the allocated memory amount is much less than the data size came during the execution, a Hash Match should be able to partition the joining data, and process it in portions that fit allocated memory, while the rest of the data is spilled to the disk waiting to be processed. Here is where the dancing begins.

Read on to learn more about the details of this operation.

Comments closed

Learn Machine Learning In Just 7 Years

Rwiddhi Chakraborty explains that machine learning isn’t a topic you pick up overnight:

Of course you could write a Hello World program in C++ in 24 hours, or a program to find the area of a circle in 24 hours, but that’s not the point. Do you grasp object oriented programming as a paradigm? Do you understand the use cases of namespaces and templates? Do you know your way around the famed STL? If you do, you certainly didn’t learn all this in a week, or even a month. It took you a considerable amount of time. And the more you learned, the more you realised that the abyss is deeper than it looks from the cliff.

I’ve found a similar situation in the current atmosphere surrounding Machine Learning, Deep Learning, and Artificial Intelligence as a whole. Feeding the hype, thousands of blogs, articles, and courses have popped up everywhere. Thousands of them have the same kind of headlines — “Machine Learning in 7 lines of code”, “Machine Learning in 10 days”, etc. This has, in turn led people on Quora to ask questions like “How do I learn Machine Learning in 30 days?”. The short answer is, “You can’t. No one can. And no expert (or even one comfortable with its ins and outs) did.”

This is a good antidote to the “I read a blog post and now I’m an expert” mentality which is particularly pernicious.

Comments closed

The Theory Behind ARIMA

Bidyut Ghosh explains how the ARIMA forecasting method works:

The earlier models of time series are based on the assumptions that the time series variable is stationary (at least in the weak sense).

But in practical, most of the time series variables will be non-stationary in nature and they are intergrated series.

This implies that you need to take either the first or second difference of the non-stationary time series to convert them into stationary.

Bidyut ends with a little bit of implementation in R, but I’d guess that’ll be the focus of part 2.

Comments closed

Slicing In R

John Mount recommends learning about the array slicing system in R:

R has a very powerful array slicing ability that allows for some very slick data processing.

Suppose we have a data.frame “d“, and for every row where d$n_observations < 5 we wish to “NA-out” some other columns (mark them as not yet reliably available). Using slicing techniques this can be done quite quickly as follows.

library("wrapr")

d[d$n_observations < 5, 
  qc(mean_cost, mean_revenue, mean_duration)] <- NA

Read on for more.  In general, I prefer the pipeline mechanics offered with the Tidyverse for readability.  But this is a good example of why you should know both styles.

Comments closed