Multi-Model Time Series Analysis

The folks at ELEKS discuss what to do when a single time series model just won’t cut it:

With the emergence of the powerful forecasting methods based on Machine Learning, future predictions have become more accurate. In general, forecasting techniques can be grouped into two categories: qualitative and quantitative. Qualitative forecasts are applied when there is no data available and prediction is based only on expert judgement. Quantitative forecasts are based on time series modeling. This kind of models uses historical data and is especially efficient in forecasting some events that occur over periods of time: for example prices, sales figures, volume of production etc.

The existing models for time series prediction include the ARIMA models that are mainly used to model time series data without directly handling seasonality; VAR modelsHolt-Winters seasonal methods, TAR modelsand other. Unfortunately, these algorithms may fail to deliver the required level of the prediction accuracy, as they can involve raw data that might be incomplete, inconsistent or contain some errors. As quality decisions are based only on quality data, it is crucial to perform preprocessing to prepare entry information for further processing.

Treating time series data as a set of waveform functions can generate some very interesting results.

Unassigned Shards In Elasticsearch

Emily Chang shows how to find and fix unassigned shards in Elasticsearch:

As nodes join and leave the cluster, the master node reassigns shards automatically, ensuring that multiple copies of a shard aren’t assigned to the same node. In other words, the master node will not assign a primary shard to the same node as its replica, nor will it assign two replicas of the same shard to the same node. A shard may linger in an unassigned state if there are not enough nodes to distribute the shards accordingly.

To avoid this issue, make sure that every index in your cluster is initialized with fewer replicas per primary shard than the number of nodes in your cluster by following the formula below:
N >= R + 1

Where N is the number of nodes in your cluster, and R is the largest shard replication factor across all indices in your cluster.

Read the whole thing if you’re an Elasticsearch administrator.

OLAP On Hadoop

Tim Spann discusses OLAP options on the Hadoop stack:

Apache Kylin

For an introduction to this interesting Hadoop project, check out this article.   Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase.   It can use called through SparkSQL as well making for a very useful project.   This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon.    You can doMOLAP cubes and support many users with fast queries over billions of rows.   Apache Kylin provides JDBC and ODBC drivers.

There are a few interesting options here.

Exporting To Flat Files

Kevin Feasel


Biml, ETL

Ben Weissman shows how to dump tables to flat files:

In our next step, we loop through all tables in that database (feel free to limit the results by playing with GetDatabaseSchema) and create a FlatFileFormat for each of them. We will include all columns except those with datatype Binary or Object. As flatfiles don’t really care about actual data formats, we will just define every column as a string with maximum length. We will also add an annotation with the table’s original name, the list of columns as well as a list of primary keys (we’ll need the latter for a later step :)):

Like most Biml-related things, it’s not that many lines of code, so check it out.

Cast Or Convert

Kevin Feasel



Aaron Bertrand discusses the Cast and Convert functions:

Neither is really any more typing than the other, and they both work the exact same way under the covers. So it would seem that the choice between CASTand CONVERT is merely a subjective style preference. If that were always true, however, you probably wouldn’t be reading this post.

There are other cases where you need to use CONVERT in order to output correct data or perform the right comparison. Some examples:

Read on for examples.  My preference is CAST, mostly because it’s fewer characters to type.  But there are certainly advantages to using CONVERT.

Wait Stats

David Alcock provides an introduction to wait stats and why they’re useful for performance tuning:

So here are two different ways that we can use SQL Servers wait statistics for troubleshooting purposes. Both views give us really useful information but both have different purposes. If we wanted to look back over time then the sys.dm_os_wait_stats will give us a view of wait time totals. Typically we would capture the information via a scheduled job and analyse the data for spikes during periods where issues might be suspected.

For performing real-time analysis of wait statistics then we should base queries on the sys.dm_os_waiting_tasks view where we can see accurate wait duration values as they are happening within our instance.

In my opinion wait statistics are the most important piece of information when troubleshooting SQL Server so learning about the different types is vital for anyone using SQL. Thankfully there is a wealth of really useful information about wait statistics out there; I’ve listed some of my favourite posts below.

Click through for an example, as well as links to more resources.

Grouping And Binning

Reza Rad discusses a couple new additions to Power BI, grouping and binning:

Binning is grouping a numeric field based on a division. This type of grouping is called Banding as well. For example you might have customers with different yearlyIncome range from $10,000 to $100,000 and you want to create a banding by $25,000. This will generate 4 groups of yearly income for you. This is exactly what Binning in Power BI does. Let’s look at the example.

Create a Table in Power BI Report and visualize YearlyIncome (from DimCustomer), and SalesAmount (from FactInternetSales) in it. Change the aggregation of YearlyIncome from Sum to Do Not Summarize as below

You could already build this yourself, but I’m glad they introduced this, as it’s an easier solution.

Logical Windowing

Kevin Feasel



Lukas Eder discusses window functions:

Now, let’s assume I’m interested in these things:

  1. How many payments were there in the same hour as any given payment?
  2. How many payments were there in the same hour before any given payment?
  3. How many payments were there within one hour before any given payment?

Those are three entirely different questions.

Lukas’s solution uses Oracle syntax, but most of it also applies to SQL Server 2012 and higher.  The part that doesn’t apply, unfortunately, is the RANGE BETWEEN INTERVAL, which allows you to find values clustered in the same time period (one hour in his example).

Hypothetical Indexes

Kenneth Fisher discusses hypothetical indexes:

I saw something like this the other day. My first thought was “Hu, never seen that before.” My second thought was “Wow, that’s really cool. I wonder what a hypothetical index is?” A quick search later and I discovered that the DTA (database tuning adviser) uses them to test out what indexes will work best. A pretend (one might almost say hypothetical) index is created, with statistics, but without the actual index structure. Then a query plan is created allowing for that index.

This is pretty cool since creating a real index can take quite a bit of time, particularly on a really large table. It would be nice to be able to tell SQL that an index exists and try it out before actually spending the time creating it. I’d learned about a DB2 method of doing this a while back but wasn’t aware of one for SQL Server. In part that’s because it’s undocumented. Because the commands I’m going to use here are undocumented standard warnings apply.

That’s completely new to me.


November 2016
« Oct Dec »