Kalman Filters With Spark And Kafka

Konur Unyelioglu goes deep into Kalman filters:

In simple terms, a Kalman filter is a theoretical model to predict the state of a dynamic system under measurement noise. Originally developed in the 1960s, the Kalman filter has found applications in many different fields of technology including vehicle guidance and control, signal processing, transportation, analysis of economic data, and human health state monitoring, to name a few (see the Kalman filter Wikipedia page for a detailed discussion). A particular application area for the Kalman filter is signal estimation as part of time series analysis. Apache Spark provides a great framework to facilitate time series stream processing. As such, it would be useful to discuss how the Kalman filter can be combined with Apache Spark.

In this article, we will implement a Kalman filter for a simple dynamic model using the Apache Spark Structured Streaming engine and an Apache Kafka data source. We will use Apache Spark version 2.3.1 (latest, as of writing this article), Java version 1.8, and Kafka version 2.0.0. The article is organized as follows: the next section gives an overview of the dynamic model and the corresponding Kalman filter; the following section will discuss the application architecture and the corresponding deployment model, and in that section we will also review the Java code comprising different modules of the application; then, we will show graphically how the Kalman filter performs by comparing the predicted variables to measured variables under random measurement noise; we’ll wrap up the article by giving concluding remarks.

This is going on my “reread carefully” list; it’s very interesting and goes deep into the topic.

Values Belong In Columns

Kevin Feasel

2018-09-04

R

John Mount argues that to reduce ambiguity, ensure that your values are columns on appropriate data frames:

Here is an (artificial) example.

chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
# (Intercept) chamber_sizes
# 2.937 4.104 

Notice: one of the variables came from a vector in the environment, not from the primary data.framechamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

Read the whole thing.

Important Community Tools

Chrissy LeMaire shares some great community-driven Powershell modules:

dbops

dbops was created by Kirill Kravtsov.

dbops is a Powershell module that provides Continuous Integration/Continuous Deployment capabilities for SQL database deployments.

It is based on DbUp, which is DbUp is an open source .NET library that helps you to deploy changes to SQL Server databases. dbops currently supports both SQL Server and Oracle.

Read on for links to several more projects.

Using R With Excel

Kevin Feasel

2018-09-04

R

David Smith walks us through various ways to integrate R and Excel:

If you’re familiar with analyzing data in Excel and want to learn how to work with the same data in R, Alyssa Columbus has put together a very useful guide: How To Use R With Excel. In addition to providing you with a guide for installing and setting up R and the RStudio IDE, it provide a wealth of useful tips for working with Excel data in R, including:

  • To import Excel data into R, use the readxl package

  • To export Excel data from R, use the openxlsx package

  • How to remove symbols like “$” and “%” from currency and percentage columns in Excel, and convert them to numeric variables suitable for analysis in R

  • How to do computations on variables in R, and a list of common Excel functions (like RAND and VLOOKUP) with their R equivalents

  • How to emulate common Excel chart types (like histograms and line plots) using R plotting functions

David also shows how to run R within Excel.  One of the big benefits of readxl is that it doesn’t require Java; most other Excel readers do.

Installing External Python Modules In SQL Server

Kevin Feasel

2018-09-04

Python

David Fowler shows how to import an external Python module into SQL Server Machine Learning Services:

But how do we go about installing them into SQL Server?  Now I’m a DBA and not a Python wizz so had to do a little digging to figure it out but to be honest, it’s fairly easy.

I don’t know how many other DBAs know that we can install these modules or even how to do it so I thought I’d write up a quick post explaining it.

First things first, you’re going to need to do this from your SQL Server.

Read on for the instructions.

Getting Anatomical With SSIS Catalog Compare

Andy Leonard shows off parts of SSIS Catalog Compare.  First up is the catalog reference script:

As you may glean from this analysis of one script generated for SSIS Catalog References management, the Transact-SQL for scripting SSIS Catalog artifacts in SSIS Catalog Compare is rigorous and includes several existence and error-condition checks prior to performing any updates. The script is designed to be idempotent, as well, meaning the script will succeed and the results will be repeatable and predictable each time the script is executed – and that the script itself is re-executable.

Then an environment script:

The final portion of the script checks for the existence of the Environment Variables and responds accordingly. This is a three-step process:

  1. Drop the Environment Variable if it exists.
  2. Create the Environment Variable.
  3. Set the Environment Variable value.

If the Environment Variable exists the script drops it. Why? SSIS Catalog Compare wants to be sure the environment variable is created with the proper data type and initial values.

And connection literals:

These messages are intended to be copied and stored in the Notes field of a ticketing system in a DevOps enterprise. Note the detail contained herein:

  • Script Name – the path to the file used to perform the operation.

  • Generated From – the SQL Server instance of the SSIS Catalog host from which the script was generated.

  • Catalog Name – redundant at present because all SSIS Catalogs are named “SSISDB.”

  • Folder Name – the name of the SSIS Catalog Folder that contains the scripted artifact.

  • Project Name – the name of the SSIS Project that contains the scripted artifact.

  • Project Connection Name – the name of the SSIS Project Connection.

  • Generated By – the name of the enterprise account used to generate the artifact’s script.

    • Note: SSIS Catalog Compare respects the security model of the SSIS Catalog. Windows Authentication is required to perform many SSIS Catalog operations.
  • Generated Date – the date and time the script was generated.

  • Generated From – the version of CatalogBase used in the generation of the artifact script.

    • Executing On – the name of the machine on which CatalogBase was running.
  • Deployed to Instance – the SQL Server instance hosting the target SSIS Catalog.

  • Deploy Date – the date and time the deployment script was executed.

  • Deploy By – the enterprise account used to deploy the artifact script.

Andy has put a lot of thought into SSIS Catalog Compare, so go check it out.

SQL Operations Studio August Release

Alan Yu announces a new release of SQL Operations Studio:

SQL Operations Studio was announced for Public Preview on November 15th at Connect(), and this August release is the ninth major update since the announcement. If you missed it, the July release announcement is available here.

Highlights for this release include the following.

  • Announcing the SQL Server Import extension

  • SQL Server Profiler Session management

  • New community extension: First responder kit

  • Quality of Life improvements: Connection strings

  • Bug bash galore

That’s a nice set of improvements this month.

Power BI Without Active Directory

Ginger Grant shows us how we can expose Power BI dashboards without needing users to have Power BI or Active Directory accounts:

There are many companies which would like to provide Power BI reports which would allow customers to interactively work with their data, but they don’t want to create Power BI accounts for customers as that can be a lot of work from an administrative standpoint.  For the same reason, these customers are not added to the corporate network which means they are not added Active Directory.  For example, if Desert Isle SQL contracts with Acme Corporation to create a custom conference display, Acme might want to show me a report showing when the components were purchased, when they were modified and when the order is in process and when the order is completed.  How do I show a Power BI report containing information? From an application design perspective data from all of the customers should be stored in the same place and Desert Isle SQL should only see their orders when logging in to Acme’s site.

Ginger also covers a bit about the licensing cost of going down this route.

Burndown Charts In Power BI

Paul Turley shows how to create burn-down and Pareto charts in Power BI:

I’m managing an Agile team project using Microsoft Teams – the new project management platform integrated with Office 365.  Teams is a simple and useful project management tool but it’s new and light on features.  Using Power BI, we want to show the hourly task burn-down for each two-week sprint.  In JIRA and some other more mature project management platforms, the burn-down chart is a standard feature in the tool that shows the number of hours or story points remaining, compared to the estimated number for the sprint.  Just as I began working on that, a client asked for some help creating a Pareto chart and it occurred to me that burn-down and Pareto charts are very similar variations of the same type of chart presentation.  These are not so much chart types as they are a set of calculations and techniques for displaying a desired result.

Read the whole thing.

Finding Gaps In Identity Columns

Kevin Feasel

2018-09-04

T-SQL

Shaun J Stuart walks us through a couple of solutions for finding gaps in identity ranges:

Have you ever had random inserts into a large table fail? Most of the time, inserts happen fine, but every so often you get a failure with a “primary key violation” error? If your primary key is an integer column with the identity property, you may be wondering how this is possible.

What is likely happening is your table has grown very large or has been in use for a long time and your identity column ran out of numbers. An integer column has a maximum value of 2,147,483,647. Now an integer can start at -2,147,483,648, but most people  start at 0 or 1, so that leaves you with 2 billion numbers.

This is a specific sub-case of the more general gaps and islands problem.

Categories

September 2018
MTWTFSS
« Aug Oct »
 12
3456789
10111213141516
17181920212223
24252627282930