Press "Enter" to skip to content

Author: Kevin Feasel

Kafka Connect Done Easy

Robin Moffatt shows how to build a simple Kafka Connect flow:

This is pretty cool – the update_ts column is managed automagically by MySQL (other RDBMS have similar functionality), and Kafka Connect’s JDBC connector is using this to pick out new and updated rows from the database.

As a side note here, Kafka Connect tracks the offset of the data that its read using the connect-offsets topic. Even if you delete and recreate the connector, if the connector has the same name it will retain the same offsets previously stored. So if you want to start from scratch, you’ll want to change the connector name – for example, use an incrementing suffix for each test version you work with. You can actually check the content of the connect-offsets topic easily:

This is part 1 of a mini-series, but does show you how to build connections to stream data from MySQL into Kafka and then into a flat file.

Comments closed

Power BI With What-If Analysis

Dusty Ryan shows off What-If Analysis in Power BI:

Next, I’m going to create a calculated measure that multiples Revenue Last Year by Revenue Target % Value. Here’s the formula for the calculated measure:

Revenue Target = [Revenue Last Year] * [Revenue Target % Value]

Now I’m going to add this new measure, Revenue Target, to my line chart. And now when I use my slicer slider bar, I can dynamically change my Revenue Target line on the chart!

This is pretty cool.  Definitely check out Dusty’s example; it’s something that might make many an executive happy.

Comments closed

Saving Statistics Sample Rates

Pedro Lopes shows off a new feature in the latest SQL Server 2016 CU:

When SQL Server creates or updates statistics and a sampling rate is not manually specified, SQL Server calculates a default sampling rate. Depending on the real distribution of data in the underlying table, the default sampling rate may not accurately represent the data distribution and then cause degradation of query plan efficiency.

To improve this scenario, a database administrator can choose to manually update statistics with a specific sampling rate that can better represent the distribution of data. However, a subsequent automatic update statistics operation will reset back to the default sampling rate, possibly reintroducing degradation of query plan efficiency.

With the most recent SQL Server 2016 SP1 CU4, we released an enhancement for the CREATE and UPDATE STATISTICS command – the ability to persist sampling rates between updates with a PERSIST_SAMPLE_PERCENT keyword.

This seems rather useful.

Comments closed

Substrings: Powershell Versus T-SQL

Shane O’Neill contrasts the SUBSTRING function in T-SQL with Powershell’s Substring method:

The main difference that I can see when using SUBSTRING() in SQL Server versus in PowerShell is that SQL Server is very forgiving.

If you have a string that is 20 characters longs and you ask for everything from the 5th character to the 100th character, SQL Server is going to look at this, see that the string does not go to the 100th character, and just give you everything that it can.

It’s a small difference but an important one.

Comments closed

Anti-Virus On Your Database Server?

Steve Stedman gives you food for thought if you need to run anti-virus software on your SQL Server instance:

In a perfect world, your SQL Server would be so secure that you would not need antivirus software, you would have behind layers of firewalls, nobody would ever connect with remote desktop to install anything, and it would always have all of the latest security patches… But that is not the real world.

Given that your SQL Server often times contains extremely valuable information, and that the damage that could be done by virus software, malware, and ransomware could be so great then it is strongly recommended that you run antivirus software on your SQL Server. There are some files that you will want to exclude from the virus check.

I’m not a big fan of running anti-virus software on database instances, but if you have to run it for whatever reason, be sure to check out Steve’s advice.

Comments closed

R Services 182 Error

Joey D’Antoni provides a solution to a tricky SQL Server R Services error:

Recently, and unfortunately I don’t have an exact date on when this started failing (though it was around service pack 1 install time) with the following error:

Error
Msg 39012, Level 16, State 1, Line 10
Unable to communicate with the runtime for ‘R’ script. Please check the requirements of ‘R’ runtime.
STDERR message(s) from external script:

DLL ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ cannot be loaded.
Error in eval(expr, envir, enclos) :
DLL ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ cannot be loaded.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted
STDOUT message(s) from external script:

Failed to load dll ‘C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER1601\MSSQL\Binn\sqlsatellite.dll’ with 182 error.

Click through to see how to resolve this issue.

Comments closed

Putting Measures On Rows In Power BI

Meagan Longoria shifts our perspective by 90 degrees:

Back in January 2016, I wrote a blog post explaining a DAX workaround that allows you to put measures on rows in a matrix in a Power BI report. I’m happy to say that you no longer need my workaround because you can now natively put measures on rows in a matrix in both Power BI Desktop and PowerBI.com.

This is accomplished via a new formatting option for the matrix.

Click through to see how to pull this off.

Comments closed

Date Conversions In Oracle And SQL Server

Daniel Janik compares Oracle and SQL Server date conversion functions:

There are many ways to create a date from a string. First you’ve got the CONVERT() function and the CAST() function. Next you’ve got DATEFROMPARTS(), DATETIMEFROMPARTS(), DATETIME2FROMPARTS(), SMALLDATETIMEFROMPARTS(), TIMEFROMPARTS(), and DATETIMEOFFSETFROMPARTS().

That’s a lot of functions for one simple task isn’t it? To be fair, it’s really more than 1 simple task. Each of these functions is meant to be paired with the matching data type so you get just what you want. To go along with these you also have the ISDATE() function which tests the value to be sure it’s a date.

I never liked the verbosity of the Oracle TO_DATE() function…but I am biased.

Comments closed

Analyzing Twitter Data With Storm In HDInsight

Nischal S shows how to configure an HDInsight cluster to process tweets, followed by loading them into a Power BI dashboard:

When we need to process streams of real-time data, Storm is a great contender. Examples of streaming data are the number of consumer clicks and navigations on a website, IIS or user logs, IoT data, and social network information. In all these scenarios, we use real-time data processing. Apache Storm can process real-time unbounded streams of data.

The term “unbounded” defines streams of data with no start or end. Here, the processing of data is continuous and in real-time. Twitter is a good example. Twitter data is continuous, has no start or end time, and is provided in real-time by millions of Twitter users around the world.

Storm wouldn’t rank in my top three technologies for doing this, but it certainly does the job.

Comments closed

Diamond: Solving Generalized Linear Models Using Python

Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models.  In a two-part series, they explain more.  Part 1 covers the mathematical principles behind it:

Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:

  • Convex problems have a unique global solution, i.e. there is one best answer
  • There are well-known, efficient, and reliable algorithms for finding it

One ubiquitous example of a convex problem in data science is finding the coefficients of an L2L2-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.

Part 2 looks at one interesting use case:

In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.

Check it out.  Part 2 also contains a link to the GitHub repo if you want to try it on your own.

Comments closed