Using Kafka To Drive Machine Learning

Kai Waehner has a nice architectural post on using Kafka as the focal point for machine learning training and prediction:

The essence of this architecture is that it uses Kafka as an intermediary between the various data sources from which feature data is collected, the model building environment where the model is fit, and the production application that serves predictions.

Feature data is pulled into Kafka from the various apps and databases that host it. This data is used to build models. The environment for this will vary based on the skills and preferred toolset of the team. The model building could be a data warehouse, a big data environment like Spark or Hadoop, or a simple server running python scripts. The model can be published where the production app that gets the same model parameters can apply it to incoming examples (perhaps using Kafka Streams to help index the feature data for easy usage on demand). The production app can either receive data from Kafka as a pipeline or even be a Kafka Streams application itself.

This is approximately 80% of my interests wrapped up in one post, so of course I’m going to read it…

Non-Cost-Based Optimizations In Relational Databases

Lukas Eder has a big article on ten query optimizations that don’t involve looking at statistics or query costs:

This optimisation is really silly, but hey, why not. If users write impossible predicates, then why even execute them? Here are some examples:

-- "Obvious"
SELECT * FROM actor WHERE 1 = 0
-- "Subtle"

The first query should obviously never return any results, but the same is true for the second one, because while NULL IS NULL yields TRUE, always, NULL = NULL evaluates to NULL, which has the same effect as FALSE according to three-valued logic.

This doesn’t need much explanation, so let’s immediately jump to see which databases optimise this:

I was a bit surprised at how well DB2 did in this set.

Spark Cluster Managers

Sangeet Agulia walks us through the cluster managers available within Spark:

Spark Offers three types of Cluster Managers :

1) Standalone

2) Mesos

3) Yarn

4) Kubernetes (experimental) – In addition to the above, there is experimental support for Kubernetes. Kubernetes is an open-source platform for providing container-centric infrastructure.

Read on for a description of the top three cluster managers.

SQL Server R Services Troubleshooting

Ginger Grant walks us through a few troubleshooting tactics with SQL Server 2016 R Services:

Much to my surprise after this I received an error

Msg 39019, Level 16, State 1, Line 1
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).
Msg 11536, Level 16, State 1, Line 1
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

I looked in the log files and didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.

This service is somewhat finicky to set up in my experience, though once you have it configured, it tends to be pretty stable.

Cross-Platform Variables In Powershell Core

Max Trinidad shows how to use the Is* variables in Powershell Core to write cross-platform code:

Use the cmdlet Get-Variable to find them, and keep in mind, these variables are not found in Windows PowerShell 5.x.

Get-Variable Is*

Although, the results will display four variable, but let’s pay attention to three of them. Below are the variables with their default values:

IsLinux                            False
IsOSX                              False
IsWindows                    True

These three variables can help in identifying which Operating System the script are been executed.  This way just adding the necessary logic, in order to take the correct action.

Read on for a code example showing how to use these variables.

Docker Stop Versus Docker Kill

Andrew Pruski explains why docker kill is so much faster than docker stop:

When running demos and experimenting with containers I always clear down my environment. It’s good practice to leave a clean environment once you’ve finished working.

To do this I blow all my containers away, usually by running the docker stop command.

But there’s a quicker way to stop containers, the docker kill command.

Sending SIGTERM isn’t particularly polite and doesn’t let processes clean up, which could leave your process in an undesirable state during future runs.  But if you’re just re-deploying a container, you don’t really care about the prior state of the now-disposed container.

Using Query Performance Insight To Find High-IO Queries

Jim Donahoe shows how he used Azure’s Query Performance Insight to eliminate 10 billion logical reads:

To access QPI, you simply need to click on the database you want to work with. Once you click on your database, scroll down in the portal to Query Performance Insight(QPI). Once QPI opens, you will see three options to sort on: CPU, DATA I/O, and LOG I/O.  You can also set the timeframe to view, I set for 24 hours.  Now, I have my timeline of 24 hours, and I am able to view which queries had the highest DATA I/O. I made a list of the top 3 from each category(CPU, DATA I/O, and LOG I/O) and presented it to my client. I presented the number of times it was executed, and the usage it utilized each time(all from the QPI information). The client then sent me 10 queries they wanted tuned and listed them in a prioritized list.

Well, by the end of tuning their 3 highest priority queries, we removed over 10 billion logical reads!  Yep, 10 BILLION! The client was very happy with our results and is currently awaiting the preview Standard Elastic Pools to come out of Preview and become GA. I have provided a few screenshots of an AdventureWorksLT database on my personal instance just to show you how to access QPI, and change metrics.

Click through for a demo.

Estimates Outside The Histogram

Lonny Niederstadt is building up some information on how the cardinality estimator works when it needs to generate an estimate outside the histogram it has:

SQL Server keeps track of how many inserts and deletes since last stats update – when the number of inserts/deletes exceeds the stats update threshold the next time a query requests those stats it’ll qualify for an update.  Trace flag 2371 alters the threshold function before SQL Server 2016. With 2016 compatibility mode, the T2371 function becomes default behavior.  Auto-stats update and auto-stats update async settings of the database determine what happens once the stats qualify for an update.  But whether an auto-stats update or a manual stats update, the density, histogram, etc are all updated.

Trace flags 2389, 2390, 4139, and the ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS hint operate outside the full stats framework, bringing in the quickstats update.  They have slightly different scope in terms of which stats qualify for quickstats updates – but in each case its *only* stats for indexes, not stats for non-indexed columns that can qualify.  After 3 consecutive stats updates on an index, SQL Server “brands” the stats type as ascending or static, until then it is branded ‘unknown’. The brand of a stat can be seen by setting trace flag 2388 at the session level and using dbcc show_statistics.

Right now there are just a few details and several links, but it does look like he’s going to expand it out.

Trusted Assemblies And Module Signing

Solomon Rutzky continues his SQLCLR and trusted assemblies series:

Ownership chaining is quite handy as it makes it easier to not grant explicit permissions on base objects (i.e. Tables, etc) to everyone. Instead, you just grant EXECUTE / SELECTpermissions to Stored Procedures, Views, etc.

However, one situation where ownership chaining does not work is when using Dynamic SQL. And, any SQL submitted by a SQLCLR object is, by its very nature, Dynamic SQL. Hence, any SQLCLR objects that a) do any data access, even just SELECT statements, and b) will be executed by a User that is neither the owner of the objects being accessed nor one that has been granted permissions to the sub-objects, needs to consider module signing in order to maintain good and proper security practices. BUT, the catch here is that in order to sign any Assembly’s T-SQL wrapper objects, that Assembly needs to have been signed with a Strong Name Key or Certificate prior to being loaded into SQL Server. Neither “Trusted Assemblies” nor even signing the Assembly with a Certificate within SQL Server suffices for this purpose, as we will see below.

Read on for more details.


October 2017
« Sep