Monitoring Kafka Streaming Pipelines

Randhir Singh shows how to use open-source tools to monitor Kafka streaming pipelines:

The solution uses the following open-source tools. The solution architecture is illustrated below.

  • Apache Kafka Connect is a tool to stream data between Apache Kafka and other components.
  • InfluxDB which is a time series database from InfluxData. It will be used to store time series data from Kafka input and output topics.
  • Influx sink connector from Datamountaineer. It is a connector and sink to write events from Kafka to InfluxDB.
  • Chronograf is an open-source monitoring solution from InfluxData.

Click through for the solution.

Functions In Spark

Fisseha Berhane continues his RDDs vs DataFrames vs SparkSQL series, this time looking at functions:

Let’s use Spark SQL and DataFrame APIs ro retrieve companies ranked by sales totals from the SalesOrderHeader and SalesLTCustomer tables. We will display the first 10 rows from the solution using each method to just compare our answers to make sure we are doing it right.

All three approaches give the same results, though the SQL approach seems to me to be the easiest.

Building ML Services Models

Kevin Feasel

2018-01-25

R

I have part two of my three-part series on SQL Server Machine Learning Services modeling:

We used sp_execute_external_script to build a model and generate some number of days worth of predictions.  Now I’d like to break that up into two operations:  training a model and generating predictions.  The biggest reason I might want to do this is if model generation is very expensive relative to prediction generation.  It’s obviously not very expensive to build a random number generator following a Poisson distribution, but imagine we had a complicated neural net that took hours or days to train—we’d want to use that model over and over, as long as the model came close enough to representing the underlying reality.

So now we’re going to create a model.

Click through to see a more complete example, something closer to production-ready.

Make Your Powershell Cmdlets Hypothetical

Rob Sewell shows you how to implement Confirm, Verbose, and WhatIf parameters in Powershell cmdlets:

I have done this before because if the file does not exist then Set-Content will create a new file for you, but with this function I can check if the file exists first with the ValidateScript before running the rest of the function.

As you can see I add variables from my PowerShell code into the “The Item” and “The Change”. If I need to add a property of an object I use $($Item.Property).

So now, if I want to see what my new function would do if I ran it without actually making any changes I have -WhatIf added to my function automagically.

It’s easy to do and makes your code that much better.

Using Computed Columns For Pattern Matching

Erik Darling shows how you can use a computed column to speed up %something searches:

We were talking about computed columns, and one of our students mentioned that he uses computed columns that run the REVERSE() function on a column for easier back-searching.

What’s back-searching? It’s a word I just made up.

The easiest example to think about and demo is Social Security Numbers.

One security requirement is often to give the last four.

Obviously running this for a search WHERE ssn LIKE '%0000' would perform badly over large data sets.

One of the trickiest parts of performance tuning is understanding that the way people want to see data is not necessarily the way you should store the data.

SQL Persistent Storage In Azure Container Services

Andrew Pruski shows how to use Kubernetes persistent volumes in Azure Container Services:

I’ve been playing around with SQL Server running in Kubernetes in Azure Container Services (AKS)for a while now and I think that the technology is really cool.

You can get a highly available instance of SQL Server up and running with a few lines of code! Ok, there’s a bit of setup to do but once you get that out of the way, you’re good to go.

One thing that has been missing though, is persistent storage. Any changes made to the SQL instance would have been lost if the pod that it was running in failed and was brought back up.

Until now.

Click through to learn how.  It’s certainly not trivial, but Andrew does a good job showing us the step-by-step.

Cycling The SQL Server Error Log

Kenneth Fisher recommends cycling the SQL Server error log regularly:

You’ll note I have it set to keep 32 files. This way I have at least a months worth of information. It’s probably more than I need, but for my personal box, that’s what I chose. For your company, it’s going to depend on your needs. You can keep up to 99 files but that seems a bit excessive to me. Of course, these files aren’t all that large (usually) so it’s up to you.

I’d go with a bit more than 32—if you have three service restarts in March, you won’t have a full month’s worth of information.  Regardless of the exact number you choose, it’s a smart idea to cycle this regularly.

Diagnosing TempDB Growth

Ginger Keys has a few tips around managing tempdb:

So what happened to cause this inconsistent growth and how should I fix it?

Tempdb is used (and can grow) when users explicitly create objects like temp tables, variables, cursors, or stored procedures. Tempdb is also used when the database engine creates work tables for sorts or spooling, and row versioning operations.

It’s often difficult to diagnose what happened on a SQL Server after the fact. There are many resources online to determine what is currently using tempdb but not a lot of information from processes that may have run days ago.

Read on for some tips to help manage tempdb growth better.

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031