Press "Enter" to skip to content

Curated SQL Posts

Machine Learning From Kafka

Kai Waehner has a post covering a recent talk he did on using Kafka as a data source for neural networks:

This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem – and why this is a great fit for machine learning at extreme scale.

The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.

Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.

Check out the slide deck as well for more details.

Comments closed

Overfitting With Polynomial Regression

Vincent Granville shows us a few problems with polynomial regression:

Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section.

This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.

Granville’s point that overfitting is a relatively small concern is rather interesting.  But the advice to avoid polynomial regression is generally pretty solid.

Comments closed

Building Flow Charts In R

Alan Haynes shows how to build flow charts in R using the grid Gmisc packages:

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

Click through for sample code and a resulting image.  H/T R-bloggers

Comments closed

Building Palettes From Pictures In R

Andrea Cirillo takes inspiration from the great works to build palettes:

If you see this painting you will find a profound of colours with a great equilibrium between different hues, the hardy usage of complementary colours and the ability expressed in the “chiaroscuro” technique. While I was looking at the painting I started, wondering how we moved from this wisdom to the ugly charts you can easily find within today’s corporate reports ( find a great sample on the WTF visualization website)

This is where Paletter comes from: bring the Renaissance wisdom and beauty within the plots we produce every day.

Introducing paletter

PaletteR is a lean R package which lets you draw from any custom image an optimized palette of colours. The package extracts a custom number of representative colours from the image. Let’s try to apply it on the “Vergine con il Bambino, angeli e Santi” before looking into its functional specification.

It’s an interesting package.  I’ll have to play around with it.

Comments closed

SSMS 17.7 Released

Alan Yu announces SQL Server Management Studio 17.7:

In addition to enhancements and bug fixes, SSMS 17.7 comes with several exciting new features:

  • Support package scheduling in Azure-SSIS integration runtime.

  • Support for SSIS package scheduling in SQL Agent on SQL Managed instance. It is now possible to create SQL Agent jobs to execute SSIS packages on the managed instance.

  • Replication monitor now supports registering a listener for scenarios where publisher database and/or distributor database is part of Availability Group. So with this release of SSMS, you can monitor replication environments where publisher database and/or distribution database is part of Always On.

There are also several bugfixes that they call out.

Comments closed

Saving A SQL Server Image In The Azure Container Repository

Andrew Pruski shows us how to store Docker images in Azure:

I’m going to push my custom dbafromthecold/sqlserverlinuxagent image. It’s a public image so if you want to use it, just run: –

docker pull dbafromthecold/sqlserverlinuxagent:latest

So similar to pushing to the Docker hub, we need to tag the image with the login server name that we retrieved a couple of commands ago and the name of the image: –

docker tag dbafromthecold/sqlserverlinuxagent apcontainerregistry01.azurecr.io/sqlserverlinuxagent:latest

Read on for a full example.

Comments closed

Getting Per-Table Space Utilization With Powershell

Drew Furgiuele provides us a script and a homework assignment:

Of course, PowerShell excels at this. By using the SQL Server module, it’s really easy to:

  • Connect to an instance and collect every user database, and
  • From each database, collect every table, and
  • For each table, collect row counts and space used, and
  • If there are any indexes, group them, and sum their usage and report that as well

Here’s the script. Note that I have the server name hard-coded in there as localhost (more on that in a coming paragraph). Go ahead and take a look before we break it down.

Click through for the script, and homework is due next Tuesday on his desk.

Comments closed

Data Types In R

Ellen Talbot gives us an overview of the different data types in R:

Now here’s something we didn’t cover in the video and is especially helpful if something just WILL NOT work and you’ve spent all morning panic eating biscuits.

You can write checks to see if something is numeric, or an integer, with is.numeric() or is.integer().

The general “‘is.XXXXX()’” function will take many of the data types we cover here and more, and can be a real time/life saver.

We could also use class() here and inspect the result.^[You might recall that class(1) had the result of “numeric” – R was not by default considering 1 as an integer for the purpose of the class() function. ### Special numbers As well as i to denote imaginary numbers, there are some additional symbols you might encounter or want to use.

There’s a video as well as a full blog post.

Comments closed

Backing Up Cloudera Search Data

Eva Nahari explains different techniques to back up Cloudera Search data, as well as setting up disaster recovery:

If you have the raw data in HDFS (which most do, and which you should!), the most straightforward way to have a hot-warm disaster recovery setup is to use our Backup and Disaster Recovery tool. It allows you to set up regular incremental updates between two clusters. You then have the option of using MapReduce Indexer or Spark Indexer to regularly index the raw data in your recovery cluster and append to a running Solr service in that same recovery cluster. This way you can easily switch over from one Solr service to the backup Solr service if you experience downtime in the original cluster.

The lag would be depending on the network between the clusters and how frequent you transfer data between the clusters. To some extent it would also depend on how long time you need (i.e. how much resources you have available) to complete the MapReduce or Spark indexing workload and append it (using the Cloudera Search GoLive feature) into Solr active indexes on the recovery site.

Read on for several options.

Comments closed