Press "Enter" to skip to content

Author: Kevin Feasel

Non-Linear Classifiers with Support Vector Machines

Rahul Khanna continues a series on support vector machines:

In this blog post, we will look at a detailed explanation of how to use SVM for complex decision boundaries and build Non-Linear Classifiers using SVM. The primary method for doing this is by using Kernels.

In linear SVM we find margin maximizing hyperplane with features Xi’s . Similarly, in Logistic regression, we also try to find the hyperplane which minimizes logistic loss with features Xi’s. Most often when we use both these techniques the results are the same. But linear SVM or for the same reason a logistic regression would fail where there is a need to have complex or non-linear decision boundaries. These types of boundaries are then achieved by SVM using Kernels. So let us understand how SVM creates non-linear boundaries using Kernels

Read on to see how it works.

Comments closed

Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Comments closed

Data Type Conversions in Predicates

Bert Wagner takes us through a troublesome table design:

This table stores data for an application that has many different types of Pages. Each Page stores different types of data, but instead of creating a separate table for each type, we store all the different data in the varchar DataValue column and maintain the original data type in the DataType column.

This structure reduces the complexity required for maintaining our database (compared to creating possibly hundreds of tables, one for each PageName) and makes querying easier (only need to query one table). However, this design could also lead to some unexpected query results.

This is your daily reminder that an attribute should be a thing which describes an entity, not one of multiple things.

Comments closed

Pulling Docker Images

Grant Fritchey starts us off slowly with containers:

The first command you have to learn is ‘docker pull’. You then have to supply something for it to pull, an image that will be used to create your containers. I’m using Powershell for the commands I’m posting this week. Here’s how you get an image with SQL Server 2017:

docker pull mcr.microsoft.com/mssql/server:2017-latest

Click through to learn more.

Comments closed

Azure SQL Database and Extended Events

Dave Bland shows how to set up and read an extended event file on Azure SQL Database:

This first step when using T-SQL to read Extended Files that are stored in an Azure Storage Account is to create a database credential.  Of course the credential will provide essential security information to connect to the Azure Storage Account.  This first data point you will need is the URL to a blog storage container in you storage account.  If you look below, you can see where you would place your storage account name and the blob storage container name.

Dave gives us the grand tour of the configuration process, including where things differ between on-prem SQL Server and Azure SQL Database (which is quite a bit)

Comments closed

Overriding Spark Dependencies

Landon Robinson shows how to override a Spark dependency located on the classpath:

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:
– gson-2.2.4.jar: the method is private, and therefore too old for use here
– gson-2.6.1: the method is public, and works fine.
Somewhere between the two, the method’s status changed.

So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

Spoilers: there was a problem.

Comments closed

Kafka and MirrorMaker

Renu Tewari describes what MirrorMaker does for Kafka today and what is coming with version 2:

Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking clickstream event data, collecting logs, gathering metrics, and being the enterprise data bus in a microservices based architectures. Kafka is essentially a highly available and highly scalable distributed log of all the messages flowing in an enterprise data pipeline. Kafka supports internal replication to support data availability within a cluster. However, enterprises require that the data availability and durability guarantees span entire cluster and site failures.

The solution, thus far, in the Apache Kafka community was to use MirrorMaker, an external utility, that helped replicate the data between two Kafka clusters within or across data centers. MirrorMaker is essentially a Kafka high-level consumer and producer pair, efficiently moving data from the source cluster to the destination cluster and not offering much else. The initial use case that MirrorMaker was designed for was to move data from clusters to an aggregate cluster within a data center or to another data center to feed batch or streaming analytics pipelines. Enterprises have a much broader set of  use cases and requirements on replication guarantees.

Read on for the list of benefits and upcoming features.

Comments closed

Collecting Hadoop Metrics from Multiple Clusters

Dmitry Tolpeko shows how you can collate Hadoop metrics from several ElasticMapReduce clusters:

The first step is to dynamically get the list of clusters and their IPs. Hadoop clusters are often reprovisioned, added and terminated, so you cannot use the static list and addresses. In case of Amazon EMR, you can use the following Linux shell command to get the list of active clusters:

aws emr list-clusters --active

From its output you can get the cluster IDs and names. As a cluster ID and IP can change over time, its name is usually permanent (like DEV or Adhoc-Analytics cluster) so it can be useful for various aggregation reports.

Read on to see what you can do with this list of clusters.

Comments closed