Press "Enter" to skip to content

Author: Kevin Feasel

Bayesian vs Frequentist Approaches to Machine Learning

Ajit Jaokar has an interesting series. Here’s part one:

The arguments / discussions between the Bayesian vs frequentist approaches in statistics are long running. I am interested in how these approaches impact machine learning. Often, books on machine learning combine the two approaches, or in some cases, take only one approach. This does not help from a learning standpoint.  

So, in this two-part blog we first discuss the differences between the Frequentist and Bayesian approaches. Then, we discuss how they apply to machine learning algorithms.

Part two extends from there:

Sampled from a distribution: Many machine learning algorithms make assumptions that the data is sampled from a frequency. For example, linear regression assumes gaussian distribution and logistic regression assumes that the data is sampled from a Bernoulli distribution. Hence, these algorithms take a frequentist approach

My biases push me toward Bayesian approaches, and I really like what I see in Stan, but these techniques do often require a lot more processing power.

1 Comment

Delivering Data Insights using the Microsoft Data Platform

Paul Andrew has a talk:

Let’s start with a story, not a ‘once upon a time story‘, a story for your backlog 

As a solution architect
I need to design and build an Azure data analytics platform end to end
to deliver data insights for my customer.

In February 2021 I delivered a talk as part of the Scottish Summit conference on how you could/should build an end to end data platform solution in Azure to deliver data insights and analytics. This is one of my favourite sessions so thought it worth re-sharing the recording here.

Click through for the abstract as well as the video.

Comments closed

Testing Kafka with Kerberos and SSH

Daniel Osvath has a guide for us:

Kerberos authentication is widely used in today’s client/server applications; however getting started with Kerberos may be a daunting task if you don’t have prior experience. Information on setting up Kerberos with an SSH server and client on the web is fragmented and hasn’t been presented in a comprehensive end-to-end way on a simple local setup.

At Confluent, several of our connectors for Apache Kafka® support Kerberos-based authentication. For development and testing of these connectors, we often leverage containers due to their fast, iterative benefits. This tutorial aims to provide a simple setup for a Kerberos test environment with SSH for a passwordless authentication that uses Kerberos tickets. You may use this as a guide for testing the Kerberos functionality of SSH-based client-server applications in a local environment or as a hands-on tutorial if you’re new to Kerberos. To understand the basics of Kerberos before diving into this tutorial, you may find this video helpful. Additionally, if you are looking for a non-SSH-based setup, the setup below for the KDC server container may also be useful.

Click through for two approaches to the problem.

Comments closed

Clarifying Key Errors on Graph Tables in SQL Server

Louis Davidson clears up some noise:

This causes the following error message:

Msg 2627, Level 14, State 1, Line 14Violation of UNIQUE KEY constraint 'AKEdge'. Cannot insert duplicate key in object 'dbo.Edge'. The duplicate key value is (455672671, 0, 455672671, 1).

So what is this: (455672671, 0, 455672671, 1)?

Click through to understand what this all means. Louis also has a quick procedure which looks up those details.

Comments closed

Ignite 2021 Data and AI Announcements

James Serra has a roundup of the Data & AI announcements at Microsoft Ignite 2021:

Azure Arc enabled machine learning (preview): Build models on-premises, in multi-cloud, and at the edge with Azure Arc. It does this by deploying Azure Machine Learning to a Kubernetes cluster that resides on-prem or in another cloud. More info

Azure Migrate new features: Discover and assess your SQL servers and their databases for migration to Azure from within the Azure Migrate portal, get target SKU recommendations, and estimate monthly costs. More info

Azure SQL enhancementsMaintenance window – Azure SQL Database and Azure Managed Instance fixed maintenance windows (preview) – More infoAdvance notifications – Enable you to prepare for planned maintenance events on your Azure SQL Database resources and minimize the impact of database failover on your sensitive workloads – More info

Click through for details on each of the announcements.

Comments closed

Analyzing XGBoost Training Reports

Simon Zamarin, et al, walk us through using XGBoost reports in Amazon’s Sagemaker Debugger:

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.

This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Click through for an example of this in action.

Comments closed

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Viewing Stats Used in Creating Execution Plans

Matthew McGiffen shows us how to find the statistics used when generating an execution plan:

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning

One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used. 

Click through for a couple of caveats about this, as well as a primer on how to see those precious statistics.

Comments closed

Contrasting Redis and Memcached

Shane Ducksbury has a showdown:

Caching is an important step in increasing the performance of many applications. It can be difficult to determine which caching solution is the best for your use case. Two likely contenders that will often make an appearance in your search for the answer are Redis vs Memcached.

In the green corner is Memcached (est. 2003), the classic, high performance caching solution. In the red corner is Redis, a slightly newer (est. 2009) but very mature and feature-rich caching in-memory database. Below, we’ll take a look at the differences between the two to help you make the decision which to choose.

I’ve personally had better experiences with Redis; my experience is that both work well from low to high load, and both have a tendency fall over at really high load.

Comments closed

Always Encrypted Setup

Chad Callihan takes us through an example of configuring Always Encrypted:

Always Encrypted can encrypt columns with deterministic encryption or randomized encryption. Your choice on which is better for you depends on how you plan to use the encrypted data. Deterministic encryption will produce the same encrypted value every time whereas randomized will not have the same encrypted value.

If you want to encrypt records but will also want to be querying encrypted records, you’ll want to choose deterministic for more efficient queries. Deterministic encryption will still allow point lookups, equality joins, grouping, and indexing when querying data.

Click through for the step-by-step process.

Comments closed