Press "Enter" to skip to content

Author: Kevin Feasel

Analyzing XGBoost Training Reports

Simon Zamarin, et al, walk us through using XGBoost reports in Amazon’s Sagemaker Debugger:

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.

This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Click through for an example of this in action.

Leave a Comment

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Leave a Comment

Contrasting Redis and Memcached

Shane Ducksbury has a showdown:

Caching is an important step in increasing the performance of many applications. It can be difficult to determine which caching solution is the best for your use case. Two likely contenders that will often make an appearance in your search for the answer are Redis vs Memcached.

In the green corner is Memcached (est. 2003), the classic, high performance caching solution. In the red corner is Redis, a slightly newer (est. 2009) but very mature and feature-rich caching in-memory database. Below, we’ll take a look at the differences between the two to help you make the decision which to choose.

I’ve personally had better experiences with Redis; my experience is that both work well from low to high load, and both have a tendency fall over at really high load.

Leave a Comment

Viewing Stats Used in Creating Execution Plans

Matthew McGiffen shows us how to find the statistics used when generating an execution plan:

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning

One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used. 

Click through for a couple of caveats about this, as well as a primer on how to see those precious statistics.

Leave a Comment

SQL Server Configuration Settings Requiring Restarts

Randolph West enumerates configuration settings which do (or do not) require a restart of the SQL Server database engine:

SQL Server is a complex beast, with many configuration options that can range from recommended to completely avoided.

Since the release of SQL Server 2016, several options that were recommended post-install have been rolled into the default installation options and no longer need to be done, and similar changes were made with SQL Server 2017. Even so, there are configuration changes we data professionals need to make after installation, during maintenance windows, and sometimes even during operating hours, so here’s a handy list of changes that do and don’t require a restart of your operating system or SQL Server instance.

As Randolph mentions, this set is not conclusive. For example, enabling PolyBase requires a restart of the database engine. I believe enabling ML Services technically does not, though I do out of caution because the back of my mind remembers something weird about the service’s behavior if you don’t restart the database engine, but p > 0 my brain made up the whole thing.

Leave a Comment

Always Encrypted Setup

Chad Callihan takes us through an example of configuring Always Encrypted:

Always Encrypted can encrypt columns with deterministic encryption or randomized encryption. Your choice on which is better for you depends on how you plan to use the encrypted data. Deterministic encryption will produce the same encrypted value every time whereas randomized will not have the same encrypted value.

If you want to encrypt records but will also want to be querying encrypted records, you’ll want to choose deterministic for more efficient queries. Deterministic encryption will still allow point lookups, equality joins, grouping, and indexing when querying data.

Click through for the step-by-step process.

Leave a Comment

Power BI Object Level Security

Gilbert Quevauvilliers shows us an example of Object Level Security in Power BI:

My example which I am going to detail below will show you how I will restrict a user from viewing sales data. The same user will be able to see Quantity amounts. This becomes really powerful because not all users need to see all the data.

My goal here is to show you how to the basics on how to use Object Level Security. Yes, there are more advanced options to configure a combination of Row Level Security and Object Level Security.

By using Object Level Security, it means that I can now have a single model which can be used for Financial and Non-Financial reporting.

Read on for an example.

Leave a Comment

Including and Resizing External Images in knitr

The folks at Jumping Rivers continue a series on knitr and rmarkdown:

In this third post, we’ll look at including eternal images, such as figures and logos in HTML documents. This is relevant for all R markdown files, including fancy things like {bookdown}, {distill} and {pkgdown}. The main difference with the images discussed in this post, is that the image isn’t generated by R. Instead, we’re thinking of something like a photograph. When including an image in your web-page, the two key points are

– What size is your image?
– What’s the size of your HTML/CSS container on your web-page?

Read the whole thing.

Leave a Comment

Semantic Search in Azure Cognitive Search

Rangan Majumder, et al, have an article on how semantic search works in Azure Cognitive Search:

As part of our AI at Scale effort, we lean heavily on recent developments in large Transformer-based language models to improve the relevance quality of Microsoft Bing. These improvements allow a search engine to go beyond keyword matching to searching using the semantic meaning behind words and content. We call this transformational ability semantic search—a major showcase of what AI at Scale can deliver for customers.

Semantic search has significantly advanced the quality of Bing search results, and it has been a companywide effort: top applied scientists and engineers from Bing leverage the latest technology from Microsoft Research and Microsoft Azure. Maximizing the power of AI at Scale requires a lot of sophistication. One needs to pretrain large Transformer-based models, perform multi-task fine-tuning across various tasks, and distill big models to a servable form with very minimal loss of quality. We recognize that it takes a large group of specialized talent to integrate and deploy AI at Scale products for customers, and many companies can’t afford these types of teams. To empower every person and every organization on the planet, we need to significantly lower the bar for everyone to use AI at Scale technology.

Click through to learn more about the technology.

Leave a Comment