Press "Enter" to skip to content

Day: March 4, 2021

Analyzing XGBoost Training Reports

Simon Zamarin, et al, walk us through using XGBoost reports in Amazon’s Sagemaker Debugger:

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.

This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Click through for an example of this in action.

Comments closed

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Contrasting Redis and Memcached

Shane Ducksbury has a showdown:

Caching is an important step in increasing the performance of many applications. It can be difficult to determine which caching solution is the best for your use case. Two likely contenders that will often make an appearance in your search for the answer are Redis vs Memcached.

In the green corner is Memcached (est. 2003), the classic, high performance caching solution. In the red corner is Redis, a slightly newer (est. 2009) but very mature and feature-rich caching in-memory database. Below, we’ll take a look at the differences between the two to help you make the decision which to choose.

I’ve personally had better experiences with Redis; my experience is that both work well from low to high load, and both have a tendency fall over at really high load.

Comments closed

Viewing Stats Used in Creating Execution Plans

Matthew McGiffen shows us how to find the statistics used when generating an execution plan:

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning

One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used. 

Click through for a couple of caveats about this, as well as a primer on how to see those precious statistics.

Comments closed

SQL Server Configuration Settings Requiring Restarts

Randolph West enumerates configuration settings which do (or do not) require a restart of the SQL Server database engine:

SQL Server is a complex beast, with many configuration options that can range from recommended to completely avoided.

Since the release of SQL Server 2016, several options that were recommended post-install have been rolled into the default installation options and no longer need to be done, and similar changes were made with SQL Server 2017. Even so, there are configuration changes we data professionals need to make after installation, during maintenance windows, and sometimes even during operating hours, so here’s a handy list of changes that do and don’t require a restart of your operating system or SQL Server instance.

As Randolph mentions, this set is not conclusive. For example, enabling PolyBase requires a restart of the database engine. I believe enabling ML Services technically does not, though I do out of caution because the back of my mind remembers something weird about the service’s behavior if you don’t restart the database engine, but p > 0 my brain made up the whole thing.

Comments closed

Always Encrypted Setup

Chad Callihan takes us through an example of configuring Always Encrypted:

Always Encrypted can encrypt columns with deterministic encryption or randomized encryption. Your choice on which is better for you depends on how you plan to use the encrypted data. Deterministic encryption will produce the same encrypted value every time whereas randomized will not have the same encrypted value.

If you want to encrypt records but will also want to be querying encrypted records, you’ll want to choose deterministic for more efficient queries. Deterministic encryption will still allow point lookups, equality joins, grouping, and indexing when querying data.

Click through for the step-by-step process.

Comments closed

Power BI Object Level Security

Gilbert Quevauvilliers shows us an example of Object Level Security in Power BI:

My example which I am going to detail below will show you how I will restrict a user from viewing sales data. The same user will be able to see Quantity amounts. This becomes really powerful because not all users need to see all the data.

My goal here is to show you how to the basics on how to use Object Level Security. Yes, there are more advanced options to configure a combination of Row Level Security and Object Level Security.

By using Object Level Security, it means that I can now have a single model which can be used for Financial and Non-Financial reporting.

Read on for an example.

Comments closed