Press "Enter" to skip to content

Month: March 2021

Apache Spark 3.1 Released

Hyukjin Kwon, et al, announce Apache Spark 3.1:

Various new SQL features are added in this release. The widely used standard CHAR/VARCHAR data types are added as variants of the supported String types. More built-in functions (e.g., width_bucket (SPARK-21117) and regexp_extract_all (SPARK-24884) were added. The current number of built-in operators/functions has now reached 350. More DDL/DML/utility commands have been enhanced, including INSERT (SPARK-32976), MERGE (SPARK-32030) and EXPLAIN (SPARK-32337). Starting from this release, in Spark WebUI, the SQL plans are presented in a simpler and structured format (i.e. using EXPLAIN FORMATTED)

There have been quite a few advancements around the SQL side.

Comments closed

Contrasting Redis and Memcached

Shane Ducksbury has a showdown:

Caching is an important step in increasing the performance of many applications. It can be difficult to determine which caching solution is the best for your use case. Two likely contenders that will often make an appearance in your search for the answer are Redis vs Memcached.

In the green corner is Memcached (est. 2003), the classic, high performance caching solution. In the red corner is Redis, a slightly newer (est. 2009) but very mature and feature-rich caching in-memory database. Below, we’ll take a look at the differences between the two to help you make the decision which to choose.

I’ve personally had better experiences with Redis; my experience is that both work well from low to high load, and both have a tendency fall over at really high load.

Comments closed

Viewing Stats Used in Creating Execution Plans

Matthew McGiffen shows us how to find the statistics used when generating an execution plan:

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning

One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used. 

Click through for a couple of caveats about this, as well as a primer on how to see those precious statistics.

Comments closed

SQL Server Configuration Settings Requiring Restarts

Randolph West enumerates configuration settings which do (or do not) require a restart of the SQL Server database engine:

SQL Server is a complex beast, with many configuration options that can range from recommended to completely avoided.

Since the release of SQL Server 2016, several options that were recommended post-install have been rolled into the default installation options and no longer need to be done, and similar changes were made with SQL Server 2017. Even so, there are configuration changes we data professionals need to make after installation, during maintenance windows, and sometimes even during operating hours, so here’s a handy list of changes that do and don’t require a restart of your operating system or SQL Server instance.

As Randolph mentions, this set is not conclusive. For example, enabling PolyBase requires a restart of the database engine. I believe enabling ML Services technically does not, though I do out of caution because the back of my mind remembers something weird about the service’s behavior if you don’t restart the database engine, but p > 0 my brain made up the whole thing.

Comments closed

Always Encrypted Setup

Chad Callihan takes us through an example of configuring Always Encrypted:

Always Encrypted can encrypt columns with deterministic encryption or randomized encryption. Your choice on which is better for you depends on how you plan to use the encrypted data. Deterministic encryption will produce the same encrypted value every time whereas randomized will not have the same encrypted value.

If you want to encrypt records but will also want to be querying encrypted records, you’ll want to choose deterministic for more efficient queries. Deterministic encryption will still allow point lookups, equality joins, grouping, and indexing when querying data.

Click through for the step-by-step process.

Comments closed

Power BI Object Level Security

Gilbert Quevauvilliers shows us an example of Object Level Security in Power BI:

My example which I am going to detail below will show you how I will restrict a user from viewing sales data. The same user will be able to see Quantity amounts. This becomes really powerful because not all users need to see all the data.

My goal here is to show you how to the basics on how to use Object Level Security. Yes, there are more advanced options to configure a combination of Row Level Security and Object Level Security.

By using Object Level Security, it means that I can now have a single model which can be used for Financial and Non-Financial reporting.

Read on for an example.

Comments closed

Including and Resizing External Images in knitr

The folks at Jumping Rivers continue a series on knitr and rmarkdown:

In this third post, we’ll look at including eternal images, such as figures and logos in HTML documents. This is relevant for all R markdown files, including fancy things like {bookdown}, {distill} and {pkgdown}. The main difference with the images discussed in this post, is that the image isn’t generated by R. Instead, we’re thinking of something like a photograph. When including an image in your web-page, the two key points are

– What size is your image?
– What’s the size of your HTML/CSS container on your web-page?

Read the whole thing.

Comments closed

Semantic Search in Azure Cognitive Search

Rangan Majumder, et al, have an article on how semantic search works in Azure Cognitive Search:

As part of our AI at Scale effort, we lean heavily on recent developments in large Transformer-based language models to improve the relevance quality of Microsoft Bing. These improvements allow a search engine to go beyond keyword matching to searching using the semantic meaning behind words and content. We call this transformational ability semantic search—a major showcase of what AI at Scale can deliver for customers.

Semantic search has significantly advanced the quality of Bing search results, and it has been a companywide effort: top applied scientists and engineers from Bing leverage the latest technology from Microsoft Research and Microsoft Azure. Maximizing the power of AI at Scale requires a lot of sophistication. One needs to pretrain large Transformer-based models, perform multi-task fine-tuning across various tasks, and distill big models to a servable form with very minimal loss of quality. We recognize that it takes a large group of specialized talent to integrate and deploy AI at Scale products for customers, and many companies can’t afford these types of teams. To empower every person and every organization on the planet, we need to significantly lower the bar for everyone to use AI at Scale technology.

Click through to learn more about the technology.

Comments closed

The Logging Costs of DROP TABLE and TRUNCATE

Paul Randal explains that DROP TABLE and TRUNCATE TABLE are logged operations:

Hopefully you all know that it’s a myth that DROP TABLE and TRUNCATE TABLE are non-logged operations. If you didn’t know that, read my blog post on sqlperformance.com that explains about the deferred drop mechanism. Both operations are fully logged, and will generate quite a bit of transaction log.

The bulk of the log that’s generated comes from having to log the deallocation of extents and the pages within them. For each extent, a bit must be cleared in the corresponding GAM page and IAM page, and all 8 pages in the extent must be marked as deallocated in the corresponding PFS page (turning off the 0x40 bit in each PFS byte). So that’s three log records per allocated extent.

To get a feeling for how much that is, Paul provides an example of a 20TB table being dropped.

Comments closed