Press "Enter" to skip to content

Author: Kevin Feasel

Order, Sort, Cluster, and Distribute in Hive

The Hadoop in Real World team give us three methods (and one synonym) to organize results in Hive:

Hive provides 3 options to order or sort the result of records – order by, sort by, cluster by and distribute by. Which option you choose has performance implications. So it is important to understand the difference between the options and choose the right one for the use case at hand.

Click through for a high-level overview of the techniques.

Comments closed

Creating and Using Variables in DAX

Jeet Kainth takes us through the process of working with variables in DAX:

Variables can simplify your DAX code, help with debugging and help improve performance. To use variables in your DAX measure you need to declare a variable using the VAR keyword, give the variable a name, and then assign an expression to the variable. You can use multiple variables in a measure but when using variables you must use a RETURN statement to specify the final output to be returned.

Read on for a demonstration, as well as several examples of how you can use variables to make your DAX-writing life easier.

Comments closed

AWS EC2 I3 Instance Types and Storage Persistence

Steve REzhener has a warning for us:

Amazon Web Services Elastic Cloud Computing (a.k.a. EC2)  is a service that lets anyone with a credit card rent a virtualized server from Amazon. To cater to different clients’ needs, AWS provides various instance types that are either general instance or specific-purpose instances (focused on CPU, RAM, IO). You can see the different types in Fig 1. This blog post is going to talk about a storage optimized instance. the I3 instance type family, its little-known problem, and the solution in the form of  Elastic Block Storage (a.k.a. EBS).

Click through for the warning, more explanation, and what you can do about it. H/T the SQLServerCentral newsletter.

Comments closed

Exchange Spill Wait Stats

Erik Darling looks at exchange spills:

There are quite high waits on PAGEIOLATCH_EX, SLEEP_TASK, and SLEEP_BPOOL_STEAL rounding out the top five. This is quite interesting, because I’ve never explicitly thought of PAGEIOLATCH_EX waits in the context of exchange spills. Normally, I think of them when queries read pages from disk into memory for modification.

Going down the line, SLEEP_TASK is familiar from our time spent with hash spills, but SLEEP_BPOOL_STEAL is so far undocumented anywhere.

Erik also does the math on this query and recommends that you not write a query like this one.

Comments closed

Data Lakehouse Point-of-Sale Analytics Demo

Bryan Smith and Rob Saker share a pattern:

Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Prior to the pandemic, 71% of retailers named lack of real-time visibility into inventory as a top obstacle to achieving their omnichannel goals. The pandemic only increased demand for integrated online and in-store experiences, placing even more pressure on retailers to present accurate product availability and manage order changes on the fly. Better access to real-time information is the key to meeting consumer demands in the new normal.

In this blog, we’ll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at scale with a data lakehouse.

It’s a cool scenario, at the least.

Comments closed

Fixing Remote Desktop Resolution to Match Screen Size

Kenneth Fisher fixes a problem:

One of the things I hate most about using remote desktop is when the resolution for the remote server is way off what the size of my window is. You end up with the scroll bars on the side and bottom of the screen and spend half your time moving the screen around so that you can find what you need. Open Notepad, scroll to the top of the screen so you can save it, then scroll to the bottom of the screen so you can switch to the Excel spreadsheet that’s hidden by the Notepad window, etc.

It turns out there is an easy setting that forces the resolution to match the window size. Even if you end up with really weird window sizes.

My fix historically was to use mRemoteNg, which did this for you.

Comments closed

Building a Deployment Pipeline with sqlpackage

Daniel Hutmacher builds a cloudless deployment pipeline:

I did some googling to see just how simple I could make a database deployment pipeline. I keep the DDL scripts in a git repository on the local network, but I can’t use Azure DevOps or any other cloud service, and I don’t have Visual Studio installed, so the traditional database project in SSDT that I know and love is unfortunately not an option for me.

So I googled a little, and here’s what I ended up doing.

Click through to see how Daniel was able to do it.

Comments closed

Hash Spill Wait Stats

Erik Darling continues a series on interesting wait stats:

Hash spills are, as we’ll see, sometimes identified by a different wait than sort spills. In small quantities, spills are often not worth bothering with. But when they pile up, they can really cause some severe performance issues.

In this post, I want to show that both Hash Aggregate and Joins can cause the same wait type to show, along with some evidence that strings make things worse.

Click through for a good explanation of the wait stat you’re liable to see the most.

Comments closed