Press "Enter" to skip to content

Month: April 2021

Working on Multiple Repos with Azure Data Studio

Deborah Melkin shows off a feature of Azure Data Studio:

If you read my T-SQL Tuesday post from this month, I mentioned that I’ve been using Azure Data Studio on daily basis. One of the things that I find I use it for the most is for Source Control with Git. I’m incredibly surprised by this. Maybe it comes from years of using Management Studio and not being able to check in code from the tool that I’m using to write it. (Or maybe I’ve been able to do that all this time and no one told me…?)

As I’m using it, I found two things that have helped me out. So naturally, I thought I’d share.

Click through for information on how to use multiple repos, as well as a bonus item.

Comments closed

Installing Kubernetes on EC2

Praveen Sripati eschews EKS:

There are tons of ways of setting up K8S on AWS. Today we will see one of the easiest way to get started with K8S on AWS. The good thing is that we would be using t2.micro instance type, which falls under the AWS free tier. This configuration is good enough to get started with K8S and not for production setup. It’s with the assumption that the reader is familiar with the basic concepts of AWS.

Click through for the process.

Comments closed

A Primer on Columnstore Indexes

Gail Shaw gives us an introduction to columnstore indexes:

Columnstores are… different.

The first, and I would say most important thing to realise about columnstore indexes is that they don’t have keys. These are not seekable indexes. Any access to a columnstore index is going to be a scan.

Instead of storing the rows together on a page, a columnstore index instead stores column values together. The rows in the table are divided into chunks of max a million rows, called a row group, and the columns are then stored separately, in what are called segments. A segment will only ever contain one column’s values.

Read the whole thing.

Comments closed

Altering Columns in Temporal Tables

Meagan Longoria explains the process around table alteration when it’s a temporal table:

System-versioned temporal tables were introduced in SQL Server 2016. They provide information about data stored in the table at any point in time by storing an effective dated version of each row rather than only the data that is correct at the current time

You can alter a temporal table to add or change columns, but you must first turn off system versioning. Let’s look at an example.

The example here relates to a computed column, so a bit more work has to happen due to the way you define computed columns.

Comments closed

OFFSET and FETCH in SQL Server

Steve Jones explains how the OFFSET-FETCH process works:

The other day I saw an article on the OFFSET clause in a SELECT. I had seen this come out and looked at it briefly in SQL Server 2012, but hadn’t done much with it.

NOTE: if you use this, be sure you read about potential performance problems and solutions.

Read on for more info. This is something I really wanted to work better than it does, as paging is really awkward in SQL Server. OFFSET-FETCH is great syntactically, but doesn’t do much for performance. The best solution I’ve seen is to take the results of a paging operation and store them in a dedicated paging table, allowing the user to query by page in that small table quickly. But at that point, OFFSET-FETCH isn’t really much less complicated than TOP(NumberOfRows) WHERE NumberOfRows >= NumberOfRows * (NumberOfPages – 1) ORDER BY NumberOfRows.

Comments closed

sample_ms in sys.dm_io_virtual_file_stats and Data Types

Paul Randal points out an interesting bug:

A prior student emailed me yesterday about some strange behavior of the sample_ms column in sys.dm_io_virtual_file_stats. It’s supposed to be the number of milliseconds since the SQL Server instance has been started. He has a SQL Server 2016 instance that’s been running since August 2019, and it shows the following:

ms_ticks from sys.dm_os_sys_info: 51915112684 (which works out to be August 28, 2019)

sample_ms from sys.dm_io_virtual_file_stats: 375504432 (which works out to be about 4.5 days)

Read on to get a determination and an alternative in case you’re using that field.

Comments closed

Pareto Charts in Power BI

Imran Burki builds a chart:

Last week we built a Manufacturing Yield dashboard that showed first and final pass yield numbers. In this post, we’re going to introduce the concept of manufacturing defects and build a Pareto chart in Power BI. Unlike Excel, the Pareto chart isn’t out-of-the-box in Power BI. Instead, we must create DAX to build the Pareto. Before we dig into the DAX, let’s talk about why we would create a Pareto in the context of manufacturing and why defects are important to track.

Read on to learn what a Pareto chart is and how you can build the DAX function which gives us the relevant information.

Comments closed

EMR Studio Now Generally Availabile

Shuang Li announces that Amazon EMR Studio is now in GA:

EMR Studio provides fully managed Jupyter notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials without signing in to the AWS Management Console. You can install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and Bitbucket, and run parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

With EMR Studio, you can run notebook code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), and take advantage of the performance-optimized EMR runtime for Apache Spark. You can set up EMR Studio to run applications on existing EMR clusters or create new clusters using Cloud Formation templates for Amazon EMR.

Click through for more information.

Comments closed

reduceByKey and aggregateByKey in Spark

The Hadoop in Real World team compares two functions against RDDs in Spark:

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.

First _+_  function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

This is a demo-driven post, so check it out.

Comments closed

Offloading Maintenance Operations

Taryn Pratt has a process for offloading maintenance operations onto another server:

Early on when I started working on the SQL Servers at Stack Overflow, we were taking daily backups. We had a handful of databases that were being restored for other processes, but the majority weren’t actively tested to ensure the backups were good. Since you never want to be in a situation where you need to restore a database and find it doesn’t work, my goal was to create a process to automatically restore our backups to a separate server, and then run DBCC CHECKDB on it.

This is a T-SQL-driven process and I appreciate that. If you want a Powershell-driven process, Kevin Hill has you covered.

Comments closed