Press "Enter" to skip to content

Month: September 2020

Building dacpac Files on Non-Windows Machines

Erik Ejlskov Jensen provides another advantage for Azure SQL Database’s database projects:

For a while now, it has been possible to publish a .dacpac file (meaning apply it to an new or existing database) using the cross-platform version of sqlpackage.

But authoring and building a database project (sqlproj) was only possible on Windows, as the .sqlproj project type is based on the classic .NET Framework .csproj project type.

Now, thanks to the new Database Project extension in Azure Data Studio Insiders build, it is now possible to author, build and manually publish a SQL Server Database project.

And by using the new MsBuild.Sdk.SqlProj SDK and project type, is is also possible to build and publish a Database Project from a build agent (CI pipeline), without having to install the sqlpackage tool. Read on!

You heard Erik.

Comments closed

Data Lineage and SSIS

Aveek Das has a two-parter. First up is a discussion of data lineage:

In this article, I am going to explain what Data Lineage in ETL is and how to implement the same. In this modern world, where companies are dealing with a humongous amount of data every day, there also lies a challenge to efficiently manage and monitor this data. There are systems that generate data every second and are being processed to a final reporting or monitoring tool for analysis. In order to process this data, we use a variety of ETL tools, which in turn makes the data transformation possible in a managed way.

While transforming the data in the ETL pipeline, it has to go through multiple steps of transformations in order to achieve the final result. For example, when the ETL receives the raw data from the source, there may be operations applied to it like filtering, sorting, merging, or splitting two columns, etc. There can also be aggregations or other calculations made on this raw data before finally moving into a data warehouse or preparing it for reporting. In order to be able to detect what the source of a particular record is, we need to implement something known as Data Lineage. It is a piece of simple metadata information that helps us detect gaps in the data processing pipeline and enables us to fix issues later.

Part two covers data lineage with SQL Server Integration Services:

In this article, I am going to discuss SSIS data lineage concepts, which are often used while designing ETL workloads on a data warehouse. Although this article is focused on implementing data lineage using SSIS, it does not only confine to SSIS but to any ETL tools in the market using which data is moved from one source to a destination. In my previous article, Understanding Data Lineage in ETL, I have already discussed the generic importance of data lineage concepts for any ETL tool. I would definitely suggest you have a look at it if you want to understand in general how data lineage helps to track the source of a single record in the warehouse.

If you’re fairly new to this world, it’s a good introduction to an important topic.

Comments closed

Join Operations in Spark

Swantika Gupta compares hash and merge join operations in Apache Spark:

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.

While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Click through for the comparison, though do note that this comparison doesn’t include nested loop joins, which are possible in Spark as well.

Comments closed

Automatic Retention Periods with Power Query

Gilbert Quevauvilliers has an easy method of constraining data sizes in Power Query:

How cool would it be to not have to manually update your dataset to keep data for last 2 Years (last year and this year to the current date)?

In this blog post I will show you how you can easily filter dates in Power Query to show dates for last year and year to date by using the GUI and not having to hardcode anything.

I must admit the first time I saw this was when I watching my good friend Reid Havens presenting. Thanks Reid!

Read on for the technique.

Comments closed

Data Importation and Exportation with dbatools

Mikey Bronowski continues a series on dbatools functionality vis-a-vis SQL Server Management Studio:

The SSMS offers to script out lots of the SQL Server objects, however it can be limited in some areas. Using Get-Dba* commands and piping them into Export-DbaScript may add few more options. For example SQL Agent jobs:

Click through for just shy of a dozen cmdlets to help you run your data import-export business.

Comments closed

DATETIME2 and Storage Size

Randolph West digs into an issue:

Two years ago I wrote a post that got a lot of traction in the comments at the time. Last month there was renewed interest because one of the commenters noted that the official SQL Server documentation for DATETIME2 disagreed with my assertions, and that I was under-representing the storage requirements.

To remind you, I have been saying for years that you can use DATETIME2(3) as a drop-in replacement for DATETIME, and have better granularity (1ms versus 3ms) for 12.5% less storage (1 byte per column per row). The commenter intimated that because my statement conflicted with the documentation, that I must be wrong. As it turns out the documentation was wrong, but I also learned something new in the process!

It’s an interesting internal look at how difficult it is to get documentation right, even on something which sounds simple.

Comments closed

Restoring the Master Database

Kenneth Igiri walks us through restoring the master database in SQL Server:

The master database contains records of the structure/configuration for both the current instance and all other databases. When you run sp_configure, you are writing data to the master database.  It also contains most of the dynamic management views that are necessary to monitor the instance.

The importance of the master database is crucial. First, it has the information necessary for opening all other databases and has to be opened first. Then, it involves all instance level principals for the current instance.

It is crucial to back up the master database daily. Equally important is to know how to restore the master database to the instance. The most frequent cases are the database crash or the necessity to restore the master database to another instance when no longer use the source instance. In this article, we will examine the specific case of moving the master database to another instance. 

It’s definitely not as easy as restoring other databases, but it is possible.

Comments closed

Azure Data Studio Extension Generator

Anjali Agarwal and Laura Jiang announce a new product:

The release of the Azure Data Studio extension generator is now available. Install the generator through npm and get started with extension development with these Azure Data Studio extension tutorials.

The Azure Data Studio extension generator is a command line tool designed to help extension authors get started with the process of extension development. It includes extension templates that enable users to create and publish extensions with minimal technical knowledge required. In our most recent release, we have added three highly requested extension templates to the generator.

Anything which helps make extension development easier is fine by me.

Comments closed

An UPSERT Pattern to Avoid

Aaron Bertrand doesn’t like a common insert/update pattern:

I think everyone already knows my opinions about MERGE and why I stay away from it. But here’s another (anti-)pattern I see all over the place when people want to perform an upsert (update a row if it exists and insert it if it doesn’t):

IF EXISTS (SELECT 1 FROM dbo.t WHERE [key] = @key) BEGIN
UPDATE dbo.t SET val = @val WHERE [key] = @key;
END
ELSE
BEGIN
INSERT dbo.t([key], val) VALUES(@key, @val);
END

This looks like a pretty logical flow that reflects how we think about this in real life:

Does a row already exist for this key?
YES: OK, update that row.
NO: OK, then add it.

Click through to learn why this is a bad idea.

Comments closed