Press "Enter" to skip to content

Day: December 3, 2020

ETL with R in SQL Server

Rajendra Gupta shows one reason for using R inside SQL Server:

Data professionals get requests to import, export data into various formats. These formats can be such as Comma-separated data(.CSV), Excel, HTML, JSON, YAML, Tab-separated data(.TSV). Usually, we use SQL Server integration service ETL packages for data transformations, import or export data.

SQL Machine Learning can be useful in dealing with various file formats. In the article, External packages in R SQL Server, we explored the R services and the various external packages for performing tasks using R scripts.

In this article, we explore the useful Rio package developed by Thomas J. Leeper to simplify the data export and import process.

One thing I’d like to reiterate is that even though you’re using R to move this data, you don’t need to perform any data science activities on the data—R can be the easiest approach for getting and cleaning up certain types of data.

Comments closed

So You’ve Hit the Limits of ADF Concurrency

Paul Andrew shows what happens you you break the ADF concurrency barrier:

Firstly, understanding how these limits apply to your Data Factory pipelines takes a little bit of thinking about considering you need to understand the difference between an internal and external activity. Then you need to think about this with the caveats of being per subscription and importantly per Azure Integration Runtime region.

Assuming you know that, and you’ve hit these limits!

Click through to see what happens. It’s not pretty.

Comments closed

Logging Schema Changes with DDL Triggers

Andreas Wolter shows how we can implement a log of object changes with DDL triggers:

Over the years working on customer systems, I personally found it to be invaluable and as best practice equipped any database that I designed with such a small trigger and DDL-log-table, just in case. It has helped many times to quickly solve issues with deployments scripts, non-scripted changes to the systems, problems with Source Control and simply getting answers quickly.
The concept is almost trivial and because DDL changes are usually not in performance-critical code-paths, the theoretical overhead on the DDL statement-runtimes is not relevant. (Unless frequent schema-changes are part of a performance-sensitive workload – in which case I would then question if using DDL is a good idea at all in such a place. Note that temporary tables are not caught by DDL Triggers.)

Click through for more information, as well as a sample script.

Comments closed

Compressing and Storing LOB Data

Randolph West hits the archives:

When we look at the several data types for storing LOB data in SQL Server, the ones we might think of first are specifically designed for storing clear text. These are VARCHAR(MAX) and NVARCHAR(MAX), which you can read up about on Microsoft Docs. They’re variable-length data types that are used to store clear text between 8,000 bytes and 2 GB. The key distinction is that NVARCHAR is for Unicode strings, which use at least two bytes per character.

Then we get VARBINARY(MAX), also for storing between 8,000 bytes and 2 GB. This data type does not make any assumptions about the stored blob’s format. You can store text, zip files, images, public key security certificates, you name it. If it can be persisted to storage, it can be stored in VARBINARY(MAX).

Read on for an example of Randolph using Gzip compression and storing XML data as binary for auditing purposes.

Comments closed

Working with Self-Hosted Integration Runtimes

Craig Porteous walks us through some of the planning necessary for self-hosted integration runtimes:

If your Data Factory contains a self-hosted Integration runtime, you will need to do some planning work before everything will work nicely with CI/CD pipelines. Unlike all other resources in your Data Factory, runtimes won’t deploy cleanly between environments, primarily as you connect the installed runtime directly to a single Data Factory. (We can add more runtime nodes to a single Data Factory but we cannot share a single node between many data factories*). An excerpt from Microsoft’s docs on Continuous integration and delivery in Azure Data Factory mentions this caveat.

Read on for the consequences and two options available to you.

Comments closed

Temp Tables as a Release Valve for Design Issues

Erik Darling shows a benefit of temp tables and cold, bitter experience:

A lot of the time when I see queries that are written with all sorts of gymnastics in the join or where clause and I ask some questions about it, people usually start complaining about the design of the table.

That’s fine, but when I ask about changing the design, everyone gets quiet. Normalizing tables, especially for Applications Of A Certain Age™ can be a tremendously painful project. This is why it’s worth it to get things right the first time. Simple!

Rather than make someone re-design their schema in front of me, often times a temp table is a good workaround.

Click through for an example.

Comments closed