Data – Page 9 – Curated SQL

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

Unity Catalog in Azure Databricks

Published 2022-06-10 by Kevin Feasel

Paul Roome, et al, announce Unity Catalog:

We are excited to announce that data lineage for Unity Catalog, the unified governance solution for all data and AI assets on lakehouse, is now available in preview.
This blog will discuss the importance of data lineage, some of the common use cases, our vision for better data transparency and data understanding with data lineage, and a sneak peek into some of the data provenance and governance features we’re building.

Click through to see what it currently supports. My curious question is around whether this and Microsoft Purview will play nice in an Azure Databricks setup.

Comments closed

Knowledge Graphs, Data Fabrics, and Data Meshes

Published 2022-06-06 by Kevin Feasel

Alan Morrison describes the history of three concepts:

By 2014, SAP was using “in-memory data fabric” to describe a virtual data warehouse, a key element of its HANA “360-degree customer view” product line. Gartner for its part uses the term “data fabric” to this day as an all-encompassing means of heterogeneous data integration. From a 2021 post on data fabric architecture:

Read on for a high-level discussion of what each is and how it fits into the context of data warehouses and data lakes.

Comments closed

Searching for Strings in a Database

Published 2022-05-10 by Kevin Feasel

Aaron Bertrand updates an old tip:

Back in 2015, I wrote a tip called Search all string columns in all SQL Server databases. That tip focused on finding strings within string-based columns in all tables across all user databases. I was recently asked if this could be made more flexible; for example, can it search views as well, and can it search only in a specific database?

Click through for a new version which works in SQL Server 2016 and later.

Comments closed

File Format Throwdown

Published 2022-05-09 by Kevin Feasel

Tomaz Kastrun tries out several file formats in Azure Data Lake Storage (Gen2):

CSV data format is an old format and very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.
We will be looking at benchmarking the CRUD operations with different data formats; from CSV to ORC, Parquet, AVRO and others with the simple Azure data storage operations, like Create, Write, read and transform.

It’s important to remember that Parquet and ORC are intended to solve radically different problems than Avro. Parquet and ORC are columnar datasets intended to aggregate quickly and efficiently, whereas Avro is intended for efficient row storage. CSV is intended for easy-to-work-with row storage.

Then, Tomaz follows up with some R:

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.
We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

The Feather file format, by the way, comes from Apache Arrow and works especially well with Python and R. You might not get the same performance benefits in other languages, depending on its library support.

Comments closed

“Production” in Data Analytics

Published 2022-03-22 by Kevin Feasel

Joey Jablonski brings up an important point:

Data-driven environments have a fundamentally different set of needs around testing, deployment, and visibility then traditional business applications. Data driven environments need access to fresh data on a high level of update frequency to ensure that data engineers and data scientists are able to effect outputs and recommendations on a timeline that has a positive impact on business decisions and customer experiences.

My day job involves running a predictive analytics team. We train models on production data—there’s very little value in training models on artificial dev data (outside of understanding the parameters of the modeling process), so even our development data generally comes from production. I don’t know that I’m sold on data mesh as a solution to this but it’s worth investigation.

Comments closed

Defining Data Quality

Published 2022-03-16 by Kevin Feasel

Ust Oldfield notes the importance of data quality:

We can safely assume this because a lot of organisations do not have data quality at the top of the priority lists. Why might this be the case? Because monitoring data quality and correcting poor quality data is hard. If it was easy, every organisation would have a strategy and method for tracking and improving data quality.
Often, inertia – driven by an overwhelming amount of information to the risks – sets in as it’s difficult to know where to start. So, where do you start?

I’d say there’s an incentive alignment problem with data quality: organizations want it but not enough that they’d trade anything else for it. And agents within the organization consider data quality a chore, so they’re looking for the minimum viable path. Then, for end users, we consider it even more of a chore (or a nuisance). Furthermore, I’m one of those end users who will put in fake data if I can get away with it on the principle that I don’t want you to have my personal information because you’re probably going to sell it or lose it.

Comments closed

The Importance of Data Dictionaries

Published 2021-12-16 by Kevin Feasel

John Morehouse takes us through data dictionaries:

Data professionals—whether they’re database administrators (DBAs), developers, or data scientists—work in a wide and varied landscape usually in flux and filled with challenges. These challenges could range from changing business requirements to keeping up with the sheer velocity at which technology evolves.
It’s also critical for these professionals to understand their organization’s data and how it applies to a given application or business unit. Better outcomes usually come from employing data dictionaries throughout the organization. Through many years of experience in IT professions, I’ve seen the utilization of data dictionaries range from “not at all” to “I’m documenting every possible data attribute known to humankind.” In my experience, data-related projects with data dictionaries as part of the process are far more likely to be successful than projects without them (even extremely populated data dictionaries are more useful than nothing at all). Trust me on this.

Click through to understand why you should trust John on this one.

Comments closed

Determining the Right Batch Size for Deletes

Published 2021-12-09 by Kevin Feasel

Jess Pomfret breaks out the lab coat and safety goggles:

I found myself needing to clear out a large amount of data from a table this week as part of a clean up job. In order to avoid the transaction log catching fire from a long running, massive delete, I wrote the following T-SQL to chunk through the rows that needed to be deleted in batches. The question is though, what’s the optimal batch size?

I usually go with a rule of thumb: 1K for wide tables (in terms of columns and row size) or when there are foreign key constraints, 10K for medium-width tables, and about 25K for narrow tables. But if this is an operation you run frequently, it’s worth experimenting a bit.

Comments closed

What Good Data Governance Means

Published 2021-12-07 by Kevin Feasel

Paul Andrew shares some thoughts on the true meaning of data governance:

… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.
To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.
Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.

Paul has put together a lengthy answer on the topic, well worth the read.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Data

Example Data Pre-Processing Activities

Unity Catalog in Azure Databricks

Knowledge Graphs, Data Fabrics, and Data Meshes

Searching for Strings in a Database

File Format Throwdown

“Production” in Data Analytics

Defining Data Quality

The Importance of Data Dictionaries

Determining the Right Batch Size for Deletes

What Good Data Governance Means