Press "Enter" to skip to content

Category: Data

The Importance of Data Governance

Rob Farley riffs on another T-SQL Tuesday topic:

But the checks that we do are more about things that the database can allow, but are business scenarios that should never happen.

Plenty of businesses seem to recognise these scenarios all too well, and can point them out when they come across them. You hear phrases like “Oh, we know that’s not right, it should be XYZ instead”. And they become reasons why they don’t really trust their data. It’s a data quality issue, and every time someone comes across a data quality issue, they trust the data a little less.

Click through for Rob’s thoughts.

Comments closed

Generating Artificial Data with Databricks Generator

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

Generating Mock Data for SQL Server

Chad Callihan has a few options for creating fake data:

It’s easy enough to create a handful of records for testing in SQL Server. What if you want 100 rows or 1000 rows? What if you want data that looks more legitimate compared to gibberish? In this post, we’ll look at different ways to generate mock data.

One of the trickiest things about creating mock data is getting the distributions right. For example, ABS(CHECKSUM(NEWID()) is great (just as RAND(CHECKSUM(NEWID())), but the results follow a uniform distribution because of the nature of checksums and random number generators. This makes charting numeric values look unnatural. Here’s an example I put together of generating data off of a normal distribution. It does take more effort, but if you’re generating this fake data to show it to users in tools like Power BI or Tableau, having data follow reasonable distributions is a good thing. That is, use whatever distribution makes sense for the particular data element: uniform, normal, Pareto (power law), gamma, etc.

Comments closed

A Show about Nothing

Joe Celko has a moment of zen:

Human beings are not very good at handling nothing. The printing press didn’t just lead to civilization as we know it, but it also changed our mindset about text. When we wrote text manually on paper, a blank or space was not seen as a character. It was just the background upon which characters were written.

It was centuries before the zero was accepted as a number. After all, it represents the absence of a quantity or magnitude or position; how could it possibly be a number? Before it was accepted as a number, it was considered a symbol or mark in a positional notation to indicate that there was nothing in that position.

It’s an interesting riff, so check it out.

Comments closed

Creating a CSV File from a Table via BCP

Kenneth Fisher shows how to use the bcp command to create a file from a table:

This is a pretty handy little tool in your arsenal. I’ve talked about using bcp to transfer data from one instance to another before and this is another really great use for bcp. If you haven’t used it before bcp stands for Bulk Copy Protocol and is a command line tool for transferring data in and out of SQL Server. In this case you can use this command to generate a csv file from DBName.SchemaName.TableName:

Click through to see the command, as well as some helpful hints.

Comments closed

From Excel to SQL Server via Powershell

Kevin Wilkie combines Excel and dbatools like a mad scientist:

Those ways work great if you’re inserting one tab of an Excel spreadsheet. I don’t know about y’all, but I don’t want to have to do all of this work for each tab – especially if I have 30+ tabs to import into SQL Server (and yes, I’ve been asked to do that quite a few times.)

One of the easiest ways I’ve seen to insert a lot of data (once you’ve made sure that the data is how you want it) is to use the power of PowerShell.

Read on to see how.

Comments closed

Data Hubs, Warehouses, and Lakes

Trevor Legg compares and contrasts data hubs, data warehouses, and data lakes:

Data hubs, data warehouses, and data lakes are significant investment areas for data and analytics leaders and are vital to support increasingly complex, distributed, and varied data workloads.

Gartner finds that 57% of data and analytics leaders are investing in data warehouses, 46% are using data hubs, and 39% are using data lakes. However, they also found that these same data and analytics leaders don’t necessarily understand the difference between the three…

To best support specific business requirements, it’s vital to understand the difference and purpose of each type of structure, and the role it can play in modern data management infrastructure.

Click through for the definitions and comparisons.

Comments closed

Power Query Data Profiling

Ed Hansberry takes us through the data profiler in Power Query:

A solid green bar is usually best. It means there are no issues in that column, as shown in the Discounts field. On the far right in the COGS field, there is a tiny bit of gray. That means there is no data, or a null, in one or more records. The bigger the gray bar, the higher percentage of nulls there are. In the middle though we have a problem. Anything in red is an error and must be fixed. These may cause more errors in further transformations downstream, and will usually cause refresh errors.

Before we address this, let’s get a bit more information about our data. Go to the View ribbon, and there are 3 settings we want to look at.

I really like what the data profiler provides us. If you’re a regular Power BI user, I highly recommend checking it out if you haven’t already.

Comments closed

Bulk Copying Lots of Rows into SQL Server

Esat Erkec shows us how to use the Bulk Copy Program (BCP) to bulk load data into SQL Server:

If the installed version is older than the last version, we can download and install the latest version from the Microsoft website. The main capability of the SQL Server BCP is not much complex because it can only run with several arguments. The syntax of the BCP is like below:

bcp {table|view|”query”} {out|queryout|in|format} {data_file|nul} {[optional_argument]…}

For example, if we want to export any data of a table to a text file, we have to specify the table name, the out option, and the data file. The following command will export the Production table into the specified text file.

bcp AdventureWorks2017.Production.Product out C:\ExportedData\Product.txt -S localhost -T –w

I don’t know if I’m the only person for which this is true, but the data file format has always been a royal pain for me to get right, to the point where I’d happily build an SSIS package to perform bulk loading over having to use BCP myself.

2 Comments

Drift Monitoring with Azure Machine Learning

I take a look at dataset drift monitoring in Azure Machine Learning:

One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.

But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.

Read on for a demonstration of how the product works and a couple of things to keep in mind.

Comments closed