Data – Page 14 – Curated SQL

Generating Fake Data with R

Published 2020-02-03 by Kevin Feasel

Dave Mason takes a look at generating fake PII in R:

I’ve been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren’t data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?
In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I’ll focus on the “generator” package.

Click through to see what it does and Dave’s thoughts on the topic. It would also be possible to generate fake data in R by hitting a web API like Daniel Hutmacher’s service.

Comments closed

Moving Data Around in Azure Synapse Analytics

Published 2020-01-31 by Kevin Feasel

Niko Neugebauer looks at some techniques for copying data into a table in an Azure Synapse Analytics SQL Pool:

First of all, let us list some of them (and I am not even attempting on providing all of them, of course):
– INSERT INTO … SELECT FROM … (the most well known one)
– SELECT INTO … FROM … (the most well-known to perform well, since it will create a HEAP while copying most of the properties from the original table(s))
– CREATE TABLE … AS SELECT … (the old way, which must be like 10 years old on PDW/APS & Azure SQL DW, but that has never gotten into a Box Product or Azure SQL Database)
– Polybase (that will use the External Tables & externally allocated data to transfer into Azure SQL DW)
– BCP (good old tested friend that will give you a pain in the neck until you dominate it)
– OPENROWSET / BULK INSERT (some very good and very old friends with complicated histories (who remembers all the code pages?, settings and uncertain future mostly because of their original restrictions, I guess)
– COPY INTO … (the brand new command that will allow you under very neat privileges to copy data from the external storage accounts, much like BULK INSERT)
In this blog post I will simply focus on those features that have not been ported (hopefully just yet): CTAS & COPY INTO.

Read on to see how these two work. Also, I too have wanted CTAS in on-premises SQL Server for years.

Comments closed

Defining Data Governance

Published 2020-01-28 by Kevin Feasel

Melissa Coates explains just what it is that data governance does for you:

I believe it comes down to 4 main areas. Data is governed when it is:
Trustworthy
Secured
Managed & Audited
Documented & Understood

Read on to see how this applies to Power BI.

Comments closed

Generating Synthetic Data with R

Published 2020-01-24 by Kevin Feasel

Sidharth Macherla uses the conjurer package in R to generate synthetic data:

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’.

One of the toughest problems of generating data is making it look realistic enough. It’s one level of difficulty to build “steady-state” data, but if you want data to follow a combination of trend and random walk…that’s when things get dicey. H/T R-Bloggers

Comments closed

2020 Data Professional Salary Survey Results

Published 2020-01-10 by Kevin Feasel

Brent Ozar has another year of salary data for us:

A few things to know about it:
– The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.
– The spreadsheet includes the results for all 4 years (2017-2020.) We’ve gradually asked different questions over time, so if a question wasn’t asked in a year, the answers are populated with Not Asked.
– The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.
– Frankly, anytime you let human beings enter data directly, the data can be pretty questionable – for example, there were 14 folks this year who entered annual salaries below $500. If you’re doing analysis on this, you’re going to want to discard some outliers.

It’s on my agenda (somewhere…probably a bit further back than I’d like) to dig into this year’s data and try to come up with something a little more comprehensive now that there are four years of data.

Comments closed

Data Professional Salary Survey Now Open

Published 2019-12-16 by Kevin Feasel

Brent Ozar announces the 2020 edition of the Data Professional Salary Survey:

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:
Take the Data Professional Salary Survey now.
The anonymous survey closes Sunday, January 5, 2020. The results will be completely open source, and shared with the community for your analysis.

Please take a few minutes and fill this out before January 5th. I’d really love to see a lot of non-SQL Server professionals fill out the survey, as every year, I end up having to ignore database platform because it’s 95% SQL Server and 5% everything else.

Comments closed

Scripting and Deploying Data with SSDT

Published 2019-11-04 by Kevin Feasel

Kamil Nowinski shows how you can script out and deploy data as part of your SSDT database project:

It’s a very common scenario when we have a database (project) and require some data in selected tables even in the freshly deployed, empty database.
Nomenclature for these data/tables that you can meet are:
– reference data
– master data
– dictionary
Which term have you heard and is closest to you?
By default, SSDT does not support scripting or deploying the data.
Hence: the question is: how we can cope with that?

Read on for the answer.

Comments closed

Generating Anonymous Data

Published 2019-10-03 by Kevin Feasel

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.
In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Predicting Application Problems from the Database

Published 2019-10-01 by Kevin Feasel

Ed Pollack has a pattern for rooting out application problems based on database activity:

We can approach I/O file stats very similarly to how we handled row counts above: Regularly collect data, store it in a reporting table, and then run analytics against it as needed. Since these database metrics are reset when SQL Server services restart, we need to collect a bit more often. We’ll also want to collect often enough to be able to correlate changes to ongoing application activity. Hourly is typically an acceptable collection frequency, but your environment may lend itself to the more frequent or less frequent collection.

What’s nice is that you can get a long way with heuristics and domain knowledge, even before applying data science techniques.

Comments closed

Delta Lake Schema Enforcement

Published 2019-09-25 by Kevin Feasel

Burak Yavuz, et al, explain the concept of schema enforcement with Databricks Delta Lake:

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

Something something “relational database” something something. They also walk us through some examples in a Databricks notebook, so check that out.

Comments closed

Category: Data