Press "Enter" to skip to content

Category: Data

2020 Data Professional Salary Survey Results

Brent Ozar has another year of salary data for us:

A few things to know about it:

– The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.
– The spreadsheet includes the results for all 4 years (2017-2020.) We’ve gradually asked different questions over time, so if a question wasn’t asked in a year, the answers are populated with Not Asked.
– The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.
– Frankly, anytime you let human beings enter data directly, the data can be pretty questionable – for example, there were 14 folks this year who entered annual salaries below $500. If you’re doing analysis on this, you’re going to want to discard some outliers.

It’s on my agenda (somewhere…probably a bit further back than I’d like) to dig into this year’s data and try to come up with something a little more comprehensive now that there are four years of data.

Comments closed

Data Professional Salary Survey Now Open

Brent Ozar announces the 2020 edition of the Data Professional Salary Survey:

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:

Take the Data Professional Salary Survey now.

The anonymous survey closes Sunday, January 5, 2020. The results will be completely open source, and shared with the community for your analysis.

Please take a few minutes and fill this out before January 5th. I’d really love to see a lot of non-SQL Server professionals fill out the survey, as every year, I end up having to ignore database platform because it’s 95% SQL Server and 5% everything else.

Comments closed

Scripting and Deploying Data with SSDT

Kamil Nowinski shows how you can script out and deploy data as part of your SSDT database project:

It’s a very common scenario when we have a database (project) and require some data in selected tables even in the freshly deployed, empty database.
Nomenclature for these data/tables that you can meet are:
– reference data
– master data
– dictionary
Which term have you heard and is closest to you?

By default, SSDT does not support scripting or deploying the data.
Hence: the question is: how we can cope with that?

Read on for the answer.

Comments closed

Generating Anonymous Data

Daniel Hutmacher has a nice web API to generate fake customer data:

I’ve been working on a little gadget for a while now, and today I finally got around to completing it and so now I’ve published it for everyone to try out. It’s a web API (wait, wait, don’t go away – it’s for database people!) that creates a randomized list of names, addresses, etc.

In this post, I’ll show you how easy it is to use this service to anonymize a development or test database so you don’t have all that personally identifiable information floating around.

Read the whole thing and check out his service. Also, Daniel was the one who spurred me on to update the theme here to get rid of some problems, so you can thank him for that too.

Comments closed

Predicting Application Problems from the Database

Ed Pollack has a pattern for rooting out application problems based on database activity:

We can approach I/O file stats very similarly to how we handled row counts above: Regularly collect data, store it in a reporting table, and then run analytics against it as needed. Since these database metrics are reset when SQL Server services restart, we need to collect a bit more often. We’ll also want to collect often enough to be able to correlate changes to ongoing application activity. Hourly is typically an acceptable collection frequency, but your environment may lend itself to the more frequent or less frequent collection.

What’s nice is that you can get a long way with heuristics and domain knowledge, even before applying data science techniques.

Comments closed

Delta Lake Schema Enforcement

Burak Yavuz, et al, explain the concept of schema enforcement with Databricks Delta Lake:

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

Something something “relational database” something something. They also walk us through some examples in a Databricks notebook, so check that out.

Comments closed

Null Checks in Spark DataFrames

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

Comments closed

“Big” Data

Buck Woody explains that “Big Data” is just data:

A few years ago it was all the rage to talk about “Big Data”. Lots of descriptions of “Big Data” popped up, including the “V’s” (Variety, Velocity, Volume, etc.) that proved very helpful. I even have my own definition:

Big Data is any data you can’t process
in the time you want
with the systems you have

This post is quite reasonable in its depiction of the problem. I extend it a bit further than that and talk about difficulty of processing the data. Nonetheless, read Buck’s full thoughts and check out the Big Data Clusters workshop.

Comments closed

Azure Data Share

James Serra takes us through a new product announcement:

A brand new product by Microsoft called Azure Data Share was recently announced. It is in public preview. To explain the product in short, any data which resides in Azure storage can be securely shared between a data provider and a data consumer. It does this by copying a snapshot of the data to the consumer’s subscription (called snapshot-based copying, and in the future there will be in-place sharing). It currently supports ADLS Gen1, ADLS Gen2, and Blob storage, and eventually will support Azure Data Explorer, SQL DB, and SQL DW. Check out the Documentation and a video, and then go try it out.

You can share the data with a few clicks as long as the user you are trying to share with has access to an Azure Subscription and storage account. The copying and updating of the data is handled for you using the Microsoft backbone for best performance, and is encrypted during transit. You can specify the frequency at which the data consumers receive updates. It also is a simple way to control, manage, and monitor all of your data sharing.

This is a smart idea. Sharing data between companies is a key requirement in a lot of B2B solutions, yet methods for sharing range from high-development and medium-friction (create your own API) to low-development and high-friction (FTP, e-mail).

Comments closed

Containers and Data

Randolph West argues that you should keep data and containers separated:

Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server container without throwing away the database files, provided that the container is set up to use that persisted storage.

For example, I can map the SQL Server container to a local directory on my Ubuntu or Windows Server, or — as is the case with Kubernetes — a second container can serve as the storage volume. SQL Server is then just a compute engine, or a “service” as Anthony points out in the Twitter thread.

Because every rule has a counter-example (even this one), there are cases when you do want the data to live with the container. For example, a test database for a unit test runner should probably live inside the container rather than being a persisted volume. The reason is that you can blow away the database after a test run and start over for the next run. Putting together a database for a hackathon or user group can be another exception for the same reason. But for pretty much every other purpose, I’d rather have a persisted volume.

Comments closed