Press "Enter" to skip to content

Category: Data

2023 Data Professional Survey Results

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

A New Sample Database

Daniel Hutmacher isn’t satisfied with AdventureWorks:

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Read on for an overview of the database’s schema, as well as the link to download the DB. I’ll have to check it out.

Comments closed

E-Mailing Power BI Query Outputs as CSV Files

Gilbert Quevauvilliers finds a way:

I recently had a requirement from a customer where they wanted a list of all Customers and the Primary Contact to be emailed to them every day.

The reason for this to be emailed daily is to ensure that when calling the customer, they know whom to speak to.

This got me thinking and I could use Power Automate to achieve this task which I detail in this blog post below.

I appreciate the ingenuity involved in getting this to work, though this also presents a good case for having this data in a warehouse, where data export to CSV would be easier.

Comments closed

The Importance of Data Shaping

Paul Turley shapes the youth of data:

Power BI is a new tool and dimensional modeling is an old idea. One of the challenges is that, like other modern self-service analytics products on the market, Power BI doesn’t force self-service data jockeys to transform their data before reporting with it. If you want to import a big, wide spreadsheet full of numbers and create charts in a Power BI report, knock yourself out. But, the solution won’t scale and you will inevitably run into walls when you try to make future enhancements. Similar problems arise from importing many tables from different sources and transactional systems. Several tables all chained together with creative mashups and relationships present their own set of problems. The first iteration of such an effort is usually a valuable discovery method and learning experience. Great… treat it as such; take notes, make note of the good parts and then throw it away and start over! In Fredrick Brooks’ “The Mythical Man Month“, he cites that for most engineering projects, the first six attempts should be abandoned before the team will be prepared to start over and complete the work successfully. He was a chemical engineer before working for IBM; and hopefully, our methods in the data engineering business are more effective then his 6-to-1 rule. But, this makes the case the prototypes and proof-of-concept projects are a critical part of the learning path.

The tools don’t make the rules.

Unless you’re talking about the lambda architecture, in which case that’s kind of accurate. But we’re not talking about that here.

Comments closed

Data Retention: Definition and Policy

Joey Jablonski thinks about data retention:

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

This is something that most DBAs provide input into but don’t directly control. Still, it’s good to know some of the challenges around data retention and figure out how to apply it to your organization.

Comments closed

Security Practices for Delta Sharing

Andrew Weaver, et al, share some advice:

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.

Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

Click through for eight such tips.

Comments closed

Detecting Data Changes in Power BI Incremental Refresh

Chris Webb writes some M:

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

Read on to see how it works, including a couple gotchas around things like the shape of query results.

Comments closed

Finding Sample Data Online

Mara Pereira goes searching for data:

Have you ever struggled to find sample data to play with in Power BI?

Did you spend hours (sounds crazy, but it happened to me too!) just looking for a dataset with insurance data? Healthcare data? Housing prices data?

Did you ever wonder “where are people finding the data to create those Netflix and Amazon reports that seem to be everywhere these days? Seriously, tell me your secret!”?

Click through for three good sites. Another one I’ve taken to is the US Bureau of Labor Statistics, which is the federal department responsible for tracking things like employment data, consumer prices, and compensation data. The plus side to these datasets is that you get the whole range of data cleanup, warehousing, querying, and analytics over data which is both real and fairly interesting. For a broader take, data.gov has open source data hosted by the US federal government, though I haven’t spent much time working with it.

Comments closed

Visualizing Kafka Stream Lineage

David Araujo and Julia Peng show off stream lineage in Confluent Cloud:

Stream Lineage is a tool Confluent built to address the lack of data visibility in Kafka and event-driven architectures. Confluent’s Stream Lineage provides an interactive map of all your data flows that enable users to:

1. Understand what data flows are running both now or at any point in the past

2. Trace where each data flow originated from

3. Track how data is transformed along its journey

4. Observe where each data flow ends up

Read on to see how it works.

Comments closed

Data Lakehouse Cleanrooms in Databricks

Matei Zaharia, et al, announce an interesting idea:

We are excited to announce data cleanrooms for the Lakehouse, allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language – Python, R, SQL, Java, and Scala – on the data while maintaining data privacy.

With the demand for external data greater than ever, organizations are looking for ways to securely exchange their data and consume external data to foster data-driven innovations. Historically, organizations have leveraged data sharing solutions to share data with their partners and relied on mutual trust to preserve data privacy. But the organizations relinquish control over the data once it is shared and have little to no visibility into how data is consumed by their partners across various platforms. This exposes potential data misuse and data privacy breaches. With stringent data privacy regulations, it is imperative for organizations to have control and visibility into how their sensitive data is consumed. As a result, organizations need a secure, controlled and private way to collaborate on data, and this is where data cleanrooms come into the picture.

Read on to learn more about how this all works. It’s definitely a lot better than sending off a bunch of CSVs…

Comments closed