Data – Page 12 – Curated SQL

Validating Data Model Results in Power BI

Published 2020-10-12 by Kevin Feasel

Paul Turley continues a series on doing Power BI the right way:

When designing a new data model, this is typically the first thing I do… For every fact table and for each large dimension table, I create a measure that returns the record count for that table. Users normally think about their data in business terms (like sums, ratios and averages) and not about how many records there are in a table. Record counts are a convenient sanity check for record completeness and distribution across groups; and may also be a good indicator for model size and performance.

Paul takes several passes at the problem, getting a bit deeper into it each time.

Comments closed

Processing Larger Messages with Apache Kafka

Published 2020-08-18 by Kevin Feasel

Kai Wähner walks us through the tradeoffs of sending large messages in Apache Kafka:

After exploring use cases for large message payloads, let’s clarify what Kafka is not:
Kafka is usually not the right technology to store and process large files (images, videos, proprietary files, etc.) as a whole. Products were built specifically for these use cases.
For instance, a Content Delivery Network (CDN) such as Akamai, Limelight Networks, or Amazon CloudFront distribute video streams and other software downloads across the globe. Or “big file editing and processing” (like a video processing tool). Or video editing tools from Adobe, Autodesk, Camtasia, and many other vendors are used to structure and present all video information, including films and television shows, video advertisements, and video essays.

There’s a lot of good advice in here. I think the best advice is essentially “don’t do this unless you need it” but I appreciate that Kai goes a lot further than that.

Comments closed

The Dunder Mifflin Data Set

Published 2020-08-10 by Kevin Feasel

Tim Mitchell has a new data set for us:

I’ve been a fan of Dunder Mifflin ever since I first learned about this small midwestern paper company. Over the years I’ve gotten to know their people and processes, following from a distance their successes, failures, and various adventures. Who would have known the paper business would be so interesting?
Based on what I learned about this company, I built this Dunder Mifflin data set based on the old Northwind structure, adapting it to meet the needs of this small paper company. It includes most of the employees, regional locations (both current and now-closed), and has a modestly-sized set of sales data for demos and testing.

Check out Tim’s GitHub repo and give it a try.

Comments closed

Trust and Warehouse Data

Published 2020-07-14 by Kevin Feasel

Rob Farley explains one way that people might lose trust in your warehouse data:

The scenario is that there’s a source system, and there’s a table in a warehouse that is being used to report on it. Maybe it’s being populated by Integration Services or Data Factory. Maybe it’s being populated by T-SQL. I don’t really care. What I care about is whether the data in the warehouse is a true representation of what’s in the source system.
If it’s not a true representation, then we have all kinds of problems.
Mostly, that our warehouse is rubbish.

Read on for an example of how this might occur and what you can do to prevent it.

Comments closed

Free Data Sets

Published 2020-07-06 by Kevin Feasel

Chris Taylor puts on the Santa Claus getup:

This is just a short overview page of various dataset sources I’ve used in the past for usage in my SQL Server, Azure, AWS and Power BI demo’s. Most if not all of these are free as I don’t like paying for stuff and there are some whereby you can generate your own data if there is something more specific you need……..just be careful as although its “randomly” generated, I have had issues where I pushed the files to github and as it happens, one of the email addresses randomly generated was that of an employee of a company which I’d never heard of. Pure chance (I believe) but something to be mindful of if that is your intention for use.

Chris also has sub-links to Google and Kaggle data sets.

Comments closed

Obfuscating Data in SQL Server

Published 2020-06-15 by Kevin Feasel

Dave Mason has a data obfuscator:

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I’ve expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.
The code is available in a Github repository. For now, it consists of ten stored procedures.

Unlike something like Dynamic Data Masking, this is a permanent update to the table. That makes it quite helpful for getting production distributions and use cases into non-production environments.

Comments closed

Sharing a Dataset in Power BI

Published 2020-04-08 by Kevin Feasel

Marc Lelijveld shows how you can share a dataset in Power BI:

There are many different use cases to consider where shared datasets can be an advantage. Below I have quickly listed a few advantages, but probably you can think of many more.
– Centrally managed definitions and calculations to avoid different calculations for the same metrics and different versions of the truth.
– One central load from source to Power BI dataset, which lowers the performance impact on the source system.
– Easier to kickstart the data driven analytics experience for the business users and any other self-service analytics purposes.

Sharing here doesn’t mean giving to the broader world; it’s sharing within an organization.

Comments closed

Handling Bad Records with Apache Spark

Published 2020-04-06 by Kevin Feasel

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:
– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types
Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Handling Missing Data

Published 2020-04-03 by Kevin Feasel

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.
While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.
This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed

Data Exfiltration Protection when Using Azure Databricks

Published 2020-03-31 by Kevin Feasel

Bhavin Kukadia, et al, explain how to prevent users from taking data from your Databricks cluster without authorization:

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

Click through for the architectural pattern.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data