Sampling Data Lake Data

Alex Whittles shows how to use U-SQL to sample data to read in Power BI:

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

Read on for a couple of sampling methods.

Related Posts

Working With Azure SQL Managed Instances

Jovan Popovic has a couple of posts covering configuration for Azure SQL Managed Instances.  First, he looks at how to configure tempdb: One limitation in the current public preview is that tempdb don’t preserves custom settings after fail-over happens. If you add new files to tempdb or change file size, these settings will not be preserved after fail-over, and […]

Read More

Flushing The Authentication Cache

Arun Sirpal describes an Azure SQL DB-only DBCC command: This command only applies to Azure SQL Database, at a high level it empties the database authentication cache for logins and firewall rules for the current USER database. In Azure SQL Database the authentication cache makes a copy of logins and server firewall rules which are […]

Read More


January 2017
« Dec Feb »