Sampling Data Lake Data

Alex Whittles shows how to use U-SQL to sample data to read in Power BI:

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

Read on for a couple of sampling methods.

Related Posts

Storytelling with Power BI: Consistency

Mark Lelijveld continues a series on storytelling with Power BI: In the below report you can easily click on a country on the left side to move to another page. When it comes to interactivity it is all done right! On the right top you can also filter on order date. Let’s say we apply […]

Read More

Azure Dedicated Hosts in Preview

Mine Tokus covers the benefit of Azure Dedicated Hosts: Recently introduced, Azure Dedicated Host Preview provides single-tenant physical servers that can host one or more virtual machines. With this new hosting model, physical server is dedicated to your organization and capacity isn’t shared with other customers. Physical server-level isolation helps to address security and compliance requirements, brings visibility […]

Read More

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031