Press "Enter" to skip to content

Day: March 12, 2020

Data Lakes and the Power of Data Catalogs

Ashish Kumar and Jorge Villamariona take us through data lakes and data catalogs:

Any data lake design should incorporate a metadata storage strategy to enable business users to search, locate and learn about the datasets that are available in the lake. While traditional data warehousing stores a fixed and static set of meaningful data definitions and characteristics within the relational storage layer, data lake storage is intended to support the application of schema at read time with flexibility. However, this means that a separate storage layer is required to house cataloging metadata that represents technical and business meaning. While organizations sometimes simply accumulate content in a data lake without a metadata layer, this is a recipe for an unmanageable data swamp instead of a useful data lake. There are a wide range of approaches and solutions to ensure that appropriate metadata is created and maintained. Here are some important principles and patterns to keep in mind. Single data set can have multiple metadata layers dependent on use cases. e.g. Hive Metastore, Apache Glue etc. Same data can be exported to some NoSQL database which would have different schema.

Having a bunch of data isn’t helpful if you don’t know where it is, how it’s formatted, or anything else about the data.

Comments closed

Python Cross-Validation

John Mount has some advice if you’re doing cross-validation in Python:

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Click through to see the problem and how you can fix it.

Comments closed

Extended Events and Query Store

Jason Brimhall takes us through some of the internals of Query Store as exposed by Extended Events:

One of my favorite questions to ask during some of presentations on XE is “What was the first version of SQL Server to have Query Store?” You can imagine the wide array of answers but what is interesting is how often  the correct answer is always missed. I hear lots of answers for 2012, some answers for 2017 and somewhere in between for 2016. But never does the correct answer pop up.

Right now, I hope you are scratching your head at that last statement. You see, the question is somewhat of a trick question. The first version of SQL Server that has QDS is SQL Server 2014. However, the first version where you can actually use it is SQL Server 2016. This fun fact is visible when we start exploring SQL Server from the realm of XE. Let’s take a look.

Read the whole thing.

Comments closed

Getting Started with Azure Cognitive Search

Matt How introduces us to Azure Cognitive Search:

Intuitive and powerful search technologies are becoming more and more important as businesses look to get more value from their unstructured data. Having the ability to full text search across an entire organisation’s worth of files can present huge opportunities for efficiency and understanding. Modern search tools now offer Artificial Intelligence (AI) capabilities that allow value driven enrichment of the raw content using Machine Learning and Data Science techniques. Microsoft’s Azure Cognitive Search product is a leader in this space and offers an excellent search experience with many out-of-the-box AI competencies.

Click through for an overview and a demo.

Comments closed

Oracle’s Automatic Workload Repository Explained

Kellyn Pot’vin-Gorman explains to us what the Automatic Workload Repository is:

The Automatic Workload Repository, (AWR) had been around since Oracle 10g and requires the diagnostic and tuning management pack licensing to use all of its features in Oracle’s Enterprise Edition database. Versions before had limited collections vs. the modern reporting schema and every subsequent release of Oracle has added to it’s content, which explains the size increase stored in the objects/number of objects in the SYSAUX tablespace.

By default and since version, the AWR retention is 8 days and takes an automatic snapshot once per hour. It’s common for DBAs to up this retention to at least 31 days to capture a month of workload information and these snapshot identifiers can then be used to identify workload intervals for querying and reporting. Oracle can be also be configured to lessen the intervals between snapshots to change the granularity of the AWR reports, or my preference, the DBA or privileged user can take manual snapshots to identify an important beginning or ending of a period.

Kellyn goes into a good amount of detail in this post and, based on the title, promises at least a part 2. Though this could be a History of the World: Part I trick Kellyn is playing on us.

Comments closed

Preparing for Demos with Pester

Jess Pomfret has some advice for you if you ever give a presentation:

If you don’t know what Pester is, it’s a test framework for PowerShell.  In the simplest explanation, using their Domain-Specific Language (DSL) you describe how things should look. If all looks good it returns output in green and if it doesn’t you get red output.  There are a lot of great use cases for Pester, like using it to ensure your code does what it’s supposed to, using it to validate your SQL Server environment (dbachecks), or in this example using it to make sure your demos are setup and ready to go.

When I’m preparing for a presentation I go through the demos over and over again, so it’s easy to accidentally leave things in a state that will cause issues when I go to do my demos in the presentation. If you’re creating a table, for example, during the demo and you already created it practicing and then forgot to drop it, the demo gods will strike and it’ll fail when it matters most! A simple Pester test to check whether the table exists will solve this issue.

Even if you aren’t giving talks in public (or inside your company), Pester is a useful tool for ensuring that the thing you expect to be the case actually is the case.

Comments closed

From IDENTITY to Sequences

Andy Levy recommends checking out sequences:

The SEQUENCE object eliminates all of this. It gives you a simple way to just grab a new number and increment for the next caller. It’s very fast as there’s no visible table I/O, and it’s unaffected by rollbacks.

And it’s so much easier to use! You just ask for the next number in the sequence!

I like sequences, though the interesting thing is that 2020 me has created them a lot less frequently than 2012 me was sure I would. I’m glad they’re in the product, however.

One thing I should point out is that sequences are like identity columns in that you can have gaps due to user behavior, such as rolling back transactions. If you absolutely need gap-free sets of numbers, you’re back to Andy’s Method One, except everything has to be serializable and wrapped in explicit transactions.

Comments closed

Power BI Security Features

James Serra takes us through different ways to secure your Power BI dashboards and reports:

Row-Level Security: With Row-level security (RLS) you are given the ability to publish a single report to your users but expose the data differently to each person. So instead of creating multiple copies of the same report in order to limit the data, you can just create one report that will only show the data the logged in user is allowed to see. This is done with filters, which restrict data access at the row level, and you define filters within roles. For example, creating a role called “United States” that filters the data in a table where the Region = “United States”. You then add members (user, security group, or distribution list) who can only see data for the United States to the “United States” role (the assignment of members can only be done within the Power BI Service). If a user should not have access to a report, then just don’t include that person in any of the roles for that report, so they would always see a blank report.

Click through for several more options and links to additional resources.

Comments closed