April 2020 – Page 7 – Curated SQL

Cross-Validation in R with crossval

Published 2020-04-20 by Kevin Feasel

Thierry Moudiki shows off some functionality in the the crossval package:

In this post, I present some examples of use of crossval on a linear model, and on the popular xgboost and randomForest models. The error measure used is Root Mean Squared Error (RMSE), and is currently the only choice implemented.

Click through for the demonstration in notebook form. H/T R-Bloggers.

Comments closed

Operational Database Management Tools in Cloudera Data Platform

Published 2020-04-20 by Kevin Feasel

Gokul Kamaraj, et al, describe tools available to DBAs in the Cloudera Data Platform:

Cloudera provides multiple mechanisms to allow backup and recovery, including:
– Snapshots
– Replication
– Export
– CopyTable
– HTable API
– Offline backup of HDFS data
These can be run manually or scheduled using Replication Manager. Backups can also be moved to other instances of the OpDB or alternate storage targets such as AWS S3 or Azure ADLS gen 2.

Even in the Platform-as-a-Service world, there’s still plenty of scope for database administration.

Comments closed

Understanding Key Lookups

Published 2020-04-20 by Kevin Feasel

Hugo Kornelis continues a series on SQL Server plan operators:

The Key Lookup operator provides a subset of the functionality of the Clustered Index Seek operator, but within a specific context. It is used when another operator (usually an Index Seek, sometimes an Index Scan, rarely a combination of two or more of these or other operators) is used to find rows that need to be processed, but the index used does not include all columns needed for the query. The Key Lookup operator is then used to fetch the remaining columns from the clustered index.
A Key Lookup operator will always be found on the inner input of a Nested Loops operator. It will be executed once for each row found. Since the key values passed in always come from another index, the requested row will always exist (except in rare race scenarios when read uncommitted isolation level is used).

Click through for a great deal of information on key lookups.

Comments closed

Decoding Statistics Names

Published 2020-04-20 by Kevin Feasel

Jason Brimhall explains how SQL Server comes up with names for auto-created statistics:

Every now and again I am asked about the meaning behind the automatically generated names for statistics in SQL Server. The quick answer is short, sweet and really easy. I give them a quick explanation and then often refer them to the blog post by Paul Randal on the topic.
The better answer is to show them what the auto-generated names really mean, alongside the great explanation from Paul. Finally, after years of the topic being on my backlog, I am sharing a script that will help decode those names and help to prove out fully what’s in a statistic name.

The proof is in the SQL; click through to see it.

Comments closed

Adding Time Zone-Adjusted Report Execution Times

Published 2020-04-20 by Kevin Feasel

Brett Powell shows how you can display a report’s execution time in a particular time zone:

For reports being viewed by users around the world, simply modifying the footer text box expression to note that this time is UTC may be a sufficient. However, for many paginated reports the users are all in one time zone and some of these users may ask to have the time zone conversion handled within the BI solution. The example in this post targets this scenario.
Even if the report serves users in multiple time zones, it’s technically feasible to leverage the UserID global field and a simple user to time zone mapping table to provide a local report execution time to all users. However, I tend to think most projects would not want to commit the time/resources for this logic – UTC date/time is what the users would get.

If you do need local report execution time, Brett has you covered.

Comments closed

Generating Entity Framework Core Classes from a Database Project

Published 2020-04-20 by Kevin Feasel

Erik Ejlskov Jensen walks us through generating Entity Framework classes from a Visual Studio database project and from a .dacpac file:

EF Core Power Tools adds the ability to generate code directly from a Database project, without having to publish to a live database first, and having a SQL Server database engine running locally. It can also generate code from live SQL Server, Azure SQL DB, MySQL, Postgres and SQLite database. It has a large number of customization options – pluralization, renaming, file and name space choices and more, which is not available via the EF Core commands. And you do not have to install any design time libraries in your own project.

Read on for a demo of that as well as a dacpac reverse engineering tool.

Comments closed

Using Azure Functions Inside Azure Data Factory

Published 2020-04-20 by Kevin Feasel

Rayis Imayev shows how you can call an Azure Function from inside your Azure Data Factory Pipeline:

Creating a data solution with Azure Data Factory (ADF) may look like a straightforward process: you have incoming datasets, business rules of how to connect and change them and a final destination environment to save this transformed data. Very often your data transformation may require more complex business logic that can only be developed externally (scripts, functions, web-services, databricks notebooks, etc.).

In this blog post, I will try to share my experience of using Azure Functions in my Data Factory workflows: my highs and lows of using them, my victories and struggles to make them work.

This includes a description of the options, a demo function, and additional notes for each technique.

Comments closed

Mistakes to Avoid in a BI Platform Migration

Published 2020-04-20 by Kevin Feasel

Chris Webb covers five things to consider when migrating your BI platform, using Power BI as an example:

Every report has a data source and getting source data in the right format for your BI platform is a substantial task – so much so, that you might be tempted to put Power BI on top of the data sources you have created for your previous BI platform with no changes. However different BI platforms need their data in different formats. Many BI platforms like their data munged together in one big table, sometimes even with data at different granularities in the same table. Power BI, on the other hand, likes its source data modelled as a star schema (you can find out what a star schema is and why it’s important here). If you don’t model your data as a star schema you may find that you see incorrect values in your reports, that report performance is poor, and that it’s a lot harder to write the DAX calculations that you need.

Four out of the five fit just as well with any other data platform technology.

Comments closed

Apache Kafka 2.5 Released

Published 2020-04-17 by Kevin Feasel

David Arthur announces Apache Kafka 2.5:

KIP-500 update
In Apache Kafka 2.5, some preparatory work has been done towards the removal of Apache ZooKeeper™ (ZK).
– KIP-555: details about the ZooKeeper deprecation process in admin tools
– KIP-543: dynamic configs will not require ZooKeeper access

KIP-500 looks like a doozy.

Comments closed

Avoiding Loops in Python with NumPy

Published 2020-04-17 by Kevin Feasel

Swantika Gupta walks us through vectorization and broadcasting with NumPy:

Vectorization is a powerful ability within NumPy which is used to speed up the code execution without using loop. It expresses operations as occurring on entire arrays rather than their individual elements.
Looping over an array or any data structure in Python has a lot of overhead involved. In NumPy, Vectorized Operations delegates the looping internally to highly optimized C and Fortran functions, making for cleaner and faster Python code. So, vectorization refers to the concept of replacing explicit for-loops with array expressions, which can then be computed internally with a low-level language, like C.

Read on for a few examples of this and broadcasting.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Month: April 2020