Curated SQL – Page 1251 – A Fine Slice Of SQL Server

Using Polybase External Tables To Connect To Oracle

Published 2018-11-01 by Kevin Feasel

Rajendra Gupta continues his Polybase series:

In part 2 of the series, we saw that the external table could be accessed similarly to a relational database table. One more advantage is that we can join them with any relational tables.

Let us see how we can join the external table with the relational DB tables. I have saved the data into a CSV file so we will import the table using my earlier article, SQL Server Data Import using SQL Operations Studio. Therefore, you can follow the article in the same way in the Azure Data Studio also. I will just give high-level steps to import data from flat file into Azure Data Studio in this article.

Click through for more.

Comments closed

Editing ArcGIS Maps In Power BI

Published 2018-11-01 by Kevin Feasel

Jason Bonello shows us the types of changes we can make to ArcGIS maps in Power BI:

Map themes – This allows a change in the style for the map and once can choose from location only, heatmaps or clustering (the last two are only available for point layers, that is when you select Points in the Location Type). Through the clustering option, one could group individual location points into larger circular clusters that fall within a cluster radius – giving a high level view and then the ability to drill down into each region. If heatmaps are chosen any values in the Size or Color will be ignored and the tooltips will not be available.

Read the whole thing.

Comments closed

“String Or Binary Data Would Be Truncated” Update In SQL Server 2017

Published 2018-11-01 by Kevin Feasel

Randolph West shows us how, in SQL Server 2017 CU 12, we can remove the scourge of “String or binary data would be truncated”:

This is how the error message looks now:

1

2

3

Msg 2628, Level 16, State 6, Procedure ProcedureName, Line Linenumber

String or binary data would be truncated in table ‘%.*ls’, column ‘%.*ls’.

Truncated value: ‘%.*ls’.

Notice how the table, column and value are all mentioned in the error message now, which makes debugging and troubleshooting much easier. Thank you Microsoft!

As of 24 October 2018, we can now get the full picture in SQL Server 2017 as well, provided we install Cumulative Update 12. I’d say this is worth the update in and of itself!

There is a trace flag involved, so check it out.

Comments closed

More Tabular Best Practices

Published 2018-11-01 by Kevin Feasel

Ginger Grant has a few more best practices for working with Analysis Services tabular models:

Modify Timestamps to Split Date and Time

When there is a field where the date and time are both needed, the values should be separated so that there is both a date field and a time field. Having date time in two fields assists in the dictionary encoding as the date and time fields can be separately sorted into columns where the values are the same, decreasing the number of dictionary entries. To further improve compression, only include the seconds if absolutely necessary, as add decreasing the cardinality will increase compression.

Click through for more tips.

Comments closed

Change Data Capture With Databricks Delta

Published 2018-10-31 by Kevin Feasel

Ameet Kini and Denny Lee show how to use Databricks Delta to handle change data capture from different processes:

With Databricks Delta, the CDC pipeline is now streamlined and can be refreshed more frequently: Informatica => S3 => Spark Hourly Batch Job => Delta. In this scenario, Informatica writes change sets directly to S3 using Informatica’s Parquet writer. Databricks jobs run at the desired sub-nightly refresh rate (e.g., every 15 min, hourly, every 3 hours, etc.) to read these change sets and update the target Databricks Delta table.

With minor changes, this pipeline has also been adapted to read CDC records from Kafka, so the pipeline there would look like Kafka => Spark => Delta. In the rest of this section, we elaborate on this process, and how we use Databricks Delta as a sink for their CDC workflows.

With one of our customers, we implemented these CDC techniques on their largest and most frequently refreshed ETL pipeline. In this customer scenario, Informatica writes a change set to S3 for each of its 65 tables that have any changes every 15 minutes. While the change sets themselves are fairly small (< 1000 records), their target tables can become quite large. Out of the 65 tables, roughly half a dozen are in the 50m-100m row range, and the rest are smaller than 50m. In Oracle, this pipeline would have run every 15 minutes, keeping in sync with Informatica. In Databricks Delta, we thought this would take close to an hour due to S3 latencies but ended up being pleasantly surprised with a 30 and even 15-minute refresh period depending on cluster size.

Click through for the rest of the story.

Comments closed

An Introduction To Random Forests

Published 2018-10-31 by Kevin Feasel

Shrin Glander has a new video, currently only in German but there is an English transcript:

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of “if… then…” decision splits – similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let’s say it would be “age”; we would now make a decision about which branch to follow: “if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch”. This we would do until we come to the next branch and repeat the same decision process until there are no more branches before us. This endpoint is called a leaf and in decision trees would represent the final result: a predicted class or value.

At each branch, the feature thresholds that best split the (remaining) samples locally is found. The most common metrics for defining the “best split” are gini impurity and information gain for classification tasks and variance reduction for regression.

Click through for more info and if you understand German, the video is good as well.

Comments closed

Looking At The Robin Hood Caching Algorithm

Published 2018-10-31 by Kevin Feasel

Adrian Colyer reviews a paper on a multi-system caching algorithm:

The thing about this common pattern is that we need to wait for all of these back-end requests to complete before returning to the user. So improving the average latency of these requests doesn’t help us one little bit.

Since each request must wait for all of its queries to complete, the overall request latency is defined to be the latency of the request’s slowest query. Even if almost all backends have low tail latencies, the tail latency of the maximum of several queries could be high.

(See ‘The Tail at Scale’).

The user can easily see P99 latency or greater.

Techniques to mitigate tail latencies include making redundant requests, clever use of scheduling, auto-scaling and capacity provisioning, and approximate computing. Robin Hood takes a different (complementary) approach: use the cache to improve tail latency!

Robin Hood doesn’t necessarily allocate caching resources to the most popular back-ends, instead, it allocates caching resources to the backends (currently) responsible for the highest tail latency.

This is a great review of an interesting algorithm.

Comments closed

Building Data Dictionaries

Published 2018-10-31 by Kevin Feasel

Caitlin Hudon shares some great advice on building data dictionaries:

The best defense I’ve found against relying on an oral history is creating a written one.

Enter the data dictionary. A data dictionary is a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”, and provides us with a framework to store and share all of the institutional knowledge we have about our data.

As part of my role as a lead data scientist as a start-up, building a data dictionary was one of the first tasks I took on (started during my first week on the job). Learning about our data is a crucial part of onboarding for data-focused roles, and documenting that journey in the form of a data dictionary provides a useful data asset for the company (which helps to preserve institutional knowledge) and simultaneously provides a good resource for analyzing the data.

Read the whole thing.

Comments closed

MariaDB On Azure Now In Preview

Published 2018-10-31 by Kevin Feasel

Chris Seferlis fills us in on Microsoft’s latest Platform-as-a-Service database offering:

First, a little history on MariaDB. MariaDB is a community developed fork of the MySQL. Essentially, when Oracle purchased MySQL from Sun, some of the developers from MySQL were concerned that the acquisition would make changes or lead down the road where it would no longer be open source.

So, they went ahead and forked off to MariaDB with the intent to maintain high compatibility with MySQL. Also, the contributors are required to share their copyright with MariaDB foundation rights, which in a nutshell means they want this foundation to always be open source.

This joins SQL Server, MySQL, and Postgres as PaaS database offerings in Azure.

Comments closed

Scary Scenarios For Database Developers

Published 2018-10-31 by Kevin Feasel

Bert Wagner gets into the Halloween spirit:

You’ve spent all morning loading millions of records into your tables. Some of that data was painstakingly loaded by hand from flat files, and some lookup values you even authored manually because “you’ll only have to do this once.”

You are jumping around between SSMS windows, running this query here and that query there, pressing F5 like there’s no tomorrow. And then right before you press to F5 to finish all of your work…

…you realize that you ran the whole tab worth of code instead of the one statement you meant to highlight. “Commands completed successfully” flashes in your Message window, and in horror you discover you accidentally ran a truncate table statement!

Ok so this is pretty bad.

Not that I’ve done that before…nope…

Comments closed

Curated SQL Posts

Modify Timestamps to Split Date and Time