Press "Enter" to skip to content

Author: Kevin Feasel

Scala Views

Girish Bharti takes us through a performance-tuning technique in Scala:

We all know the power of lazy variables in Scala programming. If you are developing the application with huge data then you must have worked with the Scala collections. Some mostly used collections are List, Seq, Vector, etc. Similarly, you must be aware of the power of Streams. The streams are a very powerful tool for handling the infinite flow of data and streams are powerful because of there lazy transformations. As we know most of the Scala collections are strict so applying an operation on immutable collections creates a new collection. The size of the collection can be huge in the big data world. So, what if you have to apply a lot of transformations to the collection? Is there a way to handle collections in a lazy way? What if you can find a way to apply operations on your usual collections lazily? In this blog, we will be talking about the Scala views and how to use them.

Read the whole thing.

Comments closed

Delta Lake to Become an Open Standard

Michael Armbrust and Reynold Xin have exciting news about Delta Lake:

At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged.

Delta Lake focuses on improving the reliability and scalability of data lakes. Its higher level abstractions and guarantees, including ACID transactions and time travel, drastically simplify the complexity of real-world data engineering architecture. Since we open sourced Delta Lake six months ago, we have been humbled by the reception. The project has been deployed at thousands of organizations and processes exabytes of data each month, becoming an indispensable pillar in data and AI architectures.

Read on to see what this means for Delta Lake.

Comments closed

Benchmarking JSON Query Times

Silvano Coriani compares different options for loading and querying JSON data in Azure SQL Database:

Storing and retrieving data from JSON fragments is a common need in many application scenarios, like IoT solutions or microservice-based architectures. These fragments can be persisted in a variety of data stores, from blob or file shares, to relational and non-relational databases, and there’s a long standing debate in the industry on what’s the database technology that fits “better” for this task.
 
Azure SQL Database offers several options for parsing, transforming and querying JSON data, and this article doesn’t pretend to provide a definitive answer to that debate, but rather to explore these options for common scenarios like data loading and retrieving, and benchmarking results to provide a clear indication of how Azure SQL Database will perform manipulating JSON data.

Read on for the results.

Comments closed

Query Folding with Power BI Dataflows

Matthew Roche shares a few important points about Power BI dataflows and query folding:

In a recent post I mentioned an approach for working around the import-only nature of Power BI dataflows as a data source in Power BI Desktop, and in an older post I shared information about the enhanced compute engine that’s currently available in preview.

Some recent conversations have led me to believe that I should summarize a few points about dataflows and query folding, because these existing posts don’t make them easy to find and understand.

Read on for those points.

Comments closed

Fixing Key Lookup Problems

Erik Darling has a couple techniques for mitigating key lookup-related performance problems:

They’re one of those things — I’d say even the most common thing — that makes parameterized code sensitive to the bad kind of parameter sniffing, so they get a lot of attention.

The thing is, most of the attention that they get is just for columns you’re selecting, and most of the advice you get is to “create covering indexes”.

That’s not always possible, and that’s why I did this session a while back on a different way to rewrite queries to sometimes make them more efficient. Especially since key lookups may cause blocking issues.

Read on to see what you can do when a covering index isn’t a viable option.

Comments closed

Training, Validation, and Test Data Sets with SAS Viya

Beth Ebersole takes us through creating training, validation, and test data sets using SAS Viya:

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

Read on for some good notes, including the difference between mean squared error and average squared error.

Comments closed

Shaded Ranges in Excel

Elizabeth Ricks shows how to create shaded ranges in Excel:

We can see there’s clear seasonality in this business—overall volume is highest in the summer and each outing type generally follows the same monthly pattern. Let’s say you manage the Family rentals and you’d like to compare your monthly volume to what you’re seeing across the entire fleet. 

For the purpose of this tactical illustration, let’s assume the shape of the data—relative peaks and valleys—is more important than the specifics of each category individually. If that’s the case, I can simplify by showing a shaded region to depict the range of absolute passengers each month.

This technique is excellent when you have a large number of lines but only care about one versus the norm, and individual lines would be too distracting.

Comments closed

Running Oracle on Azure

Kellyn Pot’vin-Gorman takes us through various options on running Oracle in Azure:

Running Oracle on Azure VM environments aren’t that different from running Oracle on VMs in your on-premises for a DBA.  The DBAs and developers that I work with still have their jobs, still work with their favorite tools and also get the chance to learn new skills such as cloud administration.

Click through for more, including a setup script.

Comments closed

MSDTC and Availability Groups

Ryan Adams provides guidance on using distributed transactions against Availability Groups:

A paramount concept to understand is how to make the DTC highly available.  We can see from the precedence order that SQL Server will use the local DTC out of the box.  This makes it appear that everything is working, and it is, but it is not exactly highly available.

I see a lot of customers leave it configured this way because they either don’t know the ramifications or do not realize they are using the MSDTC (Linked Servers). Since it simply works out of the box, things get left this way until they end up with a suspect database and error messages that look like this:

“SQL Server detected a DTC/KTM in-doubt transaction with UOW  {598B7EDD-F7A1-9DC1-8D3E-303A4C93AAB4}.Please resolve it following the guideline for Troubleshooting DTC Transactions.”

Read the whole thing. There are a lot of small areas between processes where things can fail, and the combination of DTC + AGs is no different.

Comments closed