Press "Enter" to skip to content

Month: January 2024

New Limits for Maximum Connections per Data Source in Power BI

Chris Webb notes a change:

One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.

Read on to learn why this setting is important.

Comments closed

tidyAML Updates

Steven Sanderson has been busy. First up, a post on tidyAML updates:

One of the standout features in this release is the addition of extract_regression_residuals(). This function empowers users to delve deeper into regression models, providing a valuable tool for analyzing and understanding residuals. Whether you’re fine-tuning your models or gaining insights into data patterns, this enhancement adds a crucial layer to your analytical arsenal.

Then, Steven goes into detail on .drap_na:

In the newest release of tidyAML there has been an addition of a new parameter to the functions fast_classification() and fast_regression(). The parameter is .drop_na and it is a logical value that defaults to TRUE. This parameter is used to determine if the function should drop rows with missing values from the output if a model cannot be built for some reason. Let’s take a look at the function and it’s arguments.

After that, we get to see an updated function:

In response to user feedback, we’ve enhanced the internal_make_wflw_predictions() function to provide a comprehensive set of predictions. Now, when you make a call to this function, it includes:

  1. The Actual Data: This is the real-world data that your model aims to predict. Having access to this information helps you assess how well your model is performing on unseen instances.
  2. Training Predictions: Predictions made on the training dataset. This is essential for understanding how well your model generalizes to the data it was trained on.
  3. Testing Predictions: Predictions made on the testing dataset. This is crucial for evaluating the model’s performance on data it hasn’t seen during the training phase.

You can also check out the package’s GitHub repository and see more.

Comments closed

Data Reading and Writing with arrow

Colin Gillespie performs two of the three R’s:

Apache Arrow is a cross-language development platform for in-memory data. As it’s in-memory (as opposed to data stored on disk), it provides additional speed boosts. It’s designed for efficient analytic operations, and uses a standardised language-independent columnar memory format for flat and hierarchical data. The {arrow} R package provides an interface to the ‘Arrow C++’ library – an efficient package for analytic operations on modern hardware.

There are many great tutorials on using {arrow} (see the links at the bottom of the post for example). The purpose of this blog post isn’t to simply reproduce a few examples, but to understand some of what’s happening behind the scenes. In this particular post, we’re interested in understanding the reading/writing aspects of {arrow}.

Read on to see it in action in R.

Comments closed

Exporting to CSV in Azure ML Designer

Tom LaRock saves a file:

The most popular feature in any application is an easy-to-find button saying “Export to CSV.” If this button is not visibly available, a simple right-click of your mouse should present such an option. You really should not be forced to spend any additional time on this Earth looking for a way to export your data to a CSV file.

Well, in Azure ML Studio, exporting to a CSV file should be simple, but is not, unless you already know what you are doing and where to look. I was reminded of this recently, and decided to write a quick post in case a person new to ML Studio was wondering how to export data to a CSV file.

Click through for one false start and then the correct answer.

Comments closed

Globs of Tabs in SSMS

Warwick Rudd has cramped environs:

Working in SQL Server Management Studio is potentially an everyday occurrence for you! And having to work with many queries open at the same time is probably the norm.  Depending on the size of your screen that you may be working on, you are limited with the amount of screen real estate you can work in.

Personally, I get frustrated with having to continually go to the open query drop down window to see what queries I have open and be able to cycle through them to make my life easier and be more productive.

Warwick shows off one built-in way to solve this problem. When I was a database developer, I would have 40-50 tabs open at a time sometimes. I used Tabs Studio (commercial product but it’s not that expensive if you’re buying for yourself) to manage all of that.

Comments closed

Storing SQL Server Backups in Cloudflare R2

Daniel Hutmacher saves a buck:

R2 is Cloudflare’s own implementation of AWS S3 storage, with some big benefits – one of them being no egress fees, which is great if you want to publish or distribute a lot of data (like I did with this demo database). In this post, I thought I’d briefly document how to set up R2, and how to use it to back up and restore your SQL Server databases.

You’ll need a Cloudflare account to follow along. The account and a lot of their services are free, but R2 storage obviously comes with a small cost. For scale, I’m running an almost-terabyte bucket at just a couple of dollars per month.

Given the number of times I’ve pushed Daniel’s excellent Chicago parking tickets database (including right now—it’s a great database that I’ve used in several presentations and videos!), the lack of egress charges is pretty big.

Comments closed

Hash Aggregates and Hash Joins in Postgres

Muhammad Ali plays matchmaker:

PostgreSQL employs various techniques for data joining and aggregation in its queries, among which the hash-based method stands out for its efficiency in particular situations and different data sizes. We will discuss hash joins and hash aggregates in PostgreSQL, providing insights on how they work and parameters which influence this algorithm.

Read on to learn more. This looks fundamentally similar to hash matches in SQL Server, so if you’re familiar with that, the concepts should be pretty clear.

Comments closed

Using Spark Connect from .NET

Ed Elliott keeps the hope alive:

Over the past couple of decades working in IT, I have found a particular interest in protocols. When I was learning how MSSQL worked, I spent a while figuring out how to read data from disk via backups rather than via the database server (MS Tape Format, if anyone cared). I spent more time than anyone should learning how to parse TDS (before the [MS-TDS] documentation was a thing)—having my head buried in a set of network traces and a pencil and pen has given me more pleasure than I can tell you.

This intersection of protocols and Spark piqued my interest in using Spark Connect to connect to Spark and run jobs from .NET rather than Python or Scala.

There’s a whole lot more ceremony involved than the Microsoft .NET for Apache Spark project, but read on to see how it all works. Also, I hereby officially chastise Ed for having examples in C# and VB.NET but not the greatest .NET language of them all: F#. Chastisement aside, I appreciate the work Ed put into this to bring Spark Connect to the .NET masses.

Comments closed

Trying out Data Wrangler

Ginger Grant tries out a feature in Microsoft Fabric:

The second element in my series on new Fabric Features is Data Wrangler. Data Wrangler is an entirely new feature found inside of the Data Engineering and Machine Learning Experience of Fabric. It was created to help analyze data in a lakehouse using Spark and generated code. You may find that there’s a lot of data in the data lake that you need to evaluate to determine how you might incorporate the data into a data model. It’s important to examine the data to evaluate what the data contains. Is there anything missing? Incorrectly data typed? Bad Data? There is an easy method to discover what is missing with your data which uses some techniques commonly used by data scientists. Data Wrangler is used inside of notebooks in the Data Engineering or Machine Learning Environments, as the functionality does not exist within the Power BI experience.

Click through to see how it works. I liken it to Power Query for people who don’t like Python.

Comments closed