Press "Enter" to skip to content

Month: September 2024

Working with lapply() in R

Steven Sanderson applies a function:

R is a powerful programming language primarily used for statistical computing and data analysis. Among its many features, the lapply() function stands out as a versatile tool for simplifying code and reducing redundancy. Whether you’re working with lists, vectors, or data frames, understanding how to use lapply() effectively can greatly enhance your programming efficiency. For beginners, mastering lapply() is a crucial step in becoming proficient in R.

Read on to see how lapply() works.

Comments closed

The Importance of Versioning Data

John Mount demonstrates an important concept:

Our business goal is to build a model relating attendance to popcorn sales, which we will apply to future data in order to predict future popcorn sales. This allows us to plan staffing and purchasing, and also to predict snack bar revenue.

In the above example data, all dates in August of 2024 are “in the past” (available as training and test/validation data) and all dates in September of 2024 are “in the future” (dates we want to make predictions for). The movie attendance service we are subscribing to supplies

  • past schedules
  • past (recorded) attendance
  • future schedules, and
  • (estimated) future attendance.

John’s example scenario covers the problem of future estimations interfering with model quality. Another important scenario is when the past changes. As one example, digital marketing providers (think Google, Bing, Amazon, etc.) will provide you impression and click data pretty quickly, and each day they close the books on a prior day’s data at some normal time. For some of these providers, that prior day’s data is yesterday’s data—on Tuesday, provider X closes the books on Monday’s data and promises that it won’t change after that. But for other providers, they might change data over the course of the next 10 days. This means that the data you’re using for model training might change from under you, and you might never know if you don’t keep track of the actual data you used for training at the time of training.

Comments closed

Finding Missing Indexes in SQL Server

Jared Westover goes searching for where those missing indexes got off to:

In the past, while using the missing index Dynamic Management Views (DMVs), something always seemed to be missing from the results. It was hard to put my finger on it then, but looking back, it now seems obvious. Why can’t we see the queries prompting SQL Server to give suggestions? Did you know Microsoft added a DMV with the query text? Since discovering this gem, we no longer need to search through Plan Cache or Query Store.

Click through for the article, but do especially read the list of limitations Jared links to in the summary section before going off and creating a bunch of indexes.

Comments closed

A Reminder for Server Consistency

Chad Callihan resolves an issue:

I connected to the latest SQL Server, opened SSMS, and tried to restore from there. Sure enough, I was presented with the error:

Cannot access the specified path or file on the server. Verify that you have the necessary security privileges and that the path or file exists.

If you know that the service account can access a specific file, type in the full path for the file in the File Name control in the Locate dialog box.

Read on for the solution, which was easy enough, but serves as a reminder that having (and occasionally running!) idempotent configuration scripts can be quite useful.

Comments closed

Implicit Conversions in SQL Server

Vlad Drumea explains what it means implicitly to convert:

If you’re here, you most likely know what a data type conversion is, but, in short, it’s the operation of converting a value from one data type to another.

There are two types of conversions in SQL Server:

  • explicit – which are done by explicitly applying the CAST and CONVERT functions on a column, variable, or value.
  • implicit – when CAST and CONVERT are not used explicitly, but SQL Server ends up doing the conversation behind the scenes due to two distinct data types being compared.

Read on to learn more about which types of implicit conversion are relevant for performance and what you can do instead.

Comments closed

Migrating Power BI Dataflows from Gen1 to Gen2

Reza Rad talks migration:

Unfortunately, there isn’t a migration tool to convert your Power BI dataflow (gen1) to Microsoft Fabric dataflow (gen2). If you have Fabric capacity licenses, it just makes sense to do that migration because Dataflow Gen2 gives you data destinations into four destinations, which we don’t have in Dataflow Gen1. However, converting Gen1 to Gen2 isn’t that complicated. The process is explained in this blog and video.

Click through for the blog post and the video.

Comments closed

Query Start Times in Query Store

Hugo Kornelis describes an issue:

I was hired by a customer who had a very annoying issue with the daily data load of their data warehouse. The volume of data to be loaded is high and they were already struggling to finish the load before business opens. But that was not their biggest issue. The biggest problem, the real pain point that they hired me for, is that at unpredictable moments, the load would run much longer than normal, pushing it well into business hours. They wanted me to find out what caused those irregular delays, and find a way to stop them from happening.

Read on to learn more about the issue itself, as well as a discrepancy in what Query Store showed. Hugo also points out that the quick-and-easy solution may not be the right solution.

Comments closed

Building a Test Data Generator for PostgreSQL

Mika Sutinen builds some data:

I recently had a project where I needed quickly to generate some realistic looking test data to PostgreSQL database. While I often like to go for ready-made solutions, this felt like a good opportunity to stretch my coding muscles and develop it myself. Moreover, this seemed like a fun puzzle to solve, and I could probably use the same solution later on elsewhere.

Click through for a description of the generator, as well as a link to Mika’s GitHub repo. Taking a quick peek at it, it does appear that you could probably use this for other data platforms like SQL Server with very limited modification.

Comments closed

Myths and Reality of Copilot for Power BI

Kurt Buhler puts together an essay:

However, recent months reveal rising skepticism, concern and possibly even disillusionment with generative AI tools, both from investors (especially from investors) and from the public. Despite the massive investment, enthusiasm, and promotion, these tools seem to be seeing limited adoption and aren’t yet showing the measurable value that fulfills their promises. And yet, paradoxically, many professionals will agree anecdotally that they use generative AI tools regularly, and that these tools seem to help them be more productive in certain tasks. Furthermore, there are concrete success stories where generative AI is bringing value, such as the models like the latest versions of Alphafold (from Google) and ESMfold (from Meta) that aid in protein folding for pharmaceutical companies more effectively find potential new drug candidates. So, who are these tools for, what problems do they solve, and how can we use them effectively? This is too big of a topic for even Bink and Bonk the Data Goblins to solve, so let’s narrow the focus, a bit.

This is a must-read, and Kurt even provides a de-goblinified PDF version for management.

Comments closed