Press "Enter" to skip to content

Author: Kevin Feasel

Chat with Your Own Data in Streamlit and Azure Open AI

I have a new video:

In this video, I show how we can make a GPT-4 deployment aware of our own custom data, without needing to fine-tune the model. I talk about meta prompts and the Retrieval Augmented Generation (RAG) pattern, and then show how you can set this up using Azure AI Search and Azure OpenAI. Then, I bring it back to Streamlit and give users the option between chatting with a generic GPT-4 deployment and chatting over custom data.

I try to make my videos 10 minutes in length. They usually end up at 15-18 minutes. This one clocks in at more than 30 minutes and there’s very little fluff.

Comments closed

Filesystem Access for Database Restoration via dbatools

Andy Levy shares a lesson learned:

While performing an instance migration this spring, I happened upon something I didn’t expect in [dbatools](https://dbatools.io/). It should have been a simple backup/restore copy of the databases, with the backup files residing on a fileshare on the destination server after being copied there. I kept getting a warning that the backup files I was attempting to restore couldn’t be read, and the restores (via Restore-DbaDatabase) wouldn’t execute.

I checked permissions on the server over and over again. Both on the filesystem and for the share that I was attempting to read from. Even more curious, if I executed the restore database statements directly from within Management Studio, the databases restored without issue.

After doing quite a bit of digging, I managed to find the reason.

Read on to learn more about necessary permissions, as well as the issue Andy hit, as well as the solution.

Comments closed

Managed Private Endpoints and Trusted Workspace Access for All

Wolfgang Strasser is very pleased with a recent announcement:

In times of data breaches and millions of customer entries breached, the security of your data platform is one of the things you need to consider upfront and – preferably in all your data solutions.

When Microsoft Fabric was announced the concepts of connecting to other parts of your already secured data platform in Azure was not possible. The options to (securely) connect Fabric to other parts of your Azure platform were not available initially.

Read on to learn more about Managed Private Endpoints and Trusted Workspace Access, the initial problem with them both, and how Microsoft has definitely improved things recently.

Comments closed

Reading a Lakehouse Table from another Microsoft Fabric Workspace

Gilbert Quevauvilliers spans the gap:

I was doing some work recently for a customer and they had data stored in different Lakehouse’s which was in a different App Workspace.

I was pleasantly surprised that this can be quite easy to do.

In my example below I am going to show you how in my notebook I can read a table in a Lakehouse table when it is not attached to any Lakehouse.

It’s good that this is so easy to do, considering that current advice leans toward having multiple workspaces and not cramming everything into one.

Comments closed

Systematic Sampling in R

Steven Sanderson continues a series on sampling:

In this post, we will explore systematic sampling in R using base R functions. Systematic sampling is a technique where you select every (k^{th}) element from a list or dataset. This method is straightforward and useful when you want a representative sample without the complexity of more advanced sampling techniques.

Let’s dive into an example to understand how it works.

In very technical circles, this is also known as the “eenie-meenie-meiney-moe technique” and is very similar to the “duck-duck-goose” algorithm, though that has an additional stochastic input.

Comments closed

Instrumenting Postgres

Grant Fritchey goes looking:

I’m still learning PostgreSQL and one of the things I’ve been looking at a lot lately is instrumentation. What do I mean? Well, if you’re in SQL Server, think, Dynamic Management Views (DMV), Extended Events, Query Store, <hack, spit> Trace <spit>. How do we know how long a query took to run? PostgreSQL can tell you, but, and this is one of those wild, cool, but, honestly, slightly frustrating things about PostgreSQL, not natively.

Read on for what Grant’s seen.

Comments closed

Fun(?) with CPU Affinity

Rod Edwards plays with fire:

Here we have the options, to set the processor affinity mask for all process, and also the I/O affinity mask. The default is to let SQL handle this automatically as above, as it will provide better performance for most workloads.

On a basic level, these settings restrict which CPUs can be used when restricting SQL server processing and/or SQL IO operations.

There are very specific use cases in which setting CPU affinity makes sense. Rod does a really good job of showing just why mucking with CPU affinity is not for the faint of heart.

Comments closed

Temporal Table History Cleanup

Chad Baldwin reads the docs:

I recently built a system for collecting index usage statistics utilizing temporal tables, clustered columnstore indexes (CCIs) and a temporal table data retention policy. The basic idea behind the system is that it collects various stats about indexes and updates this stats table. However, because it’s a temporal table, all changes are logged to the underlying history table.

My history table is built using a clustered columnstore index and had a data retention policy set up for the temporal table, like so:

Read on to see the problem Chad ran into and why it turned out not to be an actual problem (except maybe of the PEBKAC variety). In fairness, I would have made the same mistake.

Comments closed

Columnstore Key Lookups are Bad News

Forrest McDaniel does not want to perform that key lookup:

I’ve read it repeatedly, columnstore key lookups are extra slow. The question of course, is why?

In my mental model, it makes sense. A normal key lookup adds about 3 reads.

While a columnstore lookup should add at least a read per column, since each column lives in its own segments.

But it turns out that it’s not a read per column, oh no. Columnstore indexes are amazing for large-scale aggregations and awful for individual lookups.

Comments closed