Press "Enter" to skip to content

Month: May 2021

Fitting Excel Macros into Data Science Pipelines

Bryan Shalloway has a process for us:

While I no longer use it regularly for the purposes of analysis, I will always have a soft spot in my heart for excel. Furthermore, using a “correct” set of data science tools often requires a bridge. Integrating a rigorous component into a messy spreadsheet based pipeline can be an initial step towards the pipeline or team or organization starting on a path of continuous improvement in their processes. Also, spreadsheets are foundational to many (probably most) BizOps teams and therefore are sometimes unavoidable…

In this post I will walk through a short example and some considerations for when you might decide (perhaps against your preferences) to integrate your work with extant spreadsheets or shadow “pipelines” within your organization.

Click through for Bryan’s thoughts on the topic.

Comments closed

Building a Kafka Test Environment with Kafdrop

Diogo Souza walks us through an interesting project:

From a daily life standpoint, it’s challenging to manage Kafka brokers, partitions, topics, producers, and consumers all via command line. An interface would be quite helpful.

There is a ton of available options for managing your Kafka brokers for web UI applications. Perhaps Confluent’s version is one of the most complete, although it is part of a paid combo for mostly enterprise means.

Amongst the myriad of open-source options, Kafdrop stands out for being simple, fast, and easy to use. It is an open-source web project that allows you to view information from Kafka brokers as existing topics, consumers, and even the content of messages sent.

This article explores creating a more flexible test environment to work alongside the .NET app built in the previous article. This way, you’ll have more powerful tools to understand what’s happening with your topics.

Read on to learn how you can install and use Kafdrop.

Comments closed

Tracking Query Compile Time

Grant Fritchey doesn’t have time to wait:

A question that came up recently was how to track the query compile time. It’s actually a pretty interesting question because, there aren’t that many ways to tell how long it took to compile the query, and they don’t necessarily agree. For most of us, most of the time, compile time for a given query doesn’t matter. However, I love telling the story of the query I had on an old system that could run in 90ms, but took 5 minutes to compile. In short, sometimes compile time matters.

Read on to learn several ways to determine how long it took that query to compile.

Comments closed

Power Query Data Profiling

Ed Hansberry takes us through the data profiler in Power Query:

A solid green bar is usually best. It means there are no issues in that column, as shown in the Discounts field. On the far right in the COGS field, there is a tiny bit of gray. That means there is no data, or a null, in one or more records. The bigger the gray bar, the higher percentage of nulls there are. In the middle though we have a problem. Anything in red is an error and must be fixed. These may cause more errors in further transformations downstream, and will usually cause refresh errors.

Before we address this, let’s get a bit more information about our data. Go to the View ribbon, and there are 3 settings we want to look at.

I really like what the data profiler provides us. If you’re a regular Power BI user, I highly recommend checking it out if you haven’t already.

Comments closed

Centralized Data Modeling via Power BI Templates

Haroon Ashraf aims to square the circle:

Data modeling is the way you can arrange and link your organizational data (typically in the form of tables) for reporting and analysis.

In other words, it is the strategy of lining tables with each other to get useful information by following the standard practices and domain knowledge of the organization.

Traditionally, it stands for implementing the star or snowflake schema from the perspective of the data warehouse BI solution.

What is Centralized Data Modeling?

Centralized data modeling means a generic data model consisting of some commonly used tables, relationships, and hierarchies that are shared across the organization. These elements the starting point for Power BI report development to anyone eligible, interested, and capable to do so.

With that in mind, read on to learn how you can use Power BI templates to bring this about. I joke about squaring the circle here because if you treat Power BI as a self-service business intelligence tool, the users may not be totally familiar with what you’re doing and could end up accidentally undermining your plans. That said, it’s a good approach to solving this common problem.

Comments closed

Event Log Mining with Powershell

Jeffrey Hicks proffers some advice:

The other day someone who is learning PowerShell reached out to me with a problem. He couldn’t understand why the relatively simple PowerShell expression to pull information from the System event log wasn’t working. He wasn’t seeing errors, but he also wasn’t seeing the events he was expecting. Searching event logs with PowerShell is a common task. But as you’ll see, you may need to update your approach to mining event logs with PowerShell. Things change in the PowerShell world, and sometimes in subtle ways that you may not notice. Although to be fair, some of these changes my arise from new versions of the .NET Framework and/or Windows 10. Here’s what we encountered.

Read on to learn a more efficient technique for this process.

Comments closed

Cannot Recover the Master Database

Erik Cobb takes us through an undesirable experience:

Recently, during patching for a 2019 SQL Server, the SQL services refused to start after the patching. It was throwing the following heart attack inducing error:

Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it.

This is not the first time I have seen this error. Actually, this seems to be the default error that Microsoft throws any time there is a problem with the patching process. The good news is, most likely this is a false error and the master database is perfectly fine.

Read on to learn more.

Comments closed

Recursive UDF Bug in SQL Server 2019

Erik Darling finds the bugs so you don’t have to:

I see people do things like this fairly often with UDFs. I don’t know why. It’s almost like they read a list of best practices and decided the opposite was better.

This is a quite simplified function, but it’s enough to show the bug behavior.

While writing this, I learned that you can’t create a recursive (self-referencing) scalar UDF with the schemabinding option. I don’t know why that is either.

I’m going to go out on a limb and say that if you run into this bug, it’s your own fault.

Comments closed

Hot, Cool, and Large Numbers

Holger von Jouanne-Diedrich hits the casino:

The longest streak in roulette purportedly happened in 1943 in the US when the colour red won 32 consecutive times in a row! A quick calculation shows that the probability of this happening seems to be beyond crazy:

0.5^32[1] 2.328306e-10

So, what is going on here? For once streaks and clustering happen quite naturally in random sequences: if you got something like “red, black, red, black, red, black” and so on I would worry if there was any randomness involved at all (read more about this here: Learning Statistics: Randomness is a strange beast). The point is that any sequence that is defined beforehand is as probable as any other (see also my post last week: The Solution to my Viral Coin Tossing Poll). Yet streaks catch our eye, they stick out.

There’s one critical assumption in this post, which is that the game is fair, in that each event has an equal probability of happening. But as a Bayesian, if a roulette table hits red 32 times in a row, it certainly opens the door to the idea that maybe the odds on that table with that dealer aren’t quite equal between red and black.

Comments closed

Creating Diagrams from Code

Sheldon Hull walks us through the diagrams package in Python:

LucidChart, Draw.io and other tools are great for a quick solution.

Mermaid also provides a nice simple text based diagramming tool that is integrated with many markdown tools.

For me, this just never fit. I like a bit of polish and beauty in a visual presentation and most of these are very utilitarian in their output.

I came across diagrams and found it a perfect fit for intuitive and beautiful diagram rendering of cloud architecture, and figured it would be worth a blog post to share this.

Back when GitPitch was still a viable concern, I had just gotten into using the diagrams package. It takes some getting used to and has very strong preferences on the sorts of relationships diagram elements can have, but it’s good at its job.

Comments closed