Press "Enter" to skip to content

Curated SQL Posts

Pattern Learning in Amazon SageMaker

Vishaal Kapoor, et al, take us through an example of pattern learning in Amazon SageMaker:

Pattern learning automatically analyzes your data and surfaces textual constraints that may apply to your dataset. For the example with phone numbers, pattern learning can analyze the data and identify that the vast majority of phone numbers follow the textual constraint [1-9][0-9]{2}-[0-9][4]. It can also alert you that there are examples of invalid data so that you can exclude or correct them.

In the following sections, we demonstrate how to use pattern learning in Data Wrangler using a fictional dataset of product categories and SKU (stock keeping unit) codes.

Read on for the scenario.

Comments closed

Splitting Strings with KQL

Robert Cain splits the baby:

In databases, we often find columns that are stored in a hierarchy structure, not unlike a file path on your drive. For example, in the Microsoft Logs sample database the Perf table stores its counter path this way: \\computername\Memory\Available MBytes.

It would be helpful to have a way to easily break this path out into its individual parts. KQL provides us a way of doing this using the split function.

Check out examples of how you can perform splitting.

Comments closed

Getting Row Counts for Different DBMS Platforms

Brendan Tierney wants rowcounts:

A little warning before using these queries. They may or may not give the true accurate number of records in the tables. These examples illustrate extracting the number of records from the data dictionaries of the databases. This is dependent on background processes being run to gather this information. These background processes run from time to time, anything from a few minutes to many tens of minutes. So, these results are good indication of the number of records in each table.

Click through for examples in Oracle, MySQL, Postgres, SQL Server, and Snowflake. Though the SQL Server one does need a GROUP BY clause because it’s a sum of the partitions’ rows.

Comments closed

When Totals in Power BI Look Inaccurate

Marco Russo and Alberto Ferrari ask who you believe, them or your lying eyes:

When looking at a report, it is natural to double-check the numbers produced. The simplest and most intuitive way is to verify whether the total equals the sum of individual rows. This behavior is extremely natural and mostly effective. Nonetheless, the total is the sum of rows only for additive measures, which are measures that are naturally computed as a sum.

When working with business intelligence solutions, sooner or later a developer will author a calculation that is non-additive. At that point, the total can no longer be computed by summing the rows for a very good reason: it would be inaccurate. When users complain about the fact that the rows do not sum up, seasoned BI developers offer a rational explanation of the reasons why the number are not summed: this process often provides a better understanding of how values are computed. Choosing the easy way out of introducing additivity in a naturally non-additive calculation means losing the opportunity to generate accurate calculations, and relying on inaccurate values.

Read on for examples and how to understand how to deal with non-additive or semi-additive features.

Comments closed

Tips for Getting Technical Help Online

Grant Fritchey provides some tips:

One thing you see a lot is that people, for whatever reason, will absolutely not simply state what the actual problem is. You’ll get stuff like “What SQL Server internal behavior prevents dynamically naming local variables?” And you’re left scratching your head, why on earth would someone want to dynamically name local variables? Only to find out, after lots of comment & discussion, they thought that you needed to rename variables when changing values.

When you get stuck, take a moment to describe the problem, but aim for the root of the problem, not peripheries. “I want to do X. I’m doing X. It looks like this. The results are this when they should be that. What am I doing wrong?” That’s the “magic” formula.

Grant provides several helpful tips for increasing the likelihood that someone will help you find the right answer.

I want to note two quick things here, though. The first is that, quite often, people don’t even know what they intend to ask. In the cited bit above, this quite often comes from basic confusion cascading into specific “I’m stuck here” questions. This leads to foundational misunderstandings between people and frustrates the question-and-answer process.

The second thing I’d note is, be clear in your writing. The biggest tip I have here is to make sure your sentences and ideas flow. In other words, within a paragraph or section, try to keep thoughts related to the key topic and avoid asides or rambling. The reason for this is that your inner monologue can make sense of a variety of asides, but a reader without access to your thoughts will struggle to follow along. Furthermore, it is common to add bits and pieces to a half-finished post, fleshing out details. Before posting, make sure that it makes reading sense. Quite often, people will write a post and then fill in additional details before submission. A common result of this is that you might add some information intended to clarify a point which you actually introduce later in the post.

Comments closed

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

Bitemporal Modeling and Running Totals

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Fine-Tuning Hugging Face for Named Entity Recognition in Japanese

Tsuyoshi Matsuzaki tries out a named entity recognition project with the Hugging Face library:

Now a lot of AI companies (such as, OpenAI, NLP Cloud, Google, NVIDIA, etc) are providing pre-trained large language models including methods that tune to enable models trained. Among such tools and framework, HuggingFace is widely used and providing over 20,000 transformer-based models.

In this post, I’ll show you brief fine-tuned example of transformer models in Hugging Face for your beginning.
In the last part of this post, I’ll also optimize training with DeepSpeed which is well integrated with HuggingFace transformers.

Click through for the results of this analysis.

Comments closed