Press "Enter" to skip to content

Day: October 4, 2022

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

English and Japanese Vowels

John Cook plots vowels:

Vowel sounds can be visualized in a two-dimensional space according to tongue position. The vertical axis is runs from open down to closed, and the horizontal runs from front to back. See a linguistics textbook for far more detail.

English has five vowel letters, but a lot more than five vowel sounds. Scholars argue about how many vowel sounds English and other languages have because there’s room for disagreement on how much two sounds can differ and still be considered variations on the same sound. The IPA Handbook [1] lists 11 vowel sounds in American English, not counting diphthongs.

Click through for comparative charts and some bonus expertise in the comments.

Comments closed

Streaming Datasets in Power BI

Reza Rad needs data in real time:

Datasets in Power BI can have connection types such as Import, DirectQuery or Live Connection. However, there is also one specific type of dataset which is different. This type of dataset is called Streaming Dataset. A streaming dataset is for a real-time dashboard and comes with various setups and configurations. In this video and article, we’ll talk about this type of dataset.

Reza includes a video as well as a very helpful walkthrough.

Comments closed

sqlpackage and Managed Identities

Nora Yang provides a guide:

– Enable AAD auth on Azure SQL server

– Conn to Azure SQL database via AAD admin

– Create contained user for the managed identity (using Azure VM name as contained username)

        create user <vmname> from external provider;

        alter role db_owner add member <vmname>;

Read on for the full set of steps.

Comments closed

Tuning a Range Query

Grant Fritchey gets forum-sniped:

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.

Yeah. Identical to mine. Almost line for line.

Well, nuts.

I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed

A Month of Community Tools

Erik Darling is no tool:

Over the past month (plus or minus a couple days), I’ve shown you in a series of quick posts how I use different SQL Server Community Tools that are free and open source to troubleshoot SQL Server issues.

There’s a lot of great content from Erik here; if you haven’t already been following along.

Comments closed