The Elitist Shuffle And Recommenders

Rodrigo Agundez shows us a way of displaying fresh recommendations without retraining the recommender system:

Suppose you have 10,000 items in total that can be recommended to your user, you run the recommendation system over all the items and those 10,000 items get ranked in order of relevance of the content.

The application shows 5 items on the entry screen. The first time the user opens the application after the re-scoring process the top 5 ranked items are shown. It is decided that from now on (based on user control groups, investigation, AB testing, etc.) until the next re-scoring process the entry screen should not be the same every time and remain relevant for the user.

Based on an investigation from the data scientist it turns out that somewhat relevant items appear until item 100. Then the idea is to somehow shuffle those 100 items such that the top 5 items shown are still relevant but not the same.

Click through for an example in Python and how it compares favorably to a couple other shuffling algorithms.

Scaling Kafka With Consumer Groups

Kevin Feasel

2018-05-15

Hadoop

Suhita Goswami explains how to use consumer groups to scale processing from Apache Kafka:

Kafka builds on the publish-subscribe model with the advantages of a message queuing system. It achieves this with:

  • the use of consumer groups
  • message retention by brokers

When consumers join a group and subscribe to a topic, only one consumer from the group actually consumes each message from the topic. The messages are also retained by the brokers in their topic partitions, unlike traditional message queues.

Multiple consumer groups can read from the same set of topics, and at different times catering to different logical application domains. Thus, Kafka provides both the advantage of high scalability via consumers belonging to the same consumer group and the ability to serve multiple independent downstream applications simultaneously.

Consumer groups are a great solution to the problem of long-running consumers when items to process are independent and can run concurrently.

Selecting All Columns But One In Postgres

Lukas Eder shows off a BigQuery feature which you can partially implement in Postgres:

In BigQuery syntax, we could now simply write

SELECT * EXCEPT rk
FROM (...) t
WHERE rk = 1
ORDER BY first_name, last_name

Which is really quite convenient! We want to project everything, except this one column. But none of the more popular SQL databases support this syntax.

Luckily, in PostgreSQL, we can use a workaround: Nested records:

SELECT (a).*, (f).* -- Unnesting the records again
FROM ( SELECT a, -- Nesting the actor table f, -- Nesting the film table RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk FROM film f JOIN film_actor fa USING (film_id) JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY (a).first_name, (a).last_name;

Notice how we’re no longer projecting A.* and F.* inside of the derived table T, but instead, the entire table (record). In the outer query, we have to use some slightly different syntax to unnest the record again (e.g. (A).FIRST_NAME), and we’re done.

Read the whole thing.  Lukas has a workaround for SQL Server, but I’d really like to see SELECT * EXCEPT [something] be viable syntax.  This is something I’d want to use more for ad hoc diagnostic queries, but I have one scenario where most columns on a table are narrow but then I have a big VARBINARY(MAX) (for good reason, I promise) that I almost never want to see in diagnostic queries.  I use a third-party SSMS plugin to populate all the columns and remove the one I don’t want, but it’d be nice to specify the other way because it’s so much faster to type.

Deleting Analysis Services Tabular Partitions Using C#

Chris Koester shows how to delete SQL Server Analysis Services Tabular partitions using the .NET library:

This post shows how to delete SSAS Tabular partitions with C#. This is an improvement over a PowerShell script I previously wrote about, Delete SSAS Partitions with TMSL and PowerShell, because it is more dynamic and doesn’t require a manually-authored TMSL script.

Specifically, the code below deletes all partitions that contain a specified suffix, which is useful for quickly deleting all partitions used for development purposes. I’ve worked with SSAS models that required several development partitions in order to prevent the workspace database from growing too large. Deleting them manually after a deployment to a server would be tedious, and it would be easy to forget one and end up processing duplicate data.

Click through for the code.

mssql-cli Update

Alan Yu announces an update to mssql-cli:

GDPR compliance

As many of us are familiar with, GDPR is approaching and we made some updates. In the past, file history stored entire T-SQL queries. However, if the query contained any secrets or passwords, it wasn’t smart enough to scrub those out. This is no longer the case, and now file history has been updated to no longer store secrets or passwords.

In addition, we have added 24-hour rotation of UserID when we collect telemetry.

Read on for other improvements.

Useful Azure SQL Database T-SQL Statements

Arun Sirpal has a few T-SQL statements to help manage an Azure SQL Database database:

Creating a database

Very easy to create something like the below – a S2 database.

CREATE DATABASE [MeeTwoDB]
( MAXSIZE = 5GB, EDITION = 'standard', SERVICE_OBJECTIVE = 'S2' ) ;
--checking details
SELECT Edition = DATABASEPROPERTYEX('MeeTwoDB', 'Edition'), ServiceObjective = DATABASEPROPERTYEX('MeeTwoDB', 'ServiceObjective')

Read on for several more examples.

Troubleshooting SQL Server Error 18456 State 73

Thomas Rushton reproduces an error state in SQL Server:

A question asked on one of the forums today wasn’t easily answerable by Googling. Summary of the question “I have error 18456 State 73 – why?

Google seemed remarkably quiet on the subject of that particular state code. Even Aaron Bertrand’s list of causes of state codes for SQL Server error 18456 missed this one.

However, some searching did find a link to what appears to be some in-depth VMWare VSAN training documentation that includes that error in some logging information, which made me wonder if it was related to the error 18456 state 38 that followed.

Read on for more details.

Power BI: Parameterized Loading With Web.Contents()

Stacia Varga shows off Power BI automated refresh when hitting dynamically generated endpoint URLs:

There’s just one problem. I’m using the Web.Contents() function in several queries to get the JSON data from the NHL API.

Actually, the function is not the problem. It’s the way I’m using it to dynamically construct URLs so that I can iterate through lists of players or teams or whatever to retrieve the data I want.

Clearly that technique works just fine because I have data. However, when I try to schedule the refresh in the Power BI service, I get the dreaded error: “You can’t schedule refresh for this dataset because one or more sources currently don’t support refresh.”

I explained how to solve this particular problem once upon a time. For the hockey data queries, I had to use variations on that theme to get the dynamic URLs to work. To do this, I had to fix up my queries (which are actually functions) in the Query Editor by opening up the Advanced Editor and fixing the code as noted below.

Read on for her examples.  Given some of the problems she ran into, it seems like it might be a good idea to pull that data into SQL Server (or somewhere) and thereby separate data retrieval from data processing.

Views Don’t Improve Performance

Grant Fritchey lays down the law on views:

One day, it’s going to happen. I’m going to hear some crazy theory about how SQL Server works and I’m going to literally explode. Instead of some long silly rant with oddball literary & pop culture references you’ll get a screed the size of Ulysses (and about as much fun to read). However, for the moment, like Robin Williams describing a dance move, I’m going to keep it all inside. Here’s our query:

No, no where clause because we have to compare this to this, our view:

Grant used up much of his strategic reserve of GIFs in that post, so check it out.

Categories

May 2018
MTWTFSS
« Apr Jun »
 123456
78910111213
14151617181920
21222324252627
28293031