Press "Enter" to skip to content

Author: Kevin Feasel

Measuring Semantic Relatedness

Sandipan Dey re-works a university assignment on semantic relatedness in Python:

Let’s define the semantic relatedness of two WordNet nouns x and y as follows:

  • A = set of synsets in which x appears
  • B = set of synsets in which y appears
  • distance(x, y) = length of shortest ancestral path of subsets A and B
  • sca(x, y) = a shortest common ancestor of subsets A and B

This is the notion of distance that we need to use to implement the distance() and sca() methods in the WordNet data type.

It looks like a helpful assignment for understanding natural language processing a little better.

Comments closed

Trigger Nuance

Denis Gobo offers up some good advice on triggers:

Most common mistake people make when first starting writing triggers is that they write it in such a way that it will only work if you insert/update/delete one row at a time. A trigger fires per batch not per row, you have to take this into consideration otherwise your DML statements will blow up. How to do this is explained in this post Coding SQL Server triggers for multi-row operations, there is no point recreating that post here.
Another problem that I see is that some people think a trigger is SQL Server’s version of crontab, you will see code that sends email, kicks off jobs, runs stored procedures. This is the wrong approach, a trigger should be lean and mean, it should execute as fast as possible, if you need to do some additional things then dump some data from the trigger into a processing table and then use that table to do your additional tasks. Don’t use triggers as a messaging system either, SQL Server comes with Service Broker, use that instead.

Good reading.  There are valid reasons for triggers, and ignoring them altogether is almost as bad as misusing them.

Comments closed

Basics Of Elasticsearch In .NET

Ivan Cesar gives us a brief tutorial of the Elasticsearch .NET API:

To be able to search something, we must store some data into ES. The term used is “indexing.”

The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be using Entity Framework (EF) in this tutorial.

Generally, when using Elasticsearch, you are probably looking for a site-wide search engine solution. You will either use some sort of feed or digest, or Google-like search which returns all the results from various entities, such as users, blog entries, products, categories, events, etc.

These will probably not just be one table or entity in your database, but rather, you will want to aggregate diverse data and maybe extract or derive some common properties like title, description, date, author/owner, photo, and so on. Another thing is, you probably won’t do it in one query, but if you are using an ORM, you will have to write a separate query for each of those blog entries, users, products, categories, events, or something else.

Check out Ivan’s tutorial for several examples.  Elasticsearch is really good for text-based search and simple aggregations, but it probably shouldn’t be a primary data store for any data you really care about.

Comments closed

Spelunking In The SSRS REST API

Chris Webb uses Power BI to look at the new SQL Server Reporting Services 2017 REST-based API:

And the online documentation for the API is here:

https://app.swaggerhub.com/apis/microsoft-rs/SSRS/2.0

Interestingly, the new API seems to be OData compliant – which means you can browse it in Power BI/Get&Transform/Power Query and build your own reports from it. For example in Power BI Desktop I can browse the API of the SSRS instance installed on my local machine by entering the following URL:

This is something that SSRS has been missing for a long time.  I’m glad they’re introducing a real API.

1 Comment

Checking Azure Status

Arun Sirpal shows where to look if you think you’re experiencing an Azure SQL Database outage:

It shows the many different layers involved with a product like Azure SQL Database. What happens if there is a loss of service for a specific component?  Obviously we as customers would not be able to fix the issue as this is the responsibility of Microsoft Engineers, the key for me is being kept in the loop with the issue and it is something that they do pretty well. So what happens if the load balancer has issues?

All communication is done via Service Health within the Azure portal.

Check the comments for another useful Azure status site.

Comments closed

Benchmarking Streaming Systems

Burak Yavuz shares a benchmark of Spark Streaming versus Flink and Kafka Streams:

At Databricks, we used Databricks Notebooks and cluster management to set up a reproducible benchmarking harness that compares the performance of Apache Spark’s Structured Streaming, running on Databricks Unified Analytics Platform, against other open source streaming systems such as Apache Kafka Streams and Apache Flink. In particular, we used the following systems and versions in our benchmarks:

The Yahoo Streaming Benchmark is a well-known benchmark used in industry to evaluate streaming systems. When setting up our benchmark, we wanted to push each streaming system to its absolute limits, yet keep the business logic the same as in the Yahoo Streaming Benchmark. We shared some of the results we achieved from these benchmarks during Spark Summit West 2017 keynote showing that Spark can reach 5x or higher throughput over other popular streaming systems. In this blog, we discuss in more detail about how we performed this benchmark, and how you can reproduce the results yourselves.

Standard vendor-based metric warnings aside, you can get the benchmark details at their GitHub repo.

Comments closed

Linear Discriminant Analysis

Jake Hoare explains Linear Discriminant Analysis:

Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this example, the categorical variable is called “class” and the predictive variables (which are numeric) are the other columns.

Following this is a clear example of how to use LDA.  This post is also the second time this week somebody has suggested The Elements of Statistical Learning, so I probably should make time to look at the book.

Comments closed

Bayesian Nonparametric Models

Luba Belokon asked Vadim Smolyakov to explain Bayesian Nonparametric models and here’s the result:

Bayesian Nonparametrics are a class of models for which the number of parameters grows with data. A simple example is non-parametric K-means clustering [1]. Instead of fixing the number of clusters K, we let data determine the best number of clusters. By letting the number of model parameters (cluster means and covariances) grow with data, we are better able to describe the data as well as generate new data given our model.

Of course, to avoid over-fitting, we penalize the number of clusters K via a regularization parameter which controls the rate at which new clusters are created. 

This is an interesting discussion of the Dirichlet process, particularly as applied to K-mean clustering.  It helps you figure out your best choice for K, no small task.

Comments closed

Restoration With Replacement

Joey D’Antoni tests whether RESTORE WITH REPLACE is functionally different from dropping a database and performing a restoration:

I recently read something that said using the RESTORE WITH REPLACE command could be faster than dropping a database and then performing a RESTORE, because the shell of the file could be used and therefore skip file initialization. I did not think that was the case, but books online wasn’t clear about the situation, so I went ahead and built a quick test case, using ProcMon from sysinternals. If you aren’t familar with the sysinternals tools, you should be—they are a good way to get under the hood of your Windows Server to see what’s going on, and if you’re old like me, you probably used PSEXEC to “telnet” into a Windows server to restart a service before RDP was a thing.

Read on to see how the processes compare.

Comments closed

Default Schemas In SQL Server

Daniel Hutmacher looks at specifying default schemas on a database:

If your user is a database owner, (i.e. is a member of the db_owner group or has CONTROL permissions on the database) the default schema will always be dbo. This is something you can’t change.

So if your legacy application needs quasi-administrative privileges in the database, you can’t make it a database owner, but you can grant those permissions on the schema instead (which is actually a better idea anyway).

What Daniel is doing is akin to the pre-2005 concept of user spaces, where Bob had a schema and Mary had a schema and Jill had a schema and so forth.

Comments closed