# Day: October 16, 2017

Thankfully not only have modern data analysis tools made data collection cheap and easy, they have made the process of exploratory data analysis cheaper and easier as well. Yet when we use these tools to explore data and look for anomalies or interesting features, we are implicitly formulating and testing hypotheses after we have observed the outcomes. The ease with which we are now able to collect and explore data makes it very difficult to put into practice even basic concepts of data analysis that we have learned — things such as:

• Correlation does not imply causation.
• When we segment our data into subpopulations by characteristics of interest, members are not randomly assigned (rather, they are chosen deliberately) and suffer from selection bias.
• We must correct for multiple hypothesis tests.
• We ought not dredge our data.

All of those principles are well known to statisticians, and have been so for many decades. What is newer is just how cheap it is to posit hypotheses. For better and for worse, technology has led to a democratization of data within organizations. More people than ever are using statistical analysis packages and dashboards, explicitly or more often implicitly, to develop and test hypotheses.

This is a thoughtful essay well worth reading.

### KStream-KStream Join

It is a sliding window join, that means, all tuples close to each other with regard to time are joined. Time here is the difference up to size of the window.

These joins are always windowed joins because otherwise, the size of the internal state store used to perform the join would grow indefinitely.

In the following example, we perform an inner join between two streams. The output the joined stream will be of type `KStream<K, ...>`

Let’s define the semantic relatedness of two WordNet nouns x and y as follows:

• A = set of synsets in which x appears
• B = set of synsets in which y appears
• distance(x, y) = length of shortest ancestral path of subsets A and B
• sca(x, y) = a shortest common ancestor of subsets A and B

This is the notion of distance that we need to use to implement the `distance()` and `sca()` methods in the `WordNet` data type.

It looks like a helpful assignment for understanding natural language processing a little better.

Most common mistake people make when first starting writing triggers is that they write it in such a way that it will only work if you insert/update/delete one row at a time. A trigger fires per batch not per row, you have to take this into consideration otherwise your DML statements will blow up. How to do this is explained in this post Coding SQL Server triggers for multi-row operations, there is no point recreating that post here.
Another problem that I see is that some people think a trigger is SQL Server’s version of crontab, you will see code that sends email, kicks off jobs, runs stored procedures. This is the wrong approach, a trigger should be lean and mean, it should execute as fast as possible, if you need to do some additional things then dump some data from the trigger into a processing table and then use that table to do your additional tasks. Don’t use triggers as a messaging system either, SQL Server comes with Service Broker, use that instead.

Good reading.  There are valid reasons for triggers, and ignoring them altogether is almost as bad as misusing them.

To be able to search something, we must store some data into ES. The term used is “indexing.”

The term “mapping” is used for mapping our data in the database to objects which will be serialized and stored in Elasticsearch. We will be using Entity Framework (EF) in this tutorial.

Generally, when using Elasticsearch, you are probably looking for a site-wide search engine solution. You will either use some sort of feed or digest, or Google-like search which returns all the results from various entities, such as users, blog entries, products, categories, events, etc.

These will probably not just be one table or entity in your database, but rather, you will want to aggregate diverse data and maybe extract or derive some common properties like title, description, date, author/owner, photo, and so on. Another thing is, you probably won’t do it in one query, but if you are using an ORM, you will have to write a separate query for each of those blog entries, users, products, categories, events, or something else.

Check out Ivan’s tutorial for several examples.  Elasticsearch is really good for text-based search and simple aggregations, but it probably shouldn’t be a primary data store for any data you really care about.

And the online documentation for the API is here:

https://app.swaggerhub.com/apis/microsoft-rs/SSRS/2.0

Interestingly, the new API seems to be OData compliant – which means you can browse it in Power BI/Get&Transform/Power Query and build your own reports from it. For example in Power BI Desktop I can browse the API of the SSRS instance installed on my local machine by entering the following URL:

This is something that SSRS has been missing for a long time.  I’m glad they’re introducing a real API.

It shows the many different layers involved with a product like Azure SQL Database. What happens if there is a loss of service for a specific component?  Obviously we as customers would not be able to fix the issue as this is the responsibility of Microsoft Engineers, the key for me is being kept in the loop with the issue and it is something that they do pretty well. So what happens if the load balancer has issues?

All communication is done via Service Health within the Azure portal.

Check the comments for another useful Azure status site.