Press "Enter" to skip to content

Author: Kevin Feasel

What Is The Data Platform?

Rolf Tesmer has weighed in with his thoughts on the “Data Platform”:

What this has meant is that innovation – in particular in the Azure Public Cloud, ISV’s, new data services/products, and new data related infrastructure – has accelerated dramatically and changed the very definitions of what was previously accepted as comprising the “Data Platform”.

Nowadays when I talk to customers about the “Data Platform” it encompasses a range of services across a mix of IaaS, PaaS and SaaS.  The decision of which data service to deploy now comes down to matching the business case technical requirements with the capability of a purpose built cloud service – as opposed to (in the past) trying to fit an obvious NoSQL use case into a traditional RDBMS platform.

I now see the “New Data Platform” as much broader than ever before and includes many other “non-traditional” data services…

Cf. Eugene Meidinger (who started this) and me (who exacerbated this).  This is an area ripe for consideration.

1 Comment

Stop Being Your Own Worst Enemy With Code

Bert Wagner has advice on making code understandable for future-you:

At the time I wrote it, I probably thought my code was beautiful. An elegant masterpiece. It should have been printed, framed, and hung on a wall of The Programming Hall of Fame. As clever as I thought I may have been a few years ago, I rarely am able to read my old code without some serious time wasted debugging.

This problem plagued me regularly. I tried different techniques to try and make my code easier to understand.

Bert has some good thoughts here, and I’ll add two small bits.  First, there’s a saying that it takes more mental effort to debug code than it takes to write it, so if you’re writing code at the edge of your understanding, effective debugging becomes difficult to impossible.  Second, unless you see a business rule frequently enough to internalize it, your greatest familiarity with the “whys” of the system is right when you are developing.  There is huge value in taking the time to document the rules in an accessible manner; even if you wrote the code, you probably won’t remember that weird edge case at 4 AM six months from now, when you need to remember it the most.

Comments closed

Considerations For Reducing I/O Costs

Monica Rathbun gives a few methods for reducing how many I/O operations a query requires:

Implicit Conversions

Implicit conversions often happen when a query is comparing two or more columns with different data types. In the below example, the system is having to perform extra I/O in order to compare a varchar(max) column to an nvarchar(4000) column, which leads to an implicit conversion, and ultimately a scan instead of a seek. By fixing the tables to have matching data types, or simply converting this value before evaluation, you can greatly reduce I/O and improve cardinality (the estimated rows the optimizer should expect).

There’s some good advice here if your main hardware constraint is being I/O bound.

Comments closed

Dark Queries

Michael Swart helps root out queries which get recompiled frequently and so won’t be in the cache:

Some of my favorite monitoring solutions rely on the cached queries:

but some queries will fall out of cache or don’t ever make it into cache. Those are the dark queries I’m interested in today. Today let’s look at query recompiles to shed light on some of those dark queries that maybe we’re not measuring.

By the way, if you’re using SQL Server 2016’s query store then this post isn’t for you because Query Store is awesome. Query Store doesn’t rely on the cache. It captures all activity and stores queries separately – Truth in advertising!

Click through for an Extended Event session which looks for recompilation.

Comments closed

Resetting SQL Administrators

Chris Lumnah shows how to use dbatools to reset a SQL authenticated administrative account:

As I was going through my environment, I realized I created a new domain controller for my tests. This DC has a new name and domain name which is different from my other VMs. I quickly realized that this will cause me issues later with authentication. No worries. I will just boot up the VMs and then and join them to the new domain. Easy-peasy. Now let met go test out my SQL Servers.

DOH!!

I received a login failure with access is denied. Using Windows Authentication with my new domain and recently joined server is not working. Why?…..Oh right, my new user id does not have access to SQL Server itself. As I sit there smacking myself in the head, I am also thinking about the amount of time it will take me to rebuild those VMs. Then it hit me!!!

Read on to see the solution, including a Powershell one-liner showing how it’s done.

Comments closed

Ignoring SSAS Dynamic Formatting

Chris Webb shows that tools like Power BI ignore formatting in SCOPE statements:

What’s more (and this is a bit strange) if you look at the DAX queries that are generated by Power BI to get data from the cube, they now request a new column to get the format string for the measure even though that format string isn’t used. Since it increases the amount of data returned by the query much larger, this extra column can have a negative impact on query performance if you’re bringing back large amounts of data.

There is no way of avoiding this problem at the moment, unfortunately. If you need to display formatted values in Power BI you will have to create a calculated measure that returns the value of your original measure, set the format string property on that calculated measure appropriately, and use that calculated measure in your Power BI reports instead:

Click through for more details and a workaround.

Comments closed

Minimizing Shared State

Vladimir Khorikov has an example of removing shared state and making application code more “honest” as a result:

Note how we’ve removed the private fields. Getting rid of the shared state automatically decoupled the three methods and made the workflow explicit. Without the shared state, the only way we can carry data around is by using the methods’ arguments and return values. And that is exactly what we did: all three members now explicitly state required inputs and possible outputs in their signatures.

This is the essence of functional programming. With honest method signatures, it’s extremely easy to reason about the code as we don’t need to keep in mind hidden relationships between its different parts. It’s also impossible to mess up with the invocation order. If we try, for example, to put the second line above the first one, the code simply wouldn’t compile:

This is one of many reasons why I’m fond of functional programming.

Comments closed

Understanding The Problem: Churn Edition

Emre Yazici points out the importance and difficulty of nailing down good definitions, using bookings churn as an example:

WHEN: Let’s say, we have made our design, constructed a model and obtained a good accuracy. However our model predicts (even with 95% accuracy) the customer who are going to churn in next day! That means our business department have to prevent (somehow, as explained before) those customers to churn in “one day”. Because next day, they will not be our customers. Taking an action to “3000” customers (let’s say) in one day only is impossible. So even our project predicts with very high accuracy, it will not be usefull. This approach also creates another problem: Consider that N months ago, a customer “A” was a happy customer and was working (providing us) with us (let’s say, it is a customer with %100 efficiency – happiness) and tomorrow it will be a customer who is not working with us (a customer with %100 efficiency – happiness). And we can predict the result today. So most probably, the customer has already got the idea to leave from our company in the last day. This is a deadend and we can not prevent the customer to churn at this point – because it is already too late.

So we need to have a certain time limit… Such that we need to be able to warn the business department “M months” before (customer churn) thus they can take action before the customers leave. Here comes another problem, what is the time limit… 2 months, 2.5 months, 3 months…? How do we determine the time, that we need to predict customers churn before (they leave)?

There’s a lot more to a good solution than “I ran a regression against a data set.”

Comments closed

Generating Homoglyphs In R

Bob Rudis shows how to create homoglyphs (character sequences which look similar to other character sequences) using a few R packages:

We can try it out with a very familiar domain:

(converted <- to_homoglyph("google.com"))
## [1] "ƍ၀໐|.com"

Now, that’s using all possible homoglyphs and it might not look like google.com to you, but imagine whittling down the list to ones that are really close to Latin character set matches. Or, imagine you’re in a hurry and see that version of Google’s URL with a shiny, green lock icon from Let’s Encrypt. You might not really give it a second thought if the page looked fine (or were on a mobile browser without a location bar showing).

Click through for more details, as well as information on punycode.

Comments closed

When Binomials Converge

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers.  The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed