Author: Kevin Feasel

For the first time in 3 years, Gartner dropped a significant amount of vendors off its quadrant. There were 24 vendors in the firm’s quadrant in 2016 and 2017. This year, the Magic Quadrant only lists 20 vendors…that’s a 16% quadrant reduction. Has the market shrunk?!

Not exactly: the market has evolved….and in a pretty predictable way actually. Take a look at our 3-year-movement analysis table below: we see a pretty consistent story, e.g. the big are getting bigger, some of the visionaries got absorbed (or disappeared) and few ‘trend-setters’ graduated up.

Read on for more. The leader quadrant pretty much fits my expectations in terms of the major vendors and their rank ordering.

Comments closed

Checking Plan Compilation Time

Published 2018-03-02 by Kevin Feasel

Eric Blinn looks at plan compilation time in SQL Server:

The query returns 4 rows. By including STATISTICS TIME we get extra information on the Messages output tab. We can see from the execution on my laptop that the optimizer took 6ms to compile a query plan and the actual query executed in only 1ms.

Run the query batch a few more times and notice that the parse and compile time drops to zero. This is because SQL Server keeps a list of compiled plans and tries to reuse them without having to recompile. In this case the optimizer has recognized that this query is exactly identical to one it has previously executed and it reuses the previously compiled plan. That list of plans is called the Plan Cache and will be covered in much more detail in a subsequent post.

This cost is something we tend to forget about, but can make a big difference for a user’s experience.

Comments closed

Using Power BI For Hockey Stats

Published 2018-03-02 by Kevin Feasel

Stacia Varga continues her Power BI + hockey series:

My last data acquisition step is to get statistics data for each player. I just need to build a function to dynamically get data by team like I did above using this endpoint as my base structure:

http://statsapi.web.nhl.com/api/v1/teams/54?hydrate=roster(person(stats(splits=yearByYear)))

It turns out there are many different kinds of statistics that I can get in addition to these statistics by season. I’ll probably get them all added into my model eventually, but the process is the same. For a list of other available statistics to use instead of yearByYear, see http://statsapi.web.nhl.com/api/v1/statTypes.

It’s another nice use of Power BI to read from a web-based API.

Comments closed

Allowing Azure Service Access

Published 2018-03-02 by Kevin Feasel

Arun Sirpal points out the importance of a tiny checkbox:

When you create a “logical” Azure SQL Server (I say logical because we are not really physically creating anything) there is a setting that is ticked ON by default which is called “Allow Azure services to access server”.

The question is, what does it mean? (See the highlighted section below)

Read on to see what this does and why Arun doesn’t like the default.

Comments closed

Using drop = FALSE On Data Frames

Published 2018-03-01 by Kevin Feasel

John Mount explains why you might want to add drop = FALSE to your data.frame operations:

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame(assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

Definitely read the comments on this one as well, as John extends his explanation and others chime in with very useful notes.

Comments closed

The Process Of Processing Data

Published 2018-03-01 by Kevin Feasel

I continue my series on launching a data science project:

This next category of data cleansing has to do with specific values. I want to look at three particular sub-categories: mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect. In a data science problem, the label is the thing that we are trying to explain or predict. For example, in our data set, we want to predict SalaryUSD based on various inputs. If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis. If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together. Let’s go back to the product title and UPC/MFC example. As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC. We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis. We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong. For example, in the data professional salary survey, there’s a person who works 200 hours per week. While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year. I mean, if I had the ability to generate spare days, I’d want way more than that.

In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation. This is the theory post, with implementation coming next week.

Comments closed

Image Recognition Using Viola-Jones

Published 2018-03-01 by Kevin Feasel

Ellen Talbot lays out some of the basics of image recognition:

Aggregate channel features (ACF) is a variation of channel features, which extracts features directly as pixel values in extended channels without computing rectangular sums at various locations and scales.

Common channels include the colour channels, such as grey-scale and RBG, but many other channels can be encoded, depending on the difficulty of your problem (e.g. gradient magnitude and gradient histograms).

ACF has advantages, such as a richer representation, accelerated detection speed and more accurate localisation of objects in the images when used in conjunction with a boosting method.

Click through for more, including a few resources around the Viola-Jones algorithm.

Comments closed

Configuring SQL Operations Studio

Published 2018-03-01 by Kevin Feasel

Ahmad Yaseen demonstrates how to configure SQL Operations Studio as well as writing queries with it:

To customize your connection, click the Advanced button that provides a large number of options that can help you to draw a specific type of connection. For example, you can specify the application workload type when connecting to the server by setting the Application Intent option. You can also override the default Connect Timeout setting, the SQL Server Current Language, the default Column Encryption Setting for all commands on the connection, the Encrypt option to use the SSL encryption for all data sent between the client and the server if there is an installed certificate, Persist Security Info to prevent returning the password as a part of the connection, and use the SSL encryption although there is no certificate in the server by enabling the Trust Server Certificate.

You can also use the Advanced options to specify the number of attempts to restore connection and the delay between attempts using Connect Retry Count and Connect Retry Interval. In addition, you will be able also to specify the maximum and the minimum number of connections allowed in the pool with the ability to force that the connection object is drawn from the appropriate pool, and the minimum amount of time for that connection to live in the pool using Load Balance Timeout. The Failover Partner option allows you to provide the name of the SQL Server instance that acts as a failover partner. You can control the size of the network packets used to communicate with the SQL Server instance using the Packet Size option.

It’s interesting to see just how much you can configure in the tool.

Comments closed

Full-Screen SSMS

Published 2018-03-01 by Kevin Feasel

Wayne Sheffield has another SSMS tip for us:

Do you ever find yourself working on a query and realize that you need just a bit more real estate in the SSMS window? Or perhaps you find that all the toolbars, menus, etc. are cluttering things up? To solve these issues, you can toggle the full screen mode in SSMS on. It will remove all that clutter and maximize the query window. Below, you can see a cluttered SSMS with two rows of buttons, and toolbars on both sides of it.

Click through to see how to enable full-screen mode.

Comments closed

Calculating The End Of The Month

Published 2018-03-01 by Kevin Feasel

Bob Pusateri gives us a few techniques for calculating the last day of a particular month:

Months are funny. Unlike other parts of a date, they vary in length:

The last second of a minute is always 59.

The last minute of a hour is always 59.

The last hour of a day is always 23.

But the last day of a month? Well that depends on what month it is. And the year matters too because a leap year means February gets an extra day.

Click through for several techniques, including the knuckle technique for advanced practitioners. But what if I need to calculate the end of a lunar month?

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30