Press "Enter" to skip to content

Author: Kevin Feasel

Random Forest Feature Importance

Selcuk Disci takes us through an important concept with random forest models:

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(bagging) algortihms. We would take for training sample, X = x1, …, xn and, Y = y1, …, yn for the outputs. The bagging process repeated B times with selecting a random sample by changing the training set and, tries to fit the relevant tree algorithms to the samples. This fitting function is denoted fb in the below formula.

As far as the article goes, inflation is always and everywhere a monetary phenomenon. H/T R-Bloggers.

Comments closed

Using Filter Based Feature Selection in Text Analytics

Dinesh Asanka takes us through a text analytics technique in Azure Machine Learning:

There are two parameters to be defined in the Feature Hashing control. Hashing bitsize will define the maximum number of vectors. 10 hashing bitsize means 1,024 vectors (2^10). 1,024 vectors are more than enough even for the large volume text files. Next, we need to choose N-grams which is 2 as 2 is the optimal number for N-grams for most situations. A detailed description of N-Grams is given in the link given in the reference section.

After the vectors are generated, we do not need other text columns. Apart from the vectors, we need only the dependent attribute or the category column in this example. Therefore, we can remove the unnecessary attributes by Select Columns in dataset control. However, this control will show 1,024 vectors even though it is not available in the previous step, Feature Hashing. Therefore, you need to choose only the available attributes in the Feature Hashing control at the Select Columns in dataset control. In the above example, only 93 vectors were generated.

Click through to learn more.

Comments closed

Centralized and Decentralized Data Architectures

James Serra looks at a pattern:

A centralized data architecture means the data from each domain/subject (i.e. payroll, operations, finance) is copied to one location (i.e. a data lake under one storage account), and that the data from the multiple domains/subjects are combined to create centralized data models and unified views. It also means centralized ownership of the data (usually IT). This is the approach used by a Data Fabric.

A decentralized distributed data architecture means the data from each domain is not copied but rather kept within the domain (each domain/subject has its own data lake under one storage account) and each domain has its own data models. It also means distributed ownership of the data, with each domain having its own owner.

So is decentralized better than centralized?

Read on for James’s answer, and allow me to include a Dilbert cartoon so old, the boss didn’t even have pointy hair yet.

How Decentralized Organizations Can be Effective | The Fourth Revolution Blog

Comments closed

A Show about Nothing

Joe Celko has a moment of zen:

Human beings are not very good at handling nothing. The printing press didn’t just lead to civilization as we know it, but it also changed our mindset about text. When we wrote text manually on paper, a blank or space was not seen as a character. It was just the background upon which characters were written.

It was centuries before the zero was accepted as a number. After all, it represents the absence of a quantity or magnitude or position; how could it possibly be a number? Before it was accepted as a number, it was considered a symbol or mark in a positional notation to indicate that there was nothing in that position.

It’s an interesting riff, so check it out.

Comments closed

The Future of R with SQL Server

James Rowland-Jones has an update for us:

The importance of R was first recognized by the SQL Server team back in 2016 with the launch of SQL ML Services and R Server. Over the years we have added Python to SQL ML Services in 2017 and Java support through our language extensions in 2019. Earlier this year we also announced the general availability of SQL ML Services into Azure SQL Managed Instance. SparkR, sparklyr, and PySpark are also available as part of SQL Server Big Data Clusters. We remain committed to R.

With that said, much has changed in the world of data science and analytics since 2016. Microsoft’s approach to open-source software has undergone a similar transformation in the same period. It is therefore time for us to share how we, in Azure SQL and SQL Server, are changing to meet the needs of our users and the R community moving forward.

I never used ML Server (but have used SQL Server ML Services a lot), so that part of the announcement doesn’t affect me and I’m not sure how many organizations it does affect. Switching to CRAN R is a good idea and I appreciate that they’re open-sourcing the RevoScaleR and revoscalepy code bases. The one thing I’d really like to see in vNext’s Machine Learning Services is an easy way to update the version of R

1 Comment

Monitoring Power Virtual Agent Chatbots

Devin Knight has a video for us:

Power Virtual Agents empowers subject matter experts to build intelligent conversational bots, using a guided, no-code graphical interface. In this video you will learn how to monitor how successful your chatbots are at answering your users questions. Using the monitoring capability you will uncover areas of your chatbot that can be improved.

If I were familiar enough with Latin, I’d try a play on “Quis custodiet ipsos custodes?” with this.

Comments closed

Queues and Watermarks

Forrest McDaniel wants a zippier queue in SQL Server:

I recently had the pleasure of investigating a procedure that pulled from a queue. Normally it was fast, but occasionally runtime would spike. The spooky thing was the query was using an ordered index scan that should only read one row, but during the spikes it was reading thousands.

Surely there’s a rational explanation…

Spoilers: there was. And Forrest a’int afraid of no ghosts.

(sotto voce – I’m so glad that Forrest didn’t sneak in any Ghostbusters references so that I could do that here and be original.)

Comments closed

Aggregation and Indexed Views

Randolph West dives into the archives:

Ten years of hindsight (and being able to read the wrap-up post with all the responses) gives me an advantage in this retrospective, I admit, but I didn’t find the thing I was going to write about anyway even though one or two people had a similar idea. And that, dear reader, means that I can write about one of my favourite performance secret weapons: the indexed view. It’s essentially a regular view with an index (or indexes) attached to it. Oracle calls them materialized views. Unlike a regular view which is simply a query definition, the indexed view persists the results, making it a lot more efficient to query that data:

Read on for more information.

Comments closed