Category: Data Science

Finding Maxima And Minima

Published 2018-01-04 by Kevin Feasel

Jobil Louis shares various techniques for finding a global maximum or minimum:

Let’s say we want to find the minimum point in y and value of x which gives that minimum y. There are many ways to find this. I will explain three of those.

1) Search based methods: Here the idea is to search for the minimum value of y by feeding in different values of x. There are two different ways to do this.

a) Grid search: In grid search, you give a list of values for x(as in a grid) and calculate y and see the minimum of those.

b) Random search: In this method, you randomly generate values of x and compute y and find the minimum among those.

The drawback of search based methods is that there is no guarantee that we will find a local or global minimum. Global minimum means the overall minimum of a curve. Local minimum means a point which is minimum relatively to its neighboring values.

My favorite class of algorithm here is evolutionary algorithms, particularly genetic algorithms and genetic programming. They’re a last-ditch effort when nothing else works, but the funny thing about them is that when nothing else works, they tend to step up.

Comments closed

DBA Salary Gaps

Published 2018-01-02 by Kevin Feasel

Eugene Meidinger has a great post looking at DBA salaries for women versus men:

Goofy outliers are an issue, but the larger the dataset the smaller the issue. If Bill Gates walks into a bar, the average wealth in the bar goes up by a billion. If he walks into a football stadium, everyone gets a million dollar raise.

One way of looking at the issue is to compare the median to the mean. The median is the salary smack dab in the middle, whereas mean is what we normally think of when we think of average.

The median doesn’t care where Bill Gates is, but the mean is sensitive to outliers. If we compare the two, that should give us an idea if we have too much skew in either direction.

If you’re not well-versed in descriptive statistics, Eugene has a good, methodical process and explains each step well.

Comments closed

Outlier Detection With dplyr And ruler

Published 2017-12-29 by Kevin Feasel

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

Most basic outlier detection techniques.
A way to implement them using dplyr and ruler.
A way to combine their results in order to obtain a new outlier detection method.
A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing. H/T R-Bloggers

Comments closed

Breeze: Mathematics In Scala

Published 2017-12-28 by Kevin Feasel

Nitin Aggarwal introduces the mathematics library behind Spark’s machine learning library, MLlib:

In simple terms, Breeze is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark

Breeze comprises four libraries:

breeze-math: Numerics and Linear Algebra. Fast linear algebra backed by native libraries (via JBlas) where appropriate.
breeze-process: Tools for tokenizing, processing, and massaging data, especially textual data. Includes stemmers, tokenizers, and stop word filtering, among other features.
breeze-learn: Optimization and Machine Learning. Contains state-of-the-art routines for convex optimization, sampling distributions, several classifiers, and DSLs for Linear Programming and Belief Propagation.
breeze-viz: (Very alpha) Basic support for plotting, using JFreeChart.

Read on for samples and basic usage.

Comments closed

Data Science At A Small Tech Company

Published 2017-12-28 by Kevin Feasel

Julia Silge blogs about her first year as a data scientist at Stack Overflow:

In the fall I saw this post by Shanif Dhanani about being a data scientist at a small company, and it is entirely on point, the whole way through. So much of that post resonates with my own experience of being a data scientist at a small company. And yes, I do keep saying “small company”; Stack Overflow is likely smaller than you think it is, 250 or so employees in total. I am the second data scientist here, joining David Robinson who was the first data science hire, on a data team that is five in total.

I cannot emphasize enough how much of my day-to-day work is communicating, collaborating with others, and answering not-entirely-specified questions. Data science is highly technical work, but the value of my technical work would be much lower if I could not communicate what it means in clear and compelling ways. My definition of communication here is pretty broad, and includes speaking, writing, and data visualization.

If you’re interested in a career in data science, this is food for thought.

Comments closed

The Importance Of Model Interpretability

Published 2017-12-21 by Kevin Feasel

Ilknur Kaynar Kabul explains why it’s important that your data science models be interpretable:

Some machine learning models are simple and easy to understand. We know how changing the inputs will affect the predicted outcome and can make justification for each prediction. However, with the recent advances in machine learning and artificial intelligence, models have become very complex, including complex deep neural networks and ensembles of different models. We refer to these complex models as black box models.

Unfortunately, the complexity that gives extraordinary predictive abilities to black box models also makes them very difficult to understand and trust. The algorithms inside the black box models do not expose their secrets. They don’t, in general, provide a clear explanation of why they made a certain prediction. They just give us a probability, and they are opaque and hard to interpret. Sometimes there are thousands (even millions) of model parameters, there’s no one-to-one relationship between input features and parameters, and often combinations of multiple models using many parameters affect the prediction. Some of them are also data hungry. They need enormous amounts of data to achieve high accuracy. It’s hard to figure out what they learned from those data sets and which of those data points have more influence on the outcome than the others.

This post reminds me of a story I’d heard about a financial organization using neural networks to build accurate models, but then needing to decompose the models into complex decision trees to explain to auditors that they weren’t violating any laws in the process.

Comments closed

Hierarchical Clustering

Published 2017-12-20 by Kevin Feasel

Chaitanya Sagar explains hierarchical clustering with examples in R:

Hope now you have a better understanding of clustering algorithms than what you started with. We discussed about Divisive and Agglomerative clustering techniques and four linkage methods namely, Single, Complete, Average and Ward’s method. Next, we implemented the discussed techniques in R using a numeric dataset. Note that we didn’t have any categorical variable in the dataset we used. You need to treat the categorical variables in order to incorporate them into a clustering algorithm. Lastly, we discussed a couple of plots to visualise the clusters/groups formed. Note here that we have assumed value of ‘k’ (number of clusters) is known. However, this is not always the case. There are a number of heuristics and rules-of-thumb for picking number of clusters. A given heuristic will work better on some datasets than others. It’s best to take advantage of domain knowledge to help set the number of clusters, if that’s possible. Otherwise, try a variety of heuristics, and perhaps a few different values of k.

There’s a lot to pick out of this post, but you’re able to walk through it step by step. H/T R-Bloggers

Comments closed

Matrix Transposition In T-SQL

Published 2017-12-14 by Kevin Feasel

Phil Factor has some fun transposing a matrix using T-SQL:

What I’m doing is simply converting the table into its JSON form, and then using this to create a table using the multi-row VALUES syntax which paradoxically allows expressions. The expression I’m using is JSON_Value, which allows me do effectively dictate the source within the table, via that JSON Path expression, and the destination. As it is an expression, I can do all sorts of manipulation as well as a transpose. I could, if I wanted, (in SQL 2017)provide that path parameter as a variable. This sort of technique can be used for several other reporting purposes, and it is well-worth experimenting with it because it is so versatile.

That is not at all what I would have thought up; very interesting approach. I’d probably just be lazy and shell out to R Services.

Comments closed

Taking A Random Walk

Published 2017-12-12 by Kevin Feasel

Dan Goldstein describes the basics of Brownian motion:

I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

1 up (North)
2 diagonal to the upper right (Northeast)
3 diagonal to the lower right (Southeast)
4 down (South)
5 diagonal to the lower left (Southwest)
6 diagonal to the upper left (Northwest)

Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

We did this about 50 times. The results are pictured above, along with kid hands for scale.

Javi Fernandez-Lopez then shows how to generate an animated GIF displaying Brownian motion:

Last Monday we celebrated a “Scientific Marathon” at Royal Botanic Garden in Madrid, a kind of mini-conference to talk about our research. I was talking about the relation between fungal spore size and environmental variables such as temperature and precipitation. To make my presentation more friendly, I created a GIF to explain the Brownian Motion model. In evolutionary biology, we can use this model to simulate the random variation of a continuous trait through time. Under this model, we can notice how closer species tend to maintain closer trait values due to shared evolutionary history. You have a lot of information about Brownian Motion models in evolutionary biology everywhere!

Another place that this is useful is in describing stock market movements in the short run.

Comments closed

Estimating Used Car Prices

Published 2017-12-07 by Kevin Feasel

Kevin Jacobs wants to estimate the value of his car and shows how to set up a machine learning job to do this:

As you can see, I collected the brand (Peugeot 106), the type (1.0, 1.1, …), the color of the car (black, blue, …) the construction year of the car, the odometer of the car (which is the distance in kilometers (km) traveled with the car at this point in space and time), the ask price of the car (in Euro’s), the days until the MOT (Ministry of Transport test, a required periodical check-up of your car) and the horse power (HP) of the car. Feel free to use your own variables/units!

It’s an interesting example of how you can approach a real problem.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31