Category: Data Science

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Comments closed

More DBA Salary Research

Published 2018-01-22 by Kevin Feasel

Ginger Grant digs into the DBA salary survey a bit further:

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers. According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%? Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute. Those who commute 100% of the time do very well, as well as those who spend every day at a job site. Males report working more hours and telecommuting less than females do as well. If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

Another area of importance here is in trying to normalize salaries for standard of living: it’s a lot easier to get a $100K/year job in Manhattan, NY than Manhattan, KS, but $100K in the latter goes much further. Based on my little digging into the set, it’d be tough to draw any conclusions on that front, but it is an a priori factor that I’d want to consider when dealing with salary survey data.

Comments closed

DBA Salary Calculations

Published 2018-01-15 by Kevin Feasel

Eugene Meidinger takes a whack at the data professional salary survey:

So I’m using something called a multiple linear regression to make a formula to predict your salary based on specific variables. Unfortunately, the highest Coefficient of Determination (or R²) I’ve been able to get is 0.37. Which means, as far as I understand it, that at most the model explains 37% of the variation.

Additionally the spread on the results isn’t great either. The standard deviation, a measure of spread, is about $25,000 on the original subset of data. Which means we’d expect 68% to be within +/- $25,000 of the average and 95% to be within +/- $50,000 of the average. So what happens when we apply our model?

Read on for Eugene’s early findings and a roadmap for additional posts.

Comments closed

Choose Your Own Regression Adventure

Published 2018-01-09 by Kevin Feasel

Jim Frost explains when you might use different types of regression analysis:

Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.

I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.

It’s a good overview of several techniques.

Comments closed

Finding Maxima And Minima

Published 2018-01-04 by Kevin Feasel

Jobil Louis shares various techniques for finding a global maximum or minimum:

Let’s say we want to find the minimum point in y and value of x which gives that minimum y. There are many ways to find this. I will explain three of those.

1) Search based methods: Here the idea is to search for the minimum value of y by feeding in different values of x. There are two different ways to do this.

a) Grid search: In grid search, you give a list of values for x(as in a grid) and calculate y and see the minimum of those.

b) Random search: In this method, you randomly generate values of x and compute y and find the minimum among those.

The drawback of search based methods is that there is no guarantee that we will find a local or global minimum. Global minimum means the overall minimum of a curve. Local minimum means a point which is minimum relatively to its neighboring values.

My favorite class of algorithm here is evolutionary algorithms, particularly genetic algorithms and genetic programming. They’re a last-ditch effort when nothing else works, but the funny thing about them is that when nothing else works, they tend to step up.

Comments closed

DBA Salary Gaps

Published 2018-01-02 by Kevin Feasel

Eugene Meidinger has a great post looking at DBA salaries for women versus men:

Goofy outliers are an issue, but the larger the dataset the smaller the issue. If Bill Gates walks into a bar, the average wealth in the bar goes up by a billion. If he walks into a football stadium, everyone gets a million dollar raise.

One way of looking at the issue is to compare the median to the mean. The median is the salary smack dab in the middle, whereas mean is what we normally think of when we think of average.

The median doesn’t care where Bill Gates is, but the mean is sensitive to outliers. If we compare the two, that should give us an idea if we have too much skew in either direction.

If you’re not well-versed in descriptive statistics, Eugene has a good, methodical process and explains each step well.

Comments closed

Outlier Detection With dplyr And ruler

Published 2017-12-29 by Kevin Feasel

Evgeni Chasnovski shows how to use a couple R packages in concert to find outliers:

During the process of data analysis one of the most crucial steps is to identify and account for outliers, observations that have essentially different nature than most other observations. Their presence can lead to untrustworthy conclusions. The most complicated part of this task is to define a notion of “outlier”. After that, it is straightforward to identify them based on given data.

There are many techniques developed for outlier detection. Majority of them deal with numerical data. This post will describe the most basic ones with their application using dplyrand ruler packages.

After reading this post you will know:

Most basic outlier detection techniques.
A way to implement them using dplyr and ruler.
A way to combine their results in order to obtain a new outlier detection method.
A way to discover notion of “diamond quality” without prior knowledge of this topic (as a happy consequence of previous point).

Read the whole thing. H/T R-Bloggers

Comments closed

Breeze: Mathematics In Scala

Published 2017-12-28 by Kevin Feasel

Nitin Aggarwal introduces the mathematics library behind Spark’s machine learning library, MLlib:

In simple terms, Breeze is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark

Breeze comprises four libraries:

breeze-math: Numerics and Linear Algebra. Fast linear algebra backed by native libraries (via JBlas) where appropriate.
breeze-process: Tools for tokenizing, processing, and massaging data, especially textual data. Includes stemmers, tokenizers, and stop word filtering, among other features.
breeze-learn: Optimization and Machine Learning. Contains state-of-the-art routines for convex optimization, sampling distributions, several classifiers, and DSLs for Linear Programming and Belief Propagation.
breeze-viz: (Very alpha) Basic support for plotting, using JFreeChart.

Read on for samples and basic usage.

Comments closed

Data Science At A Small Tech Company

Published 2017-12-28 by Kevin Feasel

Julia Silge blogs about her first year as a data scientist at Stack Overflow:

In the fall I saw this post by Shanif Dhanani about being a data scientist at a small company, and it is entirely on point, the whole way through. So much of that post resonates with my own experience of being a data scientist at a small company. And yes, I do keep saying “small company”; Stack Overflow is likely smaller than you think it is, 250 or so employees in total. I am the second data scientist here, joining David Robinson who was the first data science hire, on a data team that is five in total.

I cannot emphasize enough how much of my day-to-day work is communicating, collaborating with others, and answering not-entirely-specified questions. Data science is highly technical work, but the value of my technical work would be much lower if I could not communicate what it means in clear and compelling ways. My definition of communication here is pretty broad, and includes speaking, writing, and data visualization.

If you’re interested in a career in data science, this is food for thought.

Comments closed

The Importance Of Model Interpretability

Published 2017-12-21 by Kevin Feasel

Ilknur Kaynar Kabul explains why it’s important that your data science models be interpretable:

Some machine learning models are simple and easy to understand. We know how changing the inputs will affect the predicted outcome and can make justification for each prediction. However, with the recent advances in machine learning and artificial intelligence, models have become very complex, including complex deep neural networks and ensembles of different models. We refer to these complex models as black box models.

Unfortunately, the complexity that gives extraordinary predictive abilities to black box models also makes them very difficult to understand and trust. The algorithms inside the black box models do not expose their secrets. They don’t, in general, provide a clear explanation of why they made a certain prediction. They just give us a probability, and they are opaque and hard to interpret. Sometimes there are thousands (even millions) of model parameters, there’s no one-to-one relationship between input features and parameters, and often combinations of multiple models using many parameters affect the prediction. Some of them are also data hungry. They need enormous amounts of data to achieve high accuracy. It’s hard to figure out what they learned from those data sets and which of those data points have more influence on the outcome than the others.

This post reminds me of a story I’d heard about a financial organization using neural networks to build accurate models, but then needing to decompose the models into complex decision trees to explain to auditors that they weren’t violating any laws in the process.

Comments closed