Data Science – Page 46

Forensic Accounting: Cohort Analysis

Published 2019-04-19 by Kevin Feasel

I continue my series on forensic accounting techniques with cohort analysis:

In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.
Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.

Click through for some fraud-finding fun.

Comments closed

Bayes’ Theorem In A Picture

Published 2019-04-17 by Kevin Feasel

Stephanie Glen gives us the basics of Bayes’ Theorem in a picture:

Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second challenge is rephrasing your question so that you can work backwards: turning P(A|B) into P(B|A). The following image shows a basic example involving website traffic. For more simple examples, see: Bayes Theorem Problems.

Click through for the image and related links.

Comments closed

Basic Forensic Accounting Techniques

Published 2019-04-17 by Kevin Feasel

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Techniques For Standardizing Raw Scores

Published 2019-04-16 by Kevin Feasel

Sebastian Sauer shows a few techniques you can use to translate raw scores into standardized scores in R:

A common undertaking in applied research settings such as in some areas of psychology is to convert a raw score into some type of standardized score such as z-scores.
This post shows a way how to accomplish that.

Read on for three techniques.

Comments closed

K-Nearest Neighbors in Python

Published 2019-04-11 by Kevin Feasel

Hardik Jaroli shows how to use the k-Nearest Neighbors algorithm using scikit-learn:

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.
Training Algorithm:
1. Store all the Data
Prediction Algorithm:
1.Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points

Comments closed

Learning with Limited Data

Published 2019-04-09 by Kevin Feasel

Shioulin Sam and Nisha Muktewar have new research on machine learning when getting labeled data is time-consuming or difficult:

We are excited to release Learning with Limited Labeled Data, the latest report and prototype from Cloudera Fast Forward Labs.
Being able to learn with limited labeled data relaxes the stringent labeled data requirement for supervised machine learning. Our report focuses on active learning, a technique that relies on collaboration between machines and humans to label smartly.
Active learning makes it possible to build applications using a small set of labeled data, and enables enterprises to leverage their large pools of unlabeled data. In this blog post, we explore how active learning works. (For a higher level introduction, please see our previous blogpost.

The research itself is behind a paywall but you can see their write-up to get an idea of the topic.

Comments closed

Getting Started with Azure Databricks

Published 2019-04-08 by Kevin Feasel

Brad Llewellyn has a tutorial for Azure Databricks:

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post. The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform. In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product. It is owned and managed by the company Databricks and available in Azure and AWS. However, Databricks is a “first party offering” in Azure. This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products. You can read more about Azure Databricks here, hereand here.

Click through for a demonstration of the product.

Comments closed

Solving Logistic Regression Problems with Python

Published 2019-04-08 by Kevin Feasel

Hardik Jaroli shows how we can solve logistic regression problems using Python, using the Titanic data set as an example:

We will be working with the Titanic Data Set from Kaggle. We’ll be trying to predict a classification- survival or deceased.
Let’s begin by implementing Logistic Regression in Python for classification. We’ll use a “semi-cleaned” version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning.

Click through for the demo.

Comments closed

Finding an Unfair Coin with R

Published 2019-04-05 by Kevin Feasel

Sebastian Sauer works out a coin flip problem:

A stochastic problem, with application to financial theory. Some say it goes back to Warren Buffett. I relied to my colleague Norman Markgraf, who pointed it out to me.
Assume there are two coins. One is fair, one is loaded. The loaded coin has a bias of 60-40. Now, the question is: How many coin flips do you need to be “sure enough” (say, 95%) that you found the loaded coin?
Let’s simulate la chose.

It took a few more flips than I had expected but the number is not outlandish.

Comments closed

Python Natural Language Processing Tools

Published 2019-04-03 by Kevin Feasel

Sandeep Aspari takes us through some of the tooling available in Python around Natural Language Processing:

TextBlob
TextBlob is a python library tool and extension of NLTK. It provides a simple API approach to its methods and executes a large number of NLTK functions, and it also includes the pattern library functionality. You are just at the beginning, this might be an excellent tool to learning, and we can use it in applications production those don’t require heavy performant. TextBlob libraries are similar to python strings, so we can quickly transform and play similarly we performed in python. Finally, TextBlob is used in everywhere, and it is best suitable for smaller projects.

There are several tools from which you can choose. Sandeep also gives us some Node- and Java-based tools as well.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Data Science