Press "Enter" to skip to content

Category: Data Science

JupyterLab Now Available

Project Jupyter announces the general availability of JupyterLab:

JupyterLab is an interactive development environment for working with notebooks, code and data. Most importantly, JupyterLab has full support for Jupyter notebooks. Additionally, JupyterLab enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area.

JupyterLab provides a high level of integration between notebooks, documents, and activities:

  • Drag-and-drop to reorder notebook cells and copy them between notebooks.

  • Run code blocks interactively from text files (.py, .R, .md, .tex, etc.).

  • Link a code console to a notebook kernel to explore code interactively without cluttering up the notebook with temporary scratch work.

  • Edit popular file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more.

I like this, as I’m a big fan of notebooks but sometimes you just want to write some diagnostic queries and an IDE is way better for that. H/T Giovanni Lanzani

Comments closed

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R:

PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the dataset and they are orthogonal to each other. The maximum number of principal component is same as a number of dimension of data. For example, in the above figure, for two-dimension data, there will be max of two principal components (PC1 & PC2). The first principal component defines the most of the variance, followed by second principal component, third principal component and so on. Dimension reduction comes from the fact that it is possible to discard last few principal components as they will not capture much variance in the data.

PCA is a useful technique for reducing dimensionality and removing covariance.

Comments closed

Investigating The gcForest Algorithm

William Vorhies describes a new algorithm with strong potential:

gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests.  In this case, typically two of each for a total of four in each cascade layer.

Image and text problems are categorized as ‘feature learning’ or ‘representation learning’ problems where features are neither predefined nor engineered as in traditional ML problems.  And the basic rule in these feature discovery problems is to go deep, using multiple layers each of which learns relevant features of the data in order to classify them.  Hence the multi-layer structure so familiar with DNNs is retained.

By using both random forests and completely-random tree forests the authors gain the advantage of diversity.  Each forest contains 500 completely random trees allowed to split until each leaf node contains only the same class of instances making the growth self-limiting and adaptive, unlike the fixed and large depth required by DNNs.

The estimated class distribution forms a class vector which is then concatenated with the original feature vector to be the input of the next level cascade.  Not dissimilar from CNNs.

The final model is a cascade of cascade forests.  The final prediction is obtained by aggregating the class vectors and selecting the class with the highest maximum score.

Click through for how it fares on the normal sample data sets like MNIST.

Comments closed

DataExplorer

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%.  I will have to check this package out.

Comments closed

The Year Of The Data Engineer

Alex Woodie points out that data science also requires data engineers:

The shortage of data scientists – those triple-threat types who possess advanced statistics, business, and coding skills – has been well-documented over the years. But increasingly, businesses are facing a shortage of another key individual on the big data team who’s critical to achieving success – the data engineer.

Data engineers are experts in designing, building, and maintaining the data-based systems in support of an organization’s analytical and transactional operations. While they don’t boast the quantitative skills that a data scientist would use to, say, build a complex machine learning model, data engineers do much of the other work required to support that data science workload, such as:

  • Building data pipelines to collect data and move it into storage;

  • Preparing the data as part of an ETL or ELT process;

  • Stitching the data together with scripting languages;

  • Working with the DBA to construct data stores;

  • Ensuring the data is ready for use;

  • Using frameworks and microservices to serve data.

Read the whole thing.  My experience is that most shops looking to hire a data scientist really need to get data engineers first; otherwise, you’re wasting that high-priced data scientist’s time.  The plus side is that if you’re already a database developer, getting into data engineering is much easier than mastering statistics or neural networks.

Comments closed

ARIMA In R

Subhasree Chatterjee shows us how to use R to implement an ARIMA model:

Once the data is ready and satisfies all the assumptions of modeling, to determine the order of the model to be fitted to the data, we need three variables: p, d, and q which are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

To examine which p and q values will be appropriate we need to run acf() and pacf() function.

pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.

ARIMA feels like it should be too simple to work, but it does.

Comments closed

Structural Topic Models In R

Julia Silge has a great post on building Structural Topic Models in R using stm and tidytext:

The stm package has a summary() method for trained topic models like these that will print out some details to your screen, but I want to get back to a tidy data frame so I can use dplyr and ggplot2 for data manipulation and data visualization. I can use tidy() on the output of an stm model, and then I will get the probabilities that each word is generated from each topic.

I haven’t watched the video yet, but that’s on my to-do list for today.

Comments closed

Exploring The MNIST Dataset

David Robinson performs exploratory data analysis on the MNIST digit database:

The challenge is to classify a handwritten digit based on a 28-by-28 black and white image. MNIST is often credited as one of the first datasets to prove the effectiveness of neural networks.

In a series of posts, I’ll be training classifiers to recognize digits from images, while using data exploration and visualization to build our intuitions about why each method works or doesn’t. Like most of my posts I’ll be analyzing the data through tidy principles, particularly using the dplyr, tidyr and ggplot2 packages. In this first post we’ll focus on exploratory data analysis, to show how you can better understand your data before you start training classification algorithms or measuring accuracy. This will help when we’re choosing a model or transforming our features.

Read on for the analysis.

Comments closed

Markov Chains In Python

Sandipan Dey shows off various uses of Markov chains as well as how to create one in Python:

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Comments closed

More DBA Salary Research

Ginger Grant digs into the DBA salary survey a bit further:

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

Another area of importance here is in trying to normalize salaries for standard of living:  it’s a lot easier to get a $100K/year job in Manhattan, NY than Manhattan, KS, but $100K in the latter goes much further.  Based on my little digging into the set, it’d be tough to draw any conclusions on that front, but it is an a priori factor that I’d want to consider when dealing with salary survey data.

Comments closed