Press "Enter" to skip to content

Category: Data Science

Avoiding Overfitting and Underfitting in Neural Networks

Manas Narkar provides some advice on optimizing neural network models:

Adding Dropout

Dropout is considered as one of the most effective regularization methods. Dropout is basically randomly zero-ing or dropping out features from your layer during the training process, or introducing some noise in the samples. The key thing to note is that this is only applied at training time. At test time, no values are dropped out. Instead, they are scaled. The typical dropout rate is between 0.2 to 0.5.

Click through for a demo on dropout, as well as coverage of several other techniques.

Comments closed

Fun with Randomness

Holger von Jouanne-Diedrich takes us through some random thoughts:

Many a student thinks that the first pic was created by some underlying pattern (because of its points clumping together in some areas while leaving others empty) and that the second one is “more” random. The truth is that technically both are not random (but only pseudo-random) but the first resembles “true” randomness more closely while the second is a low-discrepancy sequence.

While coming to the point of pseudo-randomness in a moment “true” randomness may appear to have a tendency to occur in clusters or clumps (technically called Poisson clumping). This is the effect seen (or shall I say heard) in the iPod shuffling function. Apple changed it to a more regular behaviour (in the spirit of the second picture)… which was then perceived to be more random (as with my students)!

Read the whole thing.

Comments closed

Text Mining and Sentiment Analysis in R

Sanil Mhatre walks us through a sentiment analysis scenario in R:

Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.

This example uses the Syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford. The get_sentiment function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are syuzhet (this is the default), bingafinn and nrc. Each method uses a different scale and hence returns slightly different results. Please note the outcome of nrc method is more than just a numeric score, requires additional interpretations and is out of scope for this article. The descriptions of the get_sentiment function has been sourced from : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

Comments closed

Choosing a Recommender

Tori Tompkins walks us through four methods for generating models for recommendation systems:

One of the most popular approaches to recommendation problems is Collaborative Filtering (CF).

This approach is based on the assumption that users who have agreed in the past, also tend to agree in the future. For example, if Alice likes Star Wars, Lord of the Rings and Harry Potter and Bob likes Stars Wars and Lord of the Rings too. Then it is likely that Bob would also enjoy Harry Potter. The collaborative in Collaborative Filtering is in relation to looking at the way that you multiple users interact with the same data and share the same commonalities.

Read on for a comparison of these four systems as well as a handy chart to help you figure out which system might work best for you.

Comments closed

The Siren Song of High Accuracy

Holger von Jouanne-Diedrich notes that accuracy is not in itself necessarily a good thing for a machine learning model:

In one of my most popular posts So, what is AI really? I showed that Artificial Intelligence (AI) basically boils down to autonomously learned rules, i.e. conditional statements or simply, conditionals.

In this post, I create the simplest possible classifier, called ZeroR, to show that even this classifier can achieve surprisingly high values for accuracy (i.e. the ratio of correctly predicted instances)… and why this is not necessarily a good thing, so read on!

The nuanced answer here is that with classifiers, accuracy is not in itself a great measure in the case of class imbalance. The more balanced your classes are, the more likely it is that a model with high accuracy is a good model. That’s where other measures such as specificity and sensitivity, positive & negative predictive value, etc. come into play.

Comments closed

Useful R Packages for Data Scientists

Tomaz Kastrun has a nice collection of useful R packages for data scientists:

Among thousand of R packages available on CRAN (with all the  mirror sites) or Github and any developer’s repository.

Many useful functions are available in many different R packages, many of the same functionalities also in different packages, so it all boils down to user preferences and work, that one decides to use particular package. From the perspective of a statistician and data scientist, I will cover the essential and major packages in sections. And by no means, this is not a definite list, and only a personal preference.

Click through for Tomaz’s recommendations.

Comments closed

Sentiment Analysis with Power BI

Teo Lachev takes us through two options available for sentiment analysis with Power BI:

Cognitive Services is an Azure PaaS cloud service that supports text analytics and image recognition. It’s automatically included in Power BI Premium or Embedded capacities (make sure that AI workloads are enabled in the capacity settings). If you organization doesn’t have Power BI Premium or Embedded, you can provision Cognitive Services in Azure (requires an Azure subscription) and then write a custom Power Query function to invoke its APIs, as demonstrated by this tutorial. If you provision Cognitive Services outside Power BI Premium,  you’ll be charged per transaction. In the case of Power BI, the number of transactions equates to the number of rows in your table. So, if you refresh five times a table with 1,000 rows and calculate the sentiment polarity score for each row, you’ll be charged for 5,000 transactions.

Read on for the full report on each option.

Comments closed

Automated ML and Data Scientists

Sophia Rowland takes us through an experiment:

Ever since automated machine learning has entered the scene, people are asking, “Will automated machine learning replace data scientists?” I personally don’t think we need to be worried about losing our jobs any time soon. Automated machine learning is great at efficiently trying a lot of different options and can save a data scientist hours of work. The caveat is that automated machine learning cannot replace all tasks. Automated machine learning does not understand context as well as a human being. This means that it may not know to create specific features that are the norm for certain tasks. Also, automated machine learning may not know when it has created things that are irrelevant or unhelpful.

To strengthen my points, I created a competition between myself and two SAS Viya tools for automated machine learning. To show that we are really better together, I combined my work with automated machine learning and compared all approaches. Before we dive into the results, let’s discuss the task.

The results are in line with my expectations: a good automated ML tool will make life easier, but doesn’t replace the expert system of a human.

Comments closed

Time Series Forecasting Best Practices

David Smith talks about a new GitHub repo:

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.

The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides “tidy time series forecasting for R“. The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

This looks really cool.

Comments closed