Press "Enter" to skip to content

Category: Data Science

Useful R Packages for Data Scientists

Tomaz Kastrun has a nice collection of useful R packages for data scientists:

Among thousand of R packages available on CRAN (with all the  mirror sites) or Github and any developer’s repository.

Many useful functions are available in many different R packages, many of the same functionalities also in different packages, so it all boils down to user preferences and work, that one decides to use particular package. From the perspective of a statistician and data scientist, I will cover the essential and major packages in sections. And by no means, this is not a definite list, and only a personal preference.

Click through for Tomaz’s recommendations.

Comments closed

Sentiment Analysis with Power BI

Teo Lachev takes us through two options available for sentiment analysis with Power BI:

Cognitive Services is an Azure PaaS cloud service that supports text analytics and image recognition. It’s automatically included in Power BI Premium or Embedded capacities (make sure that AI workloads are enabled in the capacity settings). If you organization doesn’t have Power BI Premium or Embedded, you can provision Cognitive Services in Azure (requires an Azure subscription) and then write a custom Power Query function to invoke its APIs, as demonstrated by this tutorial. If you provision Cognitive Services outside Power BI Premium,  you’ll be charged per transaction. In the case of Power BI, the number of transactions equates to the number of rows in your table. So, if you refresh five times a table with 1,000 rows and calculate the sentiment polarity score for each row, you’ll be charged for 5,000 transactions.

Read on for the full report on each option.

Comments closed

Automated ML and Data Scientists

Sophia Rowland takes us through an experiment:

Ever since automated machine learning has entered the scene, people are asking, “Will automated machine learning replace data scientists?” I personally don’t think we need to be worried about losing our jobs any time soon. Automated machine learning is great at efficiently trying a lot of different options and can save a data scientist hours of work. The caveat is that automated machine learning cannot replace all tasks. Automated machine learning does not understand context as well as a human being. This means that it may not know to create specific features that are the norm for certain tasks. Also, automated machine learning may not know when it has created things that are irrelevant or unhelpful.

To strengthen my points, I created a competition between myself and two SAS Viya tools for automated machine learning. To show that we are really better together, I combined my work with automated machine learning and compared all approaches. Before we dive into the results, let’s discuss the task.

The results are in line with my expectations: a good automated ML tool will make life easier, but doesn’t replace the expert system of a human.

Comments closed

Time Series Forecasting Best Practices

David Smith talks about a new GitHub repo:

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.

The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides “tidy time series forecasting for R“. The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

This looks really cool.

Comments closed

Tidy Simulation of Stochastic Processes in R

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Generating Random Numbers with R

The folks at Data Sharkie walk us through random number generation in R:

Why is random numbers generation important and where is it used?

Random numbers generations have application in various fields like statistical sampling, simulation, test designs, and so on. Generally, when a data scientist is in need of a set of random numbers, they will have in mind

R programming language allows users to generate random distributed numbers with a set of built-in functions: runif()rnorm()rbinom().

Read on to generate random numbers across two separate distributions.

Comments closed

Handling Missing Data

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.

While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.

This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed