Data Science – Page 17

Multivariate Histograms in R

Published 2023-09-19 by Kevin Feasel

Steven Sanderson wants multiple breakdowns:

Histograms are powerful tools for visualizing the distribution of a single variable, but what if you want to compare the distributions of two variables side by side? In this blog post, we’ll explore how to create a histogram of two variables in R, a popular programming language for data analysis and visualization.

We’ll cover various scenarios, from basic histograms to more advanced techniques, and explain the code step by step in simple terms. So, grab your favorite dataset or generate some random data, and let’s dive into the world of dual-variable histograms!

Click through for several techniques.

Comments closed

Initial Thoughts on the Microsoft Fabric Data Science Experience

Published 2023-09-14 by Kevin Feasel

Tori Tompkins shares some thoughts:

Fabric is Microsoft’s recently announced SaaS all-in-one analytics platform. It brings together Azure Data Factory, Azure Synapse Analytics and Power BI into a single cohesive platform without the overhead of setting up resources, maintenance, and configuration. Fabric wouldn’t be an end-to-end data analytics platform without data science, so in this blog we will explore the data science and machine learning capabilities of Microsoft Fabric and assess where the platform fits in the completive data science landscape.

Click through for Tori’s overview, where Fabric does a good job in its preview, and where it currently falls short.

Comments closed

Plotting SVM Decision Boundaries in R

Published 2023-09-12 by Kevin Feasel

Steven Sanderson goes right up to the edge:

Support Vector Machines (SVM) are a powerful tool in the world of machine learning and classification. They excel in finding the optimal decision boundary between different classes of data. However, understanding and visualizing these decision boundaries can be a bit tricky. In this blog post, we’ll explore how to plot an SVM object using the e1071 library in R, making it easier to grasp the magic happening under the hood.

Read on to see how you can perform this analysis as well.

Comments closed

Plotting a Subset of Data in R

Published 2023-09-11 by Kevin Feasel

Steven Sanderson doesn’t need all of those data points:

Data visualization is a powerful tool for gaining insights from your data. In R, you have a plethora of libraries and functions at your disposal to create stunning and informative plots. One common task is to plot a subset of your data, which allows you to focus on specific aspects or trends within your dataset. In this blog post, we’ll explore various techniques to plot subsets of data in R, and I’ll explain each step in simple terms. Don’t worry if you’re new to R – by the end of this post, you’ll be equipped to create customized plots with ease!

Click through for several techniques for subsetting data, as well as reasons why you might want to do it.

Comments closed

Statistical Tests in R

Published 2023-09-11 by Kevin Feasel

Adrian Tam tries out a couple of tests:

R as a data analytics platform is expected to have a lot of support for various statistical tests. In this post, you are going to see how you can run statistical tests using the built-in functions in R. Specifically, you are going to learn:

What is t-test and how to do it in R

What is F-test and how to do it in R

This is one of the things that R does best among any language: statistical testing. R has support for an enormous number of statistical functions, either built into the base language or available as packages.

Comments closed

Finding Omitted Variables in Logistic Regression

Published 2023-09-08 by Kevin Feasel

John Mount picks up on a prior post:

For this note, let’s work out how to directly try and overcome the omitted variable bias by solving for the hidden or unobserved detailed data. We will work our example in R. We will derive some deep results out of a simple set-up. We show how to “un-marginalize” or “un-summarize” data.

This is an interesting dive into a common problem, and something which we can easily work around in linear regression, but not in logistic regression.

Comments closed

Random Number Generation in R

Published 2023-09-06 by Kevin Feasel

Adrian Tam rolls the dice:

Whether working on a machine learning project, a simulation, or other models, you need to generate random numbers in your code. R as a programming language, has several functions for random number generation. In this post, you will learn about them and see how they can be used in a larger program. Specifically, you will learn

How to generate Gaussian random numbers into a vector

How to generate uniform random numbers

How to manipulate random vectors and random matrices

And, of course, these are pseudo-random numbers because we’re still dealing with computers and random seeds, after all.

Comments closed

Visualizing Univariate Data Distributions in R

Published 2023-08-25 by Kevin Feasel

Steven Sanderson reviews the shape of the data:

Understanding the distribution of your data is a fundamental step in any data analysis process. It gives you insights into the spread, central tendency, and overall shape of your data. In this blog post, we’ll explore two popular functions in R for visualizing data distribution: density() and hist(). We’ll use the classic Iris dataset for our examples. Additionally, we will introduce the {TidyDensity} library and show how it can be used to create distribution plots.

Click through for three different functions for visualizing the density of a variable.

Comments closed

Adding Mean to Box Plots in R

Published 2023-08-24 by Kevin Feasel

Steven Sanderson tracks the sixth number of a five-number summary:

Data visualization is a powerful tool for understanding and interpreting data. In this blog post, we will explore how to create box plots with mean values using both base R and ggplot2. We will use the famous iris dataset as an example. So, grab your coding tools and let’s dive into the world of box plots!

Note that this is mean in addition to median in these visuals, not replacing the median.

Comments closed

Omitted Variables and Logistic Regression

Published 2023-08-21 by Kevin Feasel

John Mount misses a variable:

I would like to illustrate a way which omitted variables interfere in logistic regression inference (or coefficient estimation). These effects are different than what is seen in linear regression, and possibly different than some expectations or intuitions.

This is an interesting article and there’s a really good comment helping to explain this effect in epidemiology.

Comments closed

Category: Data Science