Descriptive Statistics | The Chemical Statistician

A macro to execute PROC TTEST for multiple binary grouping variables in SAS (and sorting t-test statistics by their absolute values)

May 4, 2018 Leave a comment

In SAS, you can perform PROC TTEST for multiple numeric variables in the same procedure. Here is an example using the built-in data set SASHELP.BASEBALL; I will compare the number of at-bats and number of walks between the American League and the National League.

proc ttest
     data = sashelp.baseball;
     class League;
     var nAtBat nBB; 
     ods select ttests;
run;

Here are the resulting tables.

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	320	2.05	0.0410
Satterthwaite	Unequal	313.66	2.06	0.04

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	320	0.85	0.3940
Satterthwaite	Unequal	319.53	0.86	0.3884

What if you want to perform PROC TTEST for multiple grouping (a.k.a. classification) variables? You cannot put more than one variable in the CLASS statement, so you would have to run PROC TTEST separately for each binary grouping variable. If you do put LEAGUE and DIVISION in the same CLASS statement, here is the resulting log.

1303 proc ttest
1304 data = sashelp.baseball;
1305 class league division;
 --------
 22
 202
ERROR 22-322: Expecting ;.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
1306 var natbat;
1307 ods select ttests;
1308 run;

There is no syntax in PROC TTEST to use multiple grouping variables at the same time, so this tutorial provides a macro to do so. There are several nice features about my macro:

It allows you to use multiple grouping variables at the same time.
It sorts the t-test statistics by their absolute values within each grouping variable.
It shows the name of each continuous variable in the output table, unlike the above output.

Here is its basic skeleton.

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, SAS Programming, Statistics, Tutorials Tagged with applied statistics, data analysis, do loop, macro, proc ttest, SAS, sas macro, sashelp.baseball, statistics, Student's t-test, t-test

Store multiple strings of text as a macro variable in SAS with PROC SQL and the INTO statement

September 8, 2017 Leave a comment

I often need to work with many variables at a time in SAS, but I don’t like to type all of their names manually – not only is it messy to read, it also induces errors in transcription, even when copying and pasting. I recently learned of an elegant and efficient way to store multiple variable names into a macro variable that overcomes those problems. This technique uses the INTO statement in PROC SQL.

To illustrate how this storage method can be applied in a practical context, suppose that we want to determine the factors that contribute to a baseball player’s salary in the built-in SASHELP.BASEBALL data set. I will consider all continuous variables other than “Salary” and “logSalary”, but I don’t want to write them explicitly in any programming statements. To do this, I first obtain the variable names and types of a data set using PROC CONTENTS.

* create a data set of the variable names;
proc contents
     data = sashelp.baseball
          noprint
     out = bvars (keep = name type);
run;

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, SAS Programming, Statistics Tagged with applied statistics, correlation, correlation coefficient, data analysis, data manipulation, into statement, macro, macro variable, PROC SQL, programming, SAS, sas programming, SQL, statistics

Sorting correlation coefficients by their magnitudes in a SAS macro

March 21, 2017 Leave a comment

Theoretical Background

Many statisticians and data scientists use the correlation coefficient to study the relationship between 2 variables. For 2 random variables, $X$ and $Y$ , the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations. Algebraically, this can be expressed as

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}$ .

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data. The most common estimator for $\rho$ is the Pearson correlation coefficient, which is defined as the sample covariance between $X$ and $Y$ divided by the product of their sample standard deviations. Since there is a common factor of

$\frac{1}{n - 1}$

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

$r_P = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i - \bar{x})^2 \sum_{i = 1}^{n}(y_i - \bar{y})^2}}$ .

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model. You can do this by

computing the correlation coefficients
obtaining their absolute values
sorting them by their absolute values.

Read more of this post

Filed under Applied Statistics, Data Analysis, Descriptive Statistics, Mathematical Statistics, Predictive Modelling, SAS Programming, Statistics, Tutorials Tagged with correlation, macro, pearson correlation, pearson correlation coefficient, predictive modelling, PROC CORR, regression, regression modelling, SAS, sas macro

The advantages of using count() to get N-way frequency tables as data frames in R

February 12, 2015 5 Comments

Introduction

I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R. Several commenters provided alternative ways of doing so, and they are all appreciated. Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.

In R, the xtabs() function is good for cross-tabulation. Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.

> y = xtabs(~ cyl + gear, mtcars)
> y
          gear
 cyl      3     4     5
 4        1     8     2
 6        2     4     1
 8        12    0     2

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with count, cross-tabulation, data analysis, frequency table, R, R programming, statistics, table(), xtabs()

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

February 3, 2015 32 Comments

Introduction

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a frequency table of a categorical variable in R, and I wanted the output as a data table that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “gear” – this denotes the number of forward gears in the car – so let’s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

> head(subset(mtcars, select = 'gear'))
                     gear
Mazda RX4            4
Mazda RX4 Wag        4
Datsun 710           4
Hornet 4 Drive       3
Hornet Sportabout    3
Valiant              3

Read more of this post

Filed under Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

Exploratory Data Analysis – All Blog Posts on The Chemical Statistician

December 11, 2014 Leave a comment

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples. Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis! Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Useful R Functions for Exploring a Data Frame

The 5-Number Summary – Two Different Methods in R

Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

Quantile-Quantile Plots for New York’s Ozone Pollution Data

Kernel Density Estimation and Rug Plots in R on Ozone Data in New York and Ozonopolis

2 Ways of Plotting Empirical Cumulative Distribution Functions in R

Conceptual Foundations of Empirical Cumulative Distribution Functions

Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

Kernel Density Estimation – Conceptual Foundations

Variations of Box Plots in R for Ozone Concentrations in New York City and Ozonopolis

Computing Descriptive Statistics in R for Data on Ozone Pollution in New York City

How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

The advantages of using count() to get N-way frequency tables as data frames in R

Filed under Applied Statistics, Data Analysis, Data Visualization, Descriptive Statistics, R programming, Statistics Tagged with 5-number summary, applied statistics, box plot, data analysis, Data Visualization, ecdf(), empirical cumulative distribution function, exploratory data analysis, five-number summary, frequency table, histogram, kernel density estimation, kernel density plot, quantile, quantile-quantile plot, R, R programming, violin plot

Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

October 28, 2014 3 Comments

Introduction

A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable. If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class.

This sounds like a simple task, yet I took a surprisingly long time to learn how to do this in SAS and get exactly what I want – a new data with with each category as the identifier and the calculated sum/mean as the value of a second variable. Here is an example to show you how to do it using PROC MEANS.

Read more to see an example data set and get the SAS code to calculate the sum or mean of a continuous variable by a categorical variable!

Read more of this post

Filed under Data Analysis, Descriptive Statistics, SAS Programming, Statistics, Tutorials Tagged with average, by statement, class statement, data analysis, mean by group, mean(), nway, PROC MEANS, PROC PRINT, programming, SAS, sas programming, statistics, sum by group, sum(), summation

Applied Statistics Lesson of the Day – The Coefficient of Variation

August 12, 2014 3 Comments

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set. However, consider the following 2 hypothetical cases:

the standard deviation for the incomes of households in Canada is $2,000
the standard deviation for the incomes of the 5 major banks in Canada is $2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas $2,000 is not a lot of money for one of the “Big Five” banks. Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets. One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

$\text{Sample Coefficient of Variation (} \bar{c_v} \text{)} \ = \ s \ \div \ \bar{x},$

where $s$ is the sample standard deviation and $\bar{x}$ is the sample mean.

Analogously, the coefficient of variation for a random variable is

$\text{Coefficient of Variation} \ (c_v) \ = \ \sigma \div \ \mu,$

where $\sigma$ is the random variable’s standard deviation and $\mu$ is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes. I hope that all new statistics students get to learn this alternative measure of dispersion.

Filed under Applied Statistics, Descriptive Statistics, Statistics, Statistics Lesson of the Day Tagged with applied statistics, coefficient of variation, data, descriptive statistics, expected value, relative variability, sample mean, sample standard deviation, standard deviation, statistics

Mathematics and Applied Statistics Lesson of the Day – The Harmonic Mean

June 12, 2014 Leave a comment

The harmonic mean, H, for $n$ positive real numbers $x_1, x_2, ..., x_n$ is defined as

$H = n \div (1/x_1 + 1/x_2 + .. + 1/x_n) = n \div \sum_{i = 1}^{n}x_i^{-1}$ .

This type of mean is useful for measuring the average of rates. For example, consider a car travelling for 240 kilometres at 2 different speeds:

60 km/hr for 120 km
40 km/hr for another 120 km

Then its average speed for this trip is

$S_{avg} = 2 \div (1/60 + 1/40) = 48 \text{ km/hr}$

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds. If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics. The average speed is defined as

$S_{avg} = \Delta \text{distance} \div \Delta \text{time}$

We already have the elapsed distance – it’s 240 km. Let’s find the time elapsed for this trip.

$\Delta \text{ time} = 120 \text{ km} \times (1 \text{ hr}/60 \text{ km}) + 120 \text{ km} \times (1 \text{ hr}/40 \text{ km})$

$\Delta \text{time} = 5 \text{ hours}$

Thus,

$S_{avg} = 240 \text{ km} \div 5 \text{ hours} = 48 \text { km/hr}$

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

Filed under Applied Statistics, Descriptive Statistics, Mathematics, Statistics, Statistics Lesson of the Day Tagged with applied statistics, average, average speed, harmonic mean, kinematics, math, mathematics, physics, rate, statistics

Statistics Lesson and Warning of the Day – Confusion Between the Median and the Average

April 25, 2014 4 Comments

Yesterday, I attended an interesting seminar called “Transforming Healthcare through Big Data” at the Providence Health Care Research Institute‘s 2014 Research Day. The seminar was delivered by Martin Kohn from Jointly Health, and I enjoyed it overall. However, I noticed a glaring error about basic statistics that needs correction.

Martin wanted to highlight the overconfidence that many doctors have about their abilities, and he quoted Vinod Kohsla, the co-founder of Sun Microsystems, who said, “50% of doctors are below average.” Martin then presented a study showing an absurdly high percentage of doctors who think that they are “above average”. A Twitter conversation between attendees of a TED conference in San Francisco and Vinod himself confirms this quotation.

The statement “50% of doctors are below average” is wrong in general. By definition, 50% of any population is below the median, and the median is only equal to the average if the population is symmetric. (Examples of symmetric probability distributions are the normal distribution and the Student’s t-distribution.) Vinod meant to say that “50% of doctors are below the median”, and he confirmed this in the aforementioned Twitter conversation; I am disappointed that he justified this mistake by claiming that it would be less understood. I think that a TED audience would know what “median” means, and those who don’t can easily search for its meaning online or in books on their own.

In communicating truth, let’s use the correct vocabulary.

Filed under Descriptive Statistics, Statistics, Statistics Lesson of the Day Tagged with average, mean(), median, statistics

Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

April 10, 2014 22 Comments

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only. I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations. For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Read more of this post

Filed under Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with axis(), box plots, boxplots(), Data Visualization, legend(), patterned box plots, plotting, rect(), side-by-side box plots, statistics, title()

Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data

September 22, 2013 3 Comments

Introduction

Continuing my recent series on exploratory data analysis, today’s post focuses on quantile-quantile (Q-Q) plots, which are very useful plots for assessing how closely a data set fits a particular distribution. I will discuss how Q-Q plots are constructed and use Q-Q plots to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.

Previous posts in this series on EDA include

Learn how to create a quantile-quantile plot like this one with R code in the rest of this blog!

Read more of this post

Filed under Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with data, data analysis, descriptive statistics, exploratory data analysis, gamma distribution, normal distribution, q-q plot, qq plot, quantile, quantile-quantile plot, R, R programming, statistics, summary()

Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

August 19, 2013 6 Comments

Introduction

Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis. I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis. Previous posts in this series include

Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.

> dim(InsectSprays)
[1] 72 2

Use nrow() and ncol() to get the number of rows and number of columns, respectively. You can get the same information by extracting the first and second element of the output vector from dim().

> nrow(InsectSprays) 
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2

Read more of this post

Filed under Descriptive Statistics, R programming, Statistics, Tutorials Tagged with 5-number summary, data, data analysis, data frame, descriptive statistics, dim(), five-number summary, head(), ncol(), nrow(), R, R programming, statistics, str(), summary statistics, summary(), tail()

Exploratory Data Analysis: The 5-Number Summary – Two Different Methods in R

August 12, 2013 6 Comments

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on 5-number summaries, which were previously mentioned in the post on descriptive statistics in this series. I will define and calculate the 5-number summary in 2 different ways that are commonly used in R. (It turns out that different methods arise from the lack of universal agreement among statisticians on how to calculate quantiles.) I will show that the fivenum() function uses a simpler and more interpretable method to calculate the 5-number summary than the summary() function. This post expands on a recent comment that I made to correct an error in the post on box plots.

> y = seq(1, 11, by = 2)
> y
[1]  1  3  5  7  9 11
> fivenum(y)
[1]  1  3  6  9 11
> summary(y)
     Min.   1st Qu.   Median    Mean     3rd Qu.    Max. 
     1.0     3.5       6.0       6.0      8.5       11.0

Why do these 2 methods of calculating the 5–number summary in R give different results? Read the rest of this post to find out the answer!

Previous posts in this series on EDA include

Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

July 29, 2013 9 Comments

Introduction

This is a follow-up post to my recent introduction of histograms. Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R. Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots. I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA). Previous posts in this series on EDA include

Read more of this post

Filed under Applied Statistics, Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with curve(), data, data analysis, Data Visualization, density(), dgamma(), dnorm(), expectation, expected value, exploratory data analysis, gamma, gamma distribution, hist(), histograms, lines(), New York, normal, normal distribution, ozone, plot, plots, plotting, R, R programming, sample mean, sample variance, statistics, variance

Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

July 9, 2013 5 Comments

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set. I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R. In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

Read more of this post

Filed under Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with data, data analysis, Data Visualization, exploratory data analysis, hist(), histograms, New York, ozone, plot, plots, plotting, R, R programming, statistics

Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R

June 25, 2013 1 Comment

Introduction

Continuing my recent series on exploratory data analysis (EDA), and following up on the last post on the conceptual foundations of empirical cumulative distribution functions (CDFs), this post shows how to plot them in R. (Previous posts in this series on EDA include descriptive statistics, box plots, kernel density estimation, and violin plots.)

I will plot empirical CDFs in 2 ways:

using the built-in ecdf() and plot() functions in R
calculating and plotting the cumulative probabilities against the ordered data

Continuing from the previous posts in this series on EDA, I will use the “Ozone” data from the built-in “airquality” data set in R. Recall that this data set has missing values, and, just as before, this problem needs to be addressed when constructing plots of the empirical CDFs.

Recall the plot of the empirical CDF of random standard normal numbers in my earlier post on the conceptual foundations of empirical CDFs. That plot will be compared to the plots of the empirical CDFs of the ozone data to check if they came from a normal distribution.

Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

June 24, 2013 13 Comments

Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R. (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution. The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

Read more of this post

Filed under Applied Statistics, Data Visualization, Descriptive Statistics, Mathematical Statistics, R programming, Statistics, Tutorials Tagged with cdf, consistency, convergence, cumulative distribution function, data, data analysis, ecdf(), empirical cdf, empirical cumulative distribution function, estimator, expected value, exploratory data analysis, normal distribution, plot, plots, plotting, R, R programming, standard normal distribution, statistics, unbiased estimator, uniform convergence, variance

Exploratory Data Analysis: Combining Box Plots and Kernel Density Plots into Violin Plots for Ozone Pollution Data

June 16, 2013 9 Comments

Introduction

Recently, I began a series on exploratory data analysis (EDA), and I have written about descriptive statistics, box plots, and kernel density plots so far. As previously mentioned in my post on box plots, there is a way to combine box plots and kernel density plots. This combination results in violin plots, and I will show how to create them in R today.

Continuing from my previous posts on EDA, I will use 2 univariate data sets. One is the “ozone” data vector that is part of the “airquality” data set that is built into R; this data set contains data on New York’s air pollution. The other is a simulated data set of ozone pollution in a fictitious city called “Ozonopolis”. It is important to remember that the ozone data from New York has missing values, and this has created complications that needed to be addressed in previous posts; missing values need to be addressed for violin plots, too, and in a different way than before.

The vioplot() command in the “vioplot” package creates violin plots; the plotting options in this function are different and less versatile than other plotting functions that I have used in R. Thus, I needed to be more creative with the plot(), title(), and axis() functions to create the plots that I want. Read the details carefully to understand and benefit fully from the code.

Read further to learn how to create these violin plots that combine box plots with kernel density plots! Be careful – the syntax is more complicated than usual!

Read more of this post

Filed under Applied Statistics, Data Visualization, Descriptive Statistics, R programming, Statistics Tagged with axis(), box plot, data, data analysis, exploratory data analysis, kernel density plot, library, New York, ozone, Ozonopolis, package, plot, plots, plotting, R, R programming, sm, statistics, title(), violin plot, vioplot()

Exploratory Data Analysis: Kernel Density Estimation – Conceptual Foundations

June 9, 2013 34 Comments

For the sake of brevity, this post has been created from the first half of a previous long post on kernel density estimation. This first half focuses on the conceptual foundations of kernel density estimation. The second half will focus on constructing kernel density plots and rug plots in R.

Introduction

Recently, I began a series on exploratory data analysis; so far, I have written about computing descriptive statistics and creating box plots in R for a univariate data set with missing values. Today, I will continue this series by introducing the underlying concepts of kernel density estimation, a useful non-parametric technique for visualizing the underlying distribution of a continuous variable. In the follow-up post, I will show how to construct kernel density estimates and plot them in R. I will also introduce rug plots and show how they can complement kernel density plots.

But first – read the rest of this post to learn the conceptual foundations of kernel density estimation.

Read more of this post

Filed under Applied Statistics, Data Visualization, Descriptive Statistics, R programming, Statistics, Tutorials Tagged with applied statistics, density plot, density(), dunif(), Gaussian distribution, kernel, kernel density estimate, kernel density estimation, kernel density plot, kernel function, legend(), lines(), normal distribution, pdf, plot, plots, plotting, probability density function, R, R programming, rug plot, rug(), set.seed(), statistics, summary(), triangular kernel, uniform distribution, uniform kernel

← Older posts

	Eric Cai - The Chemi… on Convert multiple variables bet…
	Jack on Convert multiple variables bet…
	Eric Cai - The Chemi… on Getting the names, types, form…
	Emily V on Getting the names, types, form…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Convert multiple variables bet…
	Lauren McClain on Convert multiple variables bet…
	Eric Cai - The Chemi… on Exploratory Data Analysis: Com…
	CK on Exploratory Data Analysis: Com…
	Eric Cai - The Chemi… on Video Tutorial: Breaking Down…

Theoretical Background

Introduction

2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format

Introduction

The Example Data Set – mtcars

Introduction

Introduction

Introduction

Useful Functions for Exploring Data Frames

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Introduction

Eric’s Twitter Feed (@chemstateric)

Recent Comments

Popular Topics

Recent Posts

About Eric

Blogs and Web Sites That I Like to Read

Archives

Categories