Monday, January 4, 2010

Concepts from Mathematical Statistics

A. Concepts from Mathematical Statistics

Probability Density Functions

Random Variables

A random variable takes on values that have specific probabilities of occurring
( Nicholson,2002). An example of a random variable would be the number of car accidents per year among sixteen year olds.

If we know how random variables are distributed in a population, then we may have an idea of how rare an observation may be.

Example: How often sixteen year olds are involved in auto accidents in a year’s time.

This information is then useful for making inferences ( or drawing conclusions) about the population of random variables from sample data.

Example: We could look at a sample of data consisting of 1,000 sixteen year olds in the Midwest and make inferences or draw conclusions about the population consisting of all sixteen year olds in the Midwest.

In summary, it is important to be able to specify how a random variable is distributed. It enables us to gauge how rare an observation ( or sample) is and then gives us ground to make predictions, or inferences about the population.

Random variables can be discrete, that is observed in whole units as in counting numbers 1,2,3,4 etc. Random variables may also be continuous. In this case random variables can take on an infinite number of values. An example would be crop yields. Yields can be measured in bushels down to a fraction or decimal.

The distributions of discrete random variables can be presented in tabular form or with histograms. Probability is represented by the area of a ‘rectangle’ in a histogram
( Billingsly, 1993).

Distributions for continuous random variables cannot be represented in tabular format due to their characteristic of taking on an infinite number of values. They are better represented by a smooth curve defined by a function ( Billingsly, 1993). This function is referred to as a probability density function or p.d.f.

The p.d.f. gives the probability that a random variable ‘X’ takes on values in a narrow interval ‘dx’ ( Nicholson, 2002). This probability is equivalent to the area under the p.d.f. curve. This area can be described by the cumulative density function c.d.f. The c.d.f. gives the value of an integral involving the p.d.f.

Let f(x) be a p.d.f.

P( a <= X <= b ) = a b f(x) dx = F ( x )

This can be interpreted to mean that the probability that X is between the values of ‘a’ and ‘b’ is given by the integral of the p.d.f from ‘a’ to ‘b.’ This value can be given by the c.d.f. which is F ( x ). Those familiar with calculus know that F(x) is the anti-derivative of f(x).

Common p.d.f’s

Most students are familiar with using tables in the back of textbooks for normal, chi-square, t, and F distributions. These tables are generated by the p.d.f’s for these particular distributions. For example, if we make the assumption that the random variable X is normally distributed then its p.d.f. is specified as follows:

f(x)= 1 / (2 ) -1/2 e –1/2 (x - )2/

where X~ N (  )

In the beginning of this section I stated that it was important to be able to specify how a random variable is distributed. Through experience, statisticians have found that they are justified in modeling many random variables with these p.d.f.’s. Therefore, in many cases one can be justified in using one of these p.d.f.’s to determine how rare a sample observation is, and then to make inferences about the population. This is in fact what takes place when you look up values from the tables in the back of statistics textbooks.

Mathematical Expectation

The expected value E(X) for a discrete random variable can be defined as ‘the sum of products of each observation Xi and the probability of observing that particular value of Xi ( Billingsly, 1993). The expected value of an observation is the mean of a distribution of observations. It can be thought of conceptually as an average or weighted mean.

Example: given the p.d.f. for the discrete random variable X in tabular format.

Xi : 1 2 3

P(x =xi) .25 .50 .25

E (X) = Xi P(x =xi) = 1 * .25 + 2*.50 + 3*.25 = 2.0

Conceptually, the expected value of a random variable can be viewed as a point of balance or center of gravity. In reality, the actual value of an observed random variable will likely be larger or smaller than the ‘expected value.’ However due to the nature of the expected value, actual observations will have a distribution that is balanced around the expected value. For this reason the expected value can be viewed as the balancing point for a distribution, or the center of gravity ( Billingsly, 1993).

That is to say that population values cluster or gravitate around the expected value. Most of the population values are expected to be found within a small interval ( i.e. measured in standard deviations) about the population’s expected value or mean. Hence the expected values gives a hint about how rare a sample may be. Values lying near a population mean should not be rare but quite common.

The expected value for a continuous random variable must be calculated using integral calculus.

E(X) = ∫ x f(x) dx

If x is an observed value and f(x) is the p.d.f. for x, then the product x f(x) dx is the continuous version of the discrete case Xi P(x =xi) and integration is the continuous version of summation.


As I mentioned previously, actual observations will often depart from the expected value. Variance quantifies the degree to which observations in a distribution depart from an expected value. It is a mean of squared deviations between an observation and an expected value/mean (Billingsly, 1993).

In the discrete case we have the following mathematical description:

X -  )2 = 2

In the continuous case:

∫ (X -  )2 f(x) dx = E( X -  ) = 2

Given the mean (), or expected value of a random variable, one knows the value a random variable is likely to assume on average. The variance (2) indicates how close these observations are likely to be to the mean or expected value on average.

Sample Estimates

Given knowledge of the population mean and variance, one can characterize the population distribution for a random variable. As I mentioned at the beginning of the previous section, it is important to be able to specify how a random variable is distributed. It enables us to gauge how rare an observation ( or sample) is and then gives us grounds to make predictions, or inferences about the population.

It is not always the case that we have access to all of the data from a population necessary for determining population parameters like the mean and variance. In this case we must estimate these parameters using sample data to compute estimators or statistics.

Estimators or statistics are mathematical functions of sample data. The approach most students are familiar with in computing sample statistics is to compute the sample mean ( Xbar) and sample variance (s2) from sample data to estimate the population mean () and variance (2). This is referred to as the analogy principle, which involves using a corresponding sample feature to estimate a population feature (Bollinger,2002).

Properties of Estimators

A question seldom answered in many undergraduate statistics or research methods courses is how do we justify computing a sample mean from a small sample of observations, and then use it to make inferences about the population?

I have discussed the fact that most of the values in a population can be expected to be found within a small interval about the population mean or expected value. The question remains to be can we expect that most of the values of a population observation be found within a small interval of a sample mean? This must be true if we are to make inferences about the population using sample data and estimators like the sample mean.

Just like random variables, estimators or statistics have distributions. The distributions of estimators/statistics are referred to as sampling distributions. If the sampling distribution of a statistic is similar to that of the population then it may be useful to use that statistic to make inferences about the population. In that case sample means and variances may be good estimators for population means and variances.

Fortunately statisticians have developed criteria for evaluating ‘estimators’ like the sample mean and variance. There are four properties that characterize good estimators.

Unbiased Estimators

If ^ is an estimator for the population parameter , and if
E (^ then ^ is an unbiased estimator of the population parameter . This implies that the distribution of the sample statistic/estimator is centered around the population parameter ( Bollinger, 2002).


^ is a consistant estimator of  if

lim as n infinity : Pr[ | ^ -  | < c ] = 1

This implies that as you add more and more data, the probability of getting closer and closer to  gets large or variance of ^ approaches zero as n approaches infinity. It can be said that ^ p  or converges in probability to  (Bollinger, 2002).


This is based on the variance of the sample statistic/estimator. Given the estimator ^, and an alternative estimator ~, ^ is more efficient given that

Variance(^) < Variance(

Mean Squared Error is a method of quantifying efficiency.

MSE = E [ [ - ^]2] = V(^) + E[ (E(^) - )]2 = variance + bias squared.

It can then be concluded that for an unbiased estimator the measure of efficiency reduces to the variance (Bollinger, 2002).


Robustness is determined by how other properties ( i.e. unbiasedness,consistency,efficiency,) are affected by assumptions made
(Bollinger, 2002).

When sample statistics or estimators exhibit these four properties, statisticians feel that they can rely on these computations to estimate population parameters. It can be shown mathematically that the formulas used in computing the sample mean and variance meet these criteria.

Confidence Intervals

Confidence intervals are based on the sampling distributions of a sample statistic/estimator. Confidence intervals based on these distributions tell us what values an estimator ( ex: the sample mean) is likely to take, and how likely or rare the value is ( DeGroot, 2002). Confidence intervals are the basis for hypothesis testing.

Theoretical Confidence Interval

If we assume that our sample data is distributed normally, Xi ~ N (2) then it can be shown that the statistic

Z = ( Xbar - )2 / (2 / n)1/2 ~ N( 0, 1) Standard Normal Distribution ( Billingsly, 1993).

Given the probabilities represented by the standard normal distribution it can be shown as a matter of algebra that

Pr ( -1.96 <= Z <= 1.96) = .95

The value 1.96 is the qauntile of the standard normal distribution such that there is only a .025 or 2.5% chance that we will find a Z value greater than 1.96. Conversely there is only a 2.5% chance of finding a computed Z value less than –1.96
( Steele, 1997).

As a matter of algebra it can be shown that

Pr(Xbar - 1.96 (2 / n)1/2 <= <= Xbar + 1.96 (2 / n)1/2 ) = .95

( Goldberger, 1991).

This implies that 95% of the time we can be confident that the population mean will be 1.96 standard deviations from the sample mean. The interval above then represents a 95% confidence interval ( Bollinger, 2002).

The Central Limit Theorem –Asymptotic Results

The above confidence interval is referred to as a Theoretical confidence interval. It is theoretical because is based on knowledge of the population distribution being normal. According to the central limit theorem:

Given random sampling, E(X) = , V(X) = 2 , the Z- statistic Z = ( Xbar - )2 / (2 / n1/2) converges in distribution to N( 0,1). That is

( Xbar - )2 / (2 / n1/2) ~A N(0,1)

It can then be stated that the statistic is asymptotically distributed standard normal.
Asymptotic properties are characteristics that hold as the sample size becomes large or approaches infinity. The CLT holds regardless of how that sample data is distributed, hence there are no assumptions about normality necessary (DeGroot 2002).

Student’s t Distribution-Exact Results

A limitation of the central limit theorem is that it requires knowledge of the population variance. In many cases we use s2 to estimate 2 . Gosset, a brewer for Guiness in the 1800’s was interested in normally distributed data with small sample sizes. He found that using Z with s2 to estimate 2 did not work well with small samples. He wanted a statistic that relied on exact results vs. large sample asymptotics ( Steele, 1997).

Working under the name Student, he developed the t distribution, where

t = ( Xbar - )2 / (s2 / n)1/2 ~ t(n-1)

The t distribution is the ratio of a normally distributed variable and chi-square distributed variable ( DeGroot, 2002). It is important to note that the central limit theorem does not apply because we are using s2 instead of 2. Here we can rely on using the t-table for constructing confidence intervals and rely on exact results vs. the approximate or asymptotic results of the
CLT ( DeGroot, 2002).

More Asymptotics- Extending the CLT

Sometimes we don’t know the distribution of the data we are working with, or don’t feel comfortable making assumptions of normality. Usually we have to estimate 2 with s2 . In this case we can’t rely on the asymptotic results of the CLT or the exact results of the t-distribution.

In this case there are some powerful theorems regarding asymptotic properties of sample statistics known as the Slutsky Theorems.