Sunday, January 31, 2010

Animal Cruelty and Statistical Reasoning

In a recent article, animal rights activists (Mercy for Animals-MFA) went undercover and made some observations about animal abuse on dairy farms. See-
Governor Paterson, Shut This Dairy Down

The author of the above article states:

"But the grisly footage that every farm randomly chosen for investigation--MFA has investigated 11--seems to yield, indicates the violence is not isolated, not coincidental, but agribusiness-as-usual."

Where the statement above could get carried away, is if someone tried to apply it not only to the population of dairy farmers in that state or region, but to the industry as a whole. It's not clear how broadly they are using the term 'agribusiness as usual' but let's say a reader of the article wanted to apply it to the entire dairy industry.

This is exactly why economists and scientists employ statistical methods. Anyone can make outrageous claims about a number of policies, but are these claims really consistent with evidence? How do we determine if some claims are more valid than others?

Statistical inference is the process by which we take a sample and then try to make statements about the population based on what we observe from the sample. If we take a sample (like a sample of dairy farms) and make observations, the fact that our sample was 'random' doesn't necessarily make our conclusions about the population it came from valid.

Before we can say anything about the population, we need to know 'how rare is this sample?' We need to know something about our 'sampling distribution' to make these claims.

According to the USDA, in 2006 there were 75,000 dairy operations in the U.S. According to the activists claims, they 'randomly' sampled 11 dairies and found abuse on all of them. That represents just .0146% of all dairies. If we wanted to investigate the proportion of dairy farms that were abusing animals, if we wanted to be 90% confident in our estimate ( that is construct a 90% confidence interval) and we wanted the estimate (within the confidence interval)to be within a margin of error of .05, then the sample size required to estimate this proportion can be given by the following formula:

n = (z/2E)^2 where

z = value from the standard normal distribution associated with a 90% confidence interval

E = the margin of error

The sample size we would need is: (1.645/2*.05)^2 = (16.45)^2 = 270.65 ~271 farms!

To do this we have to make some assumptions:

Since we don't know the actual proportion of dairy farms that abuse animals, the most objective estimate may be 50%. The formula above is derived based on that assumption. (if we assumed 90% then it turns out based on the math (not shown) that the sample size would have to be the same as if we assumed that only 10% of farms abused their animals, which gives a sample size of about 98 or way more than 11). This also assumes normally distributed data. But to calculate anything, we would have to depend still on someone's subjective opinion of whether a farm was engaging in abuse or not.

I'm sure the article that I'm referring to above was never intended to be scientific, but the author should have chosen their words more carefully. What they have is allegedly a 'random' observation and nothing more. They have no 'empirical' evidence to infer from their 'random' samples that these abuses are 'agribusiness-as-usual' for the whole population of dairy farmers.

While MFA may have evidence sufficient for taking action against these individual dairies, the question becomes how high should the burden of proof be to support an increase in government oversight of the industry as a whole? (which seems to be the goal of many activist organizations)This kind of analysis involves consideration of the tradeoffs involved. This may depend partly on subjective views. We can use statistics to validate claims made on both sides of the debate, but statistical tests have no 'power' in weighing one person's preferences over another. Economics has no way to make interpersonal comparisons of utility.

Note: The University of Iowa has a great number of statistical calculators for doing these sorts of calculations. The sample size option can be found here. In the box, just select 'CI for one proportion' Deselect finite population ( since the population of dairies is quite large at 75,000)then select your level of confidence and margin of error.

References:

Profits, Costs, and the Changing Structure of Dairy Farming / ERR-47
Economic Research Service/USDA Link

"Governor Patterson Shut Down This Dairy", Jan 27,2010. OpEdNews.com

R Code for Basic Histograms

#####################################################
## THIS IS A LITTLE MORE STRAIGHT FORWARD THAN THE ##
## THE OTHER EXAMPLE- USING A SIMPLE DATA SET READ ##
## DIRECTLY FROM R VS. A FILE OR WEB ##
#####################################################

#----------------------------------------#
# SCHAUM'S P. 49 3.5 SALARY DATA
# BASIC HISTOGRAM
#----------------------------------------#


# LOAD SALARY DATA INTO VARIABLE SALARY

salary<-c(240,240,240,240,240,240,240,240,255,255,265,265,280,280,290,300,305,325,330,340) print(salary) # SEE IF IT IS CORRECT library(lattice) # LOAD THE REQUIRED lattice PACKAGE FOR GRAPHICS hist(salary) # PRODUCE THE HISTOGRAM # SEE OUTPUT BELOW

#---------------------------------------
# HISTOGRAM OPTIONS FOR COLOR ETC.
#----------------------------------------

hist(salary, breaks=6, col="blue") # MANIPULATE THE # OF BREAKS AND COLOR

# CHANGE TITLE AND X-AXIS LABEL

hist(salary, breaks=6, col="blue", xlab ="Weekly Salary", main ="Distribution of Salary")

#-------------------------------------------------
# FIT A SMOOTH CURVE TO THE DATA- KERNAL DENSITY
#-------------------------------------------------

d <- density(salary) # CREATE A DENSITY CURVE TO FIT THE DATA

plot(d) # PLOT THE CURVE

plot(d, main="Kernal Density Distribution of Salary") # ADD TITLE

polygon(d, col="yellow", border="blue") # ADD COLOR

Saturday, January 30, 2010

R Code for Mean, Variance, and Standard Deviation

##################################################
## REMEMBER, THIS IS ONLY THOSE INTERESTED IN R
## YOU ARE NOT REQUIRED TO UNDERSTAND THIS CODE
## WE WILL WORK THROUGH THIS IN CLASS ON THE BOARD
## AND IN EXCEL
##################################################

#---------------------------------#
# DEMONSTRATE MEAN AND VARIANCE #
#---------------------------------#

###################################
# DATA: Corn Yields
#
# Garst: 148 150 152 146 154
#
# Pioneer: 150 175 120 140 165
###################################

# compute descriptive statistics for Garst corn yields

garst<-c(148, 150, 152, 146, 154) # enter values for 'garst'

print(garst) # see if the data is there

summary(garst) # this gives the mean, median, min, max Q1,Q3

var(garst) # compute variance

sd(garst) # compute standard deviation

plot(garst)


# compute descriptive statistics for Pioneer Corn Yields

pioneer<-c(150, 175, 120, 140, 165)

print(pioneer)

summary(pioneer)

var(pioneer)

sd(pioneer)

plot(pioneer)

Wednesday, January 27, 2010

APPLICATIONS USING R

While we will primarily rely on Excel to do your empirical projects and for examples in class, if you have an interest in programming, I encourage you to explore the free statistical programming language R.

The links below highlight applications of the R programming language. Some of these are far more advanced than what we will address in class, and some don't necessarily involve statistics-but I share them with you to illustrate how flexible the language can be for a number of things. Learning to use R for statistics is a great way to get started. If you are interested in learning more, please don't hesitate to contact me.

Visualizing Taxes and Deficits Using R and Google Visualization API

R in the New York Times

How to integrate R into web-based applications (video) using Rapache

Using Social Network Analysis to Analyze Tweets with R

Tuesday, January 5, 2010

Creating a Histogram in SAS

*---------------------------------------------------*
READ RAW DATA INTO SAS

ASSUMPTIONS:

1) DATA IS IN CSV FILE LOCATED ON YOUR DESKTOP NAMED
'hso.vsc'

*---------------------------------------------------*
;

PROC IMPORT OUT= WORK.HSO
DATAFILE= "C:\Documents and Settings\wkuuser\Desktop\hs0.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

*---------------------------------------*
CREATING A HISTOGRAM IN SAS
*---------------------------------------*
;

PROC UNIVARIATE DATA = HSO;
VAR WRITE;
HISTOGRAM WRITE;
TITLE ’HISTOGRAM OF THE HOS DATA’;
RUN;

SAS Links

UCLA SAS Links

SAS-R-Blog

R Statistical Software Links

Using R in the Real World

Why use R Statistical Software?

Google's R Style Guide - Set up for Google's corporate programmers


UCLA R Resources

Statistical Computing with R: A Tutorial (Illinois State University)

(UCLA R Class Notes) -great example code for getting started with R

Quick R (Statistics Guide)

R-Project ( Where you go to get R software Free)

SAS-R Blog

Using R for Introductory Statistics (John Verzani)

Creating a Histogram In R

# READ DATA SET CALLED 'hs0' from Web:

hs0 <- read.table("http://www.ats.ucla.edu/stat/R/notes/hs0.csv", header=T, sep=",")

attach(hs0)

# Or if you have downloaded the data set to your desktop:

hs0 <- read.table("C:\\Documents and Settings\\wkuuser\\Desktop\\hs0.csv", header=T, sep=",")

attach(hs0)


# print summary of first 20 observations to see what the data looks like

hs0[1:20, ]

# CREATE HISTOGRAM FOR VARIABLE 'write'

library(lattice)

hist(write)

Why use R Statistical Software

From: http://datamining.togaware.com/survivor/Pros_Cons.html


*R is the most comprehensive statistical analysis package available. It incorporates all of the standard statistical tests, models, analyses, as well as providing a comprehensive language for managing and manipulating data.

* R is a programming language and environment developed for statistical analysis by practising statisticians and researchers.

* R is developed by a core team of some 10 developers, including some of the worlds leading Statisticians.

* The validity of the R software is ensured through openly validated and comprehensive governance as documented for the American Food and Drug Authority in XXXX. Because R is open source, unlike commercial software, R has been reviewed by many internationally renowned statisticians and computational scientists.

* R has over 1400 packages available specialising in topics like from Econometrics, Data Mining, Spatial Analysis, Bio-Informatics.

* R is free and open source software allowing anyone to use and, importantly, to modify it. R is licensed under the GNU General Public License, with Copyright held by The R Foundation for Statistical Computing.

* Anyone can freely download and install the R software and even freely modify the software, or look at the code behind the software to learn how things are done.

* Anyone is welcome to provide bug fixes, code enhancements, and new packages, and the wealth of quality packages available for R is a testament to this approach to software development and sharing.

* R well integrates packages in different languages, including Java (hence the Rpackage[]RWeka package), Fortran (hence Rpackage[]randomForest), C (hence Rpackage[]arules), C++, and Python.

* The R command line is much more powerful than a graphical user interface.

* R is cross platform. R runs on many operating systems and different hardware. It is popularly used on GNU/Linux, Macintosh, and MW/Windows, running on both 32bit and 64bit processors.

* R has active user groups where questions can be asked and are often quickly responded to, and often responded to by the very people who have developed the environment--this support is second to none. Have you ever tried getting support from people who really know SAS or are core developers of SAS?

* New books for R (the Springer Use R! series) are emerging and there will soon be a very good library of books for using R.

* No license restrictions (other than ensuring our freedom to use it at our own discretion) and so you can run R anywhere and at any time.

* R probably has the most complete collection of statistical functions of any statistical or data mining package. New technology and ideas often appear first in R.

* The graphic capabilities of R are outstanding, providing a fully programmable graphics language which surpasses most other statistical and graphical packages.

* A very active email list, with some of the worlds leading statisticians actively responding, is available for anyone to join. Questions are quickly answered and the archive provides a wealth of user solutions and examples. Be sure to read the Posting Guide first.

* Being open source the R source code is peer reviewed, and anyone is welcome to review it and suggest improvements. Bugs are fixed very quickly. Consequently, R is a rock solid product. New packages provided with R do go through a life cycle, often beginning as somewhat less quality tools, but usually quickly evolving into top quality products.

* R plays well with many other tools, importing data, for example, from CSV files, SAS, and SPSS, or directly from MS/Excel, MS/Access, Oracle, MySQL, and SQLite. It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and table output for LATEX and HTML.

EXAMPLE CODE

SAS CODE

Histogram


R CODE

NOTE: the code in these posts looks sloppy, but if you cut and paste into notepad or the R scripting window it lines up nice

Histogram

Histogram using Schaum's Homework Data (more straightforward than example above)

Kernal Density Plots -

R Code Mean, Variance, Standard Deviation

Selected Practice Problems for Test 1

Regression using R


Plotting Normal Curves and Calculating Probability

Monday, January 4, 2010

Statistics News and Industry

These applications are examples of statistics used in the real world. Many of them may involve more advanced concepts than what we will discuss in class, but my goal is to show you the numerous applications and possibilities with statistics. If you have any questions about anything posted here, don't hesitate to ask.

The new field of Data Science
Applications Using R
Statistics Summary For Media (humor)
For today's graduate just one word: Statistics
Animal Cruelty and Statistical Reasoning
Copulas and the Financial Crisis

Concepts from Mathematical Statistics

A. Concepts from Mathematical Statistics

Probability Density Functions

Random Variables

A random variable takes on values that have specific probabilities of occurring
( Nicholson,2002). An example of a random variable would be the number of car accidents per year among sixteen year olds.

If we know how random variables are distributed in a population, then we may have an idea of how rare an observation may be.

Example: How often sixteen year olds are involved in auto accidents in a year’s time.

This information is then useful for making inferences ( or drawing conclusions) about the population of random variables from sample data.

Example: We could look at a sample of data consisting of 1,000 sixteen year olds in the Midwest and make inferences or draw conclusions about the population consisting of all sixteen year olds in the Midwest.

In summary, it is important to be able to specify how a random variable is distributed. It enables us to gauge how rare an observation ( or sample) is and then gives us ground to make predictions, or inferences about the population.

Random variables can be discrete, that is observed in whole units as in counting numbers 1,2,3,4 etc. Random variables may also be continuous. In this case random variables can take on an infinite number of values. An example would be crop yields. Yields can be measured in bushels down to a fraction or decimal.

The distributions of discrete random variables can be presented in tabular form or with histograms. Probability is represented by the area of a ‘rectangle’ in a histogram
( Billingsly, 1993).

Distributions for continuous random variables cannot be represented in tabular format due to their characteristic of taking on an infinite number of values. They are better represented by a smooth curve defined by a function ( Billingsly, 1993). This function is referred to as a probability density function or p.d.f.

The p.d.f. gives the probability that a random variable ‘X’ takes on values in a narrow interval ‘dx’ ( Nicholson, 2002). This probability is equivalent to the area under the p.d.f. curve. This area can be described by the cumulative density function c.d.f. The c.d.f. gives the value of an integral involving the p.d.f.

Let f(x) be a p.d.f.

P( a <= X <= b ) = a b f(x) dx = F ( x )

This can be interpreted to mean that the probability that X is between the values of ‘a’ and ‘b’ is given by the integral of the p.d.f from ‘a’ to ‘b.’ This value can be given by the c.d.f. which is F ( x ). Those familiar with calculus know that F(x) is the anti-derivative of f(x).


Common p.d.f’s

Most students are familiar with using tables in the back of textbooks for normal, chi-square, t, and F distributions. These tables are generated by the p.d.f’s for these particular distributions. For example, if we make the assumption that the random variable X is normally distributed then its p.d.f. is specified as follows:

f(x)= 1 / (2 ) -1/2 e –1/2 (x - )2/

where X~ N (  )

In the beginning of this section I stated that it was important to be able to specify how a random variable is distributed. Through experience, statisticians have found that they are justified in modeling many random variables with these p.d.f.’s. Therefore, in many cases one can be justified in using one of these p.d.f.’s to determine how rare a sample observation is, and then to make inferences about the population. This is in fact what takes place when you look up values from the tables in the back of statistics textbooks.

Mathematical Expectation

The expected value E(X) for a discrete random variable can be defined as ‘the sum of products of each observation Xi and the probability of observing that particular value of Xi ( Billingsly, 1993). The expected value of an observation is the mean of a distribution of observations. It can be thought of conceptually as an average or weighted mean.

Example: given the p.d.f. for the discrete random variable X in tabular format.

Xi : 1 2 3

P(x =xi) .25 .50 .25


E (X) = Xi P(x =xi) = 1 * .25 + 2*.50 + 3*.25 = 2.0

Conceptually, the expected value of a random variable can be viewed as a point of balance or center of gravity. In reality, the actual value of an observed random variable will likely be larger or smaller than the ‘expected value.’ However due to the nature of the expected value, actual observations will have a distribution that is balanced around the expected value. For this reason the expected value can be viewed as the balancing point for a distribution, or the center of gravity ( Billingsly, 1993).

That is to say that population values cluster or gravitate around the expected value. Most of the population values are expected to be found within a small interval ( i.e. measured in standard deviations) about the population’s expected value or mean. Hence the expected values gives a hint about how rare a sample may be. Values lying near a population mean should not be rare but quite common.

The expected value for a continuous random variable must be calculated using integral calculus.

E(X) = ∫ x f(x) dx

If x is an observed value and f(x) is the p.d.f. for x, then the product x f(x) dx is the continuous version of the discrete case Xi P(x =xi) and integration is the continuous version of summation.

Variance

As I mentioned previously, actual observations will often depart from the expected value. Variance quantifies the degree to which observations in a distribution depart from an expected value. It is a mean of squared deviations between an observation and an expected value/mean (Billingsly, 1993).

In the discrete case we have the following mathematical description:

X -  )2 = 2

In the continuous case:

∫ (X -  )2 f(x) dx = E( X -  ) = 2


Given the mean (), or expected value of a random variable, one knows the value a random variable is likely to assume on average. The variance (2) indicates how close these observations are likely to be to the mean or expected value on average.




Sample Estimates

Given knowledge of the population mean and variance, one can characterize the population distribution for a random variable. As I mentioned at the beginning of the previous section, it is important to be able to specify how a random variable is distributed. It enables us to gauge how rare an observation ( or sample) is and then gives us grounds to make predictions, or inferences about the population.

It is not always the case that we have access to all of the data from a population necessary for determining population parameters like the mean and variance. In this case we must estimate these parameters using sample data to compute estimators or statistics.

Estimators or statistics are mathematical functions of sample data. The approach most students are familiar with in computing sample statistics is to compute the sample mean ( Xbar) and sample variance (s2) from sample data to estimate the population mean () and variance (2). This is referred to as the analogy principle, which involves using a corresponding sample feature to estimate a population feature (Bollinger,2002).


Properties of Estimators

A question seldom answered in many undergraduate statistics or research methods courses is how do we justify computing a sample mean from a small sample of observations, and then use it to make inferences about the population?

I have discussed the fact that most of the values in a population can be expected to be found within a small interval about the population mean or expected value. The question remains to be can we expect that most of the values of a population observation be found within a small interval of a sample mean? This must be true if we are to make inferences about the population using sample data and estimators like the sample mean.

Just like random variables, estimators or statistics have distributions. The distributions of estimators/statistics are referred to as sampling distributions. If the sampling distribution of a statistic is similar to that of the population then it may be useful to use that statistic to make inferences about the population. In that case sample means and variances may be good estimators for population means and variances.

Fortunately statisticians have developed criteria for evaluating ‘estimators’ like the sample mean and variance. There are four properties that characterize good estimators.




Unbiased Estimators

If ^ is an estimator for the population parameter , and if
E (^ then ^ is an unbiased estimator of the population parameter . This implies that the distribution of the sample statistic/estimator is centered around the population parameter ( Bollinger, 2002).

Consistency

^ is a consistant estimator of  if

lim as n infinity : Pr[ | ^ -  | < c ] = 1

This implies that as you add more and more data, the probability of getting closer and closer to  gets large or variance of ^ approaches zero as n approaches infinity. It can be said that ^ p  or converges in probability to  (Bollinger, 2002).

Efficiency

This is based on the variance of the sample statistic/estimator. Given the estimator ^, and an alternative estimator ~, ^ is more efficient given that

Variance(^) < Variance(

Mean Squared Error is a method of quantifying efficiency.

MSE = E [ [ - ^]2] = V(^) + E[ (E(^) - )]2 = variance + bias squared.

It can then be concluded that for an unbiased estimator the measure of efficiency reduces to the variance (Bollinger, 2002).

Robustness

Robustness is determined by how other properties ( i.e. unbiasedness,consistency,efficiency,) are affected by assumptions made
(Bollinger, 2002).

When sample statistics or estimators exhibit these four properties, statisticians feel that they can rely on these computations to estimate population parameters. It can be shown mathematically that the formulas used in computing the sample mean and variance meet these criteria.






Confidence Intervals

Confidence intervals are based on the sampling distributions of a sample statistic/estimator. Confidence intervals based on these distributions tell us what values an estimator ( ex: the sample mean) is likely to take, and how likely or rare the value is ( DeGroot, 2002). Confidence intervals are the basis for hypothesis testing.

Theoretical Confidence Interval

If we assume that our sample data is distributed normally, Xi ~ N (2) then it can be shown that the statistic

Z = ( Xbar - )2 / (2 / n)1/2 ~ N( 0, 1) Standard Normal Distribution ( Billingsly, 1993).

Given the probabilities represented by the standard normal distribution it can be shown as a matter of algebra that

Pr ( -1.96 <= Z <= 1.96) = .95

The value 1.96 is the qauntile of the standard normal distribution such that there is only a .025 or 2.5% chance that we will find a Z value greater than 1.96. Conversely there is only a 2.5% chance of finding a computed Z value less than –1.96
( Steele, 1997).

As a matter of algebra it can be shown that


Pr(Xbar - 1.96 (2 / n)1/2 <= <= Xbar + 1.96 (2 / n)1/2 ) = .95

( Goldberger, 1991).

This implies that 95% of the time we can be confident that the population mean will be 1.96 standard deviations from the sample mean. The interval above then represents a 95% confidence interval ( Bollinger, 2002).


The Central Limit Theorem –Asymptotic Results

The above confidence interval is referred to as a Theoretical confidence interval. It is theoretical because is based on knowledge of the population distribution being normal. According to the central limit theorem:

Given random sampling, E(X) = , V(X) = 2 , the Z- statistic Z = ( Xbar - )2 / (2 / n1/2) converges in distribution to N( 0,1). That is


( Xbar - )2 / (2 / n1/2) ~A N(0,1)

It can then be stated that the statistic is asymptotically distributed standard normal.
Asymptotic properties are characteristics that hold as the sample size becomes large or approaches infinity. The CLT holds regardless of how that sample data is distributed, hence there are no assumptions about normality necessary (DeGroot 2002).

Student’s t Distribution-Exact Results

A limitation of the central limit theorem is that it requires knowledge of the population variance. In many cases we use s2 to estimate 2 . Gosset, a brewer for Guiness in the 1800’s was interested in normally distributed data with small sample sizes. He found that using Z with s2 to estimate 2 did not work well with small samples. He wanted a statistic that relied on exact results vs. large sample asymptotics ( Steele, 1997).

Working under the name Student, he developed the t distribution, where

t = ( Xbar - )2 / (s2 / n)1/2 ~ t(n-1)


The t distribution is the ratio of a normally distributed variable and chi-square distributed variable ( DeGroot, 2002). It is important to note that the central limit theorem does not apply because we are using s2 instead of 2. Here we can rely on using the t-table for constructing confidence intervals and rely on exact results vs. the approximate or asymptotic results of the
CLT ( DeGroot, 2002).

More Asymptotics- Extending the CLT

Sometimes we don’t know the distribution of the data we are working with, or don’t feel comfortable making assumptions of normality. Usually we have to estimate 2 with s2 . In this case we can’t rely on the asymptotic results of the CLT or the exact results of the t-distribution.

In this case there are some powerful theorems regarding asymptotic properties of sample statistics known as the Slutsky Theorems.