Fitting Distribution X to Data From Distribution Y
I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I’m not a “closed form” kinda guy. I’m more of a “numerical simulation” type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a “toy model” and I think this is invaluable as a method for building intuition and a visceral understanding of data.
Here’s some example code which we played with:
set.seed(3) x <- rgamma(1e5, 2, .2) plot(density(x)) # normalize the gamma so it's between 0 & 1 # .0001 added because having exactly 1 causes fail xt <- x / ( max( x ) + .0001 ) # fit a beta distribution to xt library( MASS ) fit.beta <- fitdistr( xt, "beta", start = list( shape1=2, shape2=5 ) ) x.beta <- rbeta(1e5,fit.beta$estimate[[1]],fit.beta$estimate[[2]]) ## plot the pdfs on top of each other plot(density(xt)) lines(density(x.beta), col="red" ) ## plot the qqplots qqplot(xt, x.beta)
It’s not illustrated above, but it’s probably useful to transform the simulated data (x.beta) back into pre normalized space by multiplying by max( x ) + .0001 . (I swore I’d never say this but I lied) I’ll leave that as an exercise for the reader.
Another very useful tool in building a mental road map of distributions is the graphical chart of distribution relationships that John Cook introduced me to.
Thanks for this post.
I’m trying to fit dataset y to the distribution of dataset x. I’ve followed your post but I can’t get it to work. In particular, how do I get the values of shape1 and shape2 from dataset x?
x <- abs(rnorm(100))
y <- abs(rnorm(100))
plot(density(x),type=”l)
lines(density(y),col=”red”)
Hey Muhammad,
I’m not completely sure what you are asking. What do you mean by “values of shape1″?
I don’t know what shape1 and shape2 are. I don’t create objects by those names in my example. Can you help me understand? I’m happy to help if I can grasp what you are asking!
-JD
Hi JD,
The shape1 and shape2 are in the following line (as part of the inputs of fitdistr);
fit.beta <- fitdistr( xt, "beta", start = list( shape1=2, shape2=5 ) )
I am not particularly sure of how values 2 and 5 for shape1 and shape2, respectively are derived.
-Muhammad
Ohhhh. Sorry. I got it now.
if you look at the help for fitdistr you’ll see that the beta distribution requires a starting point for searching for the shape1 & shape2. Those aren’t derived, they are just sane guesses. HTH
In your case, you can look at the parametric forms of the beta and gamma distributions to compare them. But since you mentioned John’s blog, your readers might also enjoy reading
http://www.johndcook.com/blog/2010/08/11/what-distribution-does-my-data-have/
in which John asks a different question: given observational data, why should any famous distribution fit the data?
You can also use the Kolmogorov Smirnov Test if you don’t feel like doing any graphing.