Message:
34726Posted by:
MikePosted on:
Tuesday, 21st October 2003
Hi,
I have several thousand rows of data which I need to group into appropriate sized bins for a histogram. The problem is in determining the correct number and size for each bin, i.e. if I had a bunch of test scores and didn't have pre-determined bins (A,B,C,D,F) that were already sized (> 90, 80-89, 70-79, 60-69, <60), how would I figure out the number and size of the bins??
Thanks!
Message:
34732Posted by:
GabrielPosted on:
Tuesday, 21st October 2003
I have known about the following steps and used them since then. It has no theoretical base as far as I know, but it seems to work fine:
1) Number of bins (first trial): n1=sqrt(N-1)-1, where N is the number of individuals.
2) Bin size (first trial): s1=(max-min)/n1, where max and min are the higher and lower individuals
3) Bin size (definitive): s=Round UP s1 to the precision of the data. For example, if s1=0.32 mm and the data is in 0.1 mm format, then s=0.4mm.
4) The lower limit of the first bin will be min-(1/2 of the precision), the upper limit of the first bin will be the lower limit + s, and this also be the lower limit of the second bin, add another s to get the upper limit of this second bin which will also be the lower limit of the third bin and so on.
As said, this is a gideline. If you don't like the result then you can increase or reduce the bin size, but allways keep the size a multiple of the precision as said in point 3) (if not some bins will contain more possible results than others, and the bars of those bins will be fakely higher) and allways keep the limits of the bins "between" possible readings as said in point 4), if not the bins will be "unbalanced". For example a bin "larger than 10, up to 12" has its center at 11, but if the resolution is 1 the possible results are 11 and 12, which has a center in 11.5. A bin (10.5; 12.5) has a center at 11.5, which matches the centyer of the possible results and, by the way, you don't have to bother thinking if it is "larger" or "larger or equal" than 10.5 and "lower" or "lower or equal" than 12.5, because you will never have a data point "equal" to 10.5 or 12.5 anyway.
Message:
34733Posted by:
Heebeegeebee BBPosted on:
Tuesday, 21st October 2003
Check this link out:
http://www.sytsma.com/tqmtools/hist.html
Message:
34909Posted by:
MikePosted on:
Thursday, 23rd October 2003
Thanks for the replies Gabriel & Heebeegeebee!
The following two are from published studies:
1) bin width = 3.49*ó*N-1/3
2) bin width = 2*(IQR)*N-1/3
where IQR = 75th pctl - 25th pctl; N = number of samples; and the number of bins would be based on dividing the dataset range by the bin width.
This one is a rule of thumb I found on the Internet:
3) number of bins = 1+3.3*ln(N) where the bin width would be the dataset range by the number of bins
4) I've also tried Excel's built-in data analysis tools.
5) Gabriels's method
Here is what I get with the test data I'm reviewing (I've left out some small % of some bins so it won't total 100%):
1) bin width = 888; number of bins = 338; 97% of items in one bin, 1% in next bin, then 1%
2) bin width = 17; number of bins = 17564; 20% of items in one bin, 13% in next bin, then 11%,6%,5%,5%,3%,3%,3%,2%,2%
3) bin width = 9606; number of bins = 31; 99% of items in one bin, 1% in next bin
4) bin width = 3093; number of bins = 97; 99% of items in one bin, 1% in next bin
5) bin width = 3158, that's as far as I took it
All of these give way to many bins because most of the data is clustered below a certain number and the range below the lowest and highest numbers is quite large.
Message:
101928Posted by:
hidePosted on:
Sunday, 1st October 2006
This page provides the method to select histogram bin size (or number of bins) of your data.http://www.ton.scphys.kyoto-u.ac.jp/~hideaki/res/histogram.htmlBest,