| This page is devoted to contributions submitted by colleagues and
readers of the text. You are welcome to make your own contribution to any
of the areas listed. Visit this section often as it is likely to change
frequently, especially with the help of interested readers.
If you feel that a new area should be created, please contact me by email at : James.Ramsey@NYU.edu . |
This section lists comments by readers suggesting possible additions or deletions, to the current text, error corrections, and other suggestions to improve the text or this web site.
In some cases, I have commented on the suggestions. In any event, all reader contributions are edited for relevance and good taste, but criticisms are faithfully reported. You are welcome to make your own contribution here by emailing me at the above address.
Corrections to the Manuscript
This is the section in which I will insert at your instruction a hyperlink to your favorite data sets or case studies. You may prefer to send me text and request that I enter the material myself on your behalf. You can do so by contacting me by pressing the Feedback button to email me.
If you submit data sets or case studies, please indicate whether they are in the public domain, or that permission to use the data has already been obtained.
Note that I retain the right to edit any contributions to this web site in the interests of space, propriety, and relevance.
case studies and data links |
||
|
||
|
|
||
|
New York University Department of Economics Data Resources: Includes Statistical software links | |
| Data on the Net (University of California at San Diego): Search for databases on the web | ||
| The Globally Accessible Statistical Procedures Initiative | ||
| Statistical Material on the Web | ||
![]()
Many readers have their own favorite questions and exercises and may wish to share them with other readers of this text. This is the spot where you can submit your questions and exercises by pressing the Feedback button to email me.
Here is the first suggestion.
Q: Statistical Analysis of Literary Texts Skip to Next Question
In the first chapter of the text and in one of the exercises mention is made of using statistical tools to analyze textual details. This has been done to detect authorship, compare the same author's output under different circumstances, detect the presence of embedded cryptographic writing, and so on. The comparisons that have been made include the relative frequency of certain constructions, types of words, for example certain prepositions, phrases, the distribution of word, or sentence, or paragraph lengths, the distribution of interval lengths between certain phrases, and so on. You are challenged to think of your own novel statistical analysis of literary texts and to consider new ideas for testing or discovery in the field of textual analysis. I will be pleased to post references to the outcome of such endeavors.
Listed below are some references to a part of this literature that will provide anyone who is interested an exciting introduction into this somewhat unusual branch of applied statistics. The list below has been culled without carefully checking the citations for accuracy from a dialogue on the S-news network, my apologies, but I do not think that you will have much difficulty in finding these references.
Binongo & Smith(1999); J. of Applied Statistics, vol. 26-7, p.781 for a study of Oscar Wilde's writings;
Mosteller, Frederick, and David L. Wallace(1984); "Applied Bayesian and Classical Inference: the Case of the Federalist Papers,"in the 2nd edition of Inference and Disputed Authorship, The Federalist, Springer-Verlag;
Williams, C.B.(1975); "Mendenhall's studies of word length distribution in the works of Shakespeare and Bacon," Biometrika, 62, 207-212;
Efron, Brad, and R. Thisted(1976); "Estimating the number of unseen species: How many words did Shakespeare know?" Biometrika, 63, 435-448;
Elliott, Ward and R. Valenza(1991); "Who was Shakespeare?", Chance, vol.4, no. 3, 8-14.
This is just a smattering of the literature using statistical methods for the analysis of literary texts. I am sure that if you look further, you will discover many more interesting examples.
Q: The placebo effect; does it exist? Skip to Next Question
An article in the New York Times May 27th 2001 discussed recent research casting doubt on a heretofore widely accepted "fact;" one third of patients given a "dummy pill or sham treatment," get better. There has been much research to attempt to explain the placebo effect over the years.
Two Danish researchers on careful examination discovered that there were numerous cross citations, but that there seemed to be a single source; a paper published in 1955, by Henry Beecher in a paper entitled; "The Powerful Placebo." Dr. Beecher examined 15 studies that compared placebo effects to active treatment, but ignored the patients that got worse and concluded that about one third got better with the placebo. The Danish researchers, Drs. Gotzsche and Hrobjartsson, studied 7500 patients over 114 studies that included patients that were given nothing as a control group. They concluded that those that got nothing got better at the same rate as those given a placebo; in short the placebo effect is "nothing more than a medical legend."
Given what you have learned in this text comment on this result. Indicate how you would have designed the original study, how you would have chosen the sample size, and comment on the importance in such studies of a control group. Indicate some of the reasons why a control group is so important.
Q. Shark Attacks: Skip to Next Question
An editorial in the New York Times, September 6th 2001commented on the reports of shark attacks that have increased recently saying that the incidents were isolated and that more people are in the water so more reported attacks are being reported. In a subsequent article in The Science Times, September 21st 2001, fisherman were quoted as blaming the ban on shark fishing by Federal Authorities and the fishery biologists are blaming the fisherman in that by over fishing certain species of shark they have increased the relative number of bull sharks that are most aggressive. We have here two types of opinion. There is no real increase in shark attacks, merely a statistical blip that has occurred by chance; at the worst, shark attacks are tracking increased bathing in shark prevalent areas. Alternatively, there is an agreement that there has been an increase in shark attacks, but a dispute as to the causal mechanism. The article on the 21st. presents a graph showing the number of Florida attacks since 1990 from which it is clear that in 1994 there was a sizable increase in the average number of attacks. In 1993 the government began an aggressive program to protect Atlantic shark populations from commercial and sports fisherman.
Discuss this problem and explain how you would address these issues using statistical analysis. What data would you collect? Would you run experiments, and if so what experiments and under what conditions?
A useful source of information to aid you in your investigation of this topic is: http://www.flmnh.ufl.edu/fish/Sharks/ISAF/ISAF.htm.
Q: Disease Clusters: Skip to Next Question
A recent edition of the Journal of the Royal Statistical Society, Series A, volume 164, Part 1 was devoted to the topic of "disease clusters." In ecological studies, data are often grouped by geographical areas and by time. If we ignore contagious and infectious diseases, and groupings are observed over some geographical region and within a relative short time interval, the question arises as to whether we are observing a random occurrence, or are we observing an event that requires investigation. Because there are many potential biases in estimating the parameters of spatial distributions, such effects have been summarized as the "ecological fallacy."
To compound the problem, researchers are usually required to investigate actual clusters for which some causal mechanism is assigned on the basis of observed correlations. For example, observers may notice a higher than average number of skin cancers in a given community and that recently a cell 'phone tower was installed. There is a tendency to claim that the observed correlation "proves that the erection of the tower caused the skin cancers.
Explore this topic and debate how you might devise a statistical test to distinguish whether cell 'phone towers cause skin cancer.
Q: Significant Differences from Overlapping Intervals:Skip to Next Question
Schenker and Gentleman, in the article N. Schenker and Jane F. Gentleman, "On the Significance of Differences by Examining the Overlap Between Confidence Intervals," American Statistician, 55, #3, 182-6; discussed the practice of declaring a difference between two means significant if the corresponding confidence intervals do not, or do overlap. They prove that, relative to the standard procedure to calculate the variance of the difference and test if that difference is zero or not, the overlap method at the same nominally assigned alpha levels, has a higher probability of Type I error and lower power.
Devise a computer experiment to check this result in the simple case where both population variances are the same. Recheck when the variances differ substantially. Comment on your results.
Q: Is the Property of Being Positively Correlated Transitive?: Skip to Next Question
Click Here for this Question.
Q: Is Hormone Therapy a Failure?:Skip to Next Question
Wednesday July 10 there were prominent articles in the Wall Street Journal and the New York Times claiming that the latest research showed that taking hormone treatment, estrogen plus progestin, indicated that the treatment's negative side effects out weighed the claimed benefits. The planned period of the study was 8.5 years. In light of the current findings the study was halted and the women in the experiment were told to stop taking the treatment. The study was halted because it was determine that the results of a hormone replacement regimen did more harm than good. Needless to say this was a shock for the approximately six million women who had been taking the treatment. The Times stated that "A rigorous study found that the drugs, a combination of estrogen and progestin, caused small increases in breast cancer, heart attacks, strokes and blood clots. Those risks outweighed the drugs benefits- a small decrease in hip fractures and a decrease in colorectal cancer."
There were over 16,000 women in the study, so we are not dealing with a small study you can access the Journal of the American Medical Association article at http://jama.ama-assn.org/issues/v288n3/fig_tab/joc21036_t2.html. Further information can be obtained from the WHI HRT Update, http://www.nhlbi.nih.gov/whi/hrtupd/upd2002.html. This latter report states that about 2.5% of women in the study had one or more heart attacks, strokes, blood clots, etc.; that is, "for every 10,000 women taking estrogen plus progestin, we would expect:
18 more women with blood clots.
These results also suggest that for every 10,000 women taking estrogen plus progestin, we would expect:
In the JAMA article cited above you will see the numbers of women in the study group who took the treatment and the numbers who were given a placebo and who suffered various medical problems. The table cited shows the "hazard rate;" that is the ratio of the probabilities of contracting a medical problem with and without the treatment. If the ratio is greater than one there is a greater risk of a medical problem under the treatment than under the placebo. With eight exceptions out of twenty two medical conditions tracked in the study, the ratios were all greater than one. Apparently, this is an overwhelming case against hormone therapy and was so treated by the authors of the report. As the reader will appreciate this news caused great consternation and debate among women, but little discernible analysis of the statistical implications.
You should first recognize that we are dealing with estimates and that the estimates are subject to error, certainly random variation from sample to sample. One has to ask how significant are these results; you should recall here both the concepts of statistical and operational significance. While the conditions of the experiment seem to be well thought out and carefully executed, the results were apparently cited with due recognition for the statistical difficulties in drawing well supported inferences. Recognize first that the actual numbers for the differences are small, 5, 6, 7, 8 as cited above, so that there is some difficulty in interpretation; especially as one must question how significant are these results.
In order to obtain a better feel for these results, examine the data in the JAMA article. With a sample size of over 8,000 in the treatment and the control groups, we can safely rely on approximate normality for the estimates of the probabilities and on the integrity of the researchers for statistical independence of the two groups. Devise a test for the statistical significance of the differences in the estimated probabilities of occurrences of the selected medical events. You will soon discover that only a few of the differences are statistically significant, so that this alone should cause some concern for the hasty cancellation of the trials. Further, reconsider the matter from the point of view of the operational significance.
There is one further aspect of come concern that you should consider, 14 of 22 medical events had hazard ratios of greater than one. That is, while very few of the differences are statistically significant, the estimated probabilities are more often greater for the treatment than for the placebo. But be careful, the occurrence of multiple medical events are not random and the degree of association between medical events may be enhanced by the treatment. This identifies a further effect that may well impact on the analysis; the treatment affects the degree of correlation between various medical events.
From this analysis, you should learn that even in very carefully controlled and executed clinical trials, a deep understanding of statistical procedures is needed to interpret the results correctly. What at first sight seems to be overwhelming evidence ,on closer examination becomes far more problematical. Think carefully about your conclusions and how would you advise some one contemplating taking hormone therapy treatment. How would you devise experiments to help resolve these issues?
Q: The Effect of the Relative Cost of Types I and II Errors on Decisions
Click Here for this Question.
Examples of the Use and Abuse of Statistics
This section is for you to record for other readers of the text your favorite examples of the exemplary use of statistical reasoning and procedures and of the abuse of statistical thinking. I hope that this section will soon contain an array of instructive and amusing examples submitted by readers and colleagues.
The text has pointed out repeatedly that the theory of statistics is what enables us to interpret the observed data as other than observations of the moment. However, often data are collected that are not analyzed in the sense that we have used above, but merely to support a preconceived position. In such circumstances the ideas of random sampling; looking at all the data, not just those that support one's position; estimation precision, tests of hypotheses, and so on, do not have any meaning at all; they are essentially irrelevant to the discussion. Unfortunately, the data so amassed are still labeled "statistics," even though no statistician would recognize them as such. The quotes listed below illustrate this contention and should be a strong warning of the lengths to which some will go to provide misleading "numbers."
The quotes below are extracted from a wonderful book, the Skeptical environmentalist, by Bjorn Lomborg and published by Cambridge Press, 2001. The quotes are from chapter 23, various pages. I urge you to obtain a copy of the text itself as it provides a clear insight into how "statistics" can be manipulated. Beware!
We lose something in the region of 40,000 species every year, 109 a day. One species will be extinct before you have finished reading this chapter.
This was what we were told 22 years ago when Norman Myers first published his book The Sinking Ark in 1979.
The original estimate of 40,000 species lost every year came from Myers in 1979. His arguments make astonishing reading.
"Yet even this figure seems low…Let us suppose that, as a consequence of this man-handling of natural environments [the clearing of tropical forest], the final one-quarter of this century witnesses the elimination of 1 million species - a far from unlikely prospect. This would work out, during the course of 25 years, at an average extinction rate of 40,000 species per year, or rather over 100 species per day."
This is Myers' argument in its entirety. If we assume that 1 million species will become extinct in 25 years, that makes 40,000 a year. A perfectly circular argument.
This assertion is 40,000 times greater than his own data, 10,000 times the latest observed rate and 400 times the maximum guess as seen in Figure 131.
Colinvaux admits in Scientific American that the rate is "incalculable." Even so, E.O. Wilson attempts to put a lid on the problem with the weight of his authority: "Believe me, species become extinct. We're easily eliminating a hundred thousand a year". His figures are "absolutely undeniable" and based on "literally hundreds of anecdotal reports."
Similarly, Ariel Lugo explains that "no credible effort" has yet been made to pin down the scientific assumptions behind the mega-extinction scenario. "But,", he adds, "if you point this out, people say you are collaborating with the devil."
According to Professor Ehrlich, we do not know just how many species are becoming extinct each year. Yet, "biologists don't need to know how many species there are, how they are related to one another, or how many disappear annually to recognize that Earth's biota is entering a gigantic spasm of extinction." This is a most surprising statement. Apparently it alleviates scientists of the need to demonstrate the amount of losses as long as they can feel they are right. Such a statement seems to abandon the ordinarily assumed duty of scientists to objectively gather evidence to help society make real, well-informed choices.
Jared Diamond, a professor at UCLA and the author of well-known books such as The Third Chimpanzee and the Pulitzer Prize winner Guns, Germs and Steel, actually develops Ehrllich's idea. He emphasizes that we can only know something about the familiar species in the developed part of the world (where practically no extinction has taken place). For this reason we ought to reverse the burden of proof and assume that all species are extinct unless their existence can be proven. "We biologists should not bear the burden of proof to convince economists advocating unlimited human population growth [overconfident economists] that the extinction crisis is real. Instead, it should be left to those economists to fund research in the jungles that would positively support their implausible claim of a healthy biological world."[ Emphasis added.]
The biologists seriously argue that any skeptic should himself go to the jungle and carry out the biologists' research, because the biologists already know that things are going askew. In reality, of course, they are asking society for a blank check to prevent something which is claimed to be a catastrophe (50 percent over the next 50 years) but which is not supported by data (indicating a problem in the region of 0.7 percent over the next 50 years).
The marvelous aspect of this set of quotes is that one needs to know very little statistical theory to recognize that what we have in this example is propaganda of the worst sort. Unfortunately, the constant repetition of such numbers as "40,000 species lost per year," begin to take on a life of their own. The purported truth of the statement becomes a well known fact, irrespective of all evidence to the contrary; actual evidence ceases to be believed.
Reader beware!