[ExI] Alcock on Bem in Skeptical Inquirer
Damien Broderick
thespike at satx.rr.com
Sat Dec 4 23:08:37 UTC 2010
On 12/4/2010 4:41 AM, BillK wrote:
> Good article!
Terrible article, filled with errors and disgraceful innuendo, including
unjustified imputations of dishonesty ("one cannot help but wonder if
two experiments were indeed run, and each failed to produce significant
results, and so the data from the two were combined, with the focus
shifted to only the erotic pictures common to all participants," even
though Alcock adds disingenuously, "Surely that was not done, for such
an action would make a mockery of experimental rigor").
> With nice historical context.
A biased and misleading "context". Full of stuff like "Because of the
lack of clear and replicable evidence, the Ganzfeld procedure has not
lived up to the promise of providing the long-sought breakthrough that
would lead to acceptance by mainstream science." In fact, the ganzfeld
data are replicable and have been replicated (as shown in almost all the
papers he himself cites), despite a botched critique by Richard Wiseman.
Professor Bem posted a quick reply to the careless critique of his
paper, which he allows me to repost here:
=======
I strongly dispute Alcock's charge that I ran multiple t tests that
should have been statistically corrected for multiple comparisons.
For each experimental hypothesis, I did a single t test. For example,
based on presentiment experiments in which subjects have shown
pre-stimulus arousal to erotic stimuli, my first experiment tested the
straightforward prediction that subjects would be able to select the
curtain behind which an erotic picture would appear significantly more
frequently than chance. (A blank wall was the non-target alternative on
each trial.) Since there are two curtains, the null is 50%.
Subjects achieved a mean score of 53.1%, which is significant by a
one-sample, one-tailed t test. But, because t tests assume normal
distributions, I also submitted that same figure, .531, to a
nonparametric binomial test across all trials and sessions. Throughout
the article, I did the same thing, presenting a parametric test and a
nonparametric test on the same result. The point was to counter the
potential criticism that I relied on a statistical test that requires an
underlying distribution. It was not a fishing expedition.
In that same study, I left as an open question whether there was
something unique about erotic stimuli above and beyond their high
arousal level and positive valence. It might be that subjects could
significantly detect other future stimuli, too, especially stimuli with
high arousal and positive valence. I discovered that—at least in this
unselected population—subjects could not. I did one t test showing that
they scored significantly higher on erotic stimuli than on nonerotic
stimuli and another t test showing that their performance on nonerotic
stimuli did not differ from chance.
Finally, I did t tests showing that they did not differ from chance on
any of the subcatagories of nonerotic stimuli either (e.g., negative
stimuli, neutral stimuli, positive stimuli, romantic-but-nonerotic
stimuli). So, yes, if one glances at the page, you will see many t
tests, but they are all in the service of showing no significant effects
on nonerotic stimuli. Correcting the p levels for multiple tests would
have revealed—voila!—no significant psi hitting on nonerotic stimuli.
The objection would have had more merit if I had found and then claimed
that one subtype of nonerotic stimuli, (e.g., romantic stimuli) had
shown significant psi hitting.
A similar misreading of multiple tests occurs in Experiments 2 & 7,
where I expressed concerns about potential nonrandomness in the
computer's successive left/right placements of targets. To counter this
possibility, I did 4 different analyses of the data (3 of them involving
t-tests), each one controlling in a different ways for possible
nonrandomness. So, yes, if one glances superficially at Tables 2, 3,
and 6, it looks like a lot of t tests were conducted. But every test
was aimed at showing that the same conclusion arises from different
treatments of the same data. This is not the same thing as conducting
several t tests on different portions of the data and then concluding
that one of them showed a significant p level.
Ironically, the whole point of multiple tests here was to demonstrate
that my statistical conclusions were the same no matter which kind of
test I conducted and to defend against the potential charge that I must
have tried several statistical tests and then cherry-picked and reported
only the one that worked.
File this under the maxim that no good deed goes unpunished.
I have not yet taken the time to analyze the negative correlation
between effect size and sample size that Alcock reports as a legitimate
concern. A similar debate occurred between Honorton and Hyman on a
similar negative correlation found across ganzfeld experiments. But,
unlike the ganzfeld database, which included many data points, Alcock's
calculation could not have had more than 9 data pairs to correlate.
Correlations are notoriously unstable with such low numbers. I suspect
the entire correlation rests on Experiment 7, which I designed to check
out an unexpected serendipitous finding from the previous experiment and
hence called for a large number of subjects (200), and Experiment 9, a
highly successful 50-subject replication of the previous Retroactive
Recall experiment. The other experiments offered very little variation;
most of them involved 100 subjects.
I note, too, that many of the critics of my article accuse me of
exploratory experiments, even though the predictions are always simply
that I will find the same effect that is found in non-time-reversed
versions of these standard effects. Even more frequently overlooked is
that 4 of my 9 experiments are themselves replications of the experiment
that immediately preceded them. (Hence, Retroactive Priming I and II;
Retroactive Habituation I and II; and Retroactive Recall I and II). I
did this, in part, to make sure that I wasn't misleading myself because
of forgotten pilot testing conducted to work out the procedures of each
initial experiment.
On another matter, I have now posted the complete replication package
for the Retroactive Recall 1.1 experiment at http://dbem.ws/psistuff.
=============
Damien Broderick
More information about the extropy-chat
mailing list