[ExI] Alcock on Bem in Skeptical Inquirer

Sat Dec 4 23:08:37 UTC 2010

On 12/4/2010 4:41 AM, BillK wrote:

> Good article!

Terrible article, filled with errors and disgraceful innuendo, including 
unjustified imputations of dishonesty ("one cannot help but wonder if 
two experiments were indeed run, and each failed to produce significant 
results, and so the data from the two were combined, with the focus 
shifted to only the erotic pictures common to all participants," even 
though Alcock adds disingenuously, "Surely that was not done, for such 
an action would make a mockery of experimental rigor").

> With nice historical context.

A biased and misleading "context". Full of stuff like "Because of the 
lack of clear and replicable evidence, the Ganzfeld procedure has not 
lived up to the promise of providing the long-sought breakthrough that 
would lead to acceptance by mainstream science." In fact, the ganzfeld 
data are replicable and have been replicated (as shown in almost all the 
papers he himself cites), despite a botched critique by Richard Wiseman.

Professor Bem posted a quick reply to the careless critique of his 
paper, which he allows me to repost here:

=======
I strongly dispute Alcock's charge that I ran multiple t tests that 
should have been statistically corrected for multiple comparisons.

For each experimental hypothesis, I did a single t test.  For example, 
based on presentiment experiments in which subjects have shown 
pre-stimulus arousal to erotic stimuli, my first experiment tested the 
straightforward prediction that subjects would be able to select the 
curtain behind which an erotic picture would appear significantly more 
frequently than chance. (A blank wall was the non-target alternative on 
each trial.) Since there are two curtains, the null is 50%.

Subjects achieved a mean score of 53.1%, which is significant by a 
one-sample, one-tailed  t test.  But, because t tests assume normal 
distributions, I also submitted that same figure, .531, to a 
nonparametric binomial test across all trials and sessions.  Throughout 
the article, I did the same thing, presenting a parametric test and a 
nonparametric test on the same result.  The point was to counter the 
potential criticism that I relied on a statistical test that requires an 
underlying distribution. It was not a fishing expedition.

In that same study, I left as an open question whether there was 
something unique about erotic stimuli above and beyond their high 
arousal level and positive valence.  It might be that subjects could 
significantly detect other future stimuli, too, especially stimuli with 
high arousal and positive valence.  I discovered that—at least in this 
unselected population—subjects could not.  I did one t test showing that 
they scored significantly higher on erotic stimuli than on nonerotic 
stimuli and another t test showing that their performance on nonerotic 
stimuli did not differ from chance.

Finally, I did t tests showing that they did not differ from chance on 
any of the subcatagories of nonerotic stimuli either (e.g., negative 
stimuli, neutral stimuli, positive stimuli, romantic-but-nonerotic 
stimuli).  So, yes, if one glances at the page, you will see many t 
tests, but they are all in the service of showing no significant effects 
on nonerotic stimuli.  Correcting the p levels for multiple tests would 
have revealed—voila!—no significant psi hitting on nonerotic stimuli. 
The objection would have had more merit if I had found and then claimed 
that one subtype of nonerotic stimuli, (e.g., romantic stimuli) had 
shown significant psi hitting.

A similar misreading of multiple tests occurs in Experiments 2 & 7, 
where I expressed concerns about potential nonrandomness in the 
computer's successive left/right placements of targets.  To counter this 
possibility, I did 4 different analyses of the data (3 of them involving 
t-tests), each one controlling in a different ways for possible 
nonrandomness.  So, yes, if one glances superficially at Tables 2, 3, 
and 6, it looks like a lot of t tests were conducted.  But every test 
was aimed at showing that the same conclusion arises from different 
treatments of the same data. This is not the same thing as conducting 
several t tests on different portions of the data and then concluding 
that one of them showed a significant p level.

Ironically, the whole point of multiple tests here was to demonstrate 
that my statistical conclusions were the same no matter which kind of 
test I conducted and to defend against the potential charge that I must 
have tried several statistical tests and then cherry-picked and reported 
only the one that worked.

File this under the maxim that no good deed goes unpunished.

I have not yet taken the time to analyze the negative correlation 
between effect size and sample size that Alcock reports as a legitimate 
concern.  A similar debate occurred between Honorton and Hyman on a 
similar negative correlation found across ganzfeld experiments.  But, 
unlike the ganzfeld database, which included many data points, Alcock's 
calculation could not have had more than 9 data pairs to correlate. 
Correlations are notoriously unstable with such low numbers. I suspect 
the entire correlation rests on Experiment 7, which I designed to check 
out an unexpected serendipitous finding from the previous experiment and 
hence called for a large number of subjects (200),  and Experiment 9, a 
highly successful 50-subject replication of the previous Retroactive 
Recall experiment.  The other experiments offered very little variation; 
most of them involved 100 subjects.

I note, too, that many of the critics of my article accuse me of 
exploratory experiments, even though the predictions are always simply 
that I will find the same effect that is found in non-time-reversed 
versions of these standard effects.  Even more frequently overlooked is 
that 4 of my 9 experiments are themselves replications of the experiment 
that immediately preceded them.  (Hence, Retroactive Priming I and II; 
Retroactive Habituation I and II; and Retroactive Recall I and II). I 
did this, in part, to make sure that I wasn't misleading myself because 
of forgotten pilot testing conducted to work out the procedures of each 
initial experiment.

On another matter, I have now posted the complete replication package 
for the Retroactive Recall 1.1 experiment at http://dbem.ws/psistuff.

=============

Damien Broderick