[Paleopsych] Science: Three Articles on Race

Thu Mar 31 21:01:12 UTC 2005

GENETICS: Harvesting Medical Information from the Human Family Tree
David Altshuler and Andrew G. Clark*

[More on racial medicine. The third article is a decreasingly obligatory 
one not to use racial information for evil purposes.]

              A central goal of human genetics is to identify and understand
causal links between variant forms of genes and disease risk in patients.
To date, most progress has been made studying rare, Mendelian diseases in
which a mutation in a single gene acts strictly in a deterministic manner,
that is, the mutation causes the disease. The fact that such mutations
strictly cosegregate with disease in families offers a shortcut to
identifying the relevant chromosomal region, and means that the enrichment
of mutations in patients with the disease compared with healthy controls
can be convincingly documented in small numbers of individuals. In
contrast, common diseases typically are caused by a complex combination of
multiple genetic risk factors, environmental exposures, and behaviors.
Because mutations involved in complex diseases act probabilistically--that
is, the clinical outcome depends on many factors in addition to variation
in the sequence of a single gene--the effect of any specific mutation is
smaller. Thus, such effects can only be revealed by searching for variants
that differ in frequency among large numbers of patients and controls
drawn from the general population.

              Limited knowledge about genetic variants in the human
population, and the scarcity of technologies to score them accurately and
at a reasonable cost, have been key impediments to performing this type of
search. On page 1072 of this issue, Hinds et al. (1 ) describe dramatic
progress toward overcoming these impediments. They describe a publicly
available, genome-wide data set of 1.58 million common single-nucleotide
polymorphisms (SNPs)--genome sequence sites where two alternative
"spellings" exist in the population--that have been accurately genotyped
in each of 71 people from three population samples. A second public data
set of more than 1 million SNPs typed in each of 270 people has been
generated by the International Haplotype Map (HapMap) Project (2 ). These
two public data sets, combined with multiple new technologies for rapid
and inexpensive SNP genotyping, are paving the way for comprehensive
association studies involving common human genetic variations.

              The rationale for genetic mapping by association to a dense
map of common polymorphisms is based on two observations. The first is
that most heterozygosity in the human population is due to a finite
collection of common variants (on the order of 10 million and with a
frequency exceeding 1%). The second is that nearby variants tend to
correlate with one another in the population (known as linkage
disequilibrium). Correlations among variants exist because when a mutation
first arises, it does so on a single chromosome that carries a particular
combination of alleles at flanking polymorphisms. Over time the mutation
may spread to become common in a population, carrying with it the nearby
flanking markers (see the figure). This correlation is eroded over the
generations by recombination, just as in a pedigree study, except that the
time scale may be thousands of generations, instead of one or two. In
essence, genetic mapping with linkage disequilibrium treats the entire
human population as a large family study with an unknown pedigree. The use
of unrelated individuals makes it feasible to obtain sample sizes large
enough to demonstrate modest relationships between genotype and phentoype
through statistical associations.

                       Gene mapping in families and in populations. (Left )
The cotransmission of genes in families remains an important approach for
the genetic mapping of human diseases. Pedigrees are collected, and
genetic markers of many types (SNPs, microsatellites, insertions and
deletions) are scored in each individual. Computer programs then calculate
the probability that the pattern of transmission through the family is
consistent with linkage of the disease and certain markers. (Right) For
linkage disequilibrium mapping, the time scale is much longer, going back
thousands of years. The diagram depicts a gene genealogy. At the top is an
ancestral chromosome, with time flowing down the page, and the tips of the
tree are individual chromosomes in the population today. Across a
population sample, linkage is inferred if there is a statistical
correlation (linkage disequilibrium) between the disease and a SNP marker.
Numbers indicate mutations that generate SNP variations. A Mendelian
disease is caused by mutation 2 (blue); all descendant chromosomes also
carry mutation 1. Because recombination may occur over many generations,
this correlation between variants is found only when the two are very
close together (less than about 100 kb).

              Patterns of linkage disequilibrium are shaped by the local
recombination rate, genealogical history, and chance. In the human genome,
recombination is highly variable (3) and often clusters in regions of
local high intensity or "hotspots" (4 ). Moreover, the human population
has expanded recently from a much smaller founder pool, experiencing
bottlenecks as well as expansions in its history. These forces combine to
make human SNP patterns simpler and broader, such that they extend over
longer distances than would otherwise be the case. Nevertheless, because
the typical span of linkage disequilibrium is from thousands to more than
100,000 base pairs, genetic maps of very high density are needed to use
linkage disequilibrium for mapping genes. These goals motivated the
creation of the public human SNP map, which today contains more than 8
million variants (5 ). Developing genotyping assays for large numbers of
these variants, determining their frequencies in population samples, and
establishing their patterns of correlation have been the goals of both
Perlegen--a private company whose work is described in the Hinds et al.
(1) paper--and of the International HapMap Project.
              In their study, Hinds et al. describe the genotyping of 1.58
million SNPs in each of 71 individuals. Critically, the authors document
that the data are highly accurate. They identified 157,000 SNPs and nine
individuals in their own data set that had also been collected and
released by the public HapMap Project (6 ). Comparing these overlapping
genotypes, the authors show that both data sets are of exceptionally high
accuracy: 99.6% of genotype calls were identical in the two independent
studies. High-quality data are extremely important because the goal is to
identify associations among variants. Errors cause both an underestimation
of correlation and an overestimation of diversity.

              The SNPs genotyped by Hinds et al. are distributed across the
genome, but as with all methods, certain biases of experimental design
have shaped the data collected. Two major biases arise in the Perlegen
study: One is caused by a desire to study variants that have appreciable
frequency in the population (also shared by the HapMap project); the other
is a particular technical aspect of the oligonucleotide chips used by
Perlegen that limits analysis to unique (nonrepetitive) DNA sequences. A
critical question is how completely this subset of SNPs allows prediction
of the larger set of all common variants.

              To answer this question, the authors cleverly included in
their study a set of DNA samples in which a large collection of genes had
been resequenced by a project at the University of Washington called
SeattleSNPs (7 ). Hinds and colleagues measured the fraction of variants
in this more complete data set that could be highly correlated with
Perlegen's less complete set of genetic markers. The results are
encouraging: 73% of all common variants in the SeattleSNP genes showed a
strong correlation (r2 > 0.8) with Perlegen's 1.58 million SNPs, and the
mean correlation coefficient was 0.84. Moreover, the authors find that for
future studies, an equivalent level of statistical power can be maintained
by typing a selected set of just 300,000 SNPs in the samples with ancestry
from Europe and Asia, and 500,000 SNPs in the African American sample.
Because it is likely that the pairwise method of tagging used by the
authors is conservative (8, 9), even fewer markers would be likely to
achieve a similar power.

              In addition to the potential utility for disease research,
such data are an excellent resource for population and evolutionary
geneticists. Of particular interest is the inference of past natural
selection. For example, if a mutation with a strong positive effect is
"swept" to fixation, it leaves a footprint of low diversity and a skewed
spectrum of allele frequencies nearby (10 ). Methods that incorporate
information on this genomic scale are being devised to find these
selective footprints. The hope is, of course, to locate genes that have
evolved under positive selection in the recent history of humans,
presumably because those changes were required for local adaptation to
different environments. Some of these changes may cause differences in
susceptibility to modern diseases in today's human populations.

              Although the data described by Hinds et al. represent a major
step forward, much more is needed to develop the resources for
comprehensive genetic association studies. As the number and density of
SNPs typed in reference samples become more complete, the power and
efficiency of the markers selected will rise. In this regard, it is
exciting that Perlegen and the public HapMap Project are now working
together to generate an even denser map for the 270 HapMap samples.
Integrating these SNP data with duplication, deletion, and inversion
polymorphisms (11, 12 ) will be required to fully capture all common
sequence variations. It will be important to document how well allele
frequencies and patterns of linkage disequilibrium observed in the 71
samples studied by Perlegen, and the 270 samples studied by the HapMap
Project, will project over disease cohorts collected across the globe.
Collecting data on diet, exercise, and relevant environmental exposures in
long-term studies is key if we are ever to understand the confounding
roles of genes and environment in influencing disease risk. Although there
are many promising technologies for collecting genotype data, there is an
acute need for improved methods to analyze these data for association with
disease and to achieve robust results (13).

              Ultimately, a complete description of each disease will
require finding all variants, common and rare, and understanding their
interactions with one another, with environmental exposures, and with
multiple disease phenotypes. Association studies with common variants
represent a screening method to find the most prevalent genetic risk
factors. Although our population clearly contains common allelic variants
that contribute to disease, the ultimate explanatory power of this
approach depends critically on the unknown frequency distribution of
genetic variants that contribute to disease risk, and on the magnitude of
the effect of each allelic variant. There may be diseases for which there
are no common alleles, presumably because the mutations that occurred long
ago have been lost due to purifying selection, leaving only the more
recent, rare mutations in the population. In such cases, because it is so
hard to demonstrate association with rare variants, even direct
resequencing data may be difficult to interpret. Where effects of common
alleles are particularly weak, or if they are entangled in complex
interactions with other genes and environmental factors, all methods will
have correspondingly lower power. Suppose, for example, that a disease had
the genetic architecture of the oil content of corn, where at least 50
genes, all of small effect, have been found to influence the trait (14,
15 ). Such a disease would demand an enormous amount of resources and
yield little predictive information of use to public health--although the
biological insights could still be of tremendous value. In short, we need
to pick the targets for these approaches judiciously, and to modify the
approach in light of what is learned.

              Although population genetic theory has played a vital role in
shaping our thinking about these problems (16 ), ultimately the
contribution of common and rare variants in complex disorders is an
empirical question that will only be answered by collecting data on an
adequate scale. It is exciting to live in a time when the necessary tools
are becoming available so that we can stop debating the hypothetical, and
turn our attention to what we can learn from the data about real human
diseases.

              References

                1.. D. A. Hinds et al. Science 307, 1072 (2005).
                2.. The International HapMap Consortium, Nature 426, 789
(2003) [Medline].
                3.. A. G. Clark et al., Am. J. Hum. Genet. 73, 285 (2003)
[Medline].
                4.. G. A. McVean et al., Science 304, [581] (2004) .
                5.. See www.ncbi.nlm.nih.gov/projects/SNP/.
                6.. See www.hapmap.org.
                7.. D. C. Crawford et al., Am. J. Hum. Genet. 74, 610 (2004)
[Medline].
                8.. K. R. Ahmadi et al., Nature Genet. 37, 84 (2005)
[Medline].
                9.. D. M. Evans, L. R. Cardon, A. P. Morris, Genet
Epidemiol. 27, 375 (2004) [Medline].
                10.. Y. Kim, W. Stephan, Genetics 164, 389 (2003) [Medline].
                11.. J. Sebat et al., Science 305, [525] (2004).
                12.. A. J. Iafrate et al., Nature Genet. 36, 949 (2004)
[Medline].
                13.. J. N. Hirschhorn, K. Lohmueller, E. Byrne, K.
Hirschhorn, Genet. Med. 4, 45 (2002) [Medline].
                14.. C. C. Laurie et al., Genetics 168, 2141 (2005)
[Medline].
                15.. W.G. Hill, Science 307, [683] (2005).
                16.. J. K. Pritchard, N. J. Cox, Hum. Mol. Genet. 11, 2417
(2002) [Medline].
--------------------------------------------------------------------
              D. Altshuler is at the Broad Institute of Harvard and
Massachusetts Institute of Technology, and at the Massachusetts General
Hospital, Boston, MA 02114, USA. E-mail: altshuler at molbio.mgh.harvard.edu
A. G. Clark is in the Department of Molecular Biology and Genetics,
Cornell University, Ithaca, NY 14853, USA. E-mail: ac347 at cornell.edu
10.1126/science.1109682

              Volume 307, Number 5712, Issue of 18 Feb 2005, pp. 1052-1053.
-------------------

              Science, Vol 307, Issue 5712, 1072-1079 , 18 February 2005

              Whole-Genome Patterns of Common DNA Variation in Three Human
Populations
              David A. Hinds,1 Laura L. Stuve,1 Geoffrey B. Nilsen,1 Eran
Halperin,2 Eleazar Eskin,3 Dennis G. Ballinger,1 Kelly A. Frazer,1 David
R. Cox1*
              Individual differences in DNA sequence are the genetic basis
of human variability. We have characterized whole-genome patterns of
common human DNA variation by genotyping 1,586,383 single-nucleotide
polymorphisms (SNPs) in 71 Americans of European, African, and Asian
ancestry. Our results indicate that these SNPs capture most common genetic
variation as a result of linkage disequilibrium, the correlation among
common SNP alleles. We observe a strong correlation between extended
regions of linkage disequilibrium and functional genomic elements. Our
data provide a tool for exploring many questions that remain regarding the
causal role of common human DNA variation in complex human traits and for
investigating the nature of genetic variation within and between human
populations.

              1 Perlegen Sciences Inc., 2021 Stierlin Court, Mountain View,
CA 94043, USA.
              2 International Computer Science Institute, Berkeley, CA
94704, USA.
              3 Department of Computer Science and Engineering, University
of California-San Diego, La Jolla, CA 92093, USA.

              * To whom correspondence should be addressed. E-mail:
david_cox at perlegen.com

--------------------------------------------------------------------
              Single-nucleotide polymorphisms (SNPs) are the most abundant
form of DNA variation in the human genome. It has been estimated that
there are 7 million common SNPs with a minor allele frequency (MAF) of at
least 5% across the entire human population (1). Most common SNPs are to
be found in most major populations, although the frequency of any allele
may vary considerably between populations (2). An additional 4 million
SNPs exist with a MAF between 1 and 5%. In addition, there are innumerable
very rare single-base variants, most of which exist in only a single
individual.
              The relationship between DNA variation and human phenotypic
differences (such as height, eye color, and disease susceptibility) is
poorly understood. Although there is evidence that both common SNPs and
rare variants contribute to the observed variation in complex human traits
(3, 4), the relative contribution of common versus rare variants remains
to be determined. The structure of genetic variation between populations
and its relationship to phenotypic variation is unclear. Similarly, the
relative contribution to complex human traits of DNA variants that alter
protein structure by amino acid replacement, versus variants that alter
the spatial or temporal pattern of gene expression without altering
protein structure, is unknown. In some cases, these issues have been
studied in limited genomic intervals, but comprehensive genomic analyses
have not been possible.

              Genome-wide association studies to identify alleles
contributing to complex traits of medical interest are currently performed
with subsets of common SNPs, and thus they rely on the expectation that a
disease allele is likely to be correlated with an allele of an assayed
SNP. Although studies have shown that variants in close physical proximity
are often strongly correlated, this correlation structure, or linkage
disequilibrium (LD), is complex and varies from one region of the genome
to another, as well as between different populations (5, 6). Selection of
a maximally informative subset of common SNPs for use in association
studies is necessary to provide sufficient power to assess the causal role
of common DNA variation in complex human traits. Although a large fraction
of all common human SNPs are available in public databases, lack of
information concerning SNP allele frequencies and the correlation
structure of SNPs within and between human populations has made it
difficult to select an optimal subset.

              Here we examine the SNP allele frequencies and patterns of LD
between 1,586,383 SNPs distributed uniformly across the human genome in
unrelated individuals of European, African, and Asian ancestry. Our
primary aim was to create a resource for further investigation of the
structure of human genetic variation and its relationship to phenotypic
differences.

              A dense SNP map. To characterize a panel of markers that would
be informative in whole-genome association studies, we selected a total of
2,384,494 SNPs likely to be common in individuals of diverse ancestry (7).
We identified the majority (69%) of the SNPs by performing array-based
resequencing of 24 human DNA samples of diverse ancestry (5). These SNPs
were supplemented with SNPs chosen from public databases to obtain a more
uniform physical distribution across the human genome. Further details of
the SNP ascertainment are given in the supporting online material (7). We
designed 49 high-density oligonucleotide arrays for genotyping these SNPs
(8, 9) and roughly 300,000 long-range polymerase chain reaction (PCR)
primer pairs covering the selected SNPs, with an average of eight SNPs per
individual region being amplified by PCR. The amplicons had an average
length of 9 kb and covered 92% of the available human genome. An average
of 6250 amplicons derived from a single individual were pooled and
hybridized to a single high-density oligonucleotide array, producing
genotypes for 48,000 SNPs.

              We genotyped 71 unrelated individuals from three populations:
24 European Americans, 23 African Americans, and 24 Han Chinese from the
Los Angeles area. The 71 individuals genotyped here were not related to
the individuals previously used for SNP discovery. DNA samples were
selected from the Coriell Cell Repositories' Human Variation Collection,
and we relied on Coriell's determinations of sample populations. We
complied with all Coriell policies for research use DNA of samples from
named populations.

              Each SNP was scored with a combination of metrics that had
been shown to correlate with genotype quality on our platform, and data
for poorly performing SNPs was rejected. These metrics included the call
rate; the number of observed genotype clusters; the existence of
near-perfect matches for SNP flanking sequences elsewhere in the genome;
the presence of other known SNPs in probe-flanking sequences; and
consistency with Hardy Weinberg equilibrium. Tests for Hardy Weinberg
equilibrium are very effective for identifying some types of genotyping
artifacts (10); however, because we used these tests for quality control,
our genotype data are unsuitable for investigating biologically
interesting true deviations from Hardy Weinberg equilibrium. Further
details of our genotype quality control are described in the supporting
online material (7).

              A subset of 1,586,383 SNPs was successfully genotyped based on
our quality criteria, with two alleles each observed at least once among
the 71 individuals. In total, more than 112 million individual genotypes
were determined for these SNPs. There were no missing genotypes for 64% of
these SNPs, and 92% of these SNPs had less than 5% missing data. The
overall frequency of successful genotype calls was 98.6%. SNP assay
details and individual genotypes have been deposited in the National
Center for Biotechnology Information (NCBI)'s SNP database (dbSNP, build
123, accession nos. ss23145044 to ss24731426). Genotypes for 156,757 SNPs
for nine of the European-American individuals that were part of this
project had been previously determined by the International HapMap
Project, using a variety of genotyping platforms (11). Our data for these
1.6 million genotypes is 99.54% concordant with the HapMap project data.
The distribution of discordant genotypes is very nonrandom; only 0.3% of
the SNPs account for 50% of all the discrepancies, and we estimate that
90% of the SNPs in the complete data set have no incorrect genotypes.
Haplotype analyses in particular will generally benefit from this error
distribution, because accurate inference of haplotypes requires consistent
genotypes across large groups of nearby markers.

              The distribution of the 1.6 million high-quality genotyped
SNPs (table S1) is similar to that of a previously reported map of 1.42
million SNPs (12). More than 95% of the genome is in inter-SNP intervals
of less than 50 kb, and roughly two-thirds of the sequenced genome is
covered by inter-SNP intervals of 10 kb or less (table S2). The average
distance between adjacent SNPs is 1871 base pairs (bp). Although
repetitive elements are underrepresented in our collection, we genotyped
269,611 SNPs within repetitive elements where the SNP flanking sequences
could be uniquely mapped. There are 735,094 SNPs (46%) in genic regions of
the genome, which we define as being within 10 kb of the transcribed
intervals for 22,904 protein-coding genes in release 3 of NCBI's build 34
annotations. At least one SNP is present in 78% of all transcripts. When
the 10-kb region of DNA upstream and downstream of each transcript is
included, 93% of all the protein-coding genes contain at least one SNP. A
total of 20,165 SNPs (1.3%) are present in amino acid coding sequences and
9370 of these SNPs are nonsynonymous, leading to an amino acid change
(table S3). Although our SNP ascertainment is not random, this subset of
SNPs is quite uniformly distributed throughout the human genome with
respect to annotated protein-coding genes as well as physical distance.

              Common SNPs in three populations. Table 1 illustrates our
success in obtaining a set of common SNPs that are informative in human
populations of diverse ancestry. Most of the 1,586,383 SNPs with
high-quality genotypes are polymorphic in each of the three population
samples genotyped in this study. Ninety-four percent of the SNPs
(1,483,594 SNPs) have two alleles in the African-American sample; 81%
(1,286,277 SNPs) have two alleles in the European-American sample; and 74%
(1,168,029 SNPs) have two alleles in the Han Chinese sample. In each
population, the majority of the segregating SNPs have a MAF greater than
10%, ranging from 68% of all segregating SNPs in the African-American
sample to 57% of all segregating SNPs in the Han Chinese sample. Only
263,029 of the 1,586,383 SNPs (17%) have a MAF of less than 10% in all
three of the population samples. The distributions of MAFs we see in the
three populations is very similar for the European-American and Han
Chinese samples, with a higher frequency of rarer alleles in the
African-American sample (fig. S1). Consistent with previous studies (2,
13), we observed the greatest genetic diversity in individuals of African
descent. Our SNP ascertainment strategy makes it difficult to make more
definitive statements regarding the precise distribution of SNP allele
frequencies in different populations.

                      Table 1. SNPs segregating in the three genotyped
populations. Percentages are of 1,586,383 genotyped SNPs or of 291,012
private SNPs.

--------------------------------------------------------
                                  Population   Segregating

------------------------------------------------
                                   MAF > 0.05

------------------------------------------------
                                   MAF > 0.10

------------------------------------------------

                                  SNPs   %   SNPs   %   SNPs   %

------------------------------------------------

                                  All SNPs
                                      African-American   1,483,594   93.5
1,267,594   79.9   1,083,652   68.3
                                      European-American   1,286,277   81.1
1,123,765   70.8   991,046   62.5
                                      Han Chinese   1,168,029   73.6
1,027,109   64.7   910,451   57.4
                                  Private SNPs
                                      African-American   218,500   75.1
139,536   47.9   88,525   30.4
                                      European-American   44,555   15.3
18,284   6.3   8,062   2.8
                                      Han Chinese

------------------------------------------------
                                   27,957

------------------------------------------------
                                   9.6

------------------------------------------------
                                   15,946

------------------------------------------------
                                   5.5

------------------------------------------------
                                   9,817

------------------------------------------------
                                   3.4

------------------------------------------------

              Although the small sample sizes in this study preclude any
definite conclusion regarding the complete absence of a particular allele
in any given population, we observed 291,012 SNPs (18%) that were
segregating in only one population sample ("private SNPs"). Most of these
private SNPs (75%) were segregating in the African-American sample,
although private SNPs were observed for each of the three population
samples (Table 1). Although private SNPs tend to have lower MAFs than
other SNPs in our collection, a substantial fraction are common: 106,404,
or 37%, have MAF > 0.10.

              To quantify genetic variation within and between populations,
we calculated FST for each SNP in each pair of populations, as well as
combined values across all three populations (14). FST measures the
genetic variance between populations as a fraction of the total genetic
variance. Because African Americans are a relatively admixed population
with substantial but heterogeneous European genetic contributions (15),
the FST estimates for comparisons with this group will be more variable
but should generally underestimate the results that would be obtained with
a native African sample. The distribution of pairwise FST is very similar
for the African-American versus European-American and European-American
versus Han Chinese samples, with more large FST values between the
African-American and Han Chinese samples (fig. S2). These findings are
consistent with prior studies (16, 17) showing that most common DNA
variation is shared across human populations, with differences in allele
frequencies between populations.

              Markers with large between-population variance will be useful
for admixture mapping studies to identify genetic variants causing
phenotypic differences (18). Admixture mapping exploits relatively
long-range allelic correlations in a recently admixed population to
identify functional variants that have different prevalences in the
ancestral populations, whether because of genetic drift or local natural
selection. The technique requires selection and genotyping of limited
numbers of "ancestry-informative markers." Our identification of large
numbers of such markers removes one of the major barriers to practical use
of this promising but largely untested technique.

              Evidence for natural selection between populations. It has
been suggested that natural selection distorts the observed distribution
of FST across the human genome and that large FST values can be used to
identify candidate loci likely to have undergone local selection (13, 19).
If this is true, then larger FST values should be found near functional
genetic elements. We looked at the distribution of FST for SNPs that were
genic or nongenic, coding or noncoding, and synonymous or nonsynonymous.
We performed the analysis within subsets of SNPs grouped by MAF, so that
effectively, we looked at the fraction of between-population variance for
SNPs with the same total genetic variance (fig. S3). Common SNPs in genic
regions do have slightly but significantly higher FST values than nongenic
SNPs with the same MAF [analysis of variance (ANOVA), P = 1.8 x 10-46],
and common coding SNPs have slightly higher FST values than noncoding SNPs
in genic regions (ANOVA, P = 1.1 x 10-4). We did not see a significant
difference in FST between synonymous and nonsynonymous coding SNPs, but
our sensitivity is limited by the small sample sizes and expected
correlations among SNPs within the same transcript. These results are
consistent with local selection changing the distribution of FST near
functional sequences. However, because the distributions of FST among
genic and nongenic SNPs are very similar, large FST values by themselves
appear to be very weak evidence of selection.

              We performed a similar analysis to see if there is also an
association between private SNPs and functional genetic elements. When
conditioned on MAF, we saw no difference in frequency of private SNPs
among genic and nongenic SNPs or among coding and noncoding SNPs (fig.
S4). This indicates that the SNPs responsible for evidence of local
selection in the FST analysis tend not to be private and instead are
segregating in multiple populations. Although there are known examples
linking population-specific SNP alleles to phenotypic differences (20-22),
our results are more consistent with the conclusion that most functional
human genetic variation is not population-specific.

              Correlation structure of common SNPs. DNA variants in physical
proximity along a chromosome tend to be correlated, and these correlations
are known as linkage disequilibrium. LD results from a combination of
processes, including mutation, natural selection, and genetic drift. It
can initially extend over very long genomic distances but is steadily
broken down over time by recombination. The observed structure of LD in
any particular genomic interval thus depends on a complex interplay of
demographic history, stochastic events, and functional constraints.
Several metrics exist for measuring LD between pairs of SNPs; we used r2,
the squared correlation coefficient for a 2 by 2 table of haplotype
frequencies (23).

              We have used a modification of a previously described
algorithm to identify bins of common SNPs that are in very strong LD,
where each bin has at least one "tag SNP" with an r2 of at least 0.8 with
every other SNP in the bin (24). This "greedy" algorithm works by
iteratively identifying the largest possible subset with these properties
from a list of available SNPs, then removing those SNPs from the list used
in the next iteration. By assaying a reduced set of tag SNPs, the
genotyping burden of an association study may be substantially reduced
while retaining most of the power to discover disease associations of the
entire SNP set. Unlike haplotype blocks, which are defined as contiguous
groups of SNPs, the SNPs that make up a bin may be interdigitated with
SNPs that are part of other bins.

              Table 2 summarizes bin characteristics across the genome,
excluding the Y chromosome, for each of the three population samples. We
focused on common SNPs with MAF > 10% in this analysis, because estimates
of LD for variants with lower MAF are unreliable unless large numbers of
individuals are genotyped (23). Although most LD bins contained just one
SNP, these isolated SNPs were a small proportion of all SNPs, and most
SNPs were tightly correlated with multiple other SNPs. In the
European-American data, 52.3% of 293,677 bins contained one SNP; however,
these constituted only 15.5% of the 991,185 common SNPs. A substantial
portion of all SNPs qualified as tag SNPs by having a high r2 value with
every other bin member, indicating that the bins are generally quite
densely connected. For the African-American sample, there were
substantially fewer bins made up of large numbers of SNPs extending over
large distances (Fig. 1). It should be kept in mind that the LD structure
we observed is based on an analysis of only 25% of all common SNPs in the
genome. Although the sizes of longer intervals of LD should be relatively
robust to our incomplete ascertainment, the proportion of all common SNPs
in high LD with other SNPs may be substantially underestimated from our
data.

--------------------------------------------------------------------

                     Fig. 1. Size distribution of LD bins. We show, for a
given minimum bin size, the fraction of SNPs in bins of that size or
larger. The size distributions for the European-American and Han Chinese
LD maps are essentially identical. [View Larger Version of this Image (26K
GIF file)]

--------------------------------------------------------------------

                      Table 2. LD bin statistics in three populations. Bins
were classified by the number of SNPs they contained.

--------------------------------------------------------
                                  Size*   Bins   % Bins   kb   SNPs   % SNPs

------------------------------------------------

                                  African-American

------------------------------------------------

                                  1   362,465   67.4   0.0   362,465   33.5
                                  2 to 4   131,737   24.5   12.4   337,877
31.2
                                  5 to 9   32,081   6.0   37.2   202,512
18.7
                                  10   11,530   2.1   78.4   180,556   16.7
                                  Total   537,813       1,083,410
                                  European-American

------------------------------------------------

                                  1   153,511   52.3   0.0   153,511   15.5
                                  2 to 4   84,890   28.9   14.6   226,172
22.8
                                  5 to 9   33,745   11.5   37.3   218,491
22.0
                                  10   21,531   7.3   89.5   393,011   39.7
                                  Total   293,677       991,185
                                  Han Chinese

------------------------------------------------

                                  1   129,759   50.8   0.0   129,759   14.3
                                  2 to 4   74,232   29.1   13.2   198,422
21.8
                                  5 to 9   30,569   12.0   34.8   198,429
21.8
                                  10   20,708   8.1   83.7   383,580   42.1
                                  Total

------------------------------------------------
                                   255,268

------------------------------------------------

------------------------------------------------

------------------------------------------------
                                   910,190

------------------------------------------------

------------------------------------------------

                          * The number of SNPs per LD bin.

                           Average distance spanned by the SNPs in each LD
bin, in kb.

              LD and functional elements. We observed a strong relationship
between extended intervals of LD and functional genomic features (Table
3). Large bins were significantly overpopulated with genic versus nongenic
SNPs (trend test, P  0), and in genic regions, coding SNPs were
significantly enriched over noncoding SNPs (trend test, P = 1.9 x 10-26).
Large bins were also overrepresented for nonsynonymous versus synonymous
SNPs (trend test, P = 5.3 x 10-4). This result is consistent with the
hypothesis of an association between selection and some regions of
extended LD (25, 26) and suggests that some genomic regions of extended LD
may play a particularly important role in determining the genetic basis of
human phenotypic differences.

                      Table 3. Distribution of genic, synonymous, and
nonsynonymous coding SNPs spanned by bins of extended LD in any of the
three population samples. Genic SNPs are defined as within 10 kb of a
protein-coding gene annotation.

--------------------------------------------------------
                                  Longest spanning LD bin (kb)   SNPs
Genic

------------------------------------------------
                                   Synonymous

------------------------------------------------
                                   Nonsynonymous

------------------------------------------------

                                  SNPs   %   SNPs   %   SNPs   %

------------------------------------------------

                                  <500   1,536,094   707,950   46.1   10,330
0.67   8,898   0.58
                                  500 to 1000   42,432   22,189   52.3   347
0.82   302   0.71
                                  1000

------------------------------------------------
                                   7,857

------------------------------------------------
                                   4,955

------------------------------------------------
                                   63.1

------------------------------------------------
                                   120

------------------------------------------------
                                   1.52

------------------------------------------------
                                   171

------------------------------------------------
                                   2.17

------------------------------------------------

              We identified five bins of more than 200 SNPs each and 17
genomic intervals containing bins that span more than 1000 kb in one or
more populations (tables S4 and S5). Several of these large bins spanned
similarly large genes. The bin with the most SNPs was on chromosome 17 in
the European-American map and had an unusual pattern of variation, with
two previously reported haplotypes extending across 518 SNPs and spanning
a distance of 800 kb (27). The rarer haplotype had a frequency of 25% in
the European-American sample and a 9% frequency in the African-American
sample and was absent in the Han Chinese sample. This bin includes the
gene for microtubule-associated protein tau, mutations of which are
associated with a variety of neurodegenerative disorders; a gene coding
for a protease similar to presenilins, mutations of which result in
Alzheimer's disease; and the gene for corticotropin-releasing hormone
receptor, which mediates immune, endocrine, autonomic, and behavioral
responses to stress (27-29).

              Large-scale patterns of LD. The distribution of SNPs and LD
across the entire human genome is shown in Fig. 2 and can be examined in
more detail online. The top track illustrates the relative uniformity of
coverage of the analyzed SNPs apart from intervals of centromeric and
telomeric heterochromatin. The middle track shows the fraction of common
SNPs that are in high LD with at least one other SNP. In most regions, we
observed a high level of redundancy for the European-American and Han
Chinese samples and somewhat less redundancy in the African-American
sample. The bottom track shows the fraction of common SNPs observed to be
in relatively large LD bins in each population. This track shows
substantial structure on a scale of megabases. Although there is generally
good agreement between populations, there are also intervals where there
is substantial divergence.

--------------------------------------------------------------------

                      Fig. 2. Distribution of SNP positions and LD structure
across the genome. For each chromosome, the top track shows SNP density
per kb, with a window size of 500 kb. The middle track shows, for each
population, the fraction of common SNPs with MAF > 10% that are in high LD
(r2 > 0.8) with at least one other common SNP, with a window size of 500
kb. The bottom track shows, for each population, the fraction of common
SNPs that are in an LD bin extending over at least 50 kb, with a window
size of 1000 kb. [View Larger Version of this Image (147K GIF file)]

--------------------------------------------------------------------

              Our whole-genome analysis reveals that the large-scale
structure of LD across the genome is correlated with large-scale
differences in recombination rates, consistent with previous findings for
a single chromosome (30). In particular, regions of very strong LD are
mostly located in regions of low recombination (fig. S5). This correlation
of large-scale LD structure with recombination rate and the finding that
regions of extended LD show evidence of selection provide strong support
for the hypothesis that the LD structure of the human genome has
functional significance and is not simply a byproduct of random genetic
drift and population demographics.

              SNP subsets capture most common variation. As only a fraction
of all common SNPs in human populations have been characterized to date,
association studies based on available subsets of SNPs rely on the
expectation that an undiscovered, disease-associated variant is likely to
be correlated with an allele of an assayed SNP. The statistical power to
detect an unassayed, disease-associated allele indirectly with a
correlated allele of an assayed SNP is related to r2. Specifically, the
power to detect an association indirectly in N individuals is equivalent
to the power to detect it directly in Nr2 individuals (31). The actual
power to detect a particular causal variant depends on that variant's mode
of action and penetrance as well as details of the study design. Thus, r2
can only be used to answer the narrower question of what is the sample
size penalty, in an otherwise appropriately designed study, for not
directly assaying a causal variant.

              To determine our ability to detect unassayed,
disease-associated variants with this SNP collection, we took advantage of
the fact that the European-American and African-American individuals
genotyped in this study were also sequenced across selected genes by the
SeattleSNPs Program for Genomic Applications (PGA) (32). For these
individuals, this data provides an essentially complete assessment of
genetic variation in the sequenced regions, allowing us to estimate the
fraction of all variation contained in our SNP set. In addition, the data
allows us to determine the coverage of our genotyped SNPs for the sites we
did not directly assay.

              We evaluated data for 16,601 sequence variants identified in
152 genes, of which 2465 were part of our SNP set. The concordance between
our genotype data and the PGA data for these 2465 SNPs was 99.2%. Our SNP
set contained 24% of all SNPs with a MAF  10% for these 152 genes in the
African-American and European-American samples. SNPs with low MAF are
underrepresented in our data compared to the PGA data, because our SNPs
were typically discovered with sequence data from fewer distinct
chromosomes. These rarer variants account for relatively small fractions
of the total nucleotide diversity. In the PGA data for the European
Americans, 45% of SNPs have MAF < 10% but account for only 15% of
nucleotide diversity; for the African Americans, 58% of SNPs have MAF <
10% and account for 23% of nucleotide diversity.

              Table 4 shows the average r2 and the fraction of r2 values
exceeding thresholds, for any PGA SNP with the most-correlated SNP in the
same region that was included in our SNP set. These results indicate that,
with the stringent threshold of r2 > 0.8, our SNP set ascertains 73% of
common variation in the European-American sample and 54% of common
variation in the African-American sample. These values are similar to
those previously predicted if 2.7 million SNPs from public databases were
developed into genotyping assays (17). This analysis sets a very
conservative lower bound on coverage, because it treats SNPs below the
threshold of r2 = 0.8 as completely uncovered and does not reward coverage
that exceeds the threshold. Using a less stringent threshold of r2 > 0.5,
coverage would improve to 86% in the European-American sample and 71% in
the African-American sample. The skewed distribution of r2 toward high
values is apparent in the mean values of 0.84 for the European-American
sample and 0.72 for the African-American sample. These numbers are
especially impressive considering that we did not genotype 75% of all the
common SNPs in these intervals.

                      Table 4. LD statistics for common SNPs genotyped in
this study, with common variants identified by complete resequencing in
152 genes.

--------------------------------------------------------
                                  Subset*   Yield (%)   r2   r2 > 0.5 (%)
r2 > 0.8 (%)   r2 = 1.0 (%)

------------------------------------------------

                                  African-American
                                      All   23.3   0.715   70.9   53.7
41.5
                                      Tag   12.3   0.698   70.1   51.9
33.2
                                  European-American
                                      All   25.0   0.841   86.5   72.6
62.4
                                      Tag

------------------------------------------------
                                   8.1

------------------------------------------------
                                   0.810

------------------------------------------------
                                   85.6

------------------------------------------------
                                   69.7

------------------------------------------------
                                   44.8

------------------------------------------------

                          * SNPs from the current study: either all common
SNPs or a minimal tagging subset.

                           Percentage of all SeattleSNPs PGA variants that
were in the selected set.

                           Across all PGA variants, the mean maximum r2 with
a selected SNP in the same locus.

                           Percentages of PGA variants having an r2 greater
than the specified threshold with any selected SNP in the same locus.

              Selection of one tag SNP from each LD bin for the three
population samples yielded 296,313 of the 991,398 SNPs segregating in the
European-American sample (30%); 256,766 of the 909,824 SNPs segregating in
the Han Chinese sample (28%); and 540,533 of the 1,083,638 SNPs
segregating in the African-American sample (50%). When tag SNPs from
European Americans and African Americans were used to assess common
variation in the PGA data, for MAF  10%, the amount of all common
variation ascertained was reduced very little compared to that ascertained
with the complete sets of common SNPs (Table 4). These tag SNP numbers are
smaller than have previously been predicted with a similar selection
strategy (24); however, we did not attempt to achieve 100% coverage as in
that work. Although choosing subsets of SNPs based on bin relationships
reduces the genotyping burden for a comprehensive whole-genome scan to
some degree in all populations, these data indicate that even taking
advantage of such tag SNP selection, a comprehensive whole-genome
association study requires genotyping each individual for at least several
hundreds of thousands of SNPs.

              Haplotype block structure. LD maps and haplotype maps
represent somewhat different aspects of the local structure of genetic
variation. The genetic architecture of a particular phenotype will
determine which representation is most powerful for the identification of
functional variants (33). In parallel with our LD analysis, we used the
HAP program (34) to infer haplotypes from our diploid genotype data. We
partitioned these reconstructed haplotypes into blocks with limited
diversity, separately for each of the three population samples. These
blocks were defined as sets of SNPs for which at least 80% of the inferred
haplotypes could be grouped into common patterns with population
frequencies of at least 5%.

              Table 5 summarizes the structure of the three resulting
haplotype maps for the whole genome, excluding the Y chromosome. The
haplotype map statistics across the three populations appear qualitatively
similar to the LD maps, with substantially more blocks in the map derived
from the African-American sample than in the maps from the
European-American and Han Chinese samples. The numbers of SNPs required to
represent frequencies of common haplotype patterns were similar to the
numbers of tag SNPs identified in the LD maps. Substantial fractions of LD
bins of two or more SNPs crossed haplotype block boundaries, ranging from
33% in the Han Chinese map to 48% in the African-American map.

                      Table 5. Haplotype block partition results for the
three populations.

--------------------------------------------------------
                                  Population   Blocks   Average size, kb*
Required SNPs

------------------------------------------------

                                  African-American   235,663   8.8   570,886
                                  European-American   109,913   20.7
275,960
                                  Han Chinese

------------------------------------------------
                                   89,994

------------------------------------------------
                                   25.2

------------------------------------------------
                                   220,809

------------------------------------------------

                          * Average distance spanned by segregating sites in
each block.

                           Minimum number of SNPs required to distinguish
common haplotype patterns with frequencies of 5% or higher.

              The bin structure for SNPs in the region of the CFTR gene on
chromosome 7 (Fig. 3) demonstrates some of the differences between the LD
bin and haplotype block maps and further illustrates that there can be
substantial population differences in local map structure. In this
interval, the European-American and African-American LD maps have similar
complexity, with multiple overlapping bins, but the Han Chinese map is
dominated by two disjoint bins of highly correlated SNPs. Conversely, a
break point near the 116,790-kb position is shared in the African-American
and Han Chinese LD maps but is bridged by multiple LD groupings in the
European-American map. All three haplotype maps share this break point.
However, the African-American map contains many more distinct haplotype
blocks than the maps for the other two population samples.

--------------------------------------------------------------------

                     Fig. 3. Extended LD bin and haplotype block structure
around the CFTR gene. LD bins, where each bin has at least one SNP with r2
   > 0.8 with every other SNP, are depicted as light horizontal bars, with
the positions of constituent SNPs indicated by vertical tick marks as well
as the extreme ends of the bars. Isolated SNPs are indicated by plain tick
marks. Haplotype blocks, within which at least 80% of observed haplotypes
could be grouped into common patterns with frequencies of at least 5%, are
depicted as dark horizontal bars. Unlike haplotype blocks that are by
design sequential and nonoverlapping, SNPs in one LD bin can be
interdigitated with SNPs in multiple other overlapping bins. [View Larger
Version of this Image (27K GIF file)]

--------------------------------------------------------------------

              Common genetic variation and human health. Our focus on common
genetic variation has several motivations. Common variants account for a
larger share of human nucleotide diversity than rare variants and are more
experimentally tractable. For the same allelic effect, a common variant
represents a larger fraction of phenotypic variance and population
attributable risk than a rare one, so common variants are more valuable
from the perspective of diagnostics and intervention. Finally, detecting
and characterizing effects of rare variants requires very large sample
sizes to obtain statistically meaningful numbers of individuals carrying a
rare allele. There is no doubt that rare variants play a role in the
etiology of common disease, but pursuit of common variants is more
tractable with available technologies.

              Common human diseases, such as cardiovascular disease and
psychiatric illness, are caused by the interplay of multiple genetic and
environmental factors. The bounded nature of the human genome and the
availability of the complete human genome sequence have resulted in
extensive efforts to define the genetic basis of a wide variety of complex
human traits. One approach for identifying such genetic risk factors is
the case-control association study, in which a group of individuals with
disease is found to have an increased frequency of a particular genetic
variant compared to a group of control individuals. A number of genetic
risk factors for common disease have been identified by such association
studies (3, 4, 35, 36). These studies suggest that many different genes
distributed throughout the human genome contribute to the total genetic
variability of a particular complex trait, with any single gene accounting
for no more than a few percent of the overall variability of the trait
(37). Case-control study designs that include on the order of 1000
individuals can provide adequate power to identify genes accounting for
only a few percent of the overall genetic variability of a complex trait,
even using the very stringent significance levels required when testing
large numbers of common DNA variants (37). Using such study designs in
conjunction with the detailed description of common human DNA variation
presented here, it may be possible to identify a set of major genetic risk
factors contributing to the variability in a complex disease and/or
treatment response. Although knowledge of a single genetic risk factor can
seldom be used to predict the treatment outcome of a common disease,
knowledge of a large fraction of all the major genetic risk factors
contributing to a treatment response or common disease could have
immediate utility, allowing existing treatment options to be matched to
individual patients without requiring additional knowledge of the
mechanisms by which the genetic differences lead to different outcomes.

              In our analyses, we selected representations of the data,
including pairwise LD as well as a haplotype-based approach, that we felt
would be most useful for an initial characterization of this resource. We
focused attention on pairwise LD analyses because they provide a
particularly simple framework for evaluating coverage and information
content of different SNP collections. The optimal representation of
genetic variation data remains an area of active research. Although we
have determined example haplotype maps of the human genome in these three
populations, the most appropriate representation of the data depends
substantially on the specific questions to be answered. There will be many
maps of human genetic variation, each tailored for specific uses.

              Public data availability. We have implemented an instance of
the Generic Genome Browser (38) at http://genome.perlegen.com for viewing
the SNP, LD, and haplotype data reported here; this data will also be
available from Science upon request. More detailed haplotype analysis
results are available at http://research.calit2.net/hap/wgha/ and through
dbSNP. The data reported here represent a massive increase in the
available number of SNPs characterized in multiple populations. For
comparison, although the public SNP database, dbSNP build 122, contained
map positions for more than 8.1 million human SNPs, frequencies were
available for only 797,000 of these SNPs, mostly in just one population,
and genotypes were available for only 210,000 SNPs. Our data also
complement the results of the International HapMap Project (11), by
providing data for many more SNPs across fewer individuals.

              This work enables detailed analyses of the structure of human
genetic variation on a whole-genome scale. We examined genetic variation
in individuals from three populations with substantially different
histories and describe general features of variation within and between
populations. Because these samples do not capture the full genetic
diversity of the populations from which they were selected, our data are
not suitable for answering many questions about the detailed genetic
structure of human populations (39). However, the public availability of
these data will enable a wide variety of additional analyses to be carried
out by scientists investigating the structure of human genetic variation
as well as the genetic basis of human phenotypic differences.

              References and Notes

                    1. L. Kruglyak, D. A. Nickerson, Nature Genet. 27, 234
(2001).[CrossRef][ISI][Medline]
                    2. C. Romualdi et al., Genome Res. 12, 602
(2002).[Abstract/Free Full Text]
                    3. J. P. Hugot et al., Nature 411, 599
(2001).[CrossRef][ISI][Medline]
                    4. A. D. Roses, Neurogenetics 1, 3
(1997).[CrossRef][ISI][Medline]
                    5. N. Patil et al., Science 294, 1719
(2001).[Abstract/Free Full Text]
                    6. S. B. Gabriel et al., Science 296, 2225
(2002).[Abstract/Free Full Text]
                    7. Materials and methods are available as supporting
material on Science Online.
                    8. D. A. Hinds et al., Am. J. Hum. Genet. 74, 317
(2004).[CrossRef][ISI][Medline]
                    9. D. A. Hinds et al., Hum. Genom. 1, 421 (2004).
                    10. L. Hosking et al., Eur. J. Hum. Genet. 12, 395
(2004).[CrossRef][ISI][Medline]
                    11. International HapMap Consortium, Nature 426, 789
(2003), available at www.hapmap.org.
                    12. International SNP Map Working Group, Nature 409, 928
(2001).[CrossRef][ISI][Medline]
                    13. A. M. Bowcock et al., Proc. Natl. Acad. Sci. U.S.A.
88, 839 (1991).[Abstract/Free Full Text]
                    14. B. S. Weir, C. C. Cockerham, Evolution 38, 1358
(1984).[ISI]
                    15. E. J. Parra et al., Am. J. Hum. Genet. 63, 1839
(1998).[CrossRef][ISI][Medline]
                    16. N. A. Rosenberg et al., Science 298, 2381
(2002).[Abstract/Free Full Text]
                    17. C. S. Carlson et al., Nature Genet. 33, 518
(2003).[CrossRef][ISI][Medline]
                    18. N. Patterson et al., Am. J. Hum. Genet. 74, 979
(2004).[CrossRef][ISI][Medline]
                    19. J. M. Akey, G. Zhang, K. Zhang, L. Jin, M. D.
Shriver, Genome Res. 12, 1805 (2002).[Abstract/Free Full Text]
                    20. Q. Xiao, H. Weiner, D. W. Crabb, J. Clin. Invest.
98, 2027 (1996).[Abstract/Free Full Text]
                    21. P. Duggal et al., Genes Immun. 4, 245
(2003).[CrossRef][ISI][Medline]
                    22. K. Nakayama et al., J. Hum. Genet. 47, 92
(2002).[CrossRef][ISI][Medline]
                    23. B. Devlin, N. Risch, Genomics 29, 311
(1995).[CrossRef][ISI][Medline]
                    24. C. S. Carlson et al., Am. J. Hum. Genet. 74, 106
(2004).[CrossRef][ISI][Medline]
                    25. G. A. Huttley, M. W. Smith, M. Carrington, S. J.
O'Brien, Genetics 152, 1711 (1999).[Abstract/Free Full Text]
                    26. P. C. Sabeti et al., Nature 419, 832
(2002).[CrossRef][ISI][Medline]
                    27. A. M. Pittman et al., Hum. Mol. Genet. 13, 1267
(2004).[Abstract/Free Full Text]
                    28. G. Van Gassen, W. Annaert, Neuroscientist 9, 117
(2003).[Abstract/Free Full Text]
                    29. E. R. De Kloet, Ann. N.Y. Acad. Sci. 1018, 1
(2004).[Abstract/Free Full Text]
                    30. E. Dawson et al., Nature 418, 544
(2002).[ISI][Medline]
                    31. J. K. Pritchard, M. Przeworski, Am. J. Hum. Genet.
69, 1 (2001).[CrossRef][ISI][Medline]
                    32. SeattleSNPs, National Heart, Lung, and Blood
Institute Program for Genomic Applications, University of Washington-Fred
Hutchinson Cancer Research Center, Seattle, WA, available at
http://pga.gs.washington.edu.
                    33. J. S. Bader, Pharmacogenomics 2, 11
(2001).[ISI][Medline]
                    34. E. Halperin, E. Eskin, Bioinformatics 20, 1842
(2004).[Abstract/Free Full Text]
                    35. D. Altshuler et al., Nature Genet. 26, 76
(2000).[CrossRef][ISI][Medline]
                    36. L. A. Pennacchio et al., Science 294, 169
(2001).[Abstract/Free Full Text]
                    37. N. J. Risch, Nature 405, 847
(2000).[CrossRef][ISI][Medline]
                    38. L. D. Stein et al., Genome Res. 12, 1599
(2002).[Abstract/Free Full Text]
                    39. D. Serre, S. Pääbo, Genome Res. 14, 1679
(2004).[Abstract/Free Full Text]
                    40. We thank B. Margus and S. Fodor for many helpful
discussions, S. Ptak for comments on the manuscript, and the following
individuals for expert technical assistance: high throughput genotyping,
C. Chen, P. Chu, D. Dalija, J. Doshi, P. Jain, A. Johnson, L. Kamigaki, J.
Karbowski, C. Kautzer, V. Mendoza, M. Morenzoni, B. Nguyen, C. Owyang, N.
Patil, K. Perry, R. Patel, C. Pethiyagoda, T. L. Pham, C. Sanders, A.
Sparks, R. Stokowski, D. Telman, R. Vergara, P. Vu, and P.-H. Wang;
bioinformatics design, W. Barrett, H. Huang, M. Jen, X. Li, B. Mooney, and
S. Pitts; data analysis, A. Berno, K. Konvicka, A. Ollmann, K. Pant, and
J. Sheehan; laboratory information management, R. Gupta, E. Jacobs, C.
Radu, and P. Starink; engineering and instrumentation, R. Hartlage, M.
Norris, G. Park, and A. Yee; computer systems and operations, T. Fleury,
R. Galvez, R. Gordon, P. Hickey, C. LaPlante, J. Nordhal, T. Ogi, and J.
VandenHengel. E.E. is supported by the California Institute for
Telecommunications and Information Technology.

              Supporting Online Material

              www.sciencemag.org/cgi/content/full/307/5712/1072/DC1

              Materials and Methods

              Figs. S1 to S5

              Tables S1 to S5

              References and Notes

              20 September 2004; accepted 14 January 2005

              Volume 307, Number 5712, Issue of 18 Feb 2005, pp. 1072-1079.
From checker at panix.com Sat Feb 19 20:38:53 2005
Date: Sat, 19 Feb 2005 20:38:53 -0500 (EST)
From: Premise Checker <checker at panix.com>
To: Premise Checker <checker at panix.com>
Subject: Science: Enhanced: Race and Reification in Science

-----------

MEDICINE: Enhanced: Race and Reification in Science
               Troy Duster[HN12]*
               Alfred North Whitehead warned many years ago about "the
fallacy of misplaced concreteness" [HN1] (1 ), by which he meant the
tendency to assume that categories of thought coincide with the obdurate
character of the empirical world. If we think of a shoe as "really a
shoe," then we are not likely to use it as a hammer (when no hammer is
around). Whitehead's insight about misplaced concreteness is also known as
the fallacy of reification [HN2] . Recent research in medicine and
genetics makes it even more crucial to resist actively the temptation to
deploy racial categories as if immutable in nature and society.

               Hypertension and Heart Disease
               In the last two decades, there has been extensive publication
on the differences in hypertension and heart disease between Americans of
European descent and Americans of African descent (2-4). Racial
designations are frequently used in efforts to assess the respective
influences of environmental and genetic factors.

               In November, a study was published regarding a combination of
isosorbide dinitrate and hydralazine (BiDil) [HN3] that was originally
found to be ineffective in treating heart disease in the general
population but was then shown to work in a 3-year trial of a group of 1050
individuals designated as African Americans (5 ). BiDil is likely to get
FDA approval this year and has been labeled "the first ethnic drug,"
although in medical practice, this becomes "the first racial drug." In
presenting their justification for FDA approval of an ethnic/race-specific
drug, the company (NitroMed) [HN4] announced, "The African American
community is affected at a greater rate by heart failure than that of the
corresponding Caucasian population. African Americans between the ages of
45 and 64 are 2.5 times more likely to die from heart failure than
Caucasians in the same age range" (6).

               However, both age and survey population complicate this
picture. The age group 45 to 64 only accounts for about 6% of heart
failure mortality, and for those over 65, the statistical differences
between "African Americans and Caucasians" nearly completely disappear
(7 ). Researchers recently published a study that was explicitly designed
to compare racial differences, by sampling whites from eight surveys
completed in Europe, the United States, and Canada and contrasting these
results with those of a sample of three surveys among blacks from Africa,
the Caribbean, and the United States (8 ). Hypertension rates were
measured in 85,000 subjects. The data from Brazil, Trinidad, and Cuba show
a significantly smaller racial disparity in blood pressure than is found
in North America (8).[HN5]

               Even within the category African American, the highly variable
phenotype of skin color complicates the hypertension and race thesis. A
classic epidemiological study on the topic also found differences within
the African American population--with darker-skinned blacks generally
having higher mean blood pressure than lighter-skinned blacks. The authors
concluded that it was not the color of the skin that produced a direct
causal outcome in hypertension, but that darker skin color in the United
States is associated with less access to scarce and valued resources of
the society. There is a complex feedback loop and interaction effect
between phenotype and social practices related to that phenotype (4, 9).

               Others have voiced concerns about the pitfalls of using race
as anything but a temporary proxy: As the geneticist David Goldstein [HN6]
observed, "Race for prescription is only an interim solution to carry us
through a period of ignorance until we find the underlying causes" (10 ).
There is every evidence that these underlying causes interact with each
other. However, race is such a dominant category in the cognitive field
that the "interim solution" can leave its own indelible mark once given
even the temporary imprimatur of scientific legitimacy by molecular
genetics.

               Studies of Human Genetic Diversity
               The procedures for answering any inquiry into the empirical
world determine the scientific legitimacy of claims to validity and
reliable knowledge, but the prior question will always be: Why that
particular question? The first principle of knowledge construction is,
therefore, which question gets asked in the research enterprise.

               A paper published in this week's issue of Science [HN7] (11 )
is well-intentioned, well-crafted, and designed to help better understand
the molecular basis of disease. The researchers were searching for and
found patterns of SNPs [HN8] differentially distributed in three
population groups, formed from a total of 71 persons who were Americans of
African, European, or Han Chinese descent.

               Why was the question raised in this manner? The answer is a
scientific Catch-22. This and other similar efforts (12 ) to create
linkage disequilibrium and haplotype maps have a logic for choosing to
study people from disparate geographic regions of the world. The purpose
is to generate maps that can indicate subtle differences in the patterning
or structuring of human genetic diversity across the globe. [HN9] An
increased understanding of these patterns of genetic diversity will help
scientists doing gene-association studies by identifying new variants and
reducing the likelihood of false-positive associations. The hope is that
it may aid scientists to identify medically relevant genes for diseases

               However, the particular groups of individuals chosen to
represent each region of the world are often chosen because of their
convenience and accessibility. Cell and tissue repositories are created to
decrease the cost and difficulty of obtaining samples, and the archived
samples will be extensively characterized and frequently utilized. Sample
collections from repositories may be treated as populations in the narrow
sense of the term, even when there is little evidence that they represent
a geographically localized, reproductively isolated group. These samples
are often subtly portrayed as representing racially categorized
populations. Finding a higher frequency of some alleles in one population
versus another is a guaranteed outcome of modern technology, even for two
randomly chosen populations. When the boundaries of those populations
coincide with the social definition of race, a delicate tightrope needs to
be better navigated between: (i) acknowledging race as a stratifying
practice in societies that can lead to different frequencies of alleles in
different modern populations but also to different access to
health-related resources, and (ii) reifying race as having genetically
sufficiently distinctive features, i.e., with "distinctive gene pathways,"
which are used to explain health disparities between racially categorized
populations.

               If we fall into the trap of accepting the categories of stored
data sets, then it can be an easy slide down the slope to the
misconceptions of "black" or "white" diseases. By accepting the
prefabricated racial designations of stored samples and then reporting
patterns of differences in SNPs between those categories, misplaced
genetic concreteness is nearly inevitable.

               SNP Patterns and Searches for a Biological Basis for Criminal
Behavior
               Several countries now have national DNA databases (13). [HN10]
Although I use the U.S. criminal justice system as an example, I have no
doubt that the principles being considered are universal ones.

               It is now relatively common for scholars to acknowledge the
considerable and documented racial and ethnic bias in the criminal justice
system, from police procedures, prosecutorial discretion, jury selection,
and sentencing practices--of which racial profiling is but the tip of an
iceberg (14-16 ). If the FBI's DNA database is primarily composed of those
who have been touched by the criminal justice system and that system has
engaged in practices that routinely select more from one group, there will
be an obvious skew or bias toward this group in this database.

               If we turn the clock back just 60 years, whites constituted
about 77% of all prisoners in America, while blacks were only 22% (17 ).
In just six decades, the incarceration rate of African Americans in
relation to whites has gone up in a striking manner. In 1933, blacks were
incarcerated at a rate about three times that of whites (18 ). In 1950,
the ratio had increased to about four times; in 1970, it was six times;
and in 1990, it was seven times that of whites.

               Among humans, gene pools and SNP patterns cannot change much
in 60 years, but economic conditions and the practices of the criminal
justice system demonstrably do. The comparative explanatory power of SNP
patterns surely pales before the analytic utility of examining shifting
institutional practices and economic conditions. However, given the body
of "ethnic-estimation" research being published on behalf of forensic
applications (19, 20) and the exponential growth of national DNA databases
(21, 22 ), it is not at all unreasonable to expect that a project that
proposed to search for SNP profiles among sex offenders and felons
convicted of violent crimes would meet with some success, both for funding
and for finding "something." This could begin with the phenotype of "three
populations," as in the study cited above (11 ), because that is the way
these data are collected by the FBI and the contributing states. We must
maintain vigilance to prevent SNP profiling from providing the thin veneer
of neutral scientific investigation, while reinscribing the racial
taxonomies of already collected data. [HN11]

               Conclusions
               As I have tried to show, a set of assumptions about race has
animated the development of BiDil, genetic diversity analyses, "ethnic
estimation" research, and the siren's call to do SNP research on the
ever-expanding databases of DNA from the incarcerated. These elements are
poised to exert a cascading effect--reinscribing taxonomies of race across
a broad range of scientific practices and fields. Biomedical research must
resist setting off the cascade and, while still moving forward in their
efforts to identify the molecular correlates of disease, climb back on the
tightrope to address racial disparities in health, in all their biosocial
complexity.

               The ability to use genomic knowledge to deliver effective
pharmaceuticals more safely to special subpopulations that have some
functional genetic markers holds promise. Thus, if the FDA approves BiDil,
it should do so only under the condition that further research be
conducted to find the markers that have the actual functional association
with drug responsiveness--thus assuring that the drug be approved for
everyone with those markers, regardless of their ancestry, or even of
their ancestral informative markers.

               The technology will be increasingly available to provide SNP
profiles of populations. When the phenotype distinguishing these
populations is race, the likelihood of committing the fallacy of misplaced
concreteness, in science, is nearly overwhelming. For this reason, when
geneticists report population data, they should always attach a caveat or
warning label that could read something like this, "allelic frequencies
vary between any selected human groups--to assume that those variations
reflect 'racial categories' is unwarranted." Whereas this will not
completely block the tendency to reify race, it will be an appropriately
cautious intervention that tries to prevent science from unwittingly
joining the current march toward a biological reinscription of the
concept.

               References and Notes

                 1.. A. N. Whitehead, Process and Reality (Harper, New York,
1929), p. 11.
                 2.. J. Kahn, Yale J. Health Policy Law Ethics 4, 1 (2004).
                 3.. R. S. Cooper, J. S. Kaufman, Hypertension 32, 813 (1998)
[Medline] [Full text].
                 4.. M. J. Klag et al., JAMA 265, 599 (6 February 1991)
[Medline].
                 5.. A. L. Taylor et al., N. Engl. J. Med. 351, 2049 (2004)
[Medline] [Abstact].
                 6.. NitroMed, Inc., "BiDil Named to American Heart
Association's 2004 'Top 10 Advances' List; Only Cardiovascular Drug
Recognized by AHA for Dramatically Improving Survival in African American
Heart Failure Patients," PR Newswire US, 11 January 2005.
                 7.. Jonathan Kahn, Jay Kaufman, personal communication.
                 8.. R. S. Cooper et al., BMC Med. 3, 11 (2005) [Medline]
[Abstract/full text].
                 9.. V. Griffith, "FDA backs ethnically targeted drug,"
Financial Times, 9 March 2001, p. 13.
                 10.. www.bioitworld.com/news/102904_report6447.html
                 11.. D. A. Hinds et al., Science 307, 1072 (2005).
                 12.. International HapMap Consortium, Nature 426, 789
(2003); available at www.hapmap.org [Medline].
                 13.. M. Jobling, P. Gill, Nature Rev. Genet. 5, 739 (2004)
[Medline] [Abstract].
                 14.. M. Mauer, Race to Incarcerate (New Press, New York,
1999) [publisher's information].
                 15.. J. Donohue, S. Levitt, J. Law Econ. 44, 367 (2001)
[Abstract].
                 16.. J. Knowles, N. Persico, P. Todd, J. Polit. Econ. 109,
203 (2001) [Abstract].
                 17.. A. Hacker, Two Nations: Black and White, Separate,
Hostile, Unequal (Scribner's, New York, 1992), p. 197.
                 18.. T. Duster, in DNA and the Criminal Justice System: The
Technology of Justice, D. Lazer, Ed. (MIT Press, Cambridge, MA, 2004), pp.
315-334. [publisher's information]
                 19.. M. D. Shriver et al., Am. J. Hum. Genet. 60, 957 (1997)
[Medline].
                 20.. A. L. Lowe, A. Urquhart, L. A. Foreman, I. W. Evett,
Forensic Sci. Int. 119, 17 (2001) [Abstract].
                 21.. D. Lazer, Ed., DNA and the Criminal Justice System: The
Technology of Justice (MIT Press, Cambridge, MA, 2004), pp. 1-2.
[publisher's information]
                 22.. T. Simoncelli, Genewatch 17 (March and April 2004).
[Full text]

--------------------------------------------------------------------
               The author is director of the Institute for the History of the
Production of Knowledge, New York University, 269 Mercer Street, New York,
NY 10003-6687, USA. E-mail: troy.duster at nyu.edu

--------------------------------------------------------------------

               HyperNotes
               Related Resources on the World Wide Web
               General Hypernotes
                 Dictionaries and Glossaries

                 The Talking Glossary of Genetic Terms is made available by
the National Human Genome Research Institute.

                 A genome glossary is provided by the Human Genome Project
Information Web site.

                 Web Collections, References, and Resource Lists

                 MedlinePlus from the U.S. National Library of Medicine
provides links to news and Internet resources on medical topics.

                 The library of the Karolinska Institutet, Stockholm, Sweden,
provides collections of Internet resources on genetics.

                 The Public Health Genetics Unit, funded by the UK Department
of Health and the Wellcome Trust, provides a collection of Internet links.

                 The Human Genome Project Information Web site provides a
resource page on minorities, race, and genomics.

                 The National Human Genome Research Institute (NHGRI)
provides a list of online bioethics resources.

                 For an anthropology course on race and racism in the modern
world, P. Willoughby, Department of Anthropology, University of Alberta,
offers links to related Internet resources.

                 Online Texts and Lecture Notes

                 The Rediscovering Biology Web site includes a textbook
chapter and other resources on genomics.

                 The National Center for Biotechnology Information (NCBI)
provides a science primer on genetics and genomics.

                 DNA Interactive from the Dolan DNA Learning Center at Cold
Spring Harbor Laboratory includes a section on DNA applications.

                 The Human Genome is an educational presentation of the
Wellcome Trust. A section on genetics and society is included.

                 The University of Montréal's HumGen Web site deals with the
social, ethical, and legal aspects of human genetics. The FAQ includes a
section on human population genetics issues.

                 The History of Race in Science Web site is a resource for
scholars and students interested in the history of "race" in science,
medicine, and technology.

                 Molecular Genetics is a tutorial provided by U. Melcher,
Department of Biochemistry and Molecular Biology, Oklahoma State
University.

                 R. L. Miesfeld, Department of Biochemistry and Molecular
Biophysics, University of Arizona, provides lecture notes for a course on
applied molecular genetics. Lecture notes on DNA forensics and
pharmacogenomics are included.

                 General Reports and Articles

                 The 14 June 1997 issue of BMJ had an article by R. Bhopal
titled "Is research into ethnicity and health racist, unsound, or
important science?"

                 NHGRI provides links to the series of articles on "Genes,
race and psychology in the genome era" that appeared in the January 2005
issue of the American Psychologist.

                 V. Randall, Institute on Race, Health Care and the Law,
University of Dayton School of Law, makes available an article by S. S.-J.
Lee, J. Mountain, and B. Koenig titled "The reification of race in health
research," which was adapted from a Spring 2001 article (available in PDF
format) in the Yale Journal of Health Policy, Law, and Ethics.

                 The November 1998 issue of Hypertension had an article by R.
S. Cooper and J. S. Kaufman titled "Race and hypertension: Science and
nescience" (3).

                 The 24 October 2003 issue of Science was a special issue on
genomic medicine. Included was a News article by C. Holden titled "Race
and medicine."

                 The 15 November 2002 issue of Science had an Enhanced
Perspective by P. Sankar and M. K. Cho titled "Toward a new vocabulary of
human genetic variation."

                 The November 2004 issue of Nature Genetics had a special
supplement titled "Genetics for the human race."

               Numbered Hypernotes
                 1.. Alfred North Whitehead. Biographical information about
Alfred North Whitehead is provided by the Columbia Encyclopedia and the
MacTutor History of Mathematics archive. The Stanford Encyclopedia of
Philosophy has an entry on Alfred North Whitehead with a discussion of his
"fallacy of misplaced concreteness."

                 2.. Reification is defined by the Principia Cybernetica Web.
Wikipedia has an entry on reification.

                 3.. The BiDil drug study. The 11 November 2004 issue of the
New England Journal of Medicine had an article by A. L. Taylor, J. N. Cohn
et al. (for the African-American Heart Failure Trial Investigators) titled
"Combination of isosorbide dinitrate and hydralazine in Blacks with heart
failure" (5); the issue included a related editorial ("Nitroso-redox
balance in the cardiovascular system" by J. M. Hare) and a perspective
("Race-based therapeutics" by M. G. Bloche). The University of Minnesota
Academic Health Center issued an 8 November 2004 press release titled
"First African American Heart Failure Trial shows 43 percent increase in
survival." The 6 December 2004 issue of American Medical News had an
article by S. J. Landers titled "New drug combo intensifies race-based
medicine debate." Cardiology Online makes available an 11 November 2004
news article about the research. ClinicalTrails.gov provides information
on the African-American Heart Failure Trial. The Minnesota
Spokesman-Recorder makes available a 24 November 2004 article by L. Boyce
titled "BiDil controversy raises specter of racial profiling in medicine."

                 4.. NitroMed offers a presentation about BiDil and makes
available press releases about BiDil dated 1 November 2004 and 8 November
2004, as well as a 3 February 2005 press release titled "FDA accepts
NitroMed's new drug application resubmission for BiDil."

                 5.. Recent study of race and hypertension. BMC Medicine had
an 5 January 2005 article by R. S. Cooper et al. titled "An international
comparative study of blood pressure in populations of European vs. African
descent" (8). BMC Medicine had a 7 January 2005 article by J. Tomson and
G. Y. H. Lip titled "Blood pressure demographics: Nature or nurture ...
genes or environment?" that includes a discussion of Cooper et al.'s
research.

                 6.. David B. Goldstein is in the Department of Biology,
University College London. Goldstein is quoted in a 29 October 2004 Bio-IT
World article by K. Davies titled "Scientists debate race, genetics, and
'ethnic medicine'" (10). The November 2004 Nature Genetics Supplement had
an article by S. K. Tate and D. B. Goldstein titled "Will tomorrow's
medicines work for everyone?"

                 7.. Paper in this issue of Science. The Research Article in
this issue by D. A. Hinds et al. is titled "Whole genome patterns of
common DNA variation in three human populations" (11). The article authors
David A. Hinds, Laura L. Stuve, Geoffrey B. Nilsen, Dennis G. Ballinger,
Kelly A. Frazer, and David R. Cox are at Perlegen Sciences, Inc., Mountain
View, CA; Eran Halperin is at the International Computer Science
Institute, Berkeley, CA; and Eleazar Eskin is in the Department of
Computer Science and Engineering, University of California, San Diego.
Also in this issue of Science is a related Perspective by D. Altshuler and
A. G. Clark titled "Harvesting medical information from the human family
tree."

                 8.. SNPs. NHGRI's Talking Glossary of Genetic Terms defines
SNPs (single nucleotide polymorphisms); an extended audio definition is
also provided. The Wellcome Trust's Human Genome Web site provides an
introduction to single nucleotide polymorphisms. The SNP Consortium
provides an introduction to SNP markers. The NCBI's Science Primer
includes a presentation on SNPs. A SNP fact sheet is provided by the Human
Genome Project Information Web site. BioTechniques had a June 2002
supplement titled "SNPs: Discovery of markers for disease." Perlegen
Sciences, Inc. offers an introduction to SNPs and genetic variation.

                 9.. Human genetic diversity and the HapMap Project. D.
O'Neil, Behavioral Sciences Department, Palomar College, San Marcos, CA,
provides a tutorial on human variation as part of a collection of physical
anthropology tutorials. The Genome News Network offers a presentation
titled "Human genome variations." Rediscovering Biology offers an
introduction to genetic variation within species and SNPs. The
International HapMap Project is a international partnership of scientists
and funding agencies to develop a public resource to identify genes
associated with human disease and response to pharmaceuticals; the project
makes available in PDF format the 18 December 2003 Nature article titled
"The International HapMap Project" (12). The 30 April 2004 issue of
Science had a News Focus article by J. Couzin titled "Consensus emerges on
HapMap strategy." NHGRI provides a resource page about the HapMap project;
a 7 February 2005 press release titled "International HapMap Consortium
expands mapping effort" is made available.

                 10.. National DNA databases. The Israel Academy of Sciences
and Humanities makes available a December 2002 report titled
"Population-based large-scale collections of DNA samples and databases of
genetic information." M. A. Jobling, Department of Genetics, University of
Leicester, UK, makes available in PDF format an October 2004 Nature
Reviews Genetics article by M. A. Jobling and P. Gill titled "Encoded
evidence: DNA in forensic analysis" (13). The U.S. Federal Bureau of
Investigation provides the text of Congressional testimony on the FBI DNA
database program and information on the Combined DNA Index System. The
March-April 2004 issue of GeneWatch had an article by T. Simoncelli titled
"Retreating justice: Proposed expansion of federal DNA database threatens
civil liberties" (22). A November 2004 report titled "Genetic information
and crime investigation: Social, ethical and public policy aspects of the
establishment, expansion and police use of the National DNA Database" by
R. Williams, P. Johnson, and P. Martin is available in PDF format from the
authors at the School of Applied Social Sciences, University of Durham,
UK. GeneWatch UK makes available a briefing and a press release about its
January 2005 report titled "The Police National DNA Database: Balancing
crime detection, human rights and privacy." The UK Forensic Science
Service provides a collection of fact sheets, including one on the UK
National DNA database. The National DNA Data Bank of Canada provides a
FAQ.

                 11.. Race, genes, and the criminal justice system. The
National Criminal Justice Reference Service makes available in PDF format
the September 2004 summary report titled "Disproportionate minority
confinement: 2002 update," issued by the Office of Juvenile Justice and
Delinquency Prevention, and other publications on minority
overrepresentation. T. Duster makes available in PDF format a chapter
titled "Selective arrests, an ever-expanding DNA forensic database, and
the specter of an early-twenty-first-century equivalent of phrenology."
The Human Genome Project Information Web site makes available a
November-December 1999 Judicature special issue on genes and justice. The
June 2004 issue of Genomics & Proteomics had an article by A. Dove titled
"Molecular cops: Forensic genomics harnessed for the law." DNAPrint
genomics, Inc. makes available a 13 August 2002 article on forensic
science (from Australian Biotechnology News) by M. Trudinger titled "From
the textbooks to the courts." Forensic Bioinformatics makes available the
presentations from its August 2004 conference ("DNA from crime scene to
court room: An expert forum"); included is a presentation (PowerPoint,
with relevant articles provided in PDF format) by T. Kessis titled "Racial
identification and future application of SNPs."

                 12.. Troy Duster is at the Institute for the History of the
Production of Knowledge and in the Department of Sociology, New York
University.

Volume 307, Number 5712, Issue of 18 Feb 2005, pp. 1050-1051.