• Search Menu
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Reasons to publish
  • About Human Molecular Genetics
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Source of human dna, internationality in both research and clinical human genetics, acknowledgements.

  • < Previous

The International Human Genome Project

  • Article contents
  • Figures & tables
  • Supplementary Data

Ewan Birney, The International Human Genome Project, Human Molecular Genetics , Volume 30, Issue R2, 15 October 2021, Pages R161–R163, https://doi.org/10.1093/hmg/ddab198

  • Permissions Icon Permissions

The human genome project was conceived and executed as an international project, due to both pragmatic and principled reasons. This internationality has served the project well, with the resulting human genome being freely available for all researchers in all countries. Over time the reference human genome will likely have to evolve to a graph genome, and tap into more diverse sequences worldwide. A similar international mindset underpins data analysis for the interpretation of the human genome from basic to clinical research.

The Human Genome Project was conceived as an international endeavor ( 1 ). Partly this was pragmatism. In the mid 1980s, the scale of sequencing any organism’s genome beyond the smallest of viruses was daunting, and the human genome was almost beyond conception. A number of early ‘demonstration’ genomes, such as the Nematode was conceived as international projects, showing how such collaboration can work in practice ( 2 ). The international collaboration helped bind the academic community together, with the sense that the breakthroughs in technology and understanding were best shared, if only to ensure that you could make the soundest argument to funders at home. But it was also a matter of principle by the participants; if the human genome was going to be a key data resource for humanity, ideally a diverse group of humans should participate in its creation, and have collective ownership of the result.

Later on in the project the internationality narrowed to predominantly US and UK academic groups racing with the US company Celera to complete a draft of the human genome. This narrative often fails to bring in the Japanese and German chromosome 21, Japanese contribution to chromosome 22 and the French chromosome 14—the international community had made use of the necessity of having a clone-based map first to allow coordinated chromosomes to be delivered by specific groups. However, the overall genetic map of the human genome (using polymorphic microsatellite markers) was created by bold insightful work from the CEPH project in France in the late 1980s and early 1990s. In the latter part of the 1990s, the announcement that Celera was aiming to create a human reference genome using just whole genome shotgun sequencing altered the strategy of the academic project and narrowed its major delivery partners to four large laboratories in the US and one in the UK, with the exception of Chromosomes 21 and 14.

Despite this narrowing later in the 1990s, a key principle of international data sharing had been established. In 1997, the majority of the human genome academic project leads met in Bermuda ( 3 ) and agreed to share sequence data via the international DNA databases (ENA/GenBank/DDBJ) within 24 h of having passed QC checks. Here, the principle of the genome being a common resource for everyone was a large driver, but this also had a strong streak of pragmatism; this ‘show your data’ provided a way to coordinate the effort across the project. The end result was the announcement of the completion of two drafts of the human genome on 26 June 2000 by Bill Clinton, in the White House with leaders of both the public and private projects present, and a video call with Tony Blair to the UK. Most importantly there was a draft human genome in the public domain for all humanity to use ( 4 , 5 ).

This draft was progressively improved upon over the decades, with the Genome Reference Consortium (GRC ( 6 )) providing the definitive ‘release’ of the human genome against which other information, from gene annotation through to polymorphism is described. The new releases improve representation of the genome and increasingly model aspects of structural variation. However, there is a large amount of inertia around moving between reference versions, in particular in the clinical domain ( 7 ).

Humans, Homo sapiens , are a young species. There is an increasingly complex and tangled web during the latest stage of human evolution beginning some 100 000 years ago. A variety of other hominid species co-evolved and sometimes mixed with us during that founding period, but the rapid migration and expansion of humans across the world starting some 50 000 years ago has meant that human genetics (variation in human DNA) is mainly due to the variation present in Africa at this point in time. This also means there is a relatively moderate amount of large structural variation, such as large insertions, deletions or rearrangements, compared to even our great ape cousins—let alone the chaos within some vertebrate genomes. This means that for much of our genome ‘any human’ will provide a reasonable reference representation that other human genome sequences can be described against. However, there are enough regions of structural variation, in particular in important biological regions such as the major histocompatibility complex (MHC), that the choice of reference becomes an important aspect of analysis.

In the 1980s and 1990s, the workhorse scheme for isolating DNA was to create bacterial artificial chromosomes (BACs) which could each store around 250 KB of DNA stably in bacteria. The resulting bacteria could be grown as clones each containing different single regions of the human genome in a BAC; the full collection of such clones was called ‘a library.’ The public human genome sequence is made from around 50 such libraries (including some other technologies than BAC), and BAC RP11 is the most common source of information for the human genome. We can infer that the donor for the RP11 library was African-American. As such, the public human genome includes more ancestral diversity than most people appreciated, though it is still substantially biased towards recent European ancestry by regions sequenced from the other libraries.

More recently, new long-read DNA technologies from Oxford Nanopore Technologies and Pacific BioSciences have provided a new way to sequence the human genome. In particular, these technologies can span the complex repeat structures present in a variety of locations across the human genome—the centromeres, ribosomal RNA arrays and peri-centromeric repeats had been impossible to tackle with previous technology. Recently a full ‘telomere to telomere’ assembly has been released for a single human haplotype ( 8 ). As well as this being a technical tour-de-force, it opens up the potential to characterize many human genomes, if not all, in a way which can capture the complete sequence for both maternal and paternal copies.

Handling both our understanding of existing, complete, human genomes and the representation, which might be partial, of any particular individual’s genome will have to move beyond the concept of a linear reference genome with simple edits performed against this reference. The ability to think of sets of genomes as a graph elegantly solves these problems, where any type of insertion, deletion or rearrangement from one sequence to another can be represented. ‘Graph genomes’ have been used in sequence analysis and bioinformatics since the early 2000s ( 9 ), with diverse applications spanning assembly through to splicing patterns, but their routine use for representing, annotating and manipulating information beyond these select applications has been limited. As more and more human genomes are generated in an end to end manner, and as there becomes more appreciation of the biology present in some of these more tangled regions we will have to have better tools, visualizations and mindset that can accommodate this representation. Indeed, this ‘multiple genome’ problem is present in representing each individual’s diploid haplotypes, and so is a direct concern for a complete view of an individual’s human genome ( 10 ). We should be thankful that we do not have the genomic complexity of most other metazoa with far higher levels of structural variation, let alone the complex polyploid structures present across plants.

The human genome provides a natural ‘index’ for all the RNAs and proteins made in a cell, and has been a key part of basic research in the design of reagents (from microarrays in 2000s to CRISPR libraries in the 2020s) and the interpretation of results from RNAseq through to Mass-spec proteomics. Much of this is supported by the presence of large scale open access databases in molecular biology, which aggregate this information for all scientists to use worldwide ( 11 , 12 ). In addition, the human genome has revitalized human genetics—the study of human biology using natural variation present between individuals. In the latter case, the combination of cheap genotyping (still predominantly via microarrays) and cheap sequencing (via the short-read technology of Illumina) has allowed the routine generation of near complete maps of individuals for their common DNA variation (common meaning present in around 1% of individuals or more). The cheapness of this genetic assay has allowed for large scale genotyping, and increasingly now full sequencing, of cohorts to occur in many places across the world. The mainstay analysis of these cohorts is genome-wide association studies (GWAS), described in more depth in this special issue. More recently, the cheapness of effective short-read sequencing, which can capture the majority of changes in protein coding genes has shifted clinical genetics from targeted gene-by-gene diagnosis to a global whole exome (WES) or whole genome (WGS) approach.

For both research and clinical analysis, responsible international data analysis has been critically important in unlocking insights. In the former research setting, replication between cohorts was important to generate confidence in the results of GWAS. This has shifted to the almost routine global consortium around a particular phenotype to maximize power; the presence of diverse cohorts not only increases the power around each tested variant, but the differential frequencies of rarer alleles means that each cohort ‘sees’ a slightly different spectrum of variants. These mega-author list papers march on, and despite the slightly repetitive nature of the science, each phenotype under study is worth understanding in as much detail as possible. Similarly in rare disease genetics, where there might be a handful of individuals worldwide who have the same mutation, international collaboration has been key to providing robust diagnosis and gene discovery for human genetics. This has been codified in projects such as the Matchmaker exchange ( 13 , 14 ), which allows clinical genetics groups to exchange information of genes of interest in a secure, responsible and even-handed manner.

Like much of the developing world, African nations are now bringing in more genetics research using the technologies developed over the previous decade and are now organizing research cohorts and deploy human clinical genetics more broadly across Africa; this is the start of rebalancing inequity in this area of research in general, but is also an opportunity globally as the richest source of genetic diversity in humans is found in the continent of the birthplace of our species. More recently the excellent H3Africa ( 15 ) resources, led by African scientists, have been creating more research cohorts that span different nations in Africa. Whilst keeping the African-led nature of this project, and placing African scientists to the fore, H3Africa has also committed to responsible data sharing. Similar efforts to H3Africa and continuation of H3Africa’s work itself are needed to broaden the practice of genetics and genomics globally over the coming decades.

To enable the most utility from these datasets, we must have responsible joint data analysis of both research cohorts and secondary use of clinical genomics. Such data sharing must be rooted in the ethical framework and the legal processes derived from them present in each country. Furthermore, international data analysis necessitates international standards for the datasets. Here the Global Alliance for Genomics and Health (GA4GH) is an organization founded in 2014 to enable responsible data sharing in genomics globally. Nearly every country has the goal to better understand the health and disease present in its population via science, and this broad goal is present in the UN Charter for Human Rights ( 16 ).The GA4GH ethical frameworks aim to activate these rights and align the discussions happening in many countries for responsible global data analysis; in practice, this maps to easier mutual recognition of processes and concepts. On the technical side, the entire endeavor of human genomics, from its earliest days in the 1980s have required well understood data structures and protocols to share data or analysis, often created as de facto standards between academics by virtue of the need to share data. GA4GH provides a responsible home for these standards (such as the widely used BAM/CRAM and VCF standards) and a process for creating new standards in the Cloud-enabled and connected world we live in now.

The human genome is a dataset which is owned by all of us, for use by humanity. Human genetics and genomics has always flourished in an international context and leaders of the field in the 1970s, 1980s and 1990s insisted on international, open data sharing of key resources. The future is likely to be as demanding for the need for as open as possible data sharing, adapting to the world of even more genetic and genomic data, again for the benefit of all humanity.

E.B. is funded by European Molecular Biology Laboratory. E.B. is paid consultant of Oxford Nanopore Technologies.

Watson , J.D. and Cook-Deegan , R.M. ( 1991 ) Origins of the human genome project . FASEB J. , 5 , 8 – 11 .

Google Scholar

Wilson , R.K. ( 1999 ) How the worm was won: the C. elegans genome sequencing project . Trends Genet. , 15 , 51 – 58 .

Guyer , M. ( 1998 ) Statement on the rapid release of genomic DNA sequence . Genome Res. , 8 , 413 .

Lander , E.S. , Linton , L.M. , Birren , B. , Nusbaum , C. , Zody , M.C. , Baldwin , J. , Devon , K. , Dewar , K. , Doyle , M. , FitzHugh , W.  et al.  ( 2001 ) Initial sequencing and analysis of the human genome . Nature , 409 , 860 – 921 .

Venter , J.C. , Adams , M.D. , Myers , E.W. , Li , P.W. , Mural , R.J. , Sutton , G.G. , Smith , H.O. , Yandell , M. , Evans , C.A. , Holt , R.A.  et al.  ( 2001 ) The sequence of the human genome . Science , 291 , 1304 – 1351 .

Church , D.M. , Schneider , V.A. , Graves , T. , Auger , K. , Cunningham , F. , Bouk , N. , Chen , H.C. , Agarwala , R. , McLaren , W.M. , Ritchie , G.R.S.  et al.  ( 2011 ) Modernizing reference genome assemblies . PLoS Biol. , 9 , e1001091 .

Lansdon , L.A. , Cadieux-Dion , M. , Yoo , B. , Miller , N. , Cohen , A.S.A. , Zellmer , L. , Zhang , L. , Farrow , E.G. , Thiffault , I. , Repnikova , E.A.  et al.  ( 2021 ) Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing . J. Mol. Diagn. , 23 , 651 – 657 .

Nurk , S. , Koren , S. , Rhie , A.  et al.  ( 2021 ) The complete sequence of a human genome. The complete sequence of a human genome . bioRxiv doi: https://doi.org/10.1101/2021.05.26.445798 .

Flicek , P. and Birney , E. ( 2009 ) Sense from sequence reads: methods for alignment and assembly . Nat. Methods , 6 , S6 – S12 .

Garg , S. , Rautiainen , M. , Novak , A.M. , Garrison , E. , Durbin , R. and Marschall , T. ( 2018 ) A graph-based approach to diploid genome assembly . Bioinformatics , 34 , i105 – i114 .

Cantelli , G. , Cochrane , G. , Brooksbank , C. , McDonagh , E. , Flicek , P. , McEntyre , J. , Birney , E. and Apweiler , R. ( 2021 ) The European bioinformatics institute: empowering cooperation in response to a global health crisis . Nucleic Acids Res. , 49 , D29 – D37 .

Sayers , E.W. , Beck , J. , Bolton , E.E. , Bourexis , D. , Brister , J.R. , Canese , K. , Comeau , D.C. , Funk , K. , Kim , S. , Klimke , W.  et al.  ( 2021 ) Database resources of the National Center for biotechnology information . Nucleic Acids Res. , 49 , D10 – D17 .

Philippakis , A.A. , Azzariti , D.R. , Beltran , S. , Brookes , A.J. , Brownstein , C.A. , Brudno , M. , Brunner , H.G. , Buske , O.J. , Carey , K. , Doll , C.  et al.  ( 2015 ) The matchmaker exchange: a platform for rare disease gene discovery . Hum. Mutat. , 36 , 915 – 921 .

Sobreira , N.L.M. , Arachchi , H. , Buske , O.J. , Chong , J.X. , Hutton , B. , Foreman , J. , Schiettecatte , F. , Groza , T. , Jacobsen , J.O.B. , Haendel , M.A.  et al.  ( 2017 ) Matchmaker exchange . Curr. Protoc. Hum. Genet. , 95 , 9.31.1 – 9.31.15 .

The H3Africa Consortium ( 2014 ) Enabling the genomic revolution in Africa . Science , 344 , 1346 – 1348 .

Knoppers , B.M. ( 2014 ) Framework for responsible sharing of genomic and health-related data . HUGO J. , 8 , 3 .

  • human genetics
  • genome, human
  • human genome project
  • graphical displays
  • clinical research
  • internationality
  • data analysis

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1460-2083
  • Print ISSN 0964-6906
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Open access
  • Published: 24 March 2021

How to design a national genomic project—a systematic review of active projects

  • Anja Kovanda   ORCID: orcid.org/0000-0002-7468-878X 1 ,
  • Ana Nyasha Zimani 1 &
  • Borut Peterlin 1  

Human Genomics volume  15 , Article number:  20 ( 2021 ) Cite this article

8203 Accesses

12 Citations

3 Altmetric

Metrics details

An increasing number of countries are investing efforts to exploit the human genome, in order to improve genetic diagnostics and to pave the way for the integration of precision medicine into health systems. The expected benefits include improved understanding of normal and pathological genomic variation, shorter time-to-diagnosis, cost-effective diagnostics, targeted prevention and treatment, and research advances.

We review the 41 currently active individual national projects concerning their aims and scope, the number and age structure of included subjects, funding, data sharing goals and methods, and linkage with biobanks, medical data, and non-medical data (exposome). The main aims of ongoing projects were to determine normal genomic variation (90%), determine pathological genomic variation (rare disease, complex diseases, cancer, etc.) (71%), improve infrastructure (59%), and enable personalized medicine (37%). Numbers of subjects to be sequenced ranges substantially, from a hundred to over a million, representing in some cases a significant portion of the population. Approximately half of the projects report public funding, with the rest having various mixed or private funding arrangements. 90% of projects report data sharing (public, academic, and/or commercial with various levels of access) and plan on linking genomic data and medical data (78%), existing biobanks (44%), and/or non-medical data (24%) as the basis for enabling personal/precision medicine in the future.

Our results show substantial diversity in the analysed categories of 41 ongoing national projects. The overview of current designs will hopefully inform national initiatives in designing new genomic projects and contribute to standardisation and international collaboration.

Genomic medicine is the use of genetic information to inform medical care or predict the risk of disease and has been importantly influenced by novel technology such as whole-exome sequencing and whole-genome sequencing [ 1 , 2 ]. This has led to a significant improvement of health systems particularly in the diagnosis of rare genetic disorders and cancer [ 3 , 4 , 5 , 6 , 7 ] as well as in the development of precision medicine, which is the use of diagnostic tools and treatments targeted to the needs of the individual patient based on their genomics, epigenomics, proteomics, metabolomics, lipidomics, and other data such as environmental and lifestyle information [ 3 , 8 ].

Thirty years ago, in 1990, the Human Genome Project was initiated with the primary goal to obtain a highly accurate sequence of the human genome and to identify its genes [ 9 , 10 ]. It was followed, in 1998, by the Icelandic deCode Project, the first major attempt to link genomic data with other medical and non-medical data [ 11 ], and in 2010 by the UK10K project, a collaboration among several UK public and private institutions, to identify genetic causes of rare diseases [ 12 ]. In 2015, the large precision medicine initiatives of the USA and China were started (to be completed within the next decade) [ 13 , 14 , 15 , 16 ]. In Europe, the initiative “Towards access to at least 1 million sequenced genomes in the EU by 2022” started in 2018 with the aim to share genomic information and best practices among member states [ 13 , 14 , 17 , 18 ]. There are high expectations on the benefits of whole genomic sequencing in terms of the development of precision medicine including improved and cost-effective diagnostics, more targeted prevention and treatment. Nevertheless, few of the projected gains have been demonstrated and no standards on designing the national genome projects have been developed so far.

With this systematic review, we aimed to provide an overview of available information on active national genome projects worldwide in terms of identifying common characteristics and differences among them, which could provide a basis for developing best practices and standards for the design of national projects and sharing of national genome resources.

Materials and methods

The principles of the PRISMA model were used in the preparation of this work, where possible and appropriate (Fig. 1 ) [ 19 ].

figure 1

PRISMA type approach to the selection of projects to be included in the analysis

Shortly, to identify existing national genomic projects, PubMed ( www.ncbi.nlm.nih.gov/pubmed ), Google, and European Genome Phenome Archive-EGA ( https://ega-archive.org/ ) searches were performed in April 2020 by using the search strings: (<country name> [Title]) and (human genome project).

Country names were used in their English language form as listed on Wikipedia countries and dependencies site ( https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population ).

The following exclusion criteria were used to classify on-going projects: projects concluded prior to the year 2020 or planned with no imminent date in the year 2021 were classified as ‘not currently on-going’; international projects and/or those providing only samples/sequencing facilities were defined as ‘international-scope projects’; and finally, those with unavailable information on key features examined in the article (non-functional websites, announcements with insufficient information, no information in the English language) were defined as ‘limited scope projects'. All three authors analysed and co-reviewed the data and any discrepancies and/or inconsistencies were resolved through agreement. Projects that were not currently on-going, were of limited-scope, and those of international, rather than national scope were excluded from the analysis (Fig. 1 ).

The complete list of categories for all identified projects is given in Supplement Table 1 .

The contents of the individual national project websites were browsed for information pertaining to (1) the aims and scope of the individual project (determining normal and pathological genomic variation, infrastructure (including sequencing and analysis capacities, implementation of standards, data management, education, integration of genomics into existing health-care systems), and intention of facilitating personalized medicine); (2) the number and age structure of included subjects; (3) funding; (4) data sharing goals and methods; and (5) linkage with biobanks, medical data, and non-medical data.

A PRISMA flow-chart diagram was generated using the on-line template ( http://www.prisma-statement.org/ ).

Shared aims of national genomic projects were visualized using an online VENN diagram tool ( http://bioinformatics.psb.ugent.be/cgi-bin/liste/Venn/calculate_venn.htpl .).

World maps of national genomic projects were constructed using the online tools available at Mapchart.net ( https://mapchart.net/world.html ).

A total of 86 countries with genomic projects and/or genomic databases were identified among the 240 countries and territories searched, of which 41 projects were currently active, according to the information provided by respective websites [ 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 ] (Fig. 1 ). The remaining projects were either not active at the moment or were part of larger international projects (such as H3Africa) and hence not actual ‘national’ projects in a strict sense (Fig. 2 ). The full list of identified projects is given in Supplement table 1 , List of national projects.

figure 2

National genomic projects across the world

Aims and scope

The aims of the national genomic projects consisted of four major categories: (1) determining normal genomic variation, (2) determining pathological genomic variation (clinical cohorts such as rare diseases, cancer, complex diseases, etc.), (3) infrastructure, and (4) facilitating personalized and precision medicine (Fig. 3 ). Additionally, many country-specific aims were also identified, such as history/ethnic studies (Armenia, Brazil, Chile, Hong Kong, Iran, Malta, Mexico, New Zealand, Russia, Singapore, Vietnam) [ 20 , 21 , 22 , 25 , 31 , 34 , 41 , 42 , 45 , 56 , 61 ], drug discovery (Australia, Bahrain, Cyprus, Hong Kong, Japan, Malta, Switzerland, Thailand, UK) [ 23 , 37 , 39 , 41 , 43 , 45 , 46 , 48 , 60 , 62 ], reparation efforts (Argentina) [ 63 ], or specific health-related goals (infectious diseases interactions—e.g. malaria, tuberculosis in endemic countries) [ 64 , 65 ].

figure 3

Overlap of major aims of the 41 currently active national genomic projects

Determining normal genomic variation

The most common aim (90%, 37/41) of national genomic projects was to investigate normal genomic variation by sequencing healthy participants. Because defining health in the context of genomic testing can be challenging, especially in the case of non-penetrant mutations and late-onset disorders, most national projects approached this challenge by either creating cohorts based on demographic data (9/41 projects) and linking them with medical data or specific exclusion criteria, or by specifically identifying healthy individuals (healthy parents from trio testing in rare diseases, longitudinal health-tracking cohorts from previous studies) (Supplement Table 1 ).

Determining pathological genomic variation

The second most common aim was to determine pathological genomic variation through the sequencing of clinical cohorts (71%, 29/41). Seven of the 29 (24%) of the national projects clearly defined the number of subjects they plan to include in their clinical cohorts in advance (France, UK, Australia, Hong Kong, New Zealand, Thailand, and Slovenia), as well as the cohorts or pilot projects themselves. In case of France, 48 clinical cohorts will be included [ 30 ], the UK project will include over 190 rare diseases and cancer program [ 37 ], and similarly, Australia will include 18 rare disease and cancer flagship projects [ 66 ]. The final cohorts in the rest of the projects aiming to determine pathological genomic variation will depend on various factors (funding, pilot initiatives etc.) and will be discussed further below.

Infrastructure

The third most common aim, which was reported by roughly two thirds of the projects (59%, 24/41), was the implementation of various infrastructural goals (Supplement Table 1 ). Infrastructural goals were not a homologous category and reflected the individual projects’ existing sequencing and data-analysis infrastructure, and personnel capacities. The most frequently reported infrastructural project objectives apart from increasing sequencing capacity itself were data management (79%, 19/24), followed by establishing standards of analyses (71%, 17/24), and education (54%, 13/24). Several additional projects (20%, 8/41) intended to approach these goals without reporting them under ‘infrastructure’, probably reflecting cultural conceptual differences in what is considered as infrastructure.

Personalized and precision medicine

Finally, 37% (15/41) of the projects presented tangible plans for the development of personalized medicine, although most projects (85%, 35/41) reported personalized medicine as one of their rationales.

As part of the effort toward introducing personalized medicine, a further subset of countries (e.g. Australia, USA, Japan, Switzerland, etc.) intend to use their genomic data for drug discovery/precision therapy (Supplement Table 1 ).

Number and age structure of the included subjects

Websites of 37 of 41 national projects (90%) reported information on the total number of subjects to be included in the project. The number of included subjects ranged from a hundred to up to over a million subjects, representing from 0.0001 to 32% of the population. Approximately half of the projects aimed to sequence more than 10,000 subjects, with approximately a quarter aiming to sequence 1000 or less (Table 1 ). Similarly, in terms of population percentage, only four countries aimed to sequence more than 1% of their population. Of the remaining countries, half aimed to sequence more than 0.02%, and half planned to sequence less than 0.02% of their respective population.

Of the few projects with missing information on the number of subjects included, most were focused primarily on infrastructure, whereas in the remaining projects the exact number of included subjects was reported to be determined during the project (Supplement Table 1 ).

The age structure of healthy subjects was reported in five projects. In the projects that provided this information, the most common strategy for determining normal genomic variation was to include the general adult population or existing health-tracking cohorts. In the case of pathological genomic variation, some groups of minors were also planned (e.g. in rare diseases). For detailed information on the included cohorts, please see the ‘Discussion’ section.

Approximately half (51%, 21/41) of all national projects stated the total funding planned (Supplement Table 2 ). The declared amounts reflect the scopes of the individual projects, ranging from 0.32 M USD to over 9200.00 M USD. Roughly half (49%, 20/41) of national genomic projects reported public funding, with some projects having mixed state and federal (Australia) or EU co-funded projects (e.g. Cyprus, Czech Republic) [ 35 , 36 , 46 , 49 , 57 , 67 ]. The remaining national genomic projects either reported mixed public-private type funding (44%, 18/41) (including for example, USA and Switzerland), or fully private funding (7%, 3/41) (Qatar, Ireland, and Vietnam) [ 13 , 25 , 30 , 31 , 33 , 40 , 50 , 55 , 62 , 68 , 69 ]. The private funding partners were diverse, including sequencing, investment, and insurance companies, as will be reviewed in the discussion.

Data sharing goals and methods

Data sharing involves the analysis and curation of genomic and associated information obtained during the projects for public, academic, and/or commercial use with various levels of access. It inevitably concerns ethics and legal issues, identifying stakeholders as well as technical aspects and data security. Data sharing represents an important aspect of the national genomic projects, as most reported their main objectives to be determining normal population genomic variation that will enable the use of personalized and precision medicine. 90% (37/41) of the projects reported their intention of sharing the data obtained (Supplement Table 3 ), and over half of the projects (54%, 22/41) already implemented some form of data sharing. Of the existing data-sharing solutions, the most common format was a database platform with various levels of access for the public, academia, and researchers, whereas the second most common solution consisted of a fully public database containing anonymized or pooled genomic data. For example, Estonia reports it will make their data and DNA available per request and pending approval of the Ethical committee. On the other hand, several of the projects with private funding report they will provide access for approved pharmaceutical/biotechnology companies and research groups (e.g. Ireland, Switzerland, USA).

Association with biobanks, medical, and non-medical data

The majority of the national projects plan on linking their sequencing data with other medical data (78%, 32/41), existing or planned biobanks (54%, 22/41), and/or non-medical data (24%, 10/41), such as environmental and other factors, as the basis for enabling personal/precision medicine (Table 2 ) (Fig. 4 ). Additional countries explicitly plan to establish/connect biobanks and databases during the course of their projects (for example Australia, Slovenia) (Supplement Table 1 ). Finally, 56% (23/41) projects reported their intention to unify or establish standards for analysis and thus make provisions for adequate data management, two key prerequisites for establishing personalized medicine.

figure 4

Primary aims of active national genomic projects

Our results show several common goals but also substantial diversity of 41 ongoing national projects across the analysed categories.

Since its onset, one of the main aims of genomics has been to enable personalized and precision medicine, which is the use of diagnostic tools and treatments tailored to the needs of the individual patient [ 3 , 8 ].

Pioneering projects, such as that of the UK, that has been previously reviewed [ 16 , 70 ]), have focused on determining both the normal and pathological genomic variation (clinical cohorts consisting of rare disease and cancer patient cohorts). Consequently, the fields of rare diseases [ 3 , 4 , 5 , 6 , 7 ] and cancer [ 71 , 72 , 73 ]) are currently closest to the implementation of personalized medicine.

Additionally, population genomics has helped us to better understand complex diseases and traits. Indeed, many national projects, for example, Finland and Estonia, report they will link their genomic effort with existing national prevention and intervention health programs in order to maximise their positive impact [ 29 , 74 ]. The currently active projects have multiple, overlapping aims (Fig. 5 ) and the different strategies in which they intend to achieve them will be further discussed below.

figure 5

Overlap between 32 projects linking genomic data with biobanks, medical and non-medical data

The most common goal among the national genomic projects was to determine normal genomic variation through the sequencing of presumably healthy population cohorts. This is not surprising, as determining the genomic variability/genomic background in the general population is necessary for a polygenic risk assessment approach to various complex and multifactorial human diseases. Furthermore, knowledge of the normal population-specific genomic variation helps improve the diagnostic yield of WES and whole-genome sequencing, showing a research return on investment in a short time-frame. Defining health in the context of genomic testing can be challenging, especially in the case of non-penetrant mutations, late-onset disorders, etc. Therefore, most national projects approached this challenge by either creating demographic cohorts and linking them with medical data or specific exclusion criteria, or by specifically identifying healthy individuals (healthy parents from trio testing in rare diseases, longitudinal health-tracking cohorts from previous studies, etc.), as is further discussed under the ‘Number and age structure of the included subjects’ section. Nine projects designed their normal genomic variability cohorts based on ethnicity data (Supplement Table 1 ). This approach is preferable, especially in case of large countries with many ethnic groups, or countries with considerable migration (both historic and present).

Genomic projects traditionally focused predominantly on rare disease and cancer cohorts. This approach has proved successful, and personalised medicine has begun in both of these fields [ 3 , 4 , 5 , 6 , 7 , 72 , 73 ].

The 29 countries with clinical cohorts approach this issue in various ways (Supplement Table 1 ). France, for example, plans to sequence over 235,000 genomes of at least 48 cohorts with clearly defined genetic conditions [ 30 ], UK plans to sequence approximately 100,000 patient genomes (rare diseases programme, which includes over 190 rare diseases, and cancer programme) [ 37 ], Australia and Hong Kong aim to sequence 20,000 patients each (18 rare disease and cancer flagship projects) [ 41 , 66 ], while Thailand [ 39 ], New Zealand [ 21 ] and Slovenia [ 54 ] each plan a few hundred patients from rare disease and cancers cohorts. In the remainder of the countries that have clearly indicated the diseases included in their clinical cohorts (e.g. Ireland: rare disorders and 10 chronic conditions), the numbers of included patients remain to be finalized.

As can be seen from the results, most larger projects have made provisions to sequence complex clinical cohorts. Interestingly, as far as the composition of their cohorts can be analysed from the data provided on the websites, it is apparent how other factors, such as funding clearly influence clinical priority in genomics. Large, initially publicly funded initiatives such as that of UK and France [ 30 , 37 ] have very complex clinical cohorts including over 190 rare diseases and cancer programme, in case of the former, and 48 conditions in case of the latter. On the other hand, privately funded projects will focus primarily on conditions where the biggest return on investment can most reasonably be expected (e.g. Ireland project aims to focus on Alzheimer’s disease, asthma, inflammatory bowel disease, multiple sclerosis, diabetes, nonalcoholic liver diseases, inflammatory skin conditions, ankylosing spondylitis, etc.).

Infrastructural goals include the most heterogeneous aims, ranging from establishing new and linking existing sequencing facilities (e.g. France), improving computing/analysing capacities (e.g. Brazil, Portugal), establishing standards for analysis (e.g. Slovenia), data management (e.g. Finland, Switzerland), sharing and platform building (e.g. Estonia), education of medical personnel and incorporation of sequencing technology/diagnostics into existing health-care structures (e.g. Finland, France). This is not surprising, as the national genomic projects defined their infrastructural goals depending on their respective general situation regarding genomic sequencing and health-care systems. The most shared infrastructural goals were data management (79%, 19/24), standards of analyses (71%, 17/24), and interestingly, the goal of education-which was defined as the aim of half (54%, 13/24) of countries with infrastructural projects; however, these include major projects such as that of Australia, UK and Finland (with excellent existing health-care informatics infrastructure). Interestingly, a few projects (e.g. Slovenia) defined the goals of education, standards of analyses, and data management independently of infrastructure-highlighting the differences in the definition of the concept of ‘infrastructure’ itself. Furthermore, 68% of projects (28/41) reported their intention to unify or establish standards for analysis and/or make provisions for appropriate data management, which is not surprising as these two factors are crucial features for the establishment of personalized and precision medicine [ 75 ].

Personalised and precision medicine

The majority of the national genomic projects aspire to integrate personalized medicine with their existing healthcare infrastructure. However, only a third (37%, 15/41) of the projects have thus far proposed specific strategies for the implementation of personalised medicine. Preparing the ground for implementing personalised and precision medicine is a complex endeavour as it cannot precede the achievement of other important goals, such as identifying and cataloguing local normal genomic variability, the existence of adequate sequencing and informatics infrastructure, data security, clear ethical guidelines for reporting and interventions, education of medical professionals and health-care system integration.

Indeed, in most countries with tangible plans of implementing personalized medicine, this aim overlaps with all three other major aims: determining normal and pathological genomic variability, and infrastructural aims (Fig. 5 ). Therefore, the countries pursuing more aims are most likely to implement personalised medicine in the foreseeable future. For example, Finland, with its well-established medical data infrastructure, is in a good position to undertake the personal genomics challenge posed by complex diseases [ 29 ]. Additionally, several countries, such as Japan [ 76 ], report they will use their genomic data for drug discovery/precision therapy and have planned their cohorts accordingly.

The 34 national projects reporting the number of subjects to be included showed high heterogeneity, ranging from a hundred subjects to over a million individuals (Table 1 ).

In terms of sequenced genome numbers per country population, only four countries plan to sequence more than 1% of their respective population, while the majority of projects plan to sequence less than 0.2% of their population (Table 1 ). Five countries defined the age structure of their healthy participants (Supplement Table 1 ). For example, the Estonian national genomic project aims to analyse 32.5% of the country’s population and reports the plan to link this information with the national biobank, medical data and non-medical exposome data. Furthermore, in Estonia, the subjects for sequencing will be chosen to reflect the age structure in the country [ 74 ]. Similarly, the Latvian genome project will analyse healthy adult individuals included in their genetic biobank [ 32 ]. In the Czech Republic, approximately half of the healthy subjects included in the population cohort reflect the general population, whereas the other half is composed of healthy subjects above the age of 70 years [ 49 ]. Likewise, in Malta, a senior citizen cohort will be used to determine the normal genomic variation background [ 45 ]. Additionally, despite the relatively low number of planned genomic analyses, Brazil has an excellent population cohort from which to choose those to be sequenced. The Brazilian public servant cohort—ELSA (Longitudinal Study of Adult Health) has tracked the health of public servants aged 35–74 years and the factors associating with complex diseases since 2008 [ 77 ]. As discussed under normal genomic variability section, nine projects designed their cohorts based on ethnicity data (Supplement Table 1 ), which should be the preferred approach in case of countries with considerable migration and/or many different ethnic groups.

Similarly, to the reported range in the number of subjects, the funding amounts vary greatly from less than a million USD to 9.2 billion USD. Approximately half (49%, 20/41) of ongoing projects have public funding, which is not a surprise given a high initial investment and unlikely short-term return on the research performed. The remaining projects have either mixed (44%, 18/41) or fully private (7%, 3/41) funding. The private partners of the mixed public-private funded projects are either sequencing companies such as Illumina, Macrogen, BGI, and insurance companies, research and pharmaceutical companies, universities or a combination of several such partners. Additionally, several projects report they will collaborate and/or share data with private companies in the future (Supplement table 2 and 3 ).

An interesting comparison can be made between the approach to the selection of the clinical cohorts based on the type of funding, which is mixed (initially public) in the case of France and private in the case of Ireland. While the clinical cohorts of the initially publicly funded project were chosen based on their potential public-health impact as well as scientific rationale, the cohorts included in the fully privately funded project reflect the conditions where the biggest return on investment can most reasonably be expected: Alzheimer’s disease, asthma, inflammatory bowel disease, multiple sclerosis, diabetes, liver disease, inflammatory skin conditions, ankylosing spondylitis, and non-radiographic axial spondyloarthritis, and rare disorders. Similarly, the privately funded Qatar national project aims to analyse 100,000 individual genomes (3.6% of the population) and so far reports only clinical cohorts consisting of cardiovascular disease, diabetes, neurological disease and cancer.

Private funding of genomic research represents several challenges that have been reviewed previously [ 78 ]; however, few countries possess adequate resources to be able to pursue genomics from research to full implementation of personalised medicine without outside involvement.

As a possible solution to these challenges, several countries plan to establish designated agencies that will act as gatekeepers between the public and private conflict(s) of interest (e.g. data-security versus profit), in order to enable interested private parties to join the project and get involved in generating added value (design of novel drugs, treatments, data-mining), while maintaining public control of the data itself, as far (and as long), as possible.

Data sharing in the context of genomic projects concerns ethics and legal issues, identifying stakeholders, as well as technical aspects and security of the data itself. The ethical and legal issues depend on each national project as well as the projects' funding (public vs. private). The interested parties have been identified by several projects (please see Supplement Table 3 for detailed information) as the patient/healthy participants, referring physicians, the general public, researchers and research organizations, private corporations (such as pharmaceutical and insurance companies) and international organizations. The different stakeholders can access various levels of data either through fully public databases containing de-identified information or by formal request to the particular national ethical committee. Regarding the technical solutions for data sharing, some projects have already provided dissemination platforms, data access per request, or a synopsis of their results, whereas the remainder have announced their plans to do so (Supplement Table 3 ). The projects with significant funding, such as that of USA [ 79 ], China [ 80 ], UK [ 12 ], Australia [ 81 ], Japan [ 82 ] and Switzerland [ 82 ] as well as smaller projects such as Brazil [ 83 ], Latvia [ 84 ] and Saudi Arabia [ 85 ], have designed database platforms with various levels of access (for the interested public, academia and researchers), whereas probably due to significantly lesser financial input, the majority of projects created public databases with anonymized or pooled genetic data [ 42 , 67 , 86 , 87 , 88 , 89 , 90 ].

Additionally, five countries, Denmark, Estonia, France, Latvia and Qatar either have or plan to make available both data and the collected DNA, per request and pending the approval of their Ethical Committee (Supplement Table 3 ).

Linkage with biobanks, medical and non-medical data

Unsurprisingly, most of the ongoing projects aim to link the sequencing data with other medical data (78%, 32/41), either as part of their reported clinical cohorts, existing medical infrastructure or collected de novo. Furthermore, likely because of the high costs associated with such operations and their maintenance, roughly half of these projects (54%, 22/41) will integrate the results of the sequencing experiments with the existing biobanks or will create such biobanks as part of the project (Table 2 ).

The projects in the best position to achieve this goal are those of relatively small countries with a public healthcare system and well-established biobanks, such as Finland and Estonia. Estonia’s biobank includes close to 200,000 participants with information on their medical history, current health status and medications, in addition to anthropometric measurements and blood aliquots. In the case of Finland, the genome database will be linked with the existing National Health Data Repository (Kanta), which is already integrated into the public healthcare system. Pilot projects supporting the utilization of genomic data in Finnish healthcare, such as the GeneRISK study, that aim to analyse how information about risk-factors influences lifestyle changes and acts to prevent disease, are already underway [ 29 ].

Linkage with non-medical data, reported by 24% of projects (10/41), was less clearly defined as the exposome is both difficult to define and measure, and requires a significant investment in terms of the effort to collect and perform analyses. Current strategies examining the human exposome, that is the totality of lifetime-exposure, include many different factors (lifestyle, environment, microbiome, pollutants (sound, chemical), stress, etc.) and remain far from standardized. However, despite the fact that many issues remain before this field can be standardised, our efforts should strive to enable the linkage of data between studies in the future [ 91 , 92 , 93 , 94 ], and it is foreseeable that the exposome-genome paradigm will strengthen the application of precision medicine as these fields progress [ 95 ].

Challenges and future directions

Our aim was to provide an overview of available information on active national genome projects worldwide, in order to aid the design of such projects and usefulness of their results. We showed that despite the obvious, and substantial, diversity of the 41 ongoing projects, their overarching efforts aim to overcome the existing barriers to obtaining data, its integration, and the translation of this knowledge into personalised medicine. The challenges for this ambitious aim are many, such as addressing data security, privacy issues, inconsistency in data generation and analysis, issues with data sharing resources (both technical and ethical), incompatible data models and terminology, etc. The projects we have reviewed approach these issues in different ways, although some have already recognised the need to standardise their efforts in order to enable an interoperable framework of responsible data sharing.

Open science initiatives, such as the Global Alliance for Genomics and Health (GA4GH) [ 96 ], have been established to address the need for common standards and approaches to using genomic and related data. Their standards have so far been adopted by more than 40 leading genomics institutions as well as several of the projects described in this report, such as ‘All of Us’ USA, Genomics England, Australian Genomics, and Slovenia, to name a few, and will hopefully be even more widely adopted by such projects in the future.

Additionally, we would like to suggest that in isolation, genomic data represents only a part of the larger effort needed for implementing personalised and precision medicine, and as more and more genomes are included, the need for supporting medical and non-medical data (exposomics, integratomics, etc.) has become more and more apparent. Therefore, we would like to suggest that it is preferable for project designers to make provisions for the systematic inclusion of additional, medical and environmental/exposure data that will enable better genomic data curation and interpretation. It is foreseeable that in this aspect, open science initiatives will once again prove helpful in enabling frameworks and standards for successful data integration.

Limitations of the study

The study faces several limitations. Firstly, the information obtained by the authors is based on what was provided on the web sites of individual national projects in the English language. As individual projects’ websites do not need to adhere to standards as strict as those of scientific publishing, this prevented us from fully following all of the principles outlined as part of the PRISMA approach to systematic reviews and meta-analyses [ 19 ]. We would also like to recognize that our analysis may not reflect the full or final scope of the individual projects.

Secondly, all information we have attempted to gather was not available in case of all projects or was yet to be determined. The final scope and results of several projects will depend on the results of their many pilots and supporting/preparatory measures. Therefore, we would like to point out that perhaps not all aims may be achieved to the extent envisioned initially and that possibly additional features will be added to many of the projects at a later date.

In case of determining both healthy and pathological genomic variation, the recruitment of cohorts is an ongoing process that may result in changes to the original proposal, and new technological solutions, ethical standards and the results of international efforts (such as the European ‘1+ Million Genomes’ Initiative) may and probably will act to (re)shape the projects in the future.

Finally, this study was partly conducted during the COVID-19 global pandemic, which may influence the national genomic projects in unforeseeable ways.

Conclusions

In conclusion, this systematic review demonstrated considerable diversity among the 41 currently ongoing national genomic projects. The overview of the existing designs will hopefully inform national initiatives in designing new genomic projects and contribute to standardisation and international collaboration, thus enabling the individual projects to better contribute to the global development of genomics and personalized medicine.

Availability of data and materials

All data generated or analysed during this study are included in this published article [and its supplementary information files].

Abbreviations

Whole exome sequencing

Scott RH, Fowler TA, Caulfield M. Genomic medicine: time for health-care transformation. Lancet. 2019;394:454–6.

Article   Google Scholar  

Auffray C, Griffin JL, Khoury MJ, Lupski JR, Schwab M. Ten years of genome medicine. Genome Med. 2019;11:7.

Schee Genannt Halfmann S, Mählmann L, Leyens L, Reumann M, Brand A. Personalized medicine: what’s in it for rare diseases? Adv Exp Med Biol. 2017;1031:387–404.

Groft SC, Posada de la Paz M. Preparing for the future of rare diseases. Adv Exp Med Biol. 2017;1031:641–8.

Austin CP, Cutillo CM, Lau LPL, Jonker AH, Rath A, Julkowska D, et al. Future of rare diseases research 2017-2027: An IRDiRC Perspective. Clin Transl Sci. 2018;11:21–7.

Posey JE. Genome sequencing and implications for rare disorders. Orphanet J Rare Dis. 2019;14:153.

Prohaska A, Racimo F, Schork AJ, Sikora M, Stern AJ, Ilardo M, et al. Human disease variation in the light of population genomics. Cell. 2019;177:115–31.

Article   CAS   Google Scholar  

Ramaswami R, Bayer R, Galea S. Precision medicine from a public health perspective. Annu Rev Public Health. 2018;39:153–68.

Watson JD. The human genome project: past, present, and future. Science. 1990;248:44–9.

Cantor CR. Orchestrating the Human Genome Project. Science. 1990;248:49–51.

Pálsson G, Rabinow P. Iceland: the case of a national human genome project. Anthropol Today. 1999;15:14–8 [Wiley, Royal Anthropological Institute of Great Britain and Ireland].

100,000 Genomes Project dataset, Genomics England. Available from: https://www.genomicsengland.co.uk/about-gecip/for-gecip-members/data-and-data-access/ . Accessed 10 Feb 2021.

All of US Research Program, USA. Available from: https://allofus.nih.gov/ . Cited 2020 Oct 10

New research center seeks to map out China’s genes. Available from: http://www.globaltimes.cn/content/1072485.shtml . Cited 2020 Oct 10

Cyranoski D. China embraces precision medicine on a massive scale. Nature. 2016;529:9–10.

Stark Z, Dolman L, Manolio TA, Ozenberger B, Hill SL, Caulfied MJ, et al. Integrating Genomics into healthcare: a global responsibility. Am J Hum Genet. 2019;104:13–20.

Saunders G, Baudis M, Becker R, Beltran S, Béroud C, Birney E, et al. Leveraging European infrastructures to access 1 million human genomes by 2022. Nat Rev Genet. 2019;20:693–701.

Pawleni. European “1+ Million Genomes” Initiative. In: Shaping Europe’s digital future - European Commission; 2019. Available from: https://ec.europa.eu/digital-single-market/en/european-1-million-genomes-initiative . Cited 2020 Jul 31.

Google Scholar  

Moher D, Liberati A, Tetzlaff J, Altman DG, for the PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535.

Le VS, Tran KT, Bui HTP, Le HTT, Nguyen CD, Do DH, et al. A Vietnamese human genetic variation database. Hum Mutat. 2019;40:1664–75.

Aotearoa New Zealand genomic variome. Available from: https://www.genomics-aotearoa.org.nz/projects/aotearoa-nz-genomic-variome . Accessed 10 Feb 2021.

Armenian Genome Project. Available from: http://armeniangenome.am/ . Accessed 10 Feb 2021.

Australian Genomics Health Alliance. Available from: https://www.australiangenomics.org.au/ . Accessed 10 Feb 2021.

Centre for Arab Genomic Studies, UAE. Available from: http://www.cags.org.ae/ . Accessed 10 Feb 2021.

ChileGenomico - Genomics of the Chilean Population (FONDEF). Available from: http://chilegenomico.med.uchile.cl/chilegenomico1/ . Accessed 10 Feb 2021.

National Genomics Data Center Members and Partners, Zhang Z, Zhao W, Xiao J, Bao Y, He S, et al. Database Resources of the National Genomics Data Center in 2020. Nucleic Acids Res. 2019;48(D1):D24–D33. https://doi.org/10.1093/nar/gkz913 .

Egyptian Genome. Available from: https://www.egyptian-genome.org/ . Accessed 10 Feb 2021.

Estonian Biobank. Available from: https://genomics.ut.ee/en/access-biobank . Accessed 10 Feb 2021.

Finland’s Genome Strategy. Working group proposal. Available from: http://julkaisut.valtioneuvosto.fi/handle/10024/74712 . Accessed 10 Feb 2021.

France Medecine Genomique 2025. Available from: https://pfmg2025.aviesan.fr/en/

Genome Asia 100k. Available from: https://genomeasia100k.org/ . Accessed 10 Feb 2021.

Rovite V, Wolff-Sagi Y, Zaharenko L, Nikitina-Zake L, Grens E, Klovins J. Genome Database of the Latvian Population (LGDB): design, goals, and primary results. J Epidemiol. 2018;28:353–60.

Genome Denmark. Available from: http://www.genomedenmark.dk/english/ . Accessed 10 Feb 2021.

Genome Russia Project, Saint Petersburg State University. Available from: http://genomerussia.spbu.ru/?lang=en . Accessed 10 Feb 2021.

GenomePT, Portugal. Available from: https://www.genomept.pt/ . Accessed 10 Feb 2021.

Genomic map of Poland. Available from: http://www.ecbig.pl/page/genomic-map-of-poland/ . Accessed 10 Feb 2021.

Genomics England. Available from: https://www.genomicsengland.co.uk/ . Accessed 10 Feb 2021.

Genomics Medicine Ireland. Available from: https://genomicsmed.ie/how-you-can-help/ . Accessed 10 Feb 2021.

Genomics Thailand Initiative. Available from: https://genomicsthailand.com/Genomic/about . Accessed 10 Feb 2021.

Health 2030 genome center, Switzerland. Available from: https://www.health2030genome.ch/about/ . Accessed 10 Feb 2021.

Hong Kong Genome Project. Available from: https://www.info.gov.hk/gia/general/202005/14/P2020051400636.htm . Accessed 10 Feb 2021.

Iranome. Available from: http://www.iranome.ir/ . Accessed 10 Feb 2021.

Japan Genomic Medicine Program. Available from: https://www.amed.go.jp/en/program/index05.html . Accessed 10 Feb 2021.

Korean Personal Genome Project. Available from: http://kpgp.kr/ . Accessed 10 Feb 2021.

Borg J. Malta Human Genome Project. Unpublished; 2018; Available from: http://rgdoi.net/10.13140/RG.2.2.27666.91847 . Cited 2020 Jun 1

Molecular Medicine Research Center Biobank, University of Cyprus. Available from: https://www.ucy.ac.cy/mmrc/en/biobank . Accessed 10 Feb 2021.

MX BioBank Project. Available from: http://www.morenolab.org/projects/ . Accessed 10 Feb 2021.

National Genome Center, Kingdom of Bahrain Ministry of Health. Available from: https://www.moh.gov.bh/GenomeProject?lang=en . Accessed 10 Feb 2021.

NCMG database of genomic variants, National Center for Medical Genomics, Czech Republic. Available from: https://ncmg.cz/en/#section-projects . Accessed 10 Feb 2021.

NHRI Taiwan. Available from: http://enews.nhri.org.tw/en/?p=858 . Accessed 10 Feb 2021.

Personal Genome Project Canada (PGP-Canada). Available from: https://personalgenomes.ca/ . Accessed 10 Feb 2021.

Qatar Genome. Available from: https://qatargenome.org.qa/node/5 . Accessed 10 Feb 2021.

Saudi Omics Undertakings. Available from: https://www.saudigenomeprogram.org/en/ . Accessed 10 Feb 2021.

Slovenian genome project. Available from: https://www.sicris.si/public/jqm/prj.aspx?lang=eng&opdescr=search&opt=2&subopt=400&code1=cmn&code2=auto&psize=1&hits=1&page=1&count=&search_term=peterlin%20borut&id=17959&slng=&order_by= . Accessed 10 Feb 2021.

SNU College of Medicine Starts Uruguay Genome Project, Uruguay. Available from: https://en.snu.ac.kr/research/highlights?md=v&bbsidx=121064 . Accessed 10 Feb 2021.

The Brazilian Intiative on Precision Medicine (BIPMed). Available from: https://bipmed.org/theproject/ . Accessed 10 Feb 2021.

Turkish Genome Project. Available from: https://www.bbmri-eric.eu/news-events/turkish-genome-project-launched/

WGS first - Whole Genome Sequencing, Netherlands. Available from: https://www.wgs-first.nl/en/project . Accessed 10 Feb 2021.

Khan S, Akter S, Goswami B, Habib A, Banu TA, Barton C, et al. Whole genome analysis of four Bangladeshi individuals. Genomics. 2020; Available from: http://biorxiv.org/lookup/doi/10.1101/2020.05.21.109058 . Accessed 10 Feb 2021.

Swiss Personal Health Network (SPHN). Available from: https://sphn.ch/organization/about-sphn/ . Accessed 10 Feb 2021.

Ávila-Arcos MC, McManus KF, Sandoval K, Rodríguez-Rodríguez JE, Villa-Islas V, Martin AR, et al. Population history and gene divergence in Native Mexicans inferred from 76 human exomes. Mol Biol Evol. 2020;37:994–1006 Falush D, editor.

Genuity Science. Available from: https://genomicsmed.ie/ . Cited 2020 Oct 10.

Vishnopolska SA, Turjanski AG, Herrera Piñero M, Groisman B, Liascovich R, Chiesa A, et al. Genetics and genomic medicine in Argentina. Mol Genet Genomic Med. 2018;6:481–91.

Ariani Y, Soeharso P, Sjarif DR. Genetics and genomic medicine in Indonesia. Mol Genet Genomic Med. 2017;5:103–9.

Belhassan K, Ouldim K, Sefiani AA. Genetics and genomic medicine in Morocco: the present hope can make the future bright. Mol Genet Genomic Med. 2016;4:588–98.

Stark Z, Boughtwood T, Phillips P, Christodoulou J, Hansen DP, Braithwaite J, et al. Australian genomics: a federated model for integrating genomics into healthcare. Am J Hum Genet. 2019;105:7–14.

Egyptian genome, EgyptRef. Available from: https://www.egyptian-genome.org/

Human Population Genomics Lab. 2020. http://www.morenolab.org/projects/ . Accessed 10 Feb 2021.

Wu D, Dou J, Chai X, Bellis C, Wilm A, Shih CC, et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell. 2019;179:736–749.e15.

Brittain HK, Scott R, Thomas E. The rise of the genome and personalised medicine. Clin Med. 2017;17:545–51.

Hayashi T, Konishi I. Prospects and problems of cancer genome analysis for establishing cancer precision medicine. Cancer Investig. 2019;37:427–31.

Nakagawa H, Fujita M. Whole genome sequencing analysis for cancer genomics and precision medicine. Cancer Sci. 2018;109:513–22.

Mukherjee S. Genomics-guided immunotherapy for precision medicine in cancer. Cancer Biother Radiopharm. 2019;34:487–97.

Leitsalu L, Haller T, Esko T, Tammesoo M-L, Alavere H, Snieder H, et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int J Epidemiol. 2015;44:1137–47.

Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform. 2007;40:5–16.

Tohoku Medical Megabank Project. Available from: https://www.amed.go.jp/en/program/list/14/01/002.html . Cited 2020 Oct 10

de Oliveira C, Marmot MG, Demakakos P, Vaz de Melo Mambrini J, Peixoto SV, Lima-Costa MF. Mortality risk attributable to smoking, hypertension and diabetes among English and Brazilian older adults (The ELSA and Bambui cohort ageing studies). Eur J Pub Health. 2016;26:831–5.

Lowrance WW, Collins FS. ETHICS: identifiability in genomic research. Science. 2007;317:600–2.

All of Us Research Hub, NIH USA. Available from: https://www.researchallofus.org/ . Accessed 10 Feb 2021.

Virtual Chinese Genome Database. https://bigd.big.ac.cn/vcg/index.html . Accessed 10 Feb 2021.

A Variant Atlas Platform for Australian Genomics. Available from: https://www.australiangenomics.org.au/resources/tools/variant-atlas/ . Accessed 10 Feb 2021.

BioMedIT, Swiss Personalized Health Network. Available from: https://sphn.ch/network/projects/biomedit/ . Accessed 10 Feb 2021.

Brazilian initiative on precision medicine data sharing. Available from: https://bipmed.org/datasharing/ . Accessed 10 Feb 2021.

Genome database of Latvian population. Available from: http://www.genomadatubaze.lv/en/ . Accessed 10 Feb 2021.

Saudi Human Genome Program Database. Available from: https://genomics.saudigenomeprogram.org/en/researchers/#db-access . Accessed 10 Feb 2021.

PGP Canada Data. Available from: https://personalgenomes.ca/data . Accessed 10 Feb 2021.

Genome Asia 100K Browser. Available from: https://browser.genomeasia100k.org/ . Accessed 10 Feb 2021.

PGP Korea. Available from: http://opengenome.net/Main_Page . Accessed 10 Feb 2021.

CTGA Database, Centre for Arab Genomic Studies. Available from: http://www.cags.org.ae/ctga/ . Accessed 10 Feb 2021.

Vietnamese Genetic Variation Database. Available from: https://genomes.vn/ . Accessed 10 Feb 2021.

Sabbioni G, Berset J-D, Day BW. Is it realistic to propose determination of a lifetime internal exposome? Chem Res Toxicol. 2020;33(8):2010–21. https://doi.org/10.1021/acs.chemrestox.0c00092 .

Barupal DK, Fiehn O. Generating the blood exposome database using a comprehensive text mining and database fusion approach. Environ Health Perspect. 2019;127:97008.

Manrai AK, Cui Y, Bushel PR, Hall M, Karakitsios S, Mattingly CJ, et al. Informatics and data analytics to support exposome-based discovery for public health. Annu Rev Public Health. 2017;38:279–94.

Vineis P, Avendano-Pabon M, Barros H, Bartley M, Carmeli C, Carra L, et al. Special report: the biology of inequalities in health: the lifepath consortium. Front Public Health. 2020;8:118.

Barouki R, Audouze K, Coumoul X, Demenais F, Gauguier D. Integration of the human exposome with the human genome to advance medicine. Biochimie. 2018;152:155–8.

The Global Alliance for Genomics and Health (GA4GH). Available from: https://www.ga4gh.org/ . Accessed 10 Feb 2021.

Download references

Acknowledgements

This work was funded by the ARRS programme: P3-0326 and the ARRS project: V3-1911 Slovenian genome project.

Author information

Authors and affiliations.

Clinical Institute of Genomic Medicine, University Medical Centre Ljubljana, Slajmerjeva 4, Ljubljana, Slovenia

Anja Kovanda, Ana Nyasha Zimani & Borut Peterlin

You can also search for this author in PubMed   Google Scholar

Contributions

AK, ANZ, and BP analysed and co-reviewed the data and wrote the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Borut Peterlin .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: table s1.

. Identified national genomic projects and categories analyzed in ongoing national genomic projects.

Additional file 2: Table S2

. Funding details of national genomic projects.

Additional file 3: Table S3

. Data sharing solutions of national genomic projects.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Kovanda, A., Zimani, A.N. & Peterlin, B. How to design a national genomic project—a systematic review of active projects. Hum Genomics 15 , 20 (2021). https://doi.org/10.1186/s40246-021-00315-6

Download citation

Received : 15 December 2020

Accepted : 23 February 2021

Published : 24 March 2021

DOI : https://doi.org/10.1186/s40246-021-00315-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • National genomic projects
  • Precision medicine
  • Personalized medicine
  • Normal genomic variation
  • Pathological genomic variation
  • Population genomics

Human Genomics

ISSN: 1479-7364

human genome project literature review

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

The Human Genome Project

The 20 th century opened with rediscoveries of Gregor Mendel’s studies on patterns of inheritance in peas and closed with a research project in molecular biology that was heralded as the initial and necessary step for attaining a complete understanding of the hereditary nature of humankind. Both basic science and technological feat, the Human Genome Project (HGP) sought to map and sequence the haploid human genome’s 22 autosomes and 2 sex chromosomes, bringing to biology a “big science” model previously confined to physics. Officially launched in October 1990, the project’s official date of completion was timed to coincide with celebrations of the 50 th anniversary of James D. Watson and Francis Crick’s discovery of the double-helical structure of DNA. On 12 April 2003, heads of government of the six countries that contributed to the sequencing efforts (the U.S., the U.K., Japan, France, Germany, and China) issued a joint proclamation that the “essential sequence of three billion base pairs of DNA of the Human Genome, the molecular instruction book of human life,” had been achieved (Dept. of Trade 2003). HGP researchers compared their feat to the Apollo moon landing and splitting the atom. They foresaw the dawn of a new era, “the era of the genome,” in which the genome sequence would provide a “tremendous foundation on which to build the science and medicine of the 21 st century” (NHGRI 2003).

This article begins by providing a brief history of the Human Genome Project. An overview of various scientific developments that unfolded in the aftermath of the HGP follows; these developments came to be referred to as “postgenomics” to distinguish them from activities labelled “genomics” that are associated specifically with the mapping and sequencing of the genomes of humans and other organisms. The article then discusses some of the conceptual and social and ethical issues that gained the attention of philosophers during the project’s planning stages and as it unfolded, and which remain salient today. Novel with the HGP was the decision of its scientific leadership to set aside funds to study the project’s ethical, legal, and social implications (ELSI). Today, from a vantage point more than two-and-a-half decades after that decision was made, it is possible to reflect on the ELSI model and its relevance for ongoing biomedical research in genetics and genomics/postgenomics.

1.1 Map First, Sequence Later

1.2 race to the genome, 1.3 aftermath, 2.1 geneticization, genetic reductionism, and genetic determinism, 2.2 genetic testing, genetic discrimination, and genetic privacy, 2.3 identity and difference: the “normal” human genome, 2.4 identity and difference: race, ethnicity, and the genome, 2.5 elsi and its legacy, other internet resources, related entries, 1. the human genome project: from genomics to postgenomics.

The idea of sequencing the entire human genome arose in the U.S. in the mid-1980s and is attributed to University of California at Santa Cruz chancellor Robert Sinsheimer, Salk Institute researcher Renato Dulbecco, and the Department of Energy’s (DOE’s) Charles DeLisi. While the idea found supporters among prominent molecular biologists and human geneticists such as Walter Bodmer, Walter Gilbert, Leroy Hood, Victor McKusick, and James D. Watson, many of their colleagues expressed misgivings. There were concerns among molecular biologists about the routine nature of sequencing and the amount of “junk DNA” that would be sequenced, that the expense and big science approach would drain resources from smaller and more worthy projects, and that knowledge of gene sequence was inadequate to yield knowledge of gene function (Davis and Colleagues 1990).

Committees established to study the feasibility of a publicly funded project to sequence the human genome released reports in 1988 that responded to these concerns. The Office for Technology Assessment report, Mapping Our Genes: Genome Projects: How Big, How Fast? downplayed the concerns of scientist critics by emphasizing that there was not one but many genome projects, that these were not on the scale of the Manhattan or Apollo projects, that no agency was committed to massive sequencing, and that the study of other organisms was needed to understand human genes. The National Research Council report, Mapping and Sequencing the Human Genome , sought to accommodate the scientists’ concerns by formulating recommendations that genetic and physical mapping and the development of cheaper, more efficient sequencing technologies precede large-scale sequencing, and that funding be provided for the mapping and sequencing of nonhuman (“model”) organisms as well. Genome projects were underway even before the Office for Technology Assessment and National Research Council reports were released. The DOE made the first push toward a “big science” genome project, with DeLisi advancing a five-year plan in 1986. The DOE undertaking produced consternation among biomedical researchers who were traditionally supported by the National Institutes of Health’s (NIH’s) intramural and extramural programs, and James Wyngaarden, head of the NIH, was persuaded to lend his agency’s support to the project in 1987. Congressional funding for both agencies was in place in time for fiscal year 1988. The National Research Council report estimated the total cost of the HGP at $3 billion.

The DOE and NIH coordinated their efforts with a Memorandum of Understanding in 1988 that agreed on an official launch of the Human Genome Project on October 1, 1990 and an expected date of completion of 2005. The DOE established three genome centers in 1988–89: at Lawrence Berkeley, Lawrence Livermore, and Los Alamos National Laboratories. David Smith led the DOE-HGP at the outset; he was followed by David Galas from 1990 to 1993, and Ari Patrinos for the remainder of the project. The NIH instituted a university grant-based program for human genome research and placed Watson, co-discoverer of the structure of DNA and director of Cold Spring Harbor Laboratory, in charge in 1988. In October 1989, Watson assumed the helm of the newly established National Center for Human Genome Research (NCHGR) at the NIH. During 1990 and 1991, Watson expanded the grants-based program to fund seven genome centers for five-year periods to work on large-scale mapping projects: Washington University, St. Louis; University of California, San Francisco; Massachusetts Institute of Technology; University of Michigan; University of Utah; Baylor College of Medicine; and Children’s Hospital of Philadelphia. Francis Collins succeeded Watson in 1993, establishing an intramural research program at the NCHGR to complement the extramural program of grants for university-based research that already existed. In 1997, the NCHGR was elevated to the status of a research institute and renamed the National Human Genome Research Institute (NHGRI).

Although the HGP’s inceptions were in the U.S., it did not take long for mapping and sequencing the human genome to become an international venture (see Cook-Deegan 1994). France began to fund genome research in 1988 and had developed a more centralized, although not very well-funded, program by 1990. More significant were the contributions of Centre d’Etudes du Polymorphisme Humain (CEPH) and Généthon. CEPH, founded in 1983 by Jean Dausset, maintained a collection of DNA donated by intergenerational families to help in the study of hereditary disease; in 1991, with funding from the French muscular dystrophy association, CEPH director Daniel Cohen oversaw the launching of Généthon as an industrial-sized mapping and sequencing operation. The U.K.’s genome project received its official start in 1989, though Sydney Brenner had commenced genome research at the Medical Research Council laboratory several years before this. Medical Research Council funding was supplemented with private monies from the Imperial Cancer Research Fund and, later, the Wellcome Trust. The Sanger Centre, led by John Sulston and funded by Wellcome and the Medical Research Council, opened in October 1993. Japan, ahead of the U.S. in having funded the development of automated sequencing technologies since the early 1980s, was the major genome player outside the U.S. and Europe with several government agencies beginning small-scale genome projects in the late-1980s and early-1990s (Swinbanks 1991). Germany and China subsequently joined the U.S., France, U.K., and Japan in the publicly funded international consortium that was ultimately responsible for sequencing the genome.

The NIH and DOE released a joint five-year plan in 1990 that set specific benchmarks for mapping, sequencing, and technological development. The plan was updated in 1993 to accommodate progress that had been made, with the new five-year plan in effect through 1998 (Collins and Galas 1993). As the National Research Council report had recommended, priority at the outset of the project was given to mapping rather than sequencing the human genome. HGP scientists sought to construct two kinds of maps: genetic maps and physical maps. Genetic maps order polymorphic markers linearly on chromosomes; the aim is to have these markers densely enough situated that linkage relations can be used to locate chromosomal regions containing genes of interest to researchers. Physical maps order collections (or “libraries”) of cloned DNA fragments that cover an organism’s genome; these fragments can then be replicated in quantity for sequencing. Technological progress was needed to make sequencing more efficient and less costly for any significant progress to be made. For the meantime, efforts would focus on sequencing the smaller genomes of less complex model organisms (Watson 1990). The model organisms selected for the project were the bacterium Escherichia coli , the yeast Saccharomyces cerevisiae , the roundworm Caenorhabditis elegans , the fruitfly Drosophila melanogaster , and the mouse Mus musculans .

As 1998, the last year of the revised five-year plan and midpoint of the project’s projected 15-year span, approached, many mapping goals had been met. In 1994, Généthon completed a genetic map with more than 2,000 microsatellite markers at an average spacing of 2.9 centimorgans (cM) and only one gap larger than 20 cM (Gyapay et al. 1994); the goal was a resolution of 2 to 5 cM by 2005. The genetic mapping phase of the project came to a final close in March 1996 with Généthon’s completion of a genetic map containing 5,264 microsatellite markers located to 2,335 positions with an average spacing of 1.6 cM (Dib et al. 1996). In 1995, a physical map with 94 percent coverage of the genome and 15,086 sequence-tagged site (STS) markers at average intervals of 199 kilobases (kb) was published (Hudson et al. 1995); the initial goal was STS markers spaced approximately 100 kb apart by 1995, a deadline the revised plan extended to 1998. In 1998, a physical map of 41,664 STS markers was published (Deloukas et al. 1998). Sequencing presented more of a challenge, despite ramped-up sequencing efforts over the previous several years at the U.K.’s Wellcome Trust-funded Sanger Centre in Cambridge and the NHGRI (previously NCHGR)-funded centers at Houston’s Baylor College of Medicine, Stanford University, The Institute for Genomic Research (TIGR), University of Washington-Seattle, Washington University School of Medicine in St. Louis, and Whitehead Institute for Biomedical Research/MIT Genome Center. The genomes of the smallest model organisms had been sequenced. In April 1996, an international consortium of mostly European laboratories published the sequence for S. cerevisiae which was the first eukaryote completed, with 12 million base pairs and 5,885 genes and at a cost of $40 million (Goffeau et al. 1996). In January 1997, University of Wisconsin researchers completed the sequence of E. coli with 4,638,858 base pairs and 4,286 genes (Blattner et al. 1997). However, with only three percent of the human genome sequenced, sequencing costs hovering at $.40/base, and the desired high output not yet achieved by the sequencing centers, and about $1.8 billion spent, doubts existed about whether the HGP’s target date of 2005 could be met.

Suddenly, the publicly funded HGP faced a challenge from the private sector. In May 1998, TIGR’s J. Craig Venter announced a partnership with Applied Biosystems to sequence the entire genome in three short years and for a fraction of the cost. The new company, based in Rockville, MD and later named Celera Genomics, planned to use “whole-genome shotgun” (WGS) sequencing, an approach different from the HGP’s. The HGP confined the shotgun method to cloned fragments already mapped to specific chromosomal regions: these are broken down into smaller bits then amplified by bacterial clones, sequences are generated randomly by automated machines, and computational resources are used to reassemble sequence using overlapping areas of bits. Shotgunning is followed by painstaking “finishing” to fill in gaps, correct mistakes, and resolve ambiguities. What Celera was proposing for the shotgun method was to break the organism’s entire genome into millions of pieces of DNA with high-frequency sound waves, sequence these pieces using hundreds of Applied Biosystem’s new capillary model machines, and reassemble the sequences with one of the world’s largest civilian supercomputers without the assistance provided by the preliminary mapping of clones to chromosomes. When WGS sequencing was considered as a possibility by the HGP, it was rejected because of the risk that repeat sequences would yield mistakes in reassembly (Green 1997; Venter et al. 1996; Weber and Myers 1997). But Venter by this time had successfully used the method to sequence the 1.83 million nucleotide bases of the bacterium Hemophilus influenzae —the first free-living organism to be completely sequenced—in a year’s time (Fleischmann et al. 1995).

HGP scientists downplayed the media image of a race to sequence the genome often over the next couple of years, but they were certainly propelled by worries that funding would dry up before the sequence was complete given private sector willingness to take over and that the sequence data would become proprietary information. Wellcome more than doubled its funds to the Sanger Centre (to £205 million) and the center changed its goal from sequencing one-sixth of the genome to sequencing one-third, and possibly one-half (Dickson 1998). The NHGRI and DOE published a new five-year plan for 1998-2003 (Collins et al. 1998). The plan moved the final completion date forward from 2005 to 2003 and aimed for a “working draft” of the human genome sequence to be completed by December 2001. This would be achieved by delaying the finishing process, no longer going clone-by-clone to shotgun, reassemble, and finish the sequence of one clone before proceeding to the next. With only six percent of the human genome sequence completed, the plan called for new and improved sequencing technologies that could increase the sequencing capacity from 90 Mb per year at about $.50 per base to 500 Mb per year at no more than $.25 per base. Goals for completing the sequencing of the remaining model organisms were also set: December 1998 for C. elegans which was 80 percent complete, 2002 for D. melanogaster which was nine percent complete, and 2005 for M. musculus which was still at the physical mapping stage.

An interim victory for the publicly funded project followed when, on schedule, the first animal sequence, that of C. elegans with 97 million bases and 19,099 genes, was published in Science in December 1998 (The C. elegans Sequencing Consortium 1998). This was the product of a 10-year collaboration between scientists at Washington University in St. Louis (headed by Bob Waterston) and the Sanger Centre (headed by John Sulston), carried out at a semi-industrial scale with more than 200 people employed in each lab working around the clock. In March 1999, the main players—the NHGRI, Sanger Centre, and DOE—advanced the date of completion of the “working draft”: five-fold coverage of at least 90 percent of the genome was to be completed by the following spring (Pennisi 1999; Wadman 1999). This change reflected improved output of the new model of automated sequencing machines, diminished sequencing costs at $.20 to $.30 per base, and the desire to speed up the release of medically relevant data. NHGRI would take responsibility for 60 percent of the sequence, concentrating these efforts at Baylor, Washington University, and Whitehead/MIT; 33 percent of the sequence would be the responsibility of the Sanger Centre; and the remaining sequence would be supplied by the DOE’s Joint Genome Institute (JGI) in Walnut Creek, CA into which its three centers had merged in January 1997.

The first chromosomes to be completed (this was to finished, not working draft, standards) were the two smallest: the sequence for chromosome 22 was published by scientists at the Sanger Centre and partners at University of Oklahoma, Washington University in St. Louis, and Keio University in Japan in December 1999 (Dunham et al. 1999); the sequence for chromosome 21 was published by an international consortium of mostly Japanese and German labs—with half the sequencing carried out at Japan’s RIKEN—in May 2000 (Hattori et al. 2000). The remaining chromosomes lagged behind. On 26 June 2000, when Collins, Venter, and the DOE’s Patrinos joined U.S. President Bill Clinton (and British Prime Minister Tony Blair by satellite link) at a White House press conference (see Clinton, et al. 2000) to announce that the human genome had been sequenced, this was more an arranged truce than a tie for the prize. An editorial in Nature described the fanfare of 26 June as an “extravagant” example—one reaching “an all-out zenith or nadir, according to taste”—of scientists making public announcements not linked to peer-reviewed publication, here to bolster share prices (Celera) and for political effect (the HGP) given the “months to go before even a draft sequence will be scientifically useful” (Anonymous 2000, p. 981). Neither of the two sequence maps was complete (Pennisi 2000). The HGP had not met its previous year’s goal of a working draft covering 90 percent of the genome. Assisted by its researchers’ access to HGP data stored on public databases, [ 1 ] Celera’s efforts were accepted as being further along: the company’s press release that day announced 99 percent coverage of the genome.

Peer-reviewed publications came almost eight months later. Negotiated plans for joint publication in Science broke down when terms of agreement over data release could not be negotiated, with the journal’s editors willing to publish Celera’s findings without Venter meeting the standard requirement that the sequence data be submitted to GenBank. Press conferences in London and Washington, D.C. on 12 February preceded publications that week—by HGP scientists in Nature on 15 February 2001 and by Venter’s team in Science on 16 February 2001. The HGP draft genome sequence covered about 94 percent of the genome, with about 25 percent in the finished form already attained for chromosomes 21 and 22. Indeed, the authors themselves described it as “an incomplete, intermediate product” which “contains many gaps and errors” (International Human Genome Sequencing Consortium 2001, p. 871). The results published by Celera had 84–90 percent of the genome covered by scaffolds at least 100 kb in length, with the composition of the scaffolds averaging 91–92 percent sequence and 8–9 percent gaps (Venter et al. 2001). In the end, Celera’s published genome assembly made significant use of the HGP’s publicly available map and sequence data, which left open for debate the question whether WGS sequencing alone would have worked (see Waterston et al. 2002; Green 2002; and Myers et al. 2002).

Since the gaps in the sequence were unlikely to contain genes, and only genes as functional segments of DNA have potential commercial value, Celera was happy to leave the gaps for the HGP scientists to fill in. Despite being timed to coincide with celebrations of the 50 th anniversary of the Watson–Crick discovery of the double-helical structure of DNA, there was less fanfare surrounding the official date of completion of the HGP in April 2003, two years earlier than had been anticipated at the time of its official launch in October 1990, and several months earlier than called for in the most recent five-year plan. In the end, sequencing—the third phase of the publicly-funded project—was carried out at 16 centers in six countries by divvying up among them sections of chromosomes for sequencing. 85 percent of the sequencing, however, was done at the five major sequencing centers (Baylor, Washington University, Whitehead/MIT, Sanger Center, and DOE’s JGI), with the Sanger Centre responsible for nearly one-third. The cost was lower than anticipated, with $2.7 billion spent by U.S. agencies and £150 million spent by Wellcome Trust. The “finished” reference DNA sequence for Homo sapiens was made publicly accessible on the Internet. However, for various technical reasons, the human genome’s 3.1 billion nucleotide bases had not yet been completely sequenced at the close of the HGP in 2003. [ 2 ]

Public support was won for the HGP through scientists’ promises of the revolutionary benefits of genome-based research for pharmaceutical and other biomedical applications. At the outset of the HGP, these promises were sometimes alarmingly deterministic, reductionistic, and overblown, such as when Science editor Daniel Koshland (1989) submitted that genes are responsible not only for manic-depression and schizophrenia but also poverty and homelessness, and that sequencing the genome represented “a great new technology to aid the poor, the infirm, and the underprivileged” (p. 189). More circumspect claims by scientist-proponents of the HGP were no less optimistic. Leroy Hood expressed the belief that “we will learn more about human development and pathology in the next twenty-five years than we have in the past two thousand” (1992, p. 163). Hood expected the HGP to facilitate movement from a reactive to preventive mode of medicine, which would “enable most individuals to live a normal, healthy, and intellectually alert life without disease” (p. 158). Francis Collins predicted that sequencing the genome would “dramatically accelerate the development of new strategies for the diagnosis, prevention, and treatment of disease, not just for single-gene disorders but for the host of more common complex diseases (e.g., diabetes, heart disease, schizophrenia, and cancer)” (1999, p. 29). Collins envisioned that, by 2010, genetically-based “individualized medicine” would be a reality: physicians would routinely take cheek swabs from patients and send their DNA out for testing; based on results of genetic testing (returned within a week), physicians would be able to advise their patients about their absolute and relative risks for contracting various adult-onset diseases; by taking preventive measures (e.g., quitting smoking, having an annual colonoscopy, etc.), patients would be able to prevent the onset of any such diseases or minimize their effects; and the field of pharmacogenomics would have “blossomed” sufficiently for physicians to be able to prescribe prophylactic medications tailored precisely to the genetic make-up of their patients, so to promote efficacy and prevent adverse reactions.

Genome-wide association studies (GWAS), in which single nucleotide polymorphisms (SNPs) across the genome are compared in case–control fashion, are the main approach used to investigate the genetic bases of complex traits. The importance of developing rapid, inexpensive methods of genome sequencing and building a database of single nucleotide polymorphisms (SNPs) to support the investigation of complex traits was recognized by the project’s leadership even before completion of the HGP. Worried about the private sector’s efforts to patent SNPs, which would make them costly to use for research, the NHGRI-DOE’s five-year plan for 1998–2003 included the goal of mapping 100,000 SNPs by 2003 (Collins et al. 1998). The development of a public database of SNPs received a $138 million push from the International HapMap Project, a three-year public-private partnership completed in 2005 that mapped variation in four population groups (The International HapMap Consortium 2005). The 1000 Genomes Project, which ran between 2008 and 2015 (Birney and Soranzo 2015), sought to identify genetic variants that occur with a frequency of at least one percent in the populations studied, with the final data set consisting of 2,504 individuals from 26 populations from five continental regions.

The first genome-wide association study was published in 2005; by 2010, 500 genome-wide association studies had been published (Green et al. 2011); and by 2018, more than 5,000 genome-wide association studies had been published and their results added to the GWAS Catalog (Buniello et al. 2019). Despite these efforts, variants isolated by GWAS for complex traits account for a low percentage of heritability associated with the traits, a phenomenon known as “missing heritability” (Maher 2008). Although knowledge of the pathogenesis of common complex diseases such as diabetes, heart disease, schizophrenia, and cancer that arise due to the interaction of numerous genetic and nongenetic factors is lacking despite the numerous GWAS completed, the pathogenesis of so-called single-gene disorders is far better understood. Aided by the HGP’s dense map of genetic markers for use in positional mapping and subsequent development of genome-wide sequencing technologies, since the draft sequence was published 20 years ago, the number of Mendelian diseases with a known genetic basis has increased from 1,257 to 4,377 (Alkuraya 2021). This progress speeds up clinical diagnosis and facilitates prenatal genetic testing. Prevention and treatment remain challenging, however. There are only 59 “actionable genes” on the American College of Medical Genetics and Genomics’ most recent list of genes that are highly penetrant and associated with established interventions (Kalia et al. 2017). Gene therapy, touted as a potential cure for such disorders, yielded discouraging results in trials; only with the discovery of CRISPR, a novel technology of gene editing, has optimism been restored, though how well this basic science translates into clinical applications is yet to be seen (Doudna and Sternberg 2017; Baylis 2019).

The confident claims of leading genome scientists such as Hood and Collins proved to be overly so. Commentary at the time of the 10-year anniversary of completion of the HGP showed unison among scientists in recognizing that the promised revolution had not yet arrived. The HGP has been advanced as a case study in support of the “social bubble” hypothesis: the hypothesis that people will dive into new opportunities that present without due regard for potential risks because they are carried along by social interactions driven by enthusiasts who generate high expectations of returns—for the HGP, projections of commercially lucrative pharmaceutical and other biomedical applications advanced by project proponents (Gisler et al. 2010). The scientific consensus appears to be that although the HGP failed at least in the short term to fulfill proponents’ overly optimistic prognostications for clinical applications, it has been a boon for basic science (Evans 2010). In the HGP’s early years, Norton Zinder, who chaired the NIH’s Program Advisory Committee on the Human Genome, characterized it in this way: “This Project is creating an infrastructure for doing science; it’s not the doing of the science per se. It will provide the biological community with the basic materials for doing research in human biology” (in Cooper 1994, p. 74).

Indeed, the infrastructure of mapping and sequencing technologies and bioinformatics that was developed as part of the HGP—especially the ability to sequence entire genomes of organisms and traffic in big data—has changed the way biology, not just human biology, is done (Stevens 2013). It is recognized that genome structure by itself tells us only so much. Functional genomics places interest in how entire genomes—not just individual genes—function. A surprising discovery of the HGP was that the number of coding genes in humans is many orders smaller than what scientists had assumed at the outset of the project—that is, around 20,000, as in other vertebrates, rather than 80,000–100,000, though the final number remains an open question (Salzberg 2018). The majority of the genome’s DNA is transcribed but not translated and serves a regulatory function, with causal processes at the molecular level associated with interactive networks not linear pathways. The deterministic and reductionistic assumptions underlying the HGP that portrayed the genome as a blueprint for organismal development have been undermined by the research in molecular biology the project made possible (Keller 2000). Systems biology has emerged as a new discipline that seeks to understand this complexity by using computational methods (Hood 2003). In fact, since completion of the HGP, discovery of non-protein-coding elements of the genome and their contributions to regulatory networks has far exceeded the discovery of protein-coding genes (Gates et al. 2021). Evolutionary studies are aided by the ability of scientists to compare the human genome reference sequence to reference sequences for close relatives, such as Neandertals (Green et al. 2010), bonobos (Mao et al. 2021), and chimpanzees (The Chimpanzee Sequencing and Analysis Consortium 2005).

For scientists to deliver on their promises for a revolution in medicine, genome sequencing would need to be far faster, easier, and cheaper for its use to become routine in both research and clinical settings. In the closing years of the HGP, the cost of sequencing a human genome using existing Sanger technology was about $100 million. With high-throughput next-generation sequencing technology, in 2007, Venter’s diploid genome was sequenced at a cost of $10 million, and by 2013, the cost of sequencing an average genome had been lowered to $5,000. These developments were aided by a NHGRI grant program that set its sights on a $1,000 genome, a price point believed to place the “personal genome” within reach for routine use (Check Hayden 2014). Early in 2014, Illumina, a Californian company, claimed a win in the contest, with availability of its HiSeq X Ten system for population-scale whole-genome sequencing initiatives (Sheridan 2014); in March 2016, Veritas Genetics, a company cofounded by Harvard medical geneticist George M. Church, announced commercial availability of whole-genome sequencing for individuals, including interpretation and counseling, for $999 (Veritas Genetics 2016). Church had initiated the Personal Genome Project in 2005 (Church 2005); there are now Personal Genome Projects in Canada, the U.K., Austria, and China as well. The projects recruit volunteers who are willing to support research by releasing their genomes and health and physical information publicly. Research that takes this longitudinal approach combining genetic and clinical data is considered crucial for the promise of genomics to be fulfilled: with faster, easier, and cheaper genome sequencing technologies, genetic data are readily obtainable, but analysis of the data, without which there can be no revolution in medicine, remains challenging.

The National Research Council’s 2011 report saw the route to “personalized medicine,” or “precision medicine” as it was renamed (see Juengst et al. 2016 and Ferryman and Pitcan 2018 for an account of that change), proceeding via a “New Taxonomy” of disease informed by two data repositories: an “Information Commons” that stores molecular data (genome, transcriptome, proteome, metabolome, lipidome, and epigenome) and additional information (phenotypes, treatment outcomes, test results, etc.) gleaned from the electronic health records of millions of individuals; and a “Knowledge Network of Disease” that integrates this information with “fundamental biological knowledge.” Disease generalizations would be “built up from” this large number of individuals, a departure from studies that group individuals based on particular characteristics (e.g., GWAS). Research efforts approaching the scale called for include the All of Us Research Program in the U.S. and the UK Biobank. The data-intensive approach to molecular biology made possible by information technology need not stop at electronic health records but could also include other electronic records such as credit card purchases and social media postings (Weber et al. 2014), and biometric measurements from mobile apps and fitness trackers (Shi and Wu 2017). The increased importance of “big data” is illustrated by contrasting futuristic scenarios envisioned by Collins (1999) and Hood and Rowen (2013) almost 15 years apart. The “individualized medicine” circa 2010 forecasted by Collins is centered in the physician’s office and assumes a traditional view of the doctor–patient relationship. Hood and Rowen foresee individual genome sequences playing a larger role in medical practice and a changed doctor–patient relationship, driven by “patients” who are likely to bring consumer genetic data to their appointments and understand themselves to be active participants in their medical care. The new “P4 medicine” will be not only predictive, preventive, and personalized, but participatory, and based on a data-driven systems approach to disease. Write Hood and Rowen, “We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual” (p. 83). In the meanwhile, as Jenny Reardon (2017) tells us, the vast expanse between data and meaning characterizes “the postgenomic condition.”

Physicians are likely to encounter patients who bring consumer genetic data to their appointments because of a development largely unanticipated by HGP proponents and critics alike: “personal genomics” or “recreational genomics.” Efforts to compile SNP databases and develop rapid, inexpensive, whole-genome sequencing technologies have not yet supported the dawn of a new era of personalized medicine and drug development guided by pharmacogenomics, but direct-to-consumer (DTC) genomics has taken off as an industry, with profit-making seemingly unhampered by the lack of treatments for diseases based on knowledge of DNA sequences. The first DTC whole-genome test was marketed in 2006 (Green at al. 2011). By 2018, more than 10 million people had ordered DTC personal genomics tests (Khan and Mittelman 2018), and, that year, the NHGRI, celebrating the HGP’s 15 th anniversary, identified DTC genetic testing as one of the “15 for 15” ways in which genomics is influencing the world. In 2019, the global DTC genetic testing market was valued at over $1 billion and forecast to climb to $3.4 billion by 2028 (Ugalmugle and Swain 2020). Although health and ancestry are the most common genetic tests sought, there is a broad range of tests available, and DTC companies usually offer more than one service (Phillips 2016). 23andMe and Ancestry.com, for example, offer both health and ancestry tests. The family match function offered by these tests allows biological parentage to be discovered in cases of adoption and gamete donorship/sale. For people who want to confirm paternity, out a cheating spouse, ascertain athletic ability, identify nutritional needs, or find a romantic partner, there are genetic tests and companies for those interests too.

2. Philosophy and the Human Genome Project

At an October 1988 news conference called to announce his appointment, Watson, in an apparently off-the-cuff response to a reporter who asked about the social implications of the project, promised that a portion of the funding would be set aside to study such issues (Marshall 1996b). The result was the NIH/DOE Joint Working Group on Ethical, Legal, and Social Implications (ELSI) of Human Genome Research, chaired by Nancy Wexler, which began to meet in September 1989. The Joint Working Group identified four areas of high priority: “quality and access in the use of genetic tests; fair use of genetic information by employers and insurers; privacy and confidentiality of genetic information; and public and professional education” (Wexler in Cooper 1994, p. 321). The NIH and DOE each established ELSI programs: philosopher Eric T. Juengst served as the first director of the NIH-NCHGR ELSI program from 1990 to 1994. ELSI was funded initially to the tune of three percent of the HGP budget for both agencies; this was increased to four and later five percent at the NIH, a huge boost in bioethics funding, on the order of tens of millions of dollars each year.

Ethical issues such as genetic privacy, access to genetic testing, and genetic discrimination were not the only considerations of interest to philosophers, and besides ethicists, philosophers of science, political theorists and philosophers working in other areas benefited from ELSI-related funding. There is now a vast literature on human genome-related topics. From among these topics, this section attempts to provide a synopsis of those that are most directly associated with the HGP itself, of greatest concern and enduring interest to philosophers, and not covered in other SEP entries. Since there is interest in exporting the ELSI model to other biomedical contexts, such as neuroscience, consideration is also given to its legacy.

Various HGP proponents told us that we would discover our human essence in the genome. According to Dulbecco (1986), “the sequence of the human DNA is the reality of our species” (p. 1056); Gilbert was quoted as saying “sequencing the human genome is like pursuing the holy grail” (in Lee 1991, p. 9); on the topic of his decision to dedicate three percent of HGP funds to ELSI, Watson wrote: “The Human Genome Project is much more than a vast roll call of As, Ts, Gs, and Cs: it is as precious a body of knowledge as humankind will ever acquire, with a potential to speak to our most basic philosophical questions about human nature, for purposes of good and mischief alike” (with Berry 2003, p. 172).

“Geneticization” is a term used to describe the phenomenon characterized by an increasing tendency to reduce human differences to genetic ones (Lippman 1991). The several billion dollars of funding for the HGP was justified by the belief that genes are key determinants of not only rare Mendelian diseases like Huntington’s disease or cystic fibrosis but common multi-factorial conditions like cancer, depression, and heart disease. Wrote an early critic of the HGP: “Without question, it was the technical prowess that molecular biology had achieved by the early 1980s that made it possible even to imagine a task as formidable as that of sequencing what has come to be called ‘the human genome.’ But it was the concept of genetic disease that created the climate in which such a project could appear both reasonable and desirable” (Keller 1992, p. 293). Given that the development of any trait involves the interaction of both genetic and nongenetic factors, on what bases can genes be privileged as causes to claim that a particular disease or nondisease trait is “genetic” or caused by a “genetic susceptibility” or “genetic predisposition”? This question has led philosophers of science to grapple with appropriate definitions for terms such as “genetic disease” and “genetic susceptibility” and how best to conceptualize genetic causation and gene–environment interaction (e.g., Kitcher 1996; Gannett 1999; Kronfeldner 2009). Closely related to the concepts of geneticization and genetic disease/ susceptibility/ predisposition are assumptions about genetic reductionism and genetic determinism.

Genetic reductionism can be understood as governing the whole–part relation in which organismal properties are explained solely in terms of genes and organisms are identified with their genomes. Definitions of health and disease attach to organisms and their physiological and developmental processes in particular contexts (provided by populations and environments) and cannot simply be relocated to the level of the genome (Griesemer 1994; Limoges 1994; Lloyd 1994), however, and diseases do not become more objectively defined entities once they receive a genetic basis since social and cultural values implicated in designations of health and disease can become incorporated at the level of the genome, in what counts as a normal or abnormal gene (Gannett 1998). In contrast to physical reductionism, which does not privilege DNA but considers it on par with proteins, lipids, and other molecules (Sarkar 1998), genetic reductionism assumes that genes are in some sense more causally efficacious. Genetic determinism concerns such assumptions about the causal efficacy of genes.

In a public lecture held to celebrate completion of the HGP, Collins characterized the project as “an amazing adventure into ourselves, to understand our own DNA instruction book, the shared inheritance of all humankind” (see National Human Genome Research Institute, 2003). At the cellular level, the book is said to contain “the genetic instructions for the entire repertoire of cellular components” (Collins et al. 2003, p. 3). At this level, genetic determinism is sustained by metaphors of Weismannism and DNA as “code” or “master molecule” (Griesemer 1994; Keller 1994), which accord DNA causal priority over other cellular components. This may be in a physical sense: Weismannism assumes (falsely) that intergenerational continuity exists only for germ cell nuclei whereas somatic cells and germ cell cytoplasm arise anew in each generation. It may also be in the sense of a point of origin for the transfer of information: the central dogma of molecular biology, which represents a 1950s reformulation of Weismannism in terms of information theory, asserts that information travels unidirectionally from nucleic acids to protein, and never vice versa. It is contentious, however, whether amongst the cell’s components only nucleic acids can be said to transmit information: for some philosophers, genetic coding plays a theoretical role at least at this cellular level (Godfrey-Smith 2000); for others, genetic coding is merely (and misleadingly) metaphorical, and all cellular components are potential bearers of information (Griffiths 2001; Griffiths and Gray 1994; Sarkar 1996).

At the organismal level, new research in functional genomics may lead to less deterministic accounts even of so-called single gene disorders. For these, the concepts of penetrance and expressivity operate in ways that accommodate the one–one genetic determinist model where the mutation is necessary and/or sufficient for both the presence of the condition and confounding patterns of phenotypic variability. But the severity of even a fully penetrant condition like Huntington’s disease seems to depend on not just genetic factors like the number of DNA repeats in the mutation but epigenetic factors like the sex of the parent who transmitted the mutation (Ridley et al. 1991). For complex conditions to which both genetic and environmental differences contribute—for example, psychiatric disorders or behavioral differences—genetic determinism is denied, and everyone is an interactionist these days, in some sense of “interaction.” Both genes and environment are recognized to be necessary for development: by themselves, genes cannot determine or do anything. Yet, theorists still seem to give the nod to one or the other, suggesting that it is mostly genes or mostly the environment, mostly nature or mostly nurture, that make us what we are. This implies that it is possible to apportion the relative contributions of each. Gilbert (1992) suggests this in his dismissal of a more simplistic version of genetic determinism: “We must see beyond a first reaction that we are the consequences of our genes; that we are guilty of a crime because our genes made us do it; or that we are noble because our genes made us so. This shallow genetic determinism is unwise and untrue. But society will have to wrestle with the questions of how much of our makeup is dictated by the environment, how much is dictated by our genetics, and how much is dictated by our own will and determination” (pp. 96–97). However, the assertion that the relative contributions of genes and environment can be apportioned in this way is misleading if not outright false. Building on R. C. Lewontin’s (1974) classic paper on heritability, work in developmental systems theory (DST) undermines any such attempts to apportion causal responsibility in organismal development: traits are jointly determined by multiple causes, each context-sensitive and contingent (Griffiths and Gray 1994; Griffiths and Knight 1998; Oyama 1985; Oyama et al. 2001; Robert 2004).

Geneticization, genetic reductionism, and genetic determinism helped to sell the HGP. Gilbert (1992) endorsed the reduction of individual humans to their genes: “The information carried on the DNA, that genetic information passed down from our parents,” he wrote, “is the most fundamental property of the body” (p. 83), so much so, in fact, that “one will be able to pull a CD out of one’s pocket and say, ‘Here is a human being; it’s me!’” (p. 96). Cancers that we consider to be environmental in their origins were recast as genetically determined. In Watson’s words: “Some call New Jersey the Cancer State because of all the chemical companies there, but in fact, the major factor is probably your genetic constitution” (in Cooper 1994, p. 326). In Bodmer’s words: “Cancer, scientists have discovered, is a genetic condition in which cells spread uncontrollably, and cigarette smoke contains chemicals which stimulate those molecular changes” (Bodmer and McKie 1994, p. 89). (See Proctor 1992 and Plutynski 2018 for discussions of cancer as a genetic disease.) Marking the 20 th anniversary of the release of the draft sequences, Richard Gibbs, director of the sequencing center at Baylor, admits that “there was plenty of hype that was shared with the media and the wider community” and that such “outlandish visions” as personalizing therapies, revealing the “mysteries of the architecture of common complex diseases,” and predicting criminality have not been realized. But Gibbs excuses the hype as necessary for generating support for the project: “The hyperbole that we look back on did not, however, come from the front line. It came from those who championed the programme, mindful of its long-term benefits. Thanks to them, they generated the enthusiasm to fund this transformative work” (2020, p. 575).

Like Gibbs, bioethicist Timothy Caulfield (2018) finds the hype and hyperbole of scientists understandable: “Enthusiasm and optimistic predictions of near-future applications are required in order to mobilize the scientific community and potential funders, both public and private. This is particularly so in areas like genomics, where large amounts of sustained funding are required in order to achieve the hoped for scientific and translational goals” (p. 561). However, unlike Gibbs, Caulfield details possibilities of “real harm,” which include “potentially eroding public trust and support for science; inappropriately skewing research priorities and the allocation of resources and funding; creating unrealistic expectations of benefit for patients; facilitating the premature uptake of expensive and potentially harmful emerging technologies by health systems; misinforming policy and ethics debates; and accelerating the marketing and utilization of unproven therapies” (p. 567). The hype and hyperbole used to promote personalized (or precision) medicine carry the risks Caulfield mentions.

Approaching 20 years since completion of the HGP, genome science has not revolutionized medicine or markedly improved human health. Progress has been made on rare diseases (e.g., spinal muscular atrophy) and some forms of cancer (e.g., non-small cell lung cancer), though these interventions can be prohibitively expensive (Tabery 2023). For most complex diseases, however, predictions based on “family history, neighborhood, socioeconomic circumstances, or even measurements made with nothing more than a tape measure and a bathroom scale” outperform predictions based on the possession of genetic variants identified by GWAS. Pressing public health problems such as increasing obesity, the opiate epidemic, and mental illness fail to be addressed by the “human genome-driven research agenda” to which the lion’s share of resources go (Joyner and Paneth 2019). So even though the deterministic and reductionistic assumptions underlying the HGP have been undermined by the research in molecular biology the project made possible (Keller 2000), the critics’ worries about geneticization, genetic reductionism, and genetic determinism remain relevant, in particular their belief that embracing a reductionist approach to medicine that conceives of human health and disease in wholly molecular or genetic terms individualizes these and detracts attention from risks factors associated with our shared social and physical environments (Nelkin and Tancredi 1989; Hubbard and Wald 1993; Tabery 2023).

Genetic testing is carried out for a range of purposes: diagnostic, predictive, and reproductive. Genetic testing carried out at the population level for any of these purposes is referred to as genetic screening. Diagnostic genetic testing is performed on individuals already experiencing signs and symptoms of disease as part of their clinical care. Newborn screening programs to diagnose conditions such as PKU and hemoglobinopathies based on blood components and circulating metabolites (thus providing indirect genetic tests) have been carried out for many decades. Predictive genetic testing is performed on individuals who are at risk for inheriting a familial condition, such as cystic fibrosis or Huntingdon’s disease, but do not yet show any signs or symptoms. Reproductive genetic testing is carried out through carrier screening, prenatal testing of the fetus in utero, and preimplantation genetic diagnosis (PGD) of embryos created by in vitro fertilization (IVF). In carrier screening, prospective parents find out whether they are at risk for passing on disease-related genes to their offspring. Prenatal genetic testing of fetuses in utero is conducted using blood tests early in a woman’s pregnancy, chorionic villus sampling (CVS) at 10–12 weeks, and amniocentesis at 15–18 weeks. Testing is increasingly offered to all women who are pregnant, not just those for whom risk is elevated because of age or family history; based on the results, women can elect to continue the pregnancy or abort the fetus. In PGD, a single cell is removed from the 8-cell embryo for testing; based on the results, a decision is made about which embryo(s) to implant in the woman’s uterus.

There are significant ethical issues associated with genetic testing. These issues are informed by empirical studies of the psychosocial effects of testing (Wade 2019). Increased knowledge that comes from predictive genetic testing is not an unmitigated good: denial may be a coping mechanism; individuals may feel guilty for passing on harmful mutations to their offspring or stigmatized as having the potential to do so; survivor guilt may arise in those who find out they are not at risk for a disease such as Huntingdon’s after all, or they may become at a loss about how to live their lives differently; those who find out they are destined to develop Huntingdon’s or early-onset Alzheimer’s disease may become depressed or even suicidal; paternity may not be what it is assumed to be; decisions about disclosing results have implications for family members. During debates about the HGP, many authors appealed to the history of eugenics to warn about the dangers of reproductive genetic testing and urge caution as we move forward—so much so that historian Diane Paul (1994) characterized eugenics as the “‘approved’ project anxiety” (p. 143). Paul noted that attempts to draw lessons from the history of eugenics are confounded by disagreements about how to define “eugenics”—whether to characterize eugenics according to a program’s intentions or effects, its use of coercive rather than voluntary means, or its appeals to social and political aims that extend beyond the immediate concerns of individual families. The label “liberal eugenics” has become increasingly accepted for characterizing offspring selection based on parental choice. Reproductive rights are no longer just about the right not to have a child (to use contraception, to have an abortion) or the right to bear a child (to refuse population control measures). Reproductive rights have come to encompass the right to access technological assistance to procreate and to have a certain kind of child (Callahan 1998).

Concerns about genetic discrimination resulting from genetic testing were frequently expressed at the outset of the HGP. Concerns focused mostly on insurance companies and employers, but possibilities for genetic discrimination occurring in other institutional settings were raised as well (Nelkin 1992; Nelkin and Tancredi 1989). A number of general arguments have been made against institutional forms of genetic discrimination: we don’t choose our genes and ought not be punished for what is outside our control (Gostin 1991); the social costs of creating a “biologic” or “genetic underclass” of people who lack health care and are unemployed or stuck in low-wage jobs are too great (Lee 1993; Nelkin and Tancredi 1989); people’s fears of genetic discrimination, whether realistic or not, may lead them to forego genetic testing that might benefit their lives and be less inclined to participate in genetic research (Kass 1997); people have the right not to know their genetic risk status (Kass 1997). Genetic discrimination may also occur in less formal circumstances. Mate choice could increasingly proceed based on genetic information, with certain people being labeled as undesirable. As more and more fetuses are aborted on genetic grounds, families of children born with similar conditions, and people with disabilities and their advocates more broadly, worry that increased stigmatization will result. In addition, group-based genetic research into diseases or behavioral differences risks stigmatizing people based on racial, ethnic, and gender differences, with such risks informed by the troubling history of the study of the genetics of intelligence (Tabery 2015).

In the U.S., where unlike other industrialized countries there is no publicly funded system of universal health care, genetic discrimination by insurance companies and employers has been a particularly serious worry; existing or prospective employees found to be at genetic risk could be fired or not hired by employers to reduce costs of providing health care coverage. ELSI research relating to genetic privacy and the risk of genetic discrimination is credited with bringing about changes in federal law with “far reaching” effects on society (McEwen et al. 2014)—in particular, passage of the Genetic Information Nondiscrimination Act (GINA) in May 2008, which prohibits U.S. health insurance companies and employers from discriminating based on genetic information, defined to include genetic test results and family history but not manifest disease. The Affordable Care Act, passed in 2010, by prohibiting discrimination by health insurers based on preexisting conditions, which include genetic test results and manifest disease, fills in that gap and negates the need for GINA in the context of health insurance. As for employment, there remains a gap: employees who are substantially impaired are covered by the Americans with Disabilities Act and employees who are asymptomatic with genetic tests showing a predisposition for disease are covered by GINA, but employees with manifest disease who are not substantially impaired are covered by neither (Green et al. 2015). GINA does not prohibit use of genetic information in underwriting for life, disability, or mortgage insurance. Discrimination takes the form of refusing coverage on the basis that the genetic susceptibility counts as a “preexisting condition,” charging high premiums for the policy, limiting benefits, or excluding certain conditions. In 2020, Florida became the first state to prohibit use of genetic test results by life insurance companies.

The insurance industry argues that there is no principled reason to treat genetic information any differently from other medical information used in underwriting. They point to the problem of “adverse selection”: people who know themselves to be at high risk are more likely to seek insurance than people who know themselves to be at low risk, which threatens the market when insurers are deprived of the same information (Meyer 2004; Pokorski 1994). Taking an approach to legislation and policy that singles out genetic information for protection has also been criticized philosophically for being based on “misconceptions [that] include the presumption that a clear distinction exists between genetic and nongenetic information, tests, and diseases and the genetic essentialist belief that genetic information is more definitive, has greater predictive value, and is a greater threat to our privacy than is nongenetic medical information” (Beckwith and Alper 1998, p. 208; see also Rothstein 2005). The approach has been dubbed “genetic exceptionalism” and is criticized for drawing from, and in turn fostering, myths of genetic determinism and genetic reductionism (Murray 1997, p. 61; see also O’Neill 2001). Rather than assuming the binarism implicated in here—that genetic information is unique and targeted policies are necessary or that genetic information is not unique and targeted policies are unnecessary—“genomic contextualism” has been recommended as alternative approach (Garrison et al. 2019a). This approach recognizes that there are both similarities and differences between genomic and other types of clinically relevant information and that the specific context in which the policy or practice is implemented determines how best to proceed. Since completion of the HGP, the contexts in which genomics is practiced have changed sufficiently that while privacy concerns remain pressing, the debate has been largely recast.

At the outset of the HGP, concerns about genetic privacy focused on how to protect the public from intrusive governments, employers, and insurance companies. But the explosion of DTC genomics has raised privacy concerns caused by the very public in need of protection! In DTC genomics, Y-chromosomal, mitochondrial, and autosomal DNA ancestry tests are used to provide familial matches. These matches enable adoptees in closed adoptions and offspring of anonymous gamete donors to track down biological parents and other family members, raising obvious privacy concerns for those who gave up children for adoption or agreed to donate or sell eggs or sperm expecting that their anonymity would be protected. Additional privacy concerns arise as the result of genetic genealogy’s use of matches to cousins of various degrees to fill out missing branches in family trees. Genome scientists rely on cell lines, DNA sequence data, and clinical data sets that have been di-identified to protect the anonymity of volunteers; however, access to genetic genealogy databases that combine genetic and traditional genealogical information makes re-identification possible. In genetic genealogy, surname projects based on the co-inheritance of surnames and Y-chromosomal haplotypes furnish candidate surnames when sequence information is available; using Internet searches to match surnames with year of birth and U.S. state of residence, researchers were able to identify individuals who had participated in the 1000 Genomes Project; by extension, they also identified family members who had not participated in the project or consented to share information (Gymrek et al. 2013). Based on population genetic modelling, researchers suggest that with a genetic genealogy database that covers two percent of a target population, a third cousin match can be obtained for 99 percent of the population. With this match, family trees constructed using traditional genealogical methods and additional sources of information can be used to identify an unknown individual for whom DNA is available: this is the “long range familial search” approach that law enforcement is using in an increasing number of active as well as cold cases (Erlich et al. 2018).

A response to the inability to guarantee anonymity for participants in genomic research is to consider concerns about genetic privacy and privacy more generally to be passée. Such concerns are increasingly seen to stand in the way of scientific progress. Watson and Venter have promoted the idea that there is nothing to fear by making one’s sequence public rather than protecting it as private. Venter’s diploid genome was fully sequenced and the findings published in the October 2007 issue of PLoS Biology (Levy et al. 2007); this was followed by the publication of Watson’s “complete” genome in the 17 April 2008 issue of Nature (Wheeler et al. 2008). Relevant to the privacy question, Watson did not bare all: at Watson’s request, the APOE gene which is linked to Alzheimer’s disease was omitted from his sequence. Along similar lines, the volunteers Church recruits for the Personal Genome Project agree to release their genomes and health and physical information publicly, a model of “open consent” replacing genetic privacy (Lunshof et al. 2008). Says Church: “Ideally, everybody on the planet would share their medical and genomic information” (in Dizikes 2007). Large-scale biobanking initiatives such as the All of Us Research Program, originally called the Precision Medicine Initiative or PMI, appeal to collective altruism. The Internet has radically changed people’s expectations of privacy, the boundary between their personal and public lives, and their expectations of accessing information, and these changes are welcomed by genome scientists. To ensure “the free flow of research data,” the 2011 National Research Council report calls for the “[g]radual elimination of institutional, cultural, and regulatory barriers to widespread sharing of the molecular profiles and health histories of individuals, while still protecting patients’ rights” (p. 60).

With this free flow of research data, across institutions and globally, with varying degrees of oversight, people’s ability to consent to use of their biospecimens and data for some but not other purposes becomes impossible (Zarate et al. 2016). Privacy concerns are amplified insofar as genomics is a science driven by big data. The promise of personalized and precision medicine is premised, for some, on amassing all obtainable data on individuals, whether mined from electronic health records, government databases, DTC genomics, genealogy sites, mobile devices, social media, credit card transactions, fitbits, etc. Data-driven biology forgoes hypotheses for algorithms, but these algorithms are not innocuous. Hallam Stevens (2021) describes “an emerging medical-industrial complex” that presents “substantial challenges for privacy, data ownership, and algorithmic bias,” which, if not addressed, will lead to a genomic science that operates in the interests of “surveillance capitalism” and the corporate tech giants (pp. 565–566). Governments, especially authoritarian ones, are building databases that supplement DNA with biometics and social media posts to carry out genomic surveillance that often targets minorities (Moreau 2019).

Of course, significant social privilege attaches to some people’s ability not to worry that genetic testing offered as an employee benefit (Singer 2018) or their virtual cloud of billions of data points will cause them harm. Or that they will regret spitting into a tube and sending it off to 23andMe. As Anna Jabloner (2019) argues, “molecular identification technologies … tell a tale of two molecular Californias: one is a tale of an unchanged biological determinism that continues to mark some bodies as risky and criminal, the other tale is of individual empowerment through the consumption of molecular knowledge” (p. 15). Black and Latino men are overrepresented in CAL-DNA, which is one of the largest criminological DNA databases in the world, while 23andMe’s even larger database contains the DNA of mostly wealthy white Americans. Similarly, Reardon (2017) comments on the tension between the democratizing impulse of the open-data model of Church’s Personal Genome Project and the overwhelming Whiteness, affluence, and maleness of the tech-savvy volunteers who have contributed their genomes

Indigenous groups resist this open consent model that appeals to collective altruism and seek to maintain control over biospecimens contributed and data generated. As participants in scientific studies, they have experienced lack of support for their interests and priorities, failure to share benefits of research, disrespect for cultural and spiritual beliefs, theft of traditional knowledge, conduct of unapproved secondary research, and opportunistic commercialization (Garrison et al. 2019b). Genetics and genomics bring specific concerns, as “genomic data are commonly seen by Indigenous communities as more sensitive than other types of health data, particularly with regard to genealogy and ancestry research that can influence traditionally held beliefs, cultural histories and identity claims affecting rights to land and other resources ” (Hudson et al. 2020, p. 378). Given distrust of funding agencies, universities, and researchers arising from these experiences and concerns, Indigenous underrepresentation in biobanks, DNA sequence databases, and clinical datasets is not surprising. Genome scientists desire access to biospecimens and data of Indigenous peoples for a range of purposes: geographical isolation of populations over tens of thousands of years can yield genetic variants of physiological and clinical interest associated with adaptive responses to environments; comparative genomics and ancient DNA studies contribute to knowledge of human evolutionary history; and efforts to identify genetic contributions to complex traits using GWAS and admixture mapping depend on access to populations that incorporate the genetic diversity of the species. Any progress in the prevention and treatment of disease that arises through precision medicine will be weighted towards populations most studied. Indigenous peoples make up only 0.022% of participants in GWAS conducted worldwide (Mills and Rahal 2019).

In recognition of the overrepresentation of people of European descent in biobanks, DNA sequence databases, and clinical datasets, the NIH’s All of Us initiative seeks to include at least 50 percent underrepresented minorities, motivated by the goal that precision medicine benefit everyone. Keolu Fox (2020) points out that NIH plans to include Indigenous communities in the All of Us initiative fail to appreciate that the open-source data approach used for previous government-funded, large-scale human genome sequencing efforts such as the International HapMap Project and 1000 Genomes Project facilitates the commodification of data by pharmaceutical and ancestry-testing companies. Nanibaa’ A. Garrison et al. (2019b) suggest that development of alternative models for genetic and genomic research involving Indigenous peoples begin by recognizing Indigenous sovereignty, which is “the inherent right and capacity of Indigenous peoples to develop culturally, socially, and economically along lines consistent with their respective histories and values” (pp. 496–497). There should be tangible benefits for Indigenous communities (e.g., support for health promotion, meaningful results), with equitable sharing of profits should commercialization occur (Hudson et al. 2020). Community engagement is crucial; this may extend to community-based participatory research that views Indigenous communities as partners in research, not merely subjects of research. Individual consent is insufficient and should be preceded by collective consent, and consent needs to be an ongoing process for any subsequent research contemplated (Garrison et al. 2019b; Tsosie et al. 2019). Indigenous control over the use and disposal of DNA samples is needed: on the “DNA on loan” approach developed in Canada (Arbour and Cook 2006), the participant or community retains ownership of biological materials and entrusts these to researchers or research institutions as stewards. As for data gleaned from biological materials, the model of open consent is counter to the concept of Indigenous data sovereignty, which is “the inherent and inalienable rights and interests of indigenous peoples relating to the collection, ownership and application of data about their people, lifeways and territories” (Kukutai and Taylor 2016, p. 2).

Early in the debates surrounding plans for the HGP, questions arose concerning what it means to map and sequence the human genome—“get the genome,” as Watson (1992) put it. About these concerns, McKusick (1989) wrote: “The question often asked, especially by journalists, is ‘Whose genome will be sequenced?’ The answer is that it need not, and surely will not, be the genome of any one person. Keeping track of the origin of the DNA that is studied will be important, but the DNA can come from different persons chosen for study for particular parts of the genome” (p. 913). The HGP and Celera reference sequences are indeed composites based on chromosomal segments that originate from different individuals: the sequence in any given region of the genome belongs to a single individual, but sequences in different regions of the genome belong to different individuals. However, in both cases, the majority of the sequence originates from just one person. As HGP sequencing efforts accelerated, concerns arose that only four genomes, a couple of which belonged to known laboratory personnel, were being used for physical mapping and sequencing (Marshall 1996a). The decision was made to construct 10 new clone libraries for sequencing with each library contributing about 10 percent of the total DNA. In the end, 74.3 percent of the total number of bases sequenced was derived from a single clone library—that of a male, presumably from the Buffalo area; seven other clone libraries contributed to an additional 17.3 percent of the sequence (International Human Genome Sequencing Consortium 2001, p. 866). A similar proportion—close to 71 percent—of the Celera sequence belongs to just one male even though five ethnically diverse donors were selected; incredibly enough, rumors were eventually confirmed that this individual is Venter himself (McKie 2002).

The deeper question, of course, is how we might understand a single human genome sequence, a composite that belongs to no actual individual in its entirety and only a handful of individuals in its parts, to be representative of the entire species. This seems to ignore the extensive genetic variability that exists. Early critics of the HGP pointed out numerous faults with the concept of a representative or putatively normal genome: many DNA polymorphisms are functionally equivalent (Sarkar and Tauber 1991); the genome sequence will contain unknown defective genes (since no one, including donors, is free of these), and it is impossible to identify the genetic basis of a disorder simply by comparing the sequences of sick and well people since there will be many differences between them (Lewontin 2000 [1992]); and from an evolutionary viewpoint, mutations are not “errors” in the genetic code or “damage” to the genome’s structure, but the genetic variants that provide the raw materials that make it possible for new species to arise (Limoges 1994, p. 124). There were related worries that the human genome reference sequence would arbitrate a standard of genetic normality; for example, the application of concepts like “genetic error” and “damage” to the genome institutes a call for correction or repair (Limoges 1994; also Murphy 1994). Indeed, the 1988 Office for Technology Assessment report on the HGP recommended the “eugenic use of genetic information … to ensure … that each individual has at least a modicum of normal genes” (p. 85).

Science named “human genetic variation” as “Breakthrough of the Year” for 2007. Humans have been found to be 99.9 percent alike genetically, but notable for the magazine was the extent to which individuals had been found to differ genetically from one another—in SNPs, insertions, deletions, and other structural elements—and the promise this apparently unexpected amount of variation holds for using genome-wide association studies (GWAS) to discover the genetic bases for complex traits, both disease and non-disease traits, to which multiple genetic and nongenetic factors contribute. The human genome reference sequence has been useful as a tool for discovering and cataloguing that genetic variation by providing a standard shared by the scientific community: “The current reference genome assembly works as the foundation for all genomic data and databases. It provides a scaffold for genome assembly, variant calling, RNA or other sequencing read alignment, gene annotation, and functional analysis. Genes are referred to by their loci, with their base positions defined by reference genome coordinates. Variants and alleles are labeled as such when compared to the reference (i.e., reference (REF) versus alternative (ALT)). Diploid and personal genomes are assembled using the reference as a scaffold, and RNA-seq reads are typically mapped to the reference genome” (Ballouz et al. 2019, p. 159). However, what had portrayed as a journalist’s or philosopher’s question is now being asked by genome scientists too. Even as a tool, there are challenges to overcome. If alleles included in the reference sequence are relatively rare, “reference bias” is introduced: genomes that resemble the reference genome are easier to assemble and align, and variants are missed or misidentified (Ballouz et al. 2019). And when entire stretches of sequence are missing in the reference sequence, those sequences will be discarded and missed entirely because of the reliance on the genome reference sequence for assembling and aligning sequenced genomes (Sherman and Salzberg 2020).

Since 2003, there have been ongoing efforts to update the human genome reference sequence by filling gaps, correcting errors, and replacing minor alleles (the current version of the reference sequence is GRCh38). Diversity has been incorporated in the reference sequence by tacking on additional sequences, but by maintaining a linear representation, this loses location information (Kaye and Wasserman 2021). Suggestions have been made for ways to further improve the genome reference sequence. Recently developed long read sequencing technologies facilitate the discovery of large structural variants (SVs) and not just genetic variants (i.e., SNPs and smaller insertions and deletions, or “indels”). One suggestion calls for “reconstructing a more precise canonical human reference genome” by using those SVs to correct misassemblies in the reference sequence and adding the more common SVs to improve variant detection (Yang et al. 2019). Another suggestion recommends adopting a “consensus sequence” approach, in which the most common alleles and variants in the population are chosen for inclusion (Ballouz et al. 2019). Problems remain, however: these remain composite genomes and may contain sequences that would not be found together in any individual, and ongoing updates undermine the stability of the reference sequence as a reference (Kaye and Wasserman 2021). Suggestions have also been made for the replacement of the genome reference sequence. Long read sequencing allows the “ de novo assembly” of genomes, which obviates need of the reference sequence for scaffolding (Chiasson et al. 2015). A human “pan-genome” would accommodate variation by serving as a collection of all the DNA sequences found in the species, both SVs and genetic variants, and replace a linear representation of the genome with more complex genome graphs (Sherman and Salzberg 2020; Miga and Wang 2021). Another possibility is “The Genome Atlas,” which foregoes use of the reference sequence even as a coordinate system, with entries in the atlas instead features of the genome assigned unique feature object identifiers (FOIs). The database generates blueprint genomes based on selected features for use in sequencing reads (Kaye and Wasserman 2021).

Philosophical concerns about whether a human genome reference sequence arbitrates a standard of genetic normality remain, though these may be mitigated by the pan-genome and genome atlas approaches. An empirically validated consensus sequence approach that includes the most common alleles and variants in the population in the genome reference sequence does not imply that those alleles and variants are of biomedical significance because they are conducive to health or of evolutionary significance because they are ancestral. Science ’s 18 February 2011 issue in celebration of the 10-year anniversary of publication of the draft human genome sequence contains an essay by genome scientist Maynard V. Olson, which asks, in its title, “What Does a ‘Normal’ Human Genome Look Like?” Although the HGP was criticized as anti-evolutionary, pre-Darwinian, typological, and essentialist for seemingly instituting a standard of genetic normality, it was also argued that the HGP might be seen instead as incorporating a specific set of evolutionary assumptions (Gannett 2003). Indeed, from an evolutionary perspective, Olson contends that genetic variability among relatively healthy humans is largely composed of deleterious mutations, rather than adaptive mutations due to balancing or diversifying selection, and that, consequently, “there actually is a ‘wild-type’ human genome—one in which most genes exist in an evolutionarily optimized form” (p. 872), though individual humans inevitably “fall short of this Platonic ideal.” Judgments about what constitutes a “normal,” “wild-type,” or “ideal” human genome do not escape the socio-cultural contexts in which they arise. Social and cultural values that attach to judgements at the phenotypic level are simply embedded in the genome, where they are less visible as such (Gannett 1998).

In promoting the HGP, Gilbert (1992) suggested that we will find answers to the age-old question about human nature in our genome: “At the end of the genome project, we will want to be able to identify all the genes that make up a human being…. So by comparing a human to a primate, we will be able to identify the genes that encode the features of primates and distinguish them from other mammals. Then, by tweaking our computer programs, we will finally identify the regions of DNA that differ between the primate and the human—and understand those genes that make us uniquely human” (p. 94). Although philosophers challenge the species essentialism that defines species in terms of genetic properties shared by all and only their members (Gannett 2003; Robert and Baylis 2003), genome scientists are indeed comparing the human genome reference sequence to chimpanzee, bonobo, and Neandertal genome reference sequences to explore questions about human nature. Already in 1969, Sinsheimer foresaw the promise of molecular biology to remake human nature: “For the first time in all time, a living creature understands its origin and can undertake to design its future” (in Kevles 1992, p. 18). Remaking human nature is likely to begin with genetic modifications that convey the possibility of resistance to a serious disease, like HIV/AIDS, or minimize the effects of aging to extend lifespan, but transhumanists who view human nature as “a work-in-progress, a half-baked beginning that we can learn to remold in desirable ways” welcome improvements in memory, intelligence, and emotional capacities as well (Bostrom 2003, p. 493). Theories of justice are typically based on conceptions of human nature; although the new field of sociogenomics continues to favor nature over nurture (Bliss 2018), with the capacity to remake human nature, this foundation disappears (Fukuyama 2002; Habermas 2003).

Concerns about race, ethnicity, and the genome were raised in the early years of the HGP. Racial profiling in the legal system was one such concern: if people belonging to particular racial and ethnic groups are more likely to be arrested, charged, or convicted of a criminal offense, they are more likely to be required to provide DNA samples to forensic databases, and therefore more likely to come back into the system with future offenses (Kitcher 1996). Another concern was that if genetic discrimination by insurers and employers creates a “genetic underclass” (Lee 1993; Nelkin and Tancredi 1989), then to the extent that race and ethnicity correlate with socioeconomic status, some groups—already affected by disparities in health outcomes unrelated to genetic differences—will be disproportionately represented among this “genetic underclass.” A further concern was the social stakes involved when group-based differences are identified, whether these involve sequences localized to particular groups or varying in frequency among groups (Lappé 1994). And given the history of using biological explanations to provide ideological justification for social inequalities associated with oppressive power structures, the prospective use of molecular genetics to explain race differences was met with caution (Hubbard 1994). These concerns were dismissed by HGP proponents, who argued that mapping and sequencing the human genome celebrate our common humanity. At the June 26, 2000 White House press conference announcing completion of “a working draft” of the sequence of the human genome, Venter announced that the results show that “the concept of race has no genetic or scientific basis” (see Clinton, et al. 2000).

Post-HGP genetics and genomics have not lived up to the mantra that because we are 99.9 percent the same, there is no such thing as race. Indeed, a predominantly African American racial identity has been ascribed to the human genome reference sequence itself (Reich et al. 2009), and the de novo assembly of genomes permitted by long read sequencing has led to several countries, such as China, Korea, and Denmark, producing their own “ethnicity-specific reference genomes” (Kowal and Llamas 2019). The International HapMap Project, which was initiated in 2002 with the goal of compiling a haplotype map adequately dense with SNP markers to permit the identification of genes implicated in common diseases and drug responses, sampled the DNA of four populations (European-Americans in Utah; Yoruba in Ibadan, Nigeria; Japanese in Tokyo; and Han Chinese in Beijing). This reproduction of racial categories (here, European, African, and Asian) at the level of the genome has been characterized as the “molecular reinscription of race” (Duster 2015). DTC ancestry testing appeals to a range of group categories, which are defined by geography, nationality, ethnicity, and race: Ancestry.com advertises tests for Irish ancestry in early March each year; FamilyTreeDNA confirms Jewish ancestry, whether Ashkenazi or Sephardi; African Ancestry, Inc. finds ancestral ties to present-day African countries and ethnic groups dating back more than 500 years; and DNAPrint’s panels of ancestry informative markers determine proportions of (Indo)European, East Asian, sub-Saharan African, and Native American heritage.

The resurgence in biological thinking about race and ethnicity since the HGP is due in large part to the postgenomic use of racial and ethnic categories of difference to try to capture patterns of group genetic differences in various fields of research. The revolutionary benefits of postgenomic “personalized” or “precision” medicine were supposed to focus on individual genetic differences within populations, not group genetic differences across populations. Pharmaceuticals, a powerful engine driving post-HGP research into human genetic differences, were supposed to be tailored to individual genomes. In 2003, Venter opposed the U.S. Food and Drug Administration (FDA) proposal to carry out pharmaceutical testing using the Office of Management and Budget (OMB) racial and ethnic classification system, arguing that these are “social” not “scientific” categories of race and ethnicity and that the promise of pharmacogenetics lies in its implementation as individualized medicine given the likelihood that variation in drug responses will vary more within racial and ethnic groups than among them (Haga and Venter 2003). However, en route to a “personalized” or “precision” medicine based on individual genetic differences and pharmaceuticals tailored to individual genomes, a detour via research into group genetic differences has been taken. Now that group genetic differences have become of interest to more than just evolutionary biologists and population geneticists, impetus is provided to debates to which philosophers of science have contributed: longstanding debates about whether race is biologically real or socially constructed (Andreasen 2000; Pigliucci and Kaplan 2003; Gannett 2010; Hochman 2013; Spencer 2014) and more recent ones concerning the appropriateness of the use of racial categories in biomedical research (Root 2003; Gannett 2005; Kaplan 2010; Hardimon 2013).

Despite the detour via group genetic differences en route to “personalized” or “precision” medicine, genome-based research into common diseases and drug responses has focused predominantly on Europeans. For GWAS, a 2009 analysis showed that 96 percent of participants were of European descent; by 2016, though the proportion of participants of European descent had decreased to 81 percent, the change was mostly accounted for by a greater number of studies being carried out in Asian countries (Popejoy and Fullerton 2016). Even though African populations are the most genetically diverse in the world, by 2018, only 2 percent of GWAS participants had been African in origin (Sirugo et al. 2019). This European bias makes it more difficult to isolate rare genetic variants contributing to disease, to provide accurate and informative genetic test results to nonEuropeans, and to ensure that any clinical benefits that arise from genomic research will be equitably distributed within the U.S. and globally. Sociologist Dorothy E. Roberts (2021) calls for genetic researchers to “stop using a white, European standard for human genetics and instead study a fuller range of human genetic variation,” which will “give scientists a richer resource to understand human biology” as well as promoting equitable access to the benefits of research. That study of human genetic variation, Roberts argues, should abandon use of race “as a biological variable that can explain differences in health, disease, or responses to therapies,” as it obscures “how structural racism has biological effects and produces health disparities in racialized populations” (p. 566). Structural racism is advanced as a key determinant of population health: for example, the racial and economic segregation of neighborhoods contributes to health disparities because of differences in quality of housing, exposure to pollutants and toxins, good education and employment opportunities, and access to decent health care (Bailey et al. 2017).

The NHGRI has affirmed its commitment to improve the inclusion of participants from diverse populations in research and begun to appreciate that genomic studies are designed in ways that fail to consider the contribution of social and physical environments to disease (Hindorff et al. 2018). However, as sociologist Steven Epstein (2007) has argued more generally, the “inclusion-and-difference paradigm” that has prevailed in U.S. research over the past couple of decades, though correcting researchers’ previously held default assumption of the white, middle-aged, white male as the normative standard, serves to amplify the role of biology in health and disease while drawing attention away from society. Genome scientists understand the ramifications attached to their use of racial and ethnic categories in research, and despite longstanding, well-considered ELSI-funded research that urges care be attached to the use of these categories (Sankar and Cho 2002; Sankar et al. 2007), problems are ongoing and attended to by NHGRI leaders concerned about “the misuse of social categories of race and ethnicity as a proxy for genomic variation” (Bonham et al. 2018, p. 1534). The “weaponization” of genomics by White nationalists and the alt-right has raised the stakes for genome scientists. The self-described fascist, White supremacist, racist, and anti-Semite who murdered 10 African Americans at a Buffalo supermarket in 2022 posted writings that cited dozens of scientific studies, including use of a GWAS of educational attainment to support hereditarian views of racial differences in intelligence and use of a principal components analysis (PCA) that resolved human genetic diversity into continental-level clusters to support realism about biological race (Carlson 2022).

White supremacists chug milk to celebrate their origins in European populations that evolved the ability to digest lactose in adulthood (Harmon 2018), and they appeal to traces of Neanderthal DNA in their genomes to celebrate their origins in populations that evolved outside Africa (Wolinsky 2019). While it seems incumbent on genome scientists to confront racist misuses of their research, they face challenges in doing so. Misuse may result from misunderstanding science, but not always: lay experts among White nationalists capably use cutting-edge genomics research to justify hereditarian views, thereby building a counter-knowledge (Doron, in press) or citizen science of sorts (Panofsky and Donovan 2019). In the U.S., White nationalists take DTC ancestry tests to prove their genetic purity, that they are 100% European/White/non-Jewish (Panofsky and Donovan 2019), while in Europe, this assumed homogeneity of Whiteness comes into question with Nordicists and Mediterraneans using genetic admixture mapping as they vie to prove themselves the most European of Europeans (Doron, in press). When scientists engage with racists, even ones who are scientifically literate, there is the risk of unintentionally helping their cause. “Furthermore,” as Aaron Panofsky et al. (2021) argue, “many of the findings about human evolution and variation are genuinely complex, ambiguous, contested, changing, and involve historically contingent judgments” (p. 396); hence, it may be difficult to claim that research has been misconstrued. As Claude-Olivier Doron notes, insofar as lay experts exploit ambiguities constitutive of scientific discourse in population genetics as they transfer scientific findings to a White supremacist ideological framework, these findings operate within the framework without distortion.

Genome scientists have made recommendations about sampling protocols and standards for visualizations in population genomics to discourage the misappropriation of research by racists who draw conclusions inconsistent with the intentions of scientists (Carlson et al. 2022). However, these recommendations portray population genetic structure, unlike race, as wholly biological, and as science studies scholarship suggests, the challenge of accessing the biological without recourse to the social—and the interests, biases, and imaginaries associated with the social—may be impossible to overcome. Historians, sociologists, and anthropologists of science have insightfully documented how social, political, and cultural constructions of identity are incorporated in, and become defined by, genetics and genomics research in ways specific to their locations: reenactment of continentally-defined races as biogeographical ancestry in the U.S. (Fullwiley 2008; Gannett 2014); influence of population genetics on Irish origin stories and genealogy of the Irish Travellers (Nash 2008; Nash 2017); naturalization and even pathologization of caste and regional differences in India (Egorova 2010); geneticization/genomicization of Mestizo identity in the context of Mexico’s own genome mapping project (López Beltrán 2011); post-apartheid South Africa’s “genomic archive” bound to apartheid’s racialized subjectivities despite advancing nonracial unity through common origins (Schramm 2021); genetic ancestry testing as a basis for Jewishness (El-Haj 2012); Native American DNA as proof of tribal identity (TallBear 2013); and nationalism’s role in interpreting differences between the Korean Reference Genome (KOREF) and HGP’s genome reference sequence as occurring at the population rather than individual level (Kowal and Llamas 2019).

Use of “ancestry” as a category for genetics and genomics research is considered a means of averting problems associated with the use of “race” and “ethnicity”; for example, for the mapping of complex traits, rather than relying on self-identified race, it has been recommended that population structure be assessed empirically by genotyping individuals to determine their “continental ancestry” proportions (Shields et al. 2005). Critics contend, however, that “continental ancestry” belies the continuous pattern with which genetic variation is distributed across the species and reenacts race as it has been traditionally defined, thus contributing to its reification (Fullwiley 2008; Gannett 2014; Lewis et al. 2022). In 2023, the National Academy of Sciences published a Consensus Study Report, “Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field,” in response to a request by the National Institutes of Health to assess the status of use of race, ethnicity, ancestry, and other population descriptors. The report recommends against use of racial labels in genetics and genomics research, a possible exception being studies of health disparities with genomic data, in which race serves as a proxy for environmental variables (e.g., racism). A distinction is drawn between genetic ancestry (paths through which an individual’s DNA is inherited from specific ancestors, known as the “ancestral recombination graph”) and genetic similarity (a quantitative measure of genetic resemblance among individuals that reflects shared genetic ancestry). The report recognizes geographic origins, ethnicity, and genetic ancestry as appropriate categories for reconstructing human evolutionary history, but advocates use of genetic similarity to constitute groups in most other research contexts, including gene discovery for complex traits.

Relying on genetic similarity does not necessarily lead researchers away from race, ethnicity, and ancestry. In an ethnographic study, sociologists Joan H. Fujimura and Ramya Rajagopalan (2011) found that although the statistical machinery associated with GWAS allows researchers to avoid race and ethnic categories by analyzing samples based wholly on genetic similarity, there was “slippage” from genetic similarity to shared ancestry, which, in turn, since mediated by geography and genealogy, became interpreted as racial or ethnic. While ancestral recombination graphs situate individuals in the context of their genealogical relations without assigning them to geographically or culturally defined populations or groups, in practice, these categories are almost always incorporated in genetic ancestry estimation (Lewis et al. 2022). Population geneticist Graham Coop (2022) favours replacing genetic ancestry with genetic similarity, arguing that describing a sampled individual’s genetic similarity to a reference panel (e.g., “ X is genetically similar to the GBR 1000 Genome samples”) is preferable to attributing genetic ancestry to that individual (e.g., “ X has Northwestern European genetic ancestry”), as it recognizes the conventionalism of the reference panel and continuity of genetic variation (pp. 11–12). However, given that the 1000 Genomes Project’s 26 populations across five continental regions are named using language that reflects “both the ancestral geography or ethnicity of each population and the geographic location where the samples from that population were collected” (Coriell Institute)—e.g., “British from England and Scotland”—the slippage remarked upon by Fujimura and Rajagopalan is encouraged. Nevertheless, as sociologists Aaron Panofsky and Catherine Bliss (2017) observe, “Geneticists face a complex set of pressures regarding population labeling” (p. 75). Ambiguous labels for populations that conflate geography, race, and ethnicity may offer geneticists the flexibility to fulfill their own research goals while accessing repositories of preclassified biospecimens and data, collaborating with researchers pursuing quite different agendas, and maintaining goodwill with populations studied. These pressures compete with pressures about labels imposed by funders and journals, which may provoke skepticism and resistance.

Although ELSI may have had an inauspicious argued beginning in Watson’s apparent off-the-cuff remarks at a 1988 news conference, the research program has outlasted the HGP itself. ELSI funding is mandated through the National Institutes of Health Revitalization Act of 1993, which calls for a minimum of 5 percent of the NIH budget for the HGP—the monies directed to the NCHGR-NHGRI—to be set aside to study the ethical, legal, and social implications of the science of genomics (McEwen et al. 2014). The Division of Genomics and Society at the NIH’s NHGRI, created in 2012, maintains the Ethical, Legal and Social Implications (ELSI) Research Program as an extramural grant funding initiative to the present day; the division also includes an intramural bioethics program. Input concerning the ELSI program is provided by the NHGRI Genomics and Society Working Group through the National Advisory Council for Human Genome Research.

A review article by NIH staff (McEwen et al. 2014) characterizes the ELSI program as “an ongoing experiment.” Since it was established in 1990, the ELSI program has supported empirical and conceptual research carried out by researchers from a broad range of disciplines: “genetics and genomics, clinical medicine, bioethics, the social sciences (e.g., psychology, sociology, anthropology, political science, and communication science), history, philosophy, literature, law, economics, health services, and public policy” (p. 485). This research is considered to have had impacts on genomics studies (e.g., requirements for informed consent, protection of the privacy of subjects, and nomenclature for socially defined groups), genomic medicine (e.g., personal impacts of acquiring genetic information from screening and testing carried out in clinical, research, and direct-to-consumer settings), and wider society (e.g., federal legislation prohibiting genetic discrimination in health insurance and employment, increased awareness about DNA forensics, and policies on gene patenting). The experimental aspect remarked upon refers to the organizational and physical situation of ELSI, a program charged with critically evaluating the implications genomics research, within the very agency that funds that research. While this institutional arrangement supports the growing trend to integrate ELSI research with genomics research and policy formulation, ensuring that ELSI research is scientifically informed and practically relevant, excessive proximity also risks compromising “the autonomy, objectivity, and intellectual independence of ELSI investigators” (McEwen et al. 2014).

Since the early years of the HGP, bioethicists have criticized ELSI on various institutional grounds: for a lack of independence from scientist-overseers (Murray 1992; Yesley 2008), for an absence of structure conducive to providing guidance regarding policy (Hanna 1995), and for a negative impact on bioethics in narrowing the range of topics covered and creating an isolated subspecialty within the field (Annas and Elias 1992; Hanna et al. 1993). And from history and philosophy of science quarters, broader philosophical concerns have been raised about ELSI’s focus on the ethical, legal, and social implications of genetic research. Such a focus promotes a “downstream” rather than “upstream” framework for understanding the relationship between science and ethics that fails to appreciate that foundational concepts in genetic research such as normality and mutation are themselves evaluative and operate as directives to action (Limoges 1994). ELSI’s European counterpart, Ethical, Legal and Social Aspects or ELSA, chose to use the term “aspects” in order to avoid the connotations of narrowness, linearity, and determinism attached to the term “implications” (Hilgartner et al. 2016).

The “ongoing experiment” at the NIH’s NHGRI has come to define a particular model for doing research in bioethics that is being exported to other rapidly developing scientific fields, as expressed in the title of a target article published in AJOB Neuroscience , “To ELSI or Not to ELSI Neuroscience: Lessons for Neuroethics from the Human Genome Project” (Klein 2010). Indeed, although “ELSI” entered the lexicon as an acronym for the specific extramural research program supported by US government funding set aside for the HGP (elsewhere in the world, similar programs received their own appellations and acronyms—e.g., Genomics and its Ethical, Environmental, Economic, Legal, and Social Aspects or GE 3 LS in Canada), with that program offering a possible model for other emerging sciences, the term has come to receive a broader meaning that refers instead to a field of research, defined by its “research and scholarship content, rather than a particular set of funding sources” (Morrissey and Walker 2012, p. 52). Given the interest in exporting the ELSI model, its merits as a field of research, as bioethicists Clair Morrissey and Rebecca L. Walker argue (Morrissey and Walker 2012; Walker and Morrissey 2014), need to be examined. Investigating the content and methods of ELSI as a field of research, Morrissey and Walker combed through hundreds of articles and book chapters published between 2003 and 2008 (Morrissey and Walker 2012, Walker and Morrissey 2014). They found that funding sources influenced what research is carried out: though only 17 percent of all publications involved empirical research, for publications whose authors received US government funding, 30 percent of those with non-NHGRI support and 52 percent of those with NHGRI support were empirically based. They found that institutional and professional forces, irrespective of funding sources, promoted the coverage of topics of greatest interest to affluent populations (e.g., “genomics and clinical practice,” “intellectual property,” “genetic enhancement,” and “biorepositories”). They found that the vast majority (89 percent) of publications were prescriptive, recommending to diverse actors (scientists, clinicians, bioethicists, government, etc.) that certain policies or practices be pursued. Given this overwhelmingly prescriptive posture, they were dismayed to find that publications made use of multiple bioethical methods in piecemeal fashion with little depth. For the most part (77 percent), publications did not reflect on methods.

ELSI itself has become an object of research in science and technology studies (STS) scholarship. The ELSI model, as incorporated more recently in areas such as nano and synthetic biology, is understood as serving as “a new governance tool built on the prior institutionalization of ‘bioethics’ as a way to manage problems of moral ambiguity and disagreement in biomedicine” (Hilgartner et al. 2016, p. 824). At the outset of the HGP, ELSI relied on a governance model in which ethicists and social scientists lent their expertise by producing a body of scholarship that could inform public policy; subsequently, especially in Europe, social scientists were expected to facilitate mechanisms for making public policy in more democratic ways by engaging stakeholders and the broader public. Hilgartner et al. (2016) argue that STS scholarship sits somewhat uneasily alongside ELSI scholarship inasmuch as STS scholarship problematizes elements of the “traditional imaginaries of orderly science-society relations” to which ELSI subscribes, such as the fact/value distinction, the “neutrality” of science and technology, and “the self-evidence of power relations” (p. 832). Criticisms are also made that as a tool of governance, bioethics exercises power in ways often unseen, thereby foreclosing questions asked and debates had—for e.g., taking the boundary between facts and values as given not made, presenting rational moral arguments as outside politics to dismiss issues of public concern, or circumventing legislation by justifying the extension of existing regulations.

  • Alkuraya, Fowzan S., 2021, “A Genetic Revolution in Rare-Disease Medicine,” Nature 590 (11 Feb): 218–219.
  • Andreasen, Robin O., 2000, “Race: Biological Reality or Social Construct?” Philosophy of Science , 67: S653–S666.
  • Annas, G. J., and S. Elias, 1992, “Social Policy Research Priorities for the Human Genome Project,” in Gene Mapping: Using Law and Ethics as Guides , edited by G. J. Annas and S. Elias, 269–275, New York: Oxford University Press.
  • Anonymous, 2000, “Human Genome Projects: Work in Progress,” Nature , 405 (29 June): 981.
  • Anonymous, 2003, “International Consortium Completes Human Genome Project,” Genomics & Genetics Weekly (9 May): 32.
  • Arbour, Laura, and Doris Cook, 2006, “DNA on Loan: Issues to Consider when Carrying Out Genetic Research with Aboriginal Families and Communities,” Community Genetics 9: 153–160.
  • Bailey, Zinzi D., Nancy Krieger, Madina Agénor, Jasmine Graves, Natalia Linos, Mary T. Bassett, 2017, “Structural Racism and Health Inequities in the USA: Evidence and Interventions,” The Lancet , 389: 1453–63.
  • Ballouz, Sara, Alexander Dobin, and Jesse A. Gillis, 2019, “Is It Time To Change the Reference Genome?” Genome Biology , 20: 159 [9pp]. doi:10.1186/s13059-019-1774-4
  • Baylis, Françoise, 2019, Altered Inheritance: CRISPR and the Ethics of Human Genome Editing , Harvard University Press.
  • Beckwith, Jon, and Joseph S. Alper, 1998, “Reconsidering Genetic Antidiscrimination Legislation,” Journal of Law, Medicine & Ethics , 26: 205–210.
  • Birney, Ewan, and Nicole Soranzo, 2015, “The End of the Start for Population Sequencing,” Nature , 526 (30 Sep): 52–53.
  • Blattner, Frederick R. et al., 1997, “The Complete Genome Sequence of Escherichia coli K-12,” Science , 277 (5 Sep): 1453–1462.
  • Bliss, Catherine, 2018, Social by Nature: The Promise and Peril of Sociogenomics , Stanford University Press.
  • Bodmer, Walter, and Robin McKie, 1994, The Book of Man: The Quest to Discover Our Genetic Heritage , Toronto: Viking Press.
  • Bonham, Vence L., Eric D. Green, and Eliseo J. Pérez-Stable, 2018, “Examining How Race, Ethnicity, and Ancestry Data Are Used in Biomedical Research,” JAMA , 320: 1533–1534. doi:10.1001/jama.2018.13609
  • Bostrom, Nick, 2003, “Human Genetic Enhancements: A Transhumanist Perspective,” Journal of Value Inquiry , 37: 493–506.
  • Buniello, Annalisa, Jacqueline A.L. MacArthur, Maria Cerezo, Laura W. Harris, James Hayhurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sollis, Daniel Suveges, Olga Vrousgou, Patricia L. Whetzel, Ridwan Amode, Jose A. Guillen, Harpreet S. Riat, Stephen J. Trevanion, Peggy Hall, Heather Junkins, Paul Flicek, Tony Burdett, Lucia A. Hindorff, Fiona Cunningham, and Helen Parkinson, 2019, “The NHGRI-EBI GWAS Catalog of Published Genome-Wide Association Studies, Targeted Arrays and Summary Statistics 2019,” Nucleic Acids Research 47: D1005–D1012. doi:10.1093/nar/gky1120
  • Callahan, Daniel, 1998, “Cloning: Then and Now,” Cambridge Quarterly of Healthcare Ethics 7: 141–144.
  • Carlson, Jedidiah, 2022, “Spread This Like Wildfire!” Science for the People , [ available online ]
  • Carlson, Jedidiah, Brenna M. Henn, Dana R. Al-Hindi, and Sohini Ramachandran, 2022, “Counter the Weaponization of Genetics Research by Extremists,” Nature 610 (20 Oct): 444–447.
  • Caulfield, Timothy, 2018, “Spinning the Genome: Why Science Hype Matters,” Perspectives in Biology and Medicine , 61: 560–571. doi:10.1353/pbm.2018.0065
  • Chaisson, M.J.P., R.K. Wilson, and E.E. Eichler, 2015, “Genetic Variation and the De Novo Assembly of Human Genomes,” Nature Reviews Genetics , 16: 627–640.
  • Check Hayden, Erika, 2014, “Technology: The $1,000 Genome,” Nature , 507 (19 Mar): 294–295. doi:10.1038/507294a
  • Church, G.M., 2005, “The Personal Genome Project,” Molecular Systems Biology , 1: article no. 0030. doi:10.1038/msb4100040
  • Clinton, Bill, Tony Blair, Francis S. Collins, J. Craig Venter, 2000, “White House Remarks on Decoding of Genome”, transcript of June 26, 2000 news conference, New York Times , June 27, 2000. [ Clinton, et al. 2000 available online ]
  • Collins, Francis S., 1999, “Medical and Societal Consequences of the Human Genome Project,” New England Journal of Medicine , 341: 28–37.
  • Collins, Francis and David Galas, 1993, “A New Five-Year Plan for the U.S. Human Genome Project,” Science , 262 (1 Oct): 43–46.
  • Collins, Francis S., Ari Patrinos, Elke Jordan, Aravinda Chakravarti, Raymond Gesteland, LeRoy Walters, and the members of the DOE and NIH planning groups, 1998, “New Goals for the U.S. Human Genome Project: 1998–2003,” Science , 282 (23 Oct): 682–689.
  • Collins, Francis S., Eric D. Green, Alan E. Guttmacher, and Mark S. Guyer. 2003. “A Vision for the Future of Genomics Research: A Blueprint for the Genomic Era,” Nature , 422 (24 April): 1–13.
  • Cook-Deegan, Robert, 1994, The Gene Wars: Science, Politics, and the Human Genome , New York: W. W. Norton.
  • Coop, Graham, 2022, “Genetic Similarity versus Genetic Ancestry Groups as Sample Descriptors in Human Genetics,” [ available online ]
  • Cooper, Necia Grant, 1994, The Human Genome Project: Deciphering the Blueprint of Heredity , Mill Valley, CA: University Science Books.
  • Coriell Institute for Medical Research, n.d., “Guidelines for Referring to Populations,” [ available online ]
  • Cranor, Carl F. (ed.), 1994, Are Genes Us? : The Social Consequences of the New Genetics , New Brunswick, NJ: Rutgers University Press.
  • Davis, Bernard D. and Colleagues, 1990, “The Human Genome and Other Initiatives,” Science 249 (27 July): 342–343.
  • Deloukas, P., et al., 1998, “A Physical Map of 30,000 Human Genes,” Science , 282 (23 Oct): 744–746.
  • Department of Trade and Industry (U.K.), 2003, “Heads of Government Congratulate Scientists on Completion of Human Genome Project,” Hermes Database (12 April); LexisNexis Academic.
  • Dib, Colette et al., 1996, “A Comprehensive Genetic Map of the Human Genome Based on 5,264 Microsatellites,” Nature , 380 (14 Mar): 152–154.
  • Dickson, David, 1998, “British Funding Boost is Wellcome News,” Nature , 393 (21 May): 201.
  • Dizikes, Peter, 2007, “Gene Information Opens New Frontier in Privacy Debate,” Boston Globe , 24 September 2007.
  • Doron, Claude-Olivier, in press, “Who is the Most European of Us All? Occidentalism, White Supremacy and the Counter-Knowledge on Race and Genetics,” in Ordering People, Naming Populations: Critical Perspective on Biological Diversity and the Classification Concepts in the Life Sciences , edited by N. Ellebrecht, T. Plümeke, V. Lipphardt, J. Reardon.
  • Doudna, Jennifer A., and Samuel H. Sternberg, 2017, A Crack in Creation: Gene Editing and the Unthinkable Power to Control Evolution , Boston: Houghton Mifflin Harcourt.
  • Dulbecco, Renato, 1986, “A Turning Point in Cancer Research: Sequencing the Human Genome,” Science , 231 (7 Mar): 1055–1056.
  • Dunham, I. et al., 1999, “The DNA Sequence of Human Chromosome 22,” Nature , 402 (2 Dec): 489–495.
  • Duster, Troy, 2015, “A Post-genomic Surprise: The Molecular Reinscription of Race in Science, Law and Medicine,” The British Journal of Sociology , 66: 1–27. doi:10.1111/1468-4446.12118
  • Egorova, Yulia, 2010, “Castes of Genes? Representing Human Genetic Diversity in India,” Genomics, Society and Policy , 6(3): 32–49.
  • El-Haj, Nadia Abu, 2012, The Genealogical Science: The Search for Jewish Origins and the Politics of Epistemology , University of Chicago Press.
  • Epstein, Steven, 2007, Inclusion: The Politics of Difference in Medical Research , University of Chicago Press.
  • Erlich, Yaniv, Tal Shor, Itsik Pe’er, and Shai Carmi, 2018, “Identity Inference of Genomic Data Using Long-range Familial Searches,” Science , 362 (9 Nov): 690–694.
  • Evans, James P., 2010, “The Human Genome Project at 10 Years: A Teachable Moment,” Genetics in Medicine , 12: 477. doi:10.1097/GIM.0b013e3181ef16b6
  • Ferryman, Kadija, and Mikaela Pitcan, 2018, “What Is Precision Medicine,” Data & Society , Report, February 26, 2018. [ Ferryman & Pitcan 2018 available online ]
  • Fleischmann, Robert D. et al., 1995, “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd,” Science , 269 (28 Jul): 496–512.
  • Fox, Keolu, 2020, “The Illusion of Inclusion—The ‘All of Us’ Research Program and Indigenous Peoples’ DNA,” New England Journal of Medicine , 383: 411–414. doi:10.1056/NEJMp1915987
  • Fujimura, Joan H., and Ramya Rajagopalan, 2011, “Different Differences: The Use of ‘Genetic Ancestry’ versus Race in Biomedical Human Genetic Research,” Social Studies of Science 41: 5–30. doi:10.1177/0306312710379170
  • Fukuyama, Francis, 2002, Our Posthuman Future: Consequences of the Biotechnology Revolution , New York: Picador.
  • Fullwiley, Duana, 2008, “The Biologistical Construction of Race: ‘Admixture’ Technology and the New Genetic Medicine,” Social Studies of Science , 38: 695–735. doi:10.1177/0306312708090796.
  • Gannett, Lisa, 1998, “Genetic Variation: Difference, Deviation, or Deviance?” Ph.D. Dissertation, University of Western Ontario, Gannett 1998 available online .
  • –––, 1999, “What’s in a Cause? The Pragmatic Dimensions of Genetic Explanations,” Biology and Philosophy , 14: 349–374.
  • –––, 2003, “The Normal Genome in Twentieth-Century Evolutionary Thought,” Studies in History and Philosophy of Biological and Biomedical Sciences 34: 143–185.
  • –––, 2005, “Group Categories in Pharmacogenetics Research,” Philosophy of Science 72: 1232–1247.
  • –––, 2010, “Questions Asked and Unasked: How by Worrying Less about the ‘ Really Real’ Philosophers of Science Might Better Contribute to Debates about Genetics and Race,” Synthese , 177: 363–385.
  • –––, 2014, “Biogeographical Ancestry and Race,” Studies in History and Philosophy of Biological and Biomedical Sciences , 47: 173–184. doi:10.1016/j.shpsc.2014.05.017
  • Garrison, Nanibaa’ A., Kyle B. Brothers, Aaron J. Goldenberg, and John A. Lynch, 2019a, “Genomic Contextualism: Shifting the Rhetoric of Genetic Exceptionalism,” The American Journal of Bioethics , 19: 51–63. doi:10.1080/15265161.2018.1544304
  • Garrison, Nanibaa’ A., Māui Hudson, Leah L. Ballantyne, Ibrahim Garba, Andrew Martinez, Maile Taualii, Laura Arbour, Nadine R. Caron, and Stephanie Carroll Rainie, 2019b, “Genomic Research Through an Indigenous Lens: Understanding the Expectations,” Annual Review of Genomics and Human Genetics 20: 495–517. doi:10.1146/annurev-genom-083118-015434
  • Gates, Alexander J., Deisy Morselli Gysi, Manolis Kellis, and Albert-László Barabási, 2021, “A Wealth of Discovery Built on the Human Genome Project — By the Numbers,” Nature , 590 (11 Feb): 212–215.
  • Gibbs, Richard A., 2020, “The Human Genome Project Changed Everything,” Nature Reviews Genetics , 21: 575–576. doi: 10.1038/s41576-020-0275-3
  • Gilbert, Walter, 1992, “A Vision of the Grail,” in The Code of Codes: Scientific and Social Issues in the Human Genome Project , edited by Daniel J. Kevles and Leroy Hood, 83–97, Cambridge, MA and London: Harvard University Press.
  • Gisler, Monika, Didier Sornette, and Ryan Woodard, 2010, “Exuberant Innovation: The Human Genome Project,” Swiss Finance Institute Research Paper No. 10–12. doi:10.2139/ssrn.1573682
  • Godfrey-Smith, Peter, 2000, “On the Theoretical Role of ‘Genetic Coding,’” Philosophy of Science , 67: 26–44.
  • Goffeau, A. et al., 1996, “Life with 6000 Genes,” Science , 274 (25 Oct): 546–567.
  • Gostin, Larry, 1991, “Genetic Discrimination: The Use of Genetically Based Diagnostic and Prognostic Tests by Employers and Insurers,” American Journal of Law and Medicine , 17(1–2): 109–144.
  • Green, Eric D., Mark S. Guyer, and National Human Genome Research Institute, 2011, “Charting a Course for Genomic Medicine from Base Pairs to Bedside,” Nature , 470 (10 Feb): 204–213. doi:10.1038/nature09764
  • Green, Philip, 1997, “Against a Whole-Genome Shotgun,” Genome Research , 7: 410–417.
  • –––, 2002, “Whole-Genome Disassembly,” Proceedings of the National Academy of Sciences , 99: 4143–4144.
  • Green, Richard E., et al., 2010, “A Draft Sequence of the Neandertal Genome,” Science , 328 (7 May): 710–722. doi: 10.1126/science.1188021
  • Green, Robert C., Denise Lautenbach, and Amy L. McGuire, 2015, “GINA, Genetic Discrimination, and Genomic Medicine,” The New England Journal of Medicine 372: 397–399. doi:10.1056/NEJMp1404776
  • Griesemer, James R., 1994, “Tools for Talking: Human Nature, Weismannism, and the Interpretation of Genetic Information,” in Cranor (ed.) 1994, 69–88.
  • Griffiths, Paul E., 2001, “Genetic Information: A Metaphor in Search of a Theory,” Philosophy of Science , 68: 394–412.
  • Griffiths, P.E. and R.D. Gray, 1994, “Developmental Systems and Evolutionary Explanation,” Journal of Philosophy , 91: 277–304.
  • Griffiths, Paul E. and Robin D. Knight, 1998, “What Is the Developmentalist Challenge?” Philosophy of Science , 65: 253–258.
  • Gyapay, Gabor et al., 1994, “The 1993–94 Généthon Human Genetic Linkage Map,” Nature Genetics , 7: 246–339.
  • Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich, 2013, “Identifying Personal Genomes by Surname Inference,” Science , 339 (18 Jan): 321–324. doi:10.1126/science.1229566
  • Habermas, Jürgen, 2003, The Future of Human Nature , Polity Press.
  • Haga, Susanne B., and J. Craig Venter (2003), “FDA Races in Wrong Direction,” Science , 301 (25 Jul): 466. doi:10.1126/science.1087004
  • Hanna, K. E., 1995, “The Ethical, Legal, and Social Implications Program of the National Center for Human Genome Research: A Missed Opportunity?” in Society’s Choices: Social and Ethical Decision Making in Biomedicine , edited by R. E. Bulger, E. M. Bobby, and H. V. Fineberg, 432–457, Washington, DC: National Academy Press.
  • Hanna, K. E., R. M. Cook-Deegan, and R. Y. Nishimi. 1993, “Finding a Forum for Bioethics in U.S. Public Policy,” Politics and the Life Sciences: The Journal of the Association for Politics and the Life Sciences , 12: 205–219.
  • Hardimon, Michael O., 2013, “Race Concepts in Medicine,” Journal of Medicine and Philosophy , 38: 6–31.
  • Harmon, Amy, 2018, “Why White Supremacists Are Chugging Milk (and Why Geneticists Are Alarmed),” New York Times , 17 October 2018. [ available online ]
  • Hattori, M., et al., 2000, “The DNA Sequence of Human Chromosome 21,” Nature , 405 (18 May): 311–319.
  • Hilgartner Stephen, Barbara Prainsack, and J. Benjamin Hurlbut, 2016, “Ethics as Governance in Genomics and Beyond,” in Handbook of Science and Technology Studies , edited by Ulrike Felt, Rayvon Fouche, Clark A. Miller, and Laurel Smith-Doerr, Cambridge, MA: MIT Press.
  • Hindorff, Lucia A., Vence L. Bonham, Jr, Lawrence C. Brody, Margaret E. C. Ginoza, Carolyn M. Hutter, Teri A. Manolio, and Eric D. Green, 2018, Nature Reviews Genetics 19: 175–185. doi:10.1038/nrg.2017.89.
  • Hochman, Adam, 2013, “Against the New Racial Naturalism,” The Journal of Philosophy , 110: 331–351.
  • Hood, Leroy, 1992, “Biology and Medicine in the Twenty-First Century,” in The Code of Codes , 136–163.
  • –––, 2003, “Systems Biology: Integrating Technology, Biology, and Computation,” Mechanisms of Ageing and Development , 124: 9–16. doi:10.1016/S0047-6374(02)00164-1.
  • Hood, Leroy, and Lee Rowen. “The Human Genome Project: Big Science Transforms Biology and Medicine,” Genome Medicine , 5: 79 (8pp). doi:10.1186/gm483
  • Hubbard, Ruth, 1994, “Constructs of Genetic Difference: Race and Sex,” in Genes and Human Self-Knowledge: Historical and Philosophical Reflections on Modern Genetics , edited by Robert F. Weir, Susan C. Lawrence, and Evan Fales, Iowa City, IA: University of Iowa Press.
  • Hubbard, Ruth and Elijah Wald, 1993, Exploding the Gene Myth: How Genetic Information is Produced and Manipulated by Scientists, Physicians, Employers, Insurance Companies, Educators, and Law Enforcers , Boston: Beacon Press.
  • Hudson, Māui, Nanibaa’ A. Garrison, Rogena Sterling, Nadine R. Caron, Keolu Fox, Joseph Yracheta, Jane Anderson, Phil Wilcox, Laura Arbour, Alex Brown, Maile Taualii, Tahu Kukutai, Rodney Haring, Ben Te Aika, Gareth S. Baynam, Peter K. Dearden, David Chagné, Ripan S. Malhi, Ibrahim Garba, Nicki Tiffin, Deborah Bolnick, Matthew Stott, Anna K. Rolleston, Leah L. Ballantyne, Ray Lovett, Dominique David-Chavez, Andrew Martinez, Andrew Sporle, Maggie Walter, Jeff Reading, and Stephanie Russo Carroll, 2020, “Rights, Interests and Expectations: Indigenous Perspectives on Unrestricted Access to Genomic Data,” Nature Reviews Genetics 21: 377–384.
  • Hudson, Thomas J., et al., 1995, “An STS-Based Map of the Human Genome,” Science (22 Dec): 1945–1954. International Human Genome Sequencing Consortium, 2001, “Initial Sequencing and Analysis of the Human Genome,” Nature , 409 (15 Feb): 860–921.
  • Jabloner, Anna, 2019, “A Tale of Two Molecular Californias,” Science as Culture , 28: 1–24. doi: 10.1080/09505431.2018.1524863
  • Jones, Kathryn Maxson, Rachel A. Ankeny, and Robert Cook-Deegan, 2018, “The Bermuda Triangle: The Pragmatics, Policies, and Principles for Data Sharing in the History of the Human Genome Project,” Journal of the History of Biology , 51: 693–805.
  • Joyner, Michael J., and Nigel Paneth, 2019, “Promises, Promises, and Precision Medicine,” The Journal of Clinical Investigation , 129: 946–948. doi:10.1172/JCI126119
  • Juengst, Eric, Michelle L. McGowan, Jennifer R. Fishman, and Richard A. Settersten, Jr., 2016, “From ‘Personalized’ to ‘Precision’ Medicine: The Ethical and Social Implications of Rhetorical Reform in Genomic Medicine,” Hastings Center Report , 46: 21–33. doi:10.1002/hast.614.
  • Kalia, Sarah S., Kathy Adelman, Sherri J. Bale SJ, Wendy K. Chung, Christine Eng, James P. Evans, Gail E. Herman, Sophia B. Hufnagel, Teri E. Klein, Bruce R. Korf, Kent D. McKelvey, Kelly E. Ormond, C. Sue Richards, Christopher N. Vlangos, Michael Watson, Christa L. Martin, and David T. Miller, 2017, “Recommendations for Reporting of Secondary Findings in Clinical Exome and Genome Sequencing, 2016 Update (ACMG SF v2.0): A Policy Statement of the American College of Medical Genetics and Genomics,” Genetics in Medicine , 19: 249–255. doi:10.1038/gim.2016.190
  • Kaplan, Jonathan M., 2010, “When Socially Determined Categories Make Biological Realities: Understanding Black/White Health Disparities in the U.S.,” Monist , 93: 281–297.
  • Kass, Nancy E., 1997, “The Implications of Genetic Testing for Health and Life Insurance,” in Genetic Secrets: Protecting Privacy and Confidentiality in the Genetic Era , edited by Mark A. Rothstein, 299–316, New Haven and London: Yale University Press.
  • Kaye, Alice M., and Wyeth W. Wasserman, 2021, “The Genome Atlas: Navigating a New Era of Reference Genomes,” Trends in Genetics , in press. doi:10.1016/j.tig.2020.12.002
  • Keller, Evelyn Fox, 1992. “Nature, Nurture, and the Human Genome Project.” In Code of Codes , 281–299.
  • –––, 1994, “Master Molecules,” in Cranor (ed.) 1994, 89–98.
  • –––, 2000, The Century of the Gene , Cambridge, MA and London: Harvard University Press.
  • Kevles, Daniel J., 1992, “Out of Eugenics: The Historical Politics of the Human Genome,” in Code of Codes , 3–36.
  • Khan, Razib, and David Mittelman, 2018, “Consumer Genomics Will Change Your Life, Whether You Get Tested or Not,” Genome Biology , 19: 120–123. doi:10.1186/s13059-018-1506-1
  • Kitcher, Philip, 1996, The Lives to Come: The Genetic Revolution and Human Possibilities , New York: Simon & Schuster.
  • Klein, Eran, 2010, “To ELSI or Not to ELSI Neuroscience: Lessons for Neuroethics from the Human Genome Project,” AJOB Neuroscience , 1 (4): 3–8. doi:10.1080/21507740.2010.510821
  • Koshland, Daniel E. Jr., 1989, “Sequences and Consequences of the Human Genome,” Science , 246 (13 Oct): 189.
  • Kowal, Emma, and Bastien Llamas, 2019, “Race in a Genome: Long Read Sequencing, Ethnicity-Specific Reference Genomes and the Shifting Horizon of Race,” Journal of Anthropological Sciences , 97: 91–106. doi: 10.4436/jass.97004
  • Kronfeldner, Maria E., 2009, “Genetic Determinism and the Innate-Acquired Distinction in Medicine,” Medicine Studies , 1: 167–181. doi:10.1007/s12376-009-0014-8
  • Kukutai, Tahu, and John Taylor, 2016, “Data Sovereignty for Indigenous Peoples: Current Practice and Future Needs,” in Indigenous Data Sovereignty: Toward an Agenda , edited by Tahu Kukutai and John Taylor, 1–22, Australia National University Press. [ Kukutai and Taylor 2016 available online ]
  • Lappé, Marc A., 1994, “Justice and the Limitations of Genetic Knowledge,” in Justice and the Human Genome Project , 153–168.
  • Lee, Carol, 1993, “Creating A Genetic Underclass: The Potential for Genetic Discrimination by the Health Insurance Industry,” Pace Law Review , 13: 189–228.
  • Lee, Thomas F., 1991, The Human Genome Project: Cracking the Genetic Code of Life , New York: Plenum Press.
  • Levy, Samuel et al., 2007, “The Diploid Genome Sequence of an Individual Human,” PLoS Biology , 5: 2113–2144.
  • Lewis, Anna C. F., Santiago J. Molina, Paul S. Appelbaum, Bege Dauda, Anna Di Rienzo, Agustin Fuentes, Stephanie M. Fullerton, Nanibaa’ A. Garrison, Nayanika Ghosh, Evelynn M. Hammonds, David S. Jones, Eimear E. Kenny, Peter Kraft, Sandra S.-J. Lee, Madelyn Mauro, John Novembre, Aaron Panofsky, Mashaal Sohail, Benjamin M. Neale, and Danielle S. Allen, 2022, “Getting Genetic Ancestry Right for Science and Society,” Science 376 (15 Apr): 250–252.
  • Lewontin, R. C., 1974, “The Analysis of Variance and the Analysis of Causes,” American Journal of Human Genetics , 26: 400–411.
  • –––, 2000, It Ain’t Necessarily So: The Dream of the Human Genome and Other Illusions , New York: New York Review of Books; chapter 5, “The Dream of the Human Genome,” was originally published on May 28, 1992 in The New York Review of Books .
  • Limoges, Camille, 1994, “ Errare Humanum Est : Do Genetic Errors Have a Future?” in Cranor (ed.) 1994, 113–124.
  • Lippman, Abby. 1991. “Prenatal Genetic Testing and Screening: Constructing Needs and Reinforcing Inequities.” American Journal of Law and Medicine , 42: 15–50.
  • Lloyd, Elisabeth A., 1994, “Normality and Variation: The Human Genome Project and the Ideal Human Type,” in Cranor (ed.) 1994, 99–112.
  • López Beltrán, Carlos, 2011, ed., Genes (&) Mestizos: Genómica y Raza en la Biomedicina Mexicana , Mexico: Ficticia.
  • Lunshof, Jeantine, Ruth Chadwick, Daniel B. Vorhaus, and George M. Church, 2008, “From Genetic Privacy to Open Consent,” Nature Reviews Genetics , 9: 406–411. doi: 10.1038/nrg2360
  • Maher, Brendan, 2008, “The Case of the Missing Heritability,” Nature , 456 (6 Nov): 18–21.
  • Mao, Yafei, Claudia R. Catacchio, LaDeana W. Hillier, et al., 2021, “A High-Quality Bonobo Genome Refines the Analysis of Hominid Evolution,” Nature , 594: 77–81. doi:10.1038/s41586-021-03519-x
  • Marshall, Eliot, 1996a, “Whose Genome Is It, Anyway?” Science , 273 (27 Sept): 1788–1789.
  • –––, 1996b, “The Genome Project’s Conscience,” Science , 274 (25 Oct): 488–490.
  • McEwen, Jean E., Joy T. Boyer, Kathie Y. Sun, Karen R. Rothenberg, Nicole C. Lockhart, and Mark S. Guyer, 2014, “The Ethical, Legal, and Social Implications Program of the National Human Genome Research Institute: Reflections on an Ongoing Experiment,” Annual Review of Genomics and Human Genetics 15: 481–505. doi: 10.1146/annurev-genom-090413-025327
  • McKie, Robin, 2002, “I’m the Human Genome, says ‘Darth Venter’ of Genetics,” Observer , 28 April 2002.
  • McKusick, Victor A., 1989, “Mapping and Sequencing the Human Genome,” The New England Journal of Medicine , 320 (6 Apr): 910–915.
  • Meyer, Roberta A., 2004, “The Insurer Perspective,” in Genetics and Life Insurance: Medical Underwriting and Social Policy , edited by Mark A. Rothstein, 27–47, Cambridge and London: MIT Press.
  • Miga, Karen H., 2021, “Bridging the Gaps,” Nature , 590 (11 Feb): 217–218.
  • Miga, Karen H., and Ting Wang, 2021, “The Need for a Human Pangenome Reference Sequence,” Annual Review of Genomics and Human Genetics , 22: 11.1–11.22. doi: 10.1146/annurev-genom-120120-081921
  • Mills, Melinda C., and Charles Rahal, 2019, “A Scientometric Review of Genome-Wide Association Studies,” Communications Biology 2: 9. doi:10.1038/s42003-018-0261-x
  • Moreau, Yves, 2019, “Crack Down on Genomic Surveillance,” Nature , 576 (5 Dec): 36–38.
  • Morrissey, Clair, and Rebecca L. Walker, 2012, “Funding and Forums for ELSI Research: Who (or What) Is Setting the Agenda?” AJOB Primary Research , 3(3): 51–60. doi: 10.1080/21507716.2012.678550
  • Murphy, Timothy F., 1994, “The Genome Project and the Meaning of Difference,” in Justice and the Human Genome Project , 1–13.
  • Murray, T. H., 1992, “Speaking Unsmooth Things about the Human Genome Project,” in Gene Mapping: Using Law and Ethics as Guides , edited by G. J. Annas and S. Elias, 246–254, New York: Oxford University Press.
  • –––, 1997, “Genetic Exceptionalism and ‘Future Diaries’: Is Genetic Information Different from Other Medical Information?” in Genetic Secrets , 60–73.
  • Myers, Eugene M., Granger G. Sutton, Hamilton O. Smith, Mark D. Adams, and J. Craig Venter, 2002, “On the Sequencing and Assembly of the Human Genome,” Proceedings of the National Academy of Sciences , 99: 4145–4146.
  • Nash, Catherine, 2008, Of Irish Descent: Origin Stories, Genealogy, and the Politics of Belonging , Syracuse University Press.
  • –––, 2017, “The Politics of Genealogical Incorporation: Ethnic Difference, Genetic Relatedness and National Belonging,” Ethnic and Racial Studies , 40: 2539–2557. doi: 10.1080/01419870.2016.1242763
  • National Academies of Sciences, Engineering, and Medicine, 2023, “Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field,” Washington, DC: The National Academies Press. doi:10.17226/26902
  • National Human Genome Research Institute, 2003, “International Consortium Completes Human Genome Project” (14 April). [ available online ]
  • National Research Council Committee on Mapping and Sequencing the Human Genome, 1988, Mapping and Sequencing the Human Genome , Washington, D.C.: National Academy Press.
  • National Research Council Committee on A Framework for Developing a New Taxonomy of Disease, 2011, Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease , Washington, DC: The National Academies Press. doi:10.17226/13284
  • Nelkin, Dorothy, 1992, “The Social Power of Genetic Information,” in Code of Codes , pp. 177–190.
  • Nelkin, Dorothy, and Laurence Tancredi, 1989, Dangerous Diagnostics: The Social Power of Biological Information , New York: Basic Books.
  • Office of Technology Assessment, 1988, Mapping Our Genes: Genome Projects: How Big, How Fast? Baltimore and London: Johns Hopkins University Press.
  • Olson, Maynard V., 2011, “What Does a ‘Normal’ Human Genome Look Like?” Science , 331 (18 Feb): 872.
  • O’Neill, Onora, 2001, “Informed Consent and Genetic Information,” Studies in History and Philosophy of Biological and Biomedical Sciences 32: 689–704.
  • Oyama, Susan, 1985, The Ontogeny of Information: Developmental Systems and Evolution , Cambridge: Cambridge University Press.
  • Oyama, Susan, Paul E. Griffiths, and Russell Gray, editors, 2001, Cycles of Contingency: Developmental Systems and Evolution , Cambridge, MA: MIT Press.
  • Panofsky, Aaron, and Catherine Bliss, 2017, “Ambiguity and Scientific Authority: Population Classification in Genomic Science,” American Sociological Review 82: 59–87. doi:10.1177/0003122416685812
  • Panofsky, Aaron, and Joan Donovan, 2019, “Genetic Ancestry Testing among White Nationalists: From Identity Repair to Citizen Science,” Social Studies of Science 49: 653–681. doi:10.1177/0306312719861434
  • Panofsky, Aaron, Kushan Dasgupta, and Nicole Iturriaga, 2021, “How White Nationalists Mobilize Genetics: From Genetic Ancestry and Human Biodiversity to Counterscience and Metapolitics,” American Journal of Physical Anthropology 175: 387–398. doi:10.1002/ajpa.24150
  • Paul, Diane B., 1994, “Eugenic Anxieties, Social Realities, and Political Choices,” in Cranor (ed.) 1994, 142–154.
  • Pennisi, Elizabeth, 1999, “Academic Sequencers Challenge Celera in a Sprint to the Finish,” Science , (18 Mar): 1822–1823.
  • –––, 2000, “Finally, the Book of Life and Instructions for Navigating It,” Science , 288 (30 Jun): 2304–2307.
  • Phillips, Andelka M., 2016, “Only a Click Away—DTC Genetics for Ancestry, Health, Love…and More: A View of the Business and Regulatory Landscape,” Applied & Translational Genomics , 8: 16–22. doi:10.1016/j.atg.2016.01.001
  • Pigliucci, Massimo, and Jonathan Kaplan, 2003, “On the Concept of Biological Race and Its Applicability to Humans,” Philosophy of Science , 70: 1161–1172.
  • Plutynski, Anya, 2018, Explaining Cancer , Oxford University Press.
  • Pokorski, Robert J., 1994, “Use of Genetic Information by Private Insurers,” in Justice and the Human Genome Project , 91–109.
  • Popejoy, Alice B., and Stephanie M. Fullerton, 2016, “Genomics Is Failing on Diversity,” Nature , 538 (12 Oct): 161–164. doi: 10.1038/538161a
  • Proctor, Robert N., 1992, “Genomics and Eugenics: How Fair Is the Comparison?” in Gene Mapping: Using Law and Ethics as Guides , edited by George J. Annas and Sherman Elias, 57–93, New York and Oxford: Oxford University Press.
  • Reardon, Jenny, 2004, Race to the Finish: Identity and Governance in an Age of Genomics , Princeton: Princeton University Press.
  • –––, 2017, The Postgenomic Condition: Ethics, Justice, and Knowledge after the Genome , University of Chicago Press.
  • Reich, David, Michael A. Nalls, W.H. Linda Kao, Ermeg L. Akylbekova, Arti Tandon, Nick Patterson, James Mullikin, Wen-Chi Hsueh, Ching-Yu Cheng, Josef Coresh, Eric Boerwinkle, Man Li, Alicja Waliszewska, Julie Neubauer, Rongling Li, Tennille S. Leak, Lynette Ekunwe, Joe C. Files, Cheryl L. Hardy, Joseph M. Zmuda, Herman A. Taylor, Elad Ziv, Tamara B. Harris, James G. Wilson, 2009, “Reduced Neutrophil Count in People of African Descent Is Due to a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene,” PLOS Genetics , 5 (1): 1–14. doi:10.1371/journal.pgen.1000360
  • Ridley, R.M., C.D. Frith, L.A. Farrer, and P.M. Conneally, 1991, “Patterns of Inheritance of the Symptoms of Huntington’s Disease Suggestive of an Effect of Genomic Imprinting,” Journal of Medical Genetics , 28: 224–231.
  • Robert, Jason Scott, 2004, Embryology, Epigenesis, and Evolution , Cambridge: Cambridge University Press.
  • Robert, Jason Scott, and Françoise Baylis, 2003, “Crossing Species Boundaries,” American Journal of Bioethics , 3(3): 1–13.
  • Roberts, Dorothy E., 2021, “End the Entanglement of Race and Genetics,” Science , 371 (5 Feb): 566.
  • Root, Michael, 2003, “The Use of Race in Medicine as a Proxy for Genetic Differences,” Philosophy of Science , 70: 1173–1183.
  • Rothstein, Mark A., 2005, “Genetic Exceptionalism and Legislative Pragmatism,” Hastings Center Report , 35(4): 27–33.
  • Salzberg, Steven L., 2018, “Open Questions: How Many Genes Do We Have?” BMC Biology , 16: 94–96.
  • Sankar, Pamela, and Mildred K. Cho, 2002, “Toward a New Vocabulary of Human Genetic Variation,” Science , 298 (15 Nov): 1337–1338. doi:10.1126/science.1074447
  • Sankar, Pamela, Mildred K. Cho, and Joanna Mountain, 2007, “Race and Ethnicity in Genetic Research,” American Journal of Medical Genetics , Part A, 143A: 961–970. doi:10.1002/ajmg.a.31575
  • Sarkar, Sahotra, 1996, “Biological Information: A Sceptical Look at Some Central Dogmas of Molecular Biology,” in The Philosophy and History of Molecular Biology: New Perspectives , edited by Sahotra Sarkar, 187–232, Dordrecht: Kluwer.
  • –––, 1998, Genetics and Reductionism , Cambridge: Cambridge University Press.
  • Sarkar, Sahotra and Alfred I. Tauber, 1991, “Fallacious Claims for the HGP,” Nature , 353 (24 Oct): 691.
  • Schneider, Valerie A., et al., 2017, “Evaluation of GRCh38 and De Novo Haploid Genome Assemblies Demonstrates the Enduring Quality of the Reference Assembly,” Genome Research , 27: 849–864.
  • Schramm, Katharina, 2021, “Race, Genealogy, and the Genomic Archive in Post-apartheid South Africa,” Social Analysis: The International Journal of Anthropology 65(4): 49–69.
  • Sheridan, Cormac, 2014, “Illumina Claims $1,000 Genome Win,” Nature Biotechnology , 32: 115. doi:10.1038/nbt0214-115a
  • Sherman, Rachel M. and Steven L. Salzberg, 2020, “Pan-genomics in the Human Genome,” Nature Reviews Genetics , 21: 243–254. doi:10.1038/s41576-020-0210-7
  • Shi, Xinghua, and Xintao Wu, 2017, “An Overview of Human Genetic Privacy,” Annals of the New York Academy of Sciences , 1387: 61–72. doi:10.1111/nyas.13211
  • Singer, Natasha, 2018, “Employees Jump at Genetic Testing. Is That a Good Thing?” New York Times , 15 April 2018. [ Singer 2018 available online ]
  • Sirugo, Giorgio, Scott M. Williams, and Sarah A. Tishkoff, 2019, “The Missing Diversity in Human Genetic Studies,” Cell , 177: 26–31. doi:10.1016/j.cell.2019.02.048
  • Spencer, Quayshawn, 2014, “A Radical Solution to the Race Problem,” Philosophy of Science , 81: 1025–1038.
  • Stevens, Hallam, 2013, Life Out of Sequence: A Data-Driven History of Bioinformatics , University of Chicago Press. doi: 10.7208/9780226080345
  • –––, “Algorithmic Biology Unleashed,” Science , 371 (5 Feb): 565–566.
  • Swinbanks, David, 1991, “Japan’s Human Genome Project Takes Shape,” Nature , 351 (20 Jun): 593.
  • Tabery, James, 2015, “Why Is Studying the Genetics of Intelligence So Controversial?” Hastings Center Report , 45(5): S9–S14. doi: 10.1002/hast.492
  • –––, 2023, Tyranny of the Gene: Personalized Medicine and Its Threat to Public Health , Penguin Random House.
  • TallBear, Kimberly, 2013, Native American DNA: Tribal Belonging and the False Promise of Genetic Science , University of Minnesota Press.
  • The C. elegans Sequencing Consortium, 1998, “Genome Sequence of the Nematode C. elegans : A Platform for Investigating Biology,” Science , 282 (11 Dec): 2012–2018.
  • The Chimpanzee Sequencing and Analysis Consortium, 2005, “Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome,” Nature , 437 (1 Sep): 69–87, doi.org: 10.1038/nature04072.
  • The International HapMap Consortium, 2005, “A Haplotype Map of the Human Genome,” Nature , 437 (27 Oct): 1299–1320. doi:10.1038/nature04226
  • Tsosie, Krystal S., Joseph M. Yracheta, and Donna Dickenson, 2019, “Overvaluing Individual Consent Ignores Risks to Tribal Participants,” Nature Reviews Genetics 20: 497–498.
  • Ugalmugle, Sumant, and Rupali Swain, 2020, “Direct-To-Consumer (DTC) Genetic Testing Market Size By Test Type (Carrier Testing, Predictive Testing, Ancestry & Relationship Testing, Nutrigenomics Testing), By Distribution Channel (Online Platforms, Over-the-Counter), By Technology (Targeted Analysis, Single Nucleotide Polymorphism (SNP) Chips, Whole Genome Sequencing (WGS)), Industry Analysis Report, Regional Outlook, Application Potential, Price Trends, Competitive Market Share & Forecast, 2022–2028,” Global Market Insights , Report GMI3033. [ Ugalmugle & Swain 2020 available online ]
  • Venter, J. Craig, Hamilton O. Smith, and Leroy Hood, 1996, “A New Strategy for Genome Sequencing,” Nature , 381 (30 May): 364–366.
  • Venter, J. Craig, et al., 2001, “The Sequence of the Human Genome,” Science , 291 (16 Feb): 1304–1351.
  • Veritas Genetics, 2016, “Veritas Genetics Launches $999 Whole Genome And Sets New Standard For Genetic Testing,” PRNewswire , 3 March 2016. [ available online ]
  • Wade, Christopher H., 2019, “What Is the Psychosocial Impact of Providing Genetic and Genomic Health Information to Individuals? An Overview of Systematic Reviews,” Hastings Center Report , 49 (3): S88–S96. doi:10.1002/hast.1021
  • Wade, Nicholas, 2003, “Once Again, Scientists Say Human Genome is Complete,” New York Times , 15 April 2003, F1; LexisNexis Academic.
  • Wadman, Meredith, 1999, “Human Genome Project Aims to Finish ‘Working Draft’ Next Year,” Nature , 398 (18 Mar): 177.
  • Walker, Rebecca L., and Clair Morrissey, 2014, “Bioethics Methods in the Ethical, Legal, and Social Implications of the Human Genome Project Literature,” Bioethics , 28: 481–490. doi:10.1111/bioe.12023
  • Waterston, Robert H., Eric S. Lander, and John E. Sulston, 2002, “On the Sequencing of the Human Genome,” Proceedings of the National Academy of Sciences , 99: 3712–3716.
  • Watson, James D. 1990. “The Human Genome Project: Past, Present, and Future.” Science , 248 (6 Apr): 44–49.
  • Watson, James D., 1992, “A Personal View of the Project,” in Code of Codes , pp. 164–173.
  • –––, with Andrew Berry, 2003, DNA: The Secret of Life , New York: Alfred A. Kopf.
  • Weber, Griffin M., Kenneth D. Mandl, and Isaac S. Kohane, 2014, “Finding the Missing Link for Big Biomedical Data,” The Journal of the American Medical Association , 311: 2479–2480. doi:10.1001/jama.2014.4228
  • Weber, James L., and Eugene W. Myers, 1997, “Human Whole-Genome Shotgun Sequencing,” Genome Research 7: 401–409.
  • Wheeler, David A., et al., 2008, “The Complete Genome of an Individual by Massively Parallel DNA Sequencing,” Nature , 452 (17 April): 872–876.
  • Wolinsky, Howard, 2019, “Ancient DNA and Contemporary Politics,” EMBO Reports 20: e49507. doi:10.15252/embr.201949507
  • Yang, Xiaofei, Wan-Ping Lee, Kai Ye, and Charles Lee, 2019, “One Reference Genome Is Not Enough,” Genome Biology , 20: 104 [3pp]. doi:10.1186/s13059-019-1717-0
  • Yesley, M. S., 2008, “What’s ELSI Got to Do with It? Bioethics and the Human Genome Project,” New Genetics and Society , 27(1): 1–6.
  • Zarate, Oscar A., Julia Green Brody, Phil Brown, Mónica D. Ramírez-Andreotta, Laura Perovich, and Jacob Matz, 2016, “Balancing Benefits and Risks of Immortal Data: Participants’ Views of Open Consent in the Personal Genome Project,” Hastings Center Report , 46(1): 36–45. doi:10.1002/hast.523
  • Zhao, Tingting, Zhongqu Duan, Georgi Z Genchev, and Hui Lu, 2020, “Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences,” G3 Genes|Genomes|Genetics , 10: 2801–2809. doi:10.1534/g3.120.401280
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • About | National Institutes of Health (NIH) — All of Us
  • Council for Responsible Genetics
  • Department of Energy (DOE) Human Genome Project Information
  • Ensembl Human Genome Browser
  • ESRC Centre for Genomics in Society (Egenis), University of Exeter
  • Human Genome Organization (HUGO)
  • Human Genome News (DOE/NHGRI publication)
  • HumGen International (ELSI resources)
  • National Center for Biotechnology Information (NCBI) Human Genome Resources
  • National Human Genome Research Institute (NHGRI) All About the Human Genome Project (HGP)
  • Native BioData Consortium
  • NCBI Human Genome Resources
  • NHGRI ELSI Research Program
  • Nature Human Genome Collection
  • Nuffield Council on Bioethics
  • Science Human Genome Special Issue 16 February 2001
  • The Human Genome Project: An Annotated & Interactive Scholarly Guide to the Project in the United States (cshl.edu)
  • UK Biobank – UK Biobank
  • Wellcome Trust Sanger Institute Human Genome Project
  • World Health Organization Genomic Resource Centre

biological development: theories of | developmental biology | developmental biology: evolution and development | disability: critical disability theory | disability: definitions and models | donation and sale of human eggs and sperm | ethics, biomedical: privacy and medicine | eugenics | feminist philosophy, interventions: bioethics | feminist philosophy, interventions: philosophy of biology | feminist philosophy, topics: perspectives on disability | feminist philosophy, topics: perspectives on reproduction and the family | gene | genetics | genetics: genotype/phenotype distinction | genetics: molecular | genetics: population | genomics and postgenomics | health | heritability | human enhancement | human nature | information: biological | medicine, philosophy of | molecular biology | parenthood and procreation | race | reduction, scientific: in biology | scientific research and big data | systems and synthetic biology, philosophy of

Acknowledgment

For the original entry, I am grateful for the assistance of a California State University Faculty Development Grant and the help of three very capable student research assistants: Isabel Casimiro at California State University, Chico and Andrew Inkpen and Ashley Pringle at Saint Mary’s University. This revised entry has benefited from the excellent advice and thoughtful encouragement provided by Jim Griesemer and Jim Tabery.

Copyright © 2023 by Lisa Gannett < lisa . gannett @ smu . ca >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

  • Published: 13 September 2013

The Human Genome Project: big science transforms biology and medicine

  • Leroy Hood 1 &
  • Lee Rowen 1  

Genome Medicine volume  5 , Article number:  79 ( 2013 ) Cite this article

156k Accesses

124 Citations

122 Altmetric

Metrics details

The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.

Origins of the human genome project

The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [ 1 – 3 ]. The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [ 4 ]. In May 1985 a meeting focused entirely on the HGP was held, with Robert Sinsheimer, the Chancellor of the University of California, Santa Cruz (UCSC), assembling 12 experts to debate the merits of this potential project [ 5 ]. The meeting concluded that the project was technically possible, although very challenging. However, there was controversy as to whether it was a good idea, with six of those assembled declaring themselves for the project, six against (and those against felt very strongly). The naysayers argued that big science is bad science because it diverts resources from the ‘real’ small science (such as single investigator science); that the genome is mostly junk that would not be worth sequencing; that we were not ready to undertake such a complex project and should wait until the technology was adequate for the task; and that mapping and sequencing the genome was a routine and monotonous task that would not attract appropriate scientific talent. Throughout the early years of advocacy for the HGP (mid- to late 1980s) perhaps 80% of biologists were against it, as was the National Institutes of Health (NIH) [ 6 ]. The US Department of Energy (DOE) initially pushed for the HGP, partly using the argument that knowing the genome sequence would help us understand the radiation effects on the human genome resulting from exposure to atom bombs and other aspects of energy transmission [ 7 ]. This DOE advocacy was critical to stimulating the debate and ultimately the acceptance of the HGP. Curiously, there was more support from the US Congress than from most biologists. Those in Congress understood the appeal of international competitiveness in biology and medicine, the potential for industrial spin-offs and economic benefits, and the potential for more effective approaches to dealing with disease. A National Academy of Science committee report endorsed the project in 1988 [ 8 ] and the tide of opinion turned: in 1990, the program was initiated, with the finished sequence published in 2004 ahead of schedule and under budget [ 9 ].

What did the human genome project entail?

This 3-billion-dollar, 15-year program evolved considerably as genomics technologies improved. Initially, the HGP set out to determine a human genetic map, then a physical map of the human genome [ 10 ], and finally the sequence map. Throughout, the HGP was instrumental in pushing the development of high-throughput technologies for preparing, mapping and sequencing DNA [ 11 ]. At the inception of the HGP in the early 1990s, there was optimism that the then-prevailing sequencing technology would be replaced. This technology, now called ‘first-generation sequencing’, relied on gel electrophoresis to create sequencing ladders, and radioactive- or fluorescent-based labeling strategies to perform base calling [ 12 ]. It was considered to be too cumbersome and low throughput for efficient genomic sequencing. As it turned out, the initial human genome reference sequence was deciphered using a 96-capillary (highly parallelized) version of first-generation technology. Alternative approaches such as multiplexing [ 13 ] and sequencing by hybridization [ 14 ] were attempted but not effectively scaled up. Meanwhile, thanks to the efforts of biotech companies, successive incremental improvements in the cost, throughput, speed and accuracy of first-generation automated fluorescent-based sequencing strategies were made throughout the duration of the HGP. Because biologists were clamoring for sequence data, the goal of obtaining a full-fledged physical map of the human genome was abandoned in the later stages of the HGP in favor of generating the sequence earlier than originally planned. This push was accelerated by Craig Venter’s bold plan to create a company (Celera) for the purpose of using a whole-genome shotgun approach [ 15 ] to decipher the sequence instead of the piecemeal clone-by-clone approach using bacterial artificial chromosome (BAC) vectors that was being employed by the International Consortium. Venter’s initiative prompted government funding agencies to endorse production of a clone-based draft sequence for each chromosome, with the finishing to come in a subsequent phase. These parallel efforts accelerated the timetable for producing a genome sequence of immense value to biologists [ 16 , 17 ].

As a key component of the HGP, it was wisely decided to sequence the smaller genomes of significant experimental model organisms such as yeast, a small flowering plant ( Arabidopsis thaliana ), worm and fruit fly before taking on the far more challenging human genome. The efforts of multiple centers were integrated to produce these reference genome sequences, fostering a culture of cooperation. There were originally 20 centers mapping and sequencing the human genome as part of an international consortium [ 18 ]; in the end five large centers (the Wellcome Trust Sanger Institute, the Broad Institute of MIT and Harvard, The Genome Institute of Washington University in St Louis, the Joint Genome Institute, and the Whole Genome Laboratory at Baylor College of Medicine) emerged from this effort, with these five centers continuing to provide genome sequence and technology development. The HGP also fostered the development of mathematical, computational and statistical tools for handling all the data it generated.

The HGP produced a curated and accurate reference sequence for each human chromosome, with only a small number of gaps, and excluding large heterochromatic regions [ 9 ]. In addition to providing a foundation for subsequent studies in human genomic variation, the reference sequence has proven essential for the development and subsequent widespread use of second-generation sequencing technologies, which began in the mid-2000s. Second-generation cyclic array sequencing platforms produce, in a single run, up to hundreds of millions of short reads (originally approximately 30 to 70 bases, now up to several hundred bases), which are typically mapped to a reference genome at highly redundant coverage [ 19 ]. A variety of cyclic array sequencing strategies (such as RNA-Seq, ChIP-Seq, bisulfite sequencing) have significantly advanced biological studies of transcription and gene regulation as well as genomics, progress for which the HGP paved the way.

Impact of the human genome project on biology and technology

First, the human genome sequence initiated the comprehensive discovery and cataloguing of a ‘parts list’ of most human genes [ 16 , 17 ], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs. Understanding a complex biological system requires knowing the parts, how they are connected, their dynamics and how all of these relate to function [ 20 ]. The parts list has been essential for the emergence of ‘systems biology’, which has transformed our approaches to biology and medicine [ 21 , 22 ].

As an example, the ENCODE (Encyclopedia Of DNA Elements) Project, launched by the NIH in 2003, aims to discover and understand the functional parts of the genome [ 23 ]. Using multiple approaches, many based on second-generation sequencing, the ENCODE Project Consortium has produced voluminous and valuable data related to the regulatory networks that govern the expression of genes [ 24 ]. Large datasets such as those produced by ENCODE raise challenging questions regarding genome functionality. How can a true biological signal be distinguished from the inevitable biological noise produced by large datasets [ 25 , 26 ]? To what extent is the functionality of individual genomic elements only observable (used) in specific contexts (for example, regulatory networks and mRNAs that are operative only during embryogenesis)? It is clear that much work remains to be done before the functions of poorly annotated protein-coding genes will be deciphered, let alone those of the large regions of the non-coding portions of the genome that are transcribed. What is signal and what is noise is a critical question.

Second, the HGP also led to the emergence of proteomics, a discipline focused on identifying and quantifying the proteins present in discrete biological compartments, such as a cellular organelle, an organ or the blood. Proteins - whether they act as signaling devices, molecular machines or structural components - constitute the cell-specific functionality of the parts list of an organism’s genome. The HGP has facilitated the use of a key analytical tool, mass spectrometry, by providing the reference sequences and therefore the predicted masses of all the tryptic peptides in the human proteome - an essential requirement for the analysis of mass-spectrometry-based proteomics [ 27 ]. This mass-spectrometry-based accessibility to proteomes has driven striking new applications such as targeted proteomics [ 28 ]. Proteomics requires extremely sophisticated computational techniques, examples of which are PeptideAtlas [ 29 ] and the Trans-Proteomic Pipeline [ 30 ].

Third, our understanding of evolution has been transformed. Since the completion of the HGP, over 4,000 finished or quality draft genome sequences have been produced, mostly from bacterial species but including 183 eukaryotes [ 31 ]. These genomes provide insights into how diverse organisms from microbes to human are connected on the genealogical tree of life - clearly demonstrating that all of the species that exist today descended from a single ancestor [ 32 ]. Questions of longstanding interest with implications for biology and medicine have become approachable. Where do new genes come from? What might be the role of stretches of sequence highly conserved across all metazoa? How much large-scale gene organization is conserved across species and what drives local and global genome reorganization? Which regions of the genome appear to be resistant (or particularly susceptible) to mutation or highly susceptible to recombination? How do regulatory networks evolve and alter patterns of gene expression [ 33 ]? The latter question is of particular interest now that the genomes of several primates and hominids have been or are being sequenced [ 34 , 35 ] in hopes of shedding light on the evolution of distinctively human characteristics. The sequence of the Neanderthal genome [ 36 ] has had fascinating implications for human evolution; namely, that a few percent of Neanderthal DNA and hence the encoded genes are intermixed in the human genome, suggesting that there was some interbreeding while the two species were diverging [ 36 , 37 ].

Fourth, the HGP drove the development of sophisticated computational and mathematical approaches to data and brought computer scientists, mathematicians, engineers and theoretical physicists together with biologists, fostering a more cross-disciplinary culture [ 1 , 21 , 38 ]. It is important to note that the HGP popularized the idea of making data available to the public immediately in user-friendly databases such as GenBank [ 39 ] and the UCSC Genome Browser [ 40 ]. Moreover, the HGP also promoted the idea of open-source software, in which the source code of programs is made available to and can be edited by those interested in extending their reach and improving them [ 41 , 42 ]. The open-source operating system of Linux and the community it has spawned have shown the power of this approach. Data accessibility is a critical concept for the culture and success of biology in the future because the ‘democratization of data’ is critical for attracting available talent to focus on the challenging problems of biological systems with their inherent complexity [ 43 ]. This will be even more critical in medicine, as scientists need access to the data cloud available from each individual human to mine for the predictive medicine of the future - an effort that could transform the health of our children and grandchildren [ 44 ].

Fifth, the HGP, as conceived and implemented, was the first example of ‘big science’ in biology, and it clearly demonstrated both the power and the necessity of this approach for dealing with its integrated biological and technological aims. The HGP was characterized by a clear set of ambitious goals and plans for achieving them; a limited number of funded investigators typically organized around centers or consortia; a commitment to public data/resource release; and a need for significant funding to support project infrastructure and new technology development. Big science and smaller-scope individual-investigator-oriented science are powerfully complementary, in that the former generates resources that are foundational for all researchers while the latter adds detailed experimental clarification of specific questions, and analytical depth and detail to the data produced by big science. There are many levels of complexity in biology and medicine; big science projects are essential to tackle this complexity in a comprehensive and integrative manner [ 45 ].

The HGP benefited biology and medicine by creating a sequence of the human genome; sequencing model organisms; developing high-throughput sequencing technologies; and examining the ethical and social issues implicit in such technologies. It was able to take advantage of economies of scale and the coordinated effort of an international consortium with a limited number of players, which rendered the endeavor vastly more efficient than would have been possible if the genome were sequenced on a gene-by-gene basis in small labs. It is also worth noting that one aspect that attracted governmental support to the HGP was its potential for economic benefits. The Battelle Institute published a report on the economic impact of the HGP [ 46 ]. For an initial investment of approximately $3.5 billion, the return, according to the report, has been about $800 billion - a staggering return on investment.

Even today, as budgets tighten, there is a cry to withdraw support from big science and focus our resources on small science. This would be a drastic mistake. In the wake of the HGP there are further valuable biological resource-generating projects and analyses of biological complexity that require a big science approach, including the HapMap Project to catalogue human genetic variation [ 47 , 48 ], the ENCODE project, the Human Proteome Project (described below) and the European Commission’s Human Brain Project, as well as another brain-mapping project recently announced by President Obama [ 49 ]. Similarly to the HGP, significant returns on investment will be possible for other big science projects that are now under consideration if they are done properly. It should be stressed that discretion must be employed in choosing big science projects that are fundamentally important. Clearly funding agencies should maintain a mixed portfolio of big and small science - and the two are synergistic [ 1 , 45 ].

Last, the HGP ignited the imaginations of unusually talented scientists - Jim Watson, Eric Lander, John Sulston, Bob Waterston and Sydney Brenner to mention only a few. So virtually every argument initially posed by the opponents of the HGP turned out to be wrong. The HGP is a wonderful example of a fundamental paradigm change in biology: initially fiercely resisted, it was ultimately far more transformational than expected by even the most optimistic of its proponents.

Impact of the human genome project on medicine

Since the conclusion of the HGP, several big science projects specifically geared towards a better understanding of human genetic variation and its connection to human health have been initiated. These include the HapMap Project aimed at identifying haplotype blocks of common single nucleotide polymorphisms (SNPs) in different human populations [ 47 , 48 ], and its successor, the 1000 Genomes project, an ongoing endeavor to catalogue common and rare single nucleotide and structural variation in multiple populations [ 50 ]. Data produced by both projects have supported smaller-scale clinical genome-wide association studies (GWAS), which correlate specific genetic variants with disease risk of varying statistical significance based on case–control comparisons. Since 2005, over 1,350 GWAS have been published [ 51 ]. Although GWAS analyses give hints as to where in the genome to look for disease-causing variants, the results can be difficult to interpret because the actual disease-causing variant might be rare, the sample size of the study might be too small, or the disease phenotype might not be well stratified. Moreover, most of the GWAS hits are outside of coding regions - and we do not have effective methods for easily determining whether these hits reflect the mis-functioning of regulatory elements. The question as to what fraction of the thousands of GWAS hits are signal and what fraction are noise is a concern. Pedigree-based whole-genome sequencing offers a powerful alternative approach to identifying potential disease-causing variants [ 52 ].

Five years ago, a mere handful of personal genomes had been fully sequenced (for example, [ 53 , 54 ]). Now there are thousands of exome and whole-genome sequences (soon to be tens of thousands, and eventually millions), which have been determined with the aim of identifying disease-causing variants and, more broadly, establishing well-founded correlations between sequence variation and specific phenotypes. For example, the International Cancer Genome Consortium [ 55 ] and The Cancer Genome Atlas [ 56 ] are undertaking large-scale genomic data collection and analyses for numerous cancer types (sequencing both the normal and cancer genome for each individual patient), with a commitment to making their resources available to the research community.

We predict that individual genome sequences will soon play a larger role in medical practice. In the ideal scenario, patients or consumers will use the information to improve their own healthcare by taking advantage of prevention or therapeutic strategies that are known to be appropriate for real or potential medical conditions suggested by their individual genome sequence. Physicians will need to educate themselves on how best to advise patients who bring consumer genetic data to their appointments, which may well be a common occurrence in a few years [ 57 ].

In fact, the application of systems approaches to disease has already begun to transform our understanding of human disease and the practice of healthcare and push us towards a medicine that is predictive, preventive, personalized and participatory: P4 medicine. A key assumption of P4 medicine is that in diseased tissues biological networks become perturbed - and change dynamically with the progression of the disease. Hence, knowing how the information encoded by disease-perturbed networks changes provides insights into disease mechanisms, new approaches to diagnosis and new strategies for therapeutics [ 58 , 59 ].

Let us provide some examples. First, pharmacogenomics has identified more than 70 genes for which specific variants cause humans to metabolize drugs ineffectively (too fast or too slow). Second, there are hundreds of ‘actionable gene variants’ - variants that cause disease but whose consequences can be avoided by available medical strategies with knowledge of their presence [ 60 ]. Third, in some cases, cancer-driving mutations in tumors, once identified, can be counteracted by treatments with currently available drugs [ 61 ]. And last, a systems approach to blood protein diagnostics has generated powerful new diagnostic panels for human diseases such as hepatitis [ 62 ] and lung cancer [ 63 ].

These latter examples portend a revolution in blood diagnostics that will lead to early detection of disease, the ability to follow disease progression and responses to treatment, and the ability to stratify a disease type (for instance, breast cancer) into its different subtypes for proper impedance match against effective drugs [ 59 ]. We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual [ 58 ].

Impact of the human genome project on society

The HGP challenged biologists to consider the social implications of their research. Indeed, it devoted 5% of its budget to considering the social, ethical and legal aspects of acquiring and understanding the human genome sequence [ 64 ]. That process continues as different societal issues arise, such as genetic privacy, potential discrimination, justice in apportioning the benefits from genomic sequencing, human subject protections, genetic determinism (or not), identity politics, and the philosophical concept of what it means to be human beings who are intrinsically connected to the natural world.

Strikingly, we have learned from the HGP that there are no race-specific genes in humans [ 65 – 68 ]. Rather, an individual’s genome reveals his or her ancestral lineage, which is a function of the migrations and interbreeding among population groups. We are one race and we honor our species’ heritage when we treat each other accordingly, and address issues of concern to us all, such as human rights, education, job opportunities, climate change and global health.

What is to come?

There remain fundamental challenges for fully understanding the human genome. For example, as yet at least 5% of the human genome has not been successfully sequenced or assembled for technical reasons that relate to eukaryotic islands being embedded in heterochromatic repeats, copy number variations, and unusually high or low GC content [ 69 ]. The question of what information these regions contain is a fascinating one. In addition, there are highly conserved regions of the human genome whose functions have not yet been identified; presumably they are regulatory, but why they should be strongly conserved over a half a billion years of evolution remains a mystery.

There will continue to be advances in genome analysis. Developing improved analytical techniques to identify biological information in genomes and decipher what this information relates to functionally and evolutionarily will be important. Developing the ability to rapidly analyze complete human genomes with regard to actionable gene variants is essential. It is also essential to develop software that can accurately fold genome-predicted proteins into three dimensions, so that their functions can be predicted from structural homologies. Likewise, it will be fascinating to determine whether we can make predictions about the structures of biological networks directly from the information of their cognate genomes. Indeed, the idea that we can decipher the ‘logic of life’ of an organism solely from its genome sequence is intriguing. While we have become relatively proficient at determining static and stable genome sequences, we are still learning how to measure and interpret the dynamic effects of the genome: gene expression and regulation, as well as the dynamics and functioning of non-coding RNAs, metabolites, proteins and other products of genetically encoded information.

The HGP, with its focus on developing the technology to enumerate a parts list, was critical for launching systems biology, with its concomitant focus on high-throughput ‘omics’ data generation and the idea of ‘big data’ in biology [ 21 , 38 ]. The practice of systems biology begins with a complete parts list of the information elements of living organisms (for example, genes, RNAs, proteins and metabolites). The goals of systems biology are comprehensive yet open ended because, as seen with the HGP, the field is experiencing an infusion of talented scientists applying multidisciplinary approaches to a variety of problems. A core feature of systems biology, as we see it, is to integrate many different types of biological information to create the ‘network of networks’ - recognizing that networks operate at the genomic, the molecular, the cellular, the organ, and the social network levels, and that these are integrated in the individual organism in a seamless manner [ 58 ]. Integrating these data allows the creation of models that are predictive and actionable for particular types of organisms and individual patients. These goals require developing new types of high-throughput omic technologies and ever increasingly powerful analytical tools.

The HGP infused a technological capacity into biology that has resulted in enormous increases in the range of research, for both big and small science. Experiments that were inconceivable 20 years ago are now routine, thanks to the proliferation of academic and commercial wet lab and bioinformatics resources geared towards facilitating research. In particular, rapid increases in throughput and accuracy of the massively parallel second-generation sequencing platforms with their correlated decreases in cost of sequencing have resulted in a great wealth of accessible genomic and transcriptional sequence data for myriad microbial, plant and animal genomes. These data in turn have enabled large- and small-scale functional studies that catalyze and enhance further research when the results are provided in publicly accessible databases [ 70 ].

One descendant of the HGP is the Human Proteome Project, which is beginning to gather momentum, although it is still poorly funded. This exciting endeavor has the potential to be enormously beneficial to biology [ 71 – 73 ]. The Human Proteome Project aims to create assays for all human and model organism proteins, including the myriad protein isoforms produced from the RNA splicing and editing of protein-coding genes, chemical modifications of mature proteins, and protein processing. The project also aims to pioneer technologies that will achieve several goals: enable single-cell proteomics; create microfluidic platforms for thousands of protein enzyme-linked immunosorbent assays (ELISAs) for rapid and quantitative analyses of, for example, a fraction of a droplet of blood; develop protein-capture agents that are small, stable, easy to produce and can be targeted to specific protein epitopes and hence avoid extensive cross-reactivity; and develop the software that will enable the ordinary biologist to analyze the massive amounts of proteomics data that are beginning to emerge from human and other organisms.

Newer generations of DNA sequencing platforms will be introduced that will transform how we gather genome information. Third-generation sequencing [ 74 ] will employ nanopores or nanochannels, utilize electronic signals, and sequence single DNA molecules for read lengths of 10,000 to 100,000 bases. Third-generation sequencing will solve many current problems with human genome sequences. First, contemporary short-read sequencing approaches make it impossible to assemble human genome sequences de novo ; hence, they are usually compared against a prototype reference sequence that is itself not fully accurate, especially with respect to variations other than SNPs. This makes it extremely difficult to precisely identify the insertion-deletion and structural variations in the human genome, both for our species as a whole and for any single individual. The long reads of third-generation sequencing will allow for the de novo assembly of human (and other) genomes, and hence delineate all of the individually unique variability: nucleotide substitutions, indels, and structural variations. Second, we do not have global techniques for identifying the 16 different chemical modifications of human DNA (epigenetic marks, reviewed in [ 75 ]). It is increasingly clear that these epigenetic modifications play important roles in gene expression [ 76 ]. Thus, single-molecule analyses should be able to identify all the epigenetic marks on DNA. Third, single-molecule sequencing will facilitate the full-length sequencing of RNAs; thus, for example, enhancing interpretation of the transcriptome by enabling the identification of RNA editing, alternative splice forms with a given transcript, and different start and termination sites. Last, it is exciting to contemplate that the ability to parallelize this process (for example, by generating millions of nanopores that can be used simultaneously) could enable the sequencing of a human genome in 15 minutes or less [ 77 ]. The high-throughput nature of this sequencing may eventually lead to human genome costs of $100 or under. The interesting question is how long it will take to make third-generation sequencing a mature technology.

The HGP has thus opened many avenues in biology, medicine, technology and computation that we are just beginning to explore.

Abbreviations

Bacterial artificial chromosome

Department of Energy

Enzyme-linked immunosorbent assay

Genome-wide association studies

  • Human Genome Project

National Institutes of Health

Single nucleotide polymorphism

University of California, Santa Cruz.

Hood L: Acceptance remarks for Fritz J. and Delores H. Russ Prize. The Bridge. 2011, 41: 46-49.

Google Scholar  

Collins FS, McKusick VA: Implications of the Human Genome Project for medical science. JAMA. 2001, 285: 540-544. 10.1001/jama.285.5.540.

Article   CAS   PubMed   Google Scholar  

Green ED, Guyer MS, National Human Genome Research Institute: Charting a course for genomic medicine from base to bedside. Nature. 2011, 470: 204-213. 10.1038/nature09764.

Dulbecco R: A turning point in cancer research: sequencing the human genome. Science. 1984, 231: 1055-1056.

Article   Google Scholar  

Sinsheimer RL: The Santa Cruz workshop - May 1985. Genomics. 1989, 5: 954-956. 10.1016/0888-7543(89)90142-0.

Cooke-Degan RM: The Gene Wars: Science, Politics and the Human Genome. 1994, New York: WW Norton

Report on the Human Genome Initiative for the Office of Health and Environmental Research. http://www.ornl.gov/sci/techresources/Human_Genome/project/herac2.shtml ,

National Academy of Science: Report of the Committee on Mapping and Sequencing the Human Genome. 1988, Washington DC: National Academy Press

Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.

Understanding Our Genetic Inheritance. The United States Human Genome Project, The First Five Years: Fiscal Years. 1991, http://www.genome.gov/10001477 , –1995,

Collins FS, Galas D: A new five-year plan for the U.S. Human Genome Program. Science. 1993, 262: 43-46. 10.1126/science.8211127.

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, Hood LE: Fluorescence detection in automated DNA sequence analysis. Nature. 1986, 321: 674-679. 10.1038/321674a0.

Church G, Kieffer-Higgins S: Multiplex DNA sequencing. Science. 1988, 240: 185-188. 10.1126/science.3353714.

Strezoska Z, Paunesku T, Radosavljević D, Labat I, Drmanac R, Crkvenjakov R: DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc Natl Acad Sci USA. 1991, 88: 10089-10093. 10.1073/pnas.88.22.10089.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M: Shotgun sequencing of the human genome. Science. 1998, 280: 1540-1542. 10.1126/science.280.5369.1540.

International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Miklos GLG, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.

International Human Genome Sequencing Consortium. http://www.genome.gov/11006939 ,

Shendure J, Aiden ER: The expanding scope of DNA sequencing. Nat Biotechnol. 2012, 30: 1084-1094. 10.1038/nbt.2421.

Hood L: A personal journey of discovery: developing technology and changing biology. Annu Rev Anal Chem. 2008, 1: 1-43. 10.1146/annurev.anchem.1.031207.113113.

Article   CAS   Google Scholar  

Committee on a New Biology for the 21st Century: A New Biology for the 21st Century. 2009, Washington DC: The National Academies Press

Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001, 2: 343-372. 10.1146/annurev.genom.2.1.343.

Encyclopedia of DNA Elements. http://encodeproject.org/ENCODE/ ,

ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.

Editorial: Form and function. Nature. 2013, 495: 141-142.

ENCODE Project Consortium: A user’s guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol. 2011, 9: e1001046-10.1371/journal.pbio.1001046.

Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.

Picotti P, Aebersold R: Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat Methods. 2012, 9: 555-566. 10.1038/nmeth.2015.

Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The PeptideAtlas Project. Nucleic Acids Res. 2006, 34: D655-D658. 10.1093/nar/gkj040.

Deutsch ED, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii A, Aebersold R: A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010, 10: 1150-1159. 10.1002/pmic.200900375.

Genomes Online Database: complete genome projects. http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects ,

Theobald DL: A formal test of the theory of universal common ancestry. Nature. 2010, 465: 219-222. 10.1038/nature09014.

Wolfe KE, Li W-H: Molecular evolution meets the genomics evolution. Nat Genet. 2003, Suppl 33: 255-265.

Marques-Bonet T, Ryder OA, Eichler EE: Sequencing primate genomes: what have we learned?. Annu Rev Genomics Hum Genet. 2009, 10: 355-386. 10.1146/annurev.genom.9.081307.164420.

Noonan JP: Neanderthal genomics and the evolution of modern human. Genome Res. 2010, 20: 547-553. 10.1101/gr.076000.108.

Stoneking M, Krause J: Learning about human population history from ancient and modern genomes. Nat Rev Genet. 2011, 12: 603-614.

Sankararaman S, Patterson N, Li H, Paabo S, Reich D: The date of interbreeding between Neanderthals and Modern Humans. PLoS Genet. 2012, 8: e1002947-10.1371/journal.pgen.1002947.

Schatz MC: Computational thinking in the era of big data biology. Genome Biol. 2012, 13: 177-10.1186/gb-2012-13-11-177.

Article   PubMed Central   PubMed   Google Scholar  

Mizrachi I: GenBank: the Nucleotide Sequence Database. The NCBI Handbook. Edited by: McEntyre J, Ostell J. 2002, Bethesda: National Center for Biotechnology Information

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.

SourceForge. http://sourceforge.net/ ,

Bioconductor: open source software for bioinformatics. http://www.bioconductor.org/ ,

Field D, Sansone S-A, Collina A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka M, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Omics data sharing. Science. 2009, 326: 234-236. 10.1126/science.1180598.

Knoppers BM, Harris JR, Tasse AM, Budin-Ljosne I, Kaye J, Deschenes M, Zawati M: Towards a data-sharing Code of Conduct for international genomic research. Genome Med. 2011, 3: 46-10.1186/gm262.

Hood L: Biological complexity under attack: a personal view of systems biology and the coming of “big science”. Genet Eng Biotechnol News. 2011, 31: 17-

Tripp S, Grueber M: Economic Impact of the Human Genome Project. 2011, Columbus: Battelle Memorial Institute

International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.

The International HapMap3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58. 10.1038/nature09298.

Abbott A: Neuroscience: solving the brain. Nature. 2013, 499: 272-274. 10.1038/499272a.

The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.

Article   PubMed Central   Google Scholar  

A Catalog of Published Genome-wide Association Studies. http://www.genome.gov/gwastudies/ ,

Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010, 328: 636-639. 10.1126/science.1186802.

Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.

Wheeler DA, Srinivasian M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.

International Cancer Genome Consortium. http://icgc.org/ ,

The Cancer Genome Atlas. http://cancergenome.nih.gov/ ,

Pandey A: Preparing for the 21 st century patient. JAMA. 2013, 309: 1471-1472. 10.1001/jama.2012.116971.

Hood L, Flores M: A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory. Nat Biotechnol. 2012, 29: 613-624.

CAS   Google Scholar  

Price ND, Edelman LB, Lee I, Yoo H, Hwang D, Carlson G, Galas DJ, Heath JR, Hood L: Systems biology and the emergence of systems medicine. Genomic and Personalized Medicine: From Principles to Practice. Volume 1. Edited by: Ginsburg G, Willard H. 2009, Philadelphia: Elsevier, 131-141.

Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire A, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG: ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing. 2013, Bethesda: American College of Medical Genetics and Genomics

Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010, 11: 685-696. 10.1038/nrg2841.

Qin S, Zhou Y, Lok AS, Tsodikov A, Yan X, Gray L, Yuan M, Moritz RL, Galas D, Omenn GS, Hood L: SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Proteomics. 2012, 12: 1244-1252. 10.1002/pmic.201100601.

Li X-J, Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McClean M, Law S, Butler H, Schirm M, Gingras O, Lamontague J, Allard R, Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P: A blood-based proteomic classifier for the molecular characterization of pulmonary nodules. Sci Transl Med. in press

Knoppers BM, Thorogood A, Chadwick R: The Human Genome Organisation: towards next-generation ethics. Genome Med. 2013, 5: 38-10.1186/gm442.

Hood L: Who we are: the book of life. Commencement Address. Whitman College Magazine. 2002, 4-7.

Foster MW, Sharp RR: Beyond race: towards a whole-genome perspective on human populations and genetic variation. Nat Rev Genet. 2004, 5: 790-796. 10.1038/nrg1452.

Royal CDM, Dunston GM: Changing the paradigm from ‘race’ to human genetic variation. Nat Genet. 2004, 36: S5-S7. 10.1038/ng1454.

Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer MA, Jorde LB: Genetic similarities within and between populations. Genetics. 2007, 176: 351-359. 10.1534/genetics.106.067355.

Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Pasaniuk B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll SA: Using population admixture to help complete maps of the human genome. Nat Genet. 2013, 45: 406-414. 10.1038/ng.2565.

Fernandez-Suarez XM, Galperin MY: The, Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2013, 2013: D1-D7.

Human Proteome Project. http://www.hupo.org/research/hpp/ ,

Hood LE, Omenn GS, Moritz RL, Aebersold R, Yamamoto KR, Amos M, Hunter-Cevera J, Locascio L, Workshop Participants: New and improved proteomics technologies for understanding complex biological systems: addressing a grand challenge in the life sciences. Proteomics. 2012, 12: 2773-2783. 10.1002/pmic.201270086.

Editorial: The call of the human proteome. Nat Methods. 2010, 7: 661-

Schadt E, Turner S, Kasarskis A: A window into third-generation sequencing. Hum Mol Genet. 2010, 19: R227-R240. 10.1093/hmg/ddq416.

Kim JK, Samaranayake M, Pradhan S: Epigenetic mechanisms in mammals. Cell Mol Life Sci. 2009, 66: 596-612. 10.1007/s00018-008-8432-4.

Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-10.1371/journal.pcbi.1000201.

Hayden EC: Nanopore genome sequencer makes its debut. Nature News. 2012,  -10.1038/nature.2012.10051.

Download references

Acknowledgements

The authors gratefully acknowledge support from the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg; from the NIH, through award 2P50GM076547-06A; and the US Department of Defense (DOD), through award W911SR-09-C-0062. LH receives support from NIH P01 NS041997; 1U54CA151819-01; and DOD awards W911NF-10-2-0111 and W81XWH-09-1-0107.

Author information

Authors and affiliations.

Institute for Systems Biology, 401 Terry Ave N., Seattle, WA, 98109, USA

Leroy Hood & Lee Rowen

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Leroy Hood or Lee Rowen .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Hood, L., Rowen, L. The Human Genome Project: big science transforms biology and medicine. Genome Med 5 , 79 (2013). https://doi.org/10.1186/gm483

Download citation

Published : 13 September 2013

DOI : https://doi.org/10.1186/gm483

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Human Genome Sequence
  • Human Brain Project
  • Small Science
  • Individual Genome Sequence

Genome Medicine

ISSN: 1756-994X

human genome project literature review

  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics

The Human Genome Project

  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome Project’s signature accomplishment – generating the first sequence of the human genome – provided fundamental information about the human blueprint, which has since accelerated the study of human biology and improved the practice of medicine.

Learn more about the Human Genome Project below.

G5 Reunion

A virtual discussion with the leaders of the five genome-sequencing centers that provides the untold story on how they got the HGP across the finish line in 2003.

DNA sequencing by gel electrophoresis

A fact sheet detailing how the project began and how it shaped the future of research and technology.

Human Genome Project Timeline of Events | NHGRI

An interactive timeline listing key moments from the history of the project.

HGP Timeline

A downloadable poster containing major scientific landmarks before and throughout the project.

Francis Collins

Prominent scientists involved in the project reflect on the lessons learned.

HGP Banbury Meeting

Commentary in the journal Nature written by NHGRI leaders discussing the legacies of the project.

Science and Nature Covers

Lecture-oriented slides telling the story of the project by a front-line participant.

Human Genome Project

Related Content

Jay Shendure

Last updated: September 7, 2023

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 01 April 2005

The Human Genome Diversity Project: past, present and future

  • L. Luca Cavalli-Sforza 1  

Nature Reviews Genetics volume  6 ,  pages 333–340 ( 2005 ) Cite this article

9105 Accesses

234 Citations

21 Altmetric

Metrics details

The Human Genome Project, in accomplishing its goal of sequencing one human genome, heralded a new era of research, a component of which is the systematic study of human genetic variation. Despite delays, the Human Genome Diversity Project has started to make progress in understanding the patterns of this variation and its causes, and also promises to provide important information for biomedical studies.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$189.00 per year

only $15.75 per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

human genome project literature review

Similar content being viewed by others

human genome project literature review

The Human Pangenome Project: a global resource to map genomic diversity

Ting Wang, Lucinda Antonacci-Fulton, … the Human Pangenome Reference Consortium

human genome project literature review

Pan-genomics in the human genome era

Rachel M. Sherman & Steven L. Salzberg

human genome project literature review

The road ahead in genetics and genomics

Amy L. McGuire, Stacey Gabriel, … Jin-Soo Kim

Greely, H. T. Human genome diversity: what about the other human genome project? Nature Rev. Genet. 2 , 222–227 (2001).

Article   CAS   PubMed   Google Scholar  

Cann, H. M. et al. A human genome diversity cell line panel. (letter). Science 296 , 261 (2002).

The International HapMap Consortium. The International HapMap Project. Nature 426 , 789–795 (2003).

The International HapMap Consortium. Integrating ethics and science in the International HapMap Project. Nature Rev. Genet. 5 , 467–475 (2004).

Cavalli-Sforza, L. L. How can one study individual variation for three billion nucleotides of the human genome? Am. J. Hum. Genet. 46 , 649–651 (1990).

CAS   PubMed   PubMed Central   Google Scholar  

Cavalli-Sforza, L. L., Wilson, A. C., Cantor, C. R., Cook-Deegan, R. M. & King, M. -C. Call for a worldwide survey of human genetic diversity: a vanishing opportunity for the Human Genome Project. Genomics 11 , 490–491 (1991).

Committee on Human Genome Diversity, National Research Council. Evaluating Human Genetic Diversity (US National Academy of Sciences, Washington DC, 1997).

Dausset, J. et al. Centre d'Etude du Polymorphisme Humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6 , 575–577 (1990) (in French).

Rosenberg, N. A. et al. Genetic structure of human populations. Science 298 , 2381–2385 (2002).

Zhivotovsky, L. A., Rosenberg, N. A. & Feldman, M. W. Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. Am. J. Hum. Genet. 72 , 1171–1186 (2003).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ramachandran, S., Rosenberg, N. A., Zhivotovsky, L. A. & Feldman, M. W. On the robustness of the inference of human population structure. Hum. Genomics 1 , 87–97 (2004).

Shi, M., Caprau, D., Romitti, P., Christensen, K. & Murray, J. C. Genotype frequencies and linkage disequilibrium in the CEPH Human Diversity Panel for variants in folate pathway genes MTHFR, MTHFD , MTRR , RFLI and GCP2 . Birth Defects Res. A 67 , 545–549 (2003).

Article   CAS   Google Scholar  

Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74 , 1111–1120 (2004).

Macpherson, M. J., Ramachandran, S., Diamond, L. & Feldman, M. W. Demographic estimates from Y-chromosome microsatellite polymorphisms: analysis of a worldwide sample. Hum. Genomics 1 , 345–354 (2004).

Cavalli-Sforza, L. L. & Feldman, M. W. Biology as history: population genetic approaches to modern human evolution. Nature Genet. 33 , 266–275 (2003).

Serre, D. & Paabo, S. Evidence for gradients of human genetic diversity within and among continents. Genome Res. 14 , 1679–1685 (2004).

Horten, R. et al. Read all about it: the Lancet 's paper of the Year, 2003. Lancet 362 , 2101–2103 (2003).

Article   Google Scholar  

Cavalli-Sforza, L. L. & Bodmer, W. The Genetics of Human Populations (Freeman, San Francisco, 1971; Dover, New York, 1999).

Google Scholar  

Risch, N. & Merikangas K. The future of genetic studies of complex human diseases. Science 273 , 1516–1517 (1996).

Roseman, C. C. Detecting inter-regionally diversifying natural selection of modern human cranial form using matched molecular and morphometric data. Proc. Natl Acad. Sci. 101 , 12824–12829 (2004).

Reich, D. E. et al. Linkage disequilibrium in the human genome. Nature 411 , 199–204 (2001).

Wall, J. D. & Pritchard, J. K. Haplotype blocks and linkage disequilibrium in the human genome. Nature Rev. Genet. 4 , 587–597 (2003).

McVean, G. A. T. et al. The fine scale structure of recombination rate variation in the human genome. Science 304 , 581–584 (2004).

Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, New Jersey, 1994).

Cavalli-Sforza, L. L. The DNA revolution in population genetics. Trends Genet. 14 , 60–65 (1998).

Underhill, P. A. et al. The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations. Ann. Hum. Genet. 65 , 43–62 (2001).

Edmonds, C. A., Lillie, A. S. & Cavalli-Sforza, L. L. Mutations arising in the wave front of an expanding population. Proc. Natl Acad. Sci. USA 101 , 975–979 (2004).

Cavalli-Sforza, L. L. & Edwards, A. W. F. Analysis of human evolution. Genet. Today Proc. 11 th Int. Congress Genet. 3 , 923–933 (1964).

Menozzi, P., Piazza, A. & Cavalli-Sforza, L. L. Synthetic gene frequency maps in Europe. Science 201 , 786–792 (1978).

Cavalli-Sforza, L. L. Some current problems in human population genetics. Am. J. Hum. Genet. 25 , 82–104 (1973).

Hirszfeld, L. & Hirszfeld, H. Essai d'application des methodes au probleme des races. Anthropologie 29 , 505–537 (1919) (in French).

Race, R. R. & Sanger, R. Blood Groups in Man (Blackwell Scientific, Oxford, 1975).

Pauling, L., Itano, A. H., Singer, S. J. & Wells, I. C. Sickle cell anemia, a molecular disease. Science 110 , 543–548 (1949).

Harris, H. The Principles of Human Biochemical Genetics 3rd edn (Elsevier; North Holland Biomedical Press, Amsterdam, 1980).

Cavalli-Sforza, L. L. et al. DNA markers and genetic variation in the human species. Cold Spring Harb. Symp. Quant. Biol. 51 , 411–417 (1987).

Cavalli-Sforza, L. L. (ed.) African Pygmies (Academic, Orlando, 1986).

Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155 , 945–959 (2000).

Bowcock, A. M. et al. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368 , 455–457 (1994).

Barbujani, G. et al. An apportionment of human DNA diversity. Proc. Natl Acad. Sci. USA 84 , 4516–4519 (1997).

Download references

Acknowledgements

This work has been made possible by donors of blood samples and cell lines to the Human Genome Diversity Project (HGDP) and the Center for the Study of Human Polymorphism (CEPH). The collaboration with CEPH has been a decisive contribution. Support for preparing the first African cell lines in the Stanford laboratory in 1984–1985 came initially from the Lucille P. Markey Trust, with later additions from a National Institutes of Health Institute for General Medical Science programme and the HGDP–CEPH initiative from the Ellison Medical Foundation. H. Cann, M. Feldman, H. Greely and M.-C. King are thanked for suggesting improvements to the manuscript.

Author information

Authors and affiliations.

Genetics Department, Stanford Medical School, Stanford, 94305, California, USA

L. Luca Cavalli-Sforza

You can also search for this author in PubMed   Google Scholar

Ethics declarations

Competing interests.

The author declares no competing financial interests.

Related links

Further information.

Fondation Jean Dausset — CEPH

International HapMap Project

Human Genome Diversity Project

Human Genome Organisation

Human Genome Project

Marcus Feldman's laboratory

National Research Council

Noah Rosenberg's web site

Stanford Human Population Genetics Laboratory

The mixture of two or more genetically distinct populations.

A method for localizing genes that are responsible for specific diseases by comparing the DNA of a selected set of patients who are believed to carry the same mutation/s because of their ancestral origin, with that of unrelated healthy controls from the same population.

An increase in the breadth to length ratio of the skull.

Processes of substantial demographical growth causing geographical expansions of a population. These are made possible by innovations that affect production of food, such as agro-pastoral economies and/or other improved technologies (for example, transportation, hunting and other weapons).

A set of genetic markers that show complete or nearly complete linkage disequilibrium; that is, they are inherited through generations without being changed by crossing-over or other recombination mechanisms.

A classical mathematical principle in population genetics used for testing random mating. It gives the expected frequencies of genotypes for a gene after one generation of random mating if the parental allele frequencies are known.

The tendency for markers that are physically close to each other on the same chromosome to be transmitted to the progeny together, as there is a low probability that they will be split through recombination.

Mapping genes by typing genetic markers in families to identify regions that are associated with disease or trait values that occur within pedigrees more often than is expected by chance. Such linked regions are more likely to contain a causal genetic variant.

Lymphoblastoid cell lines are obtained from B lymphocytes, a fraction of white cells from blood that can be grown indefinitely in the laboratory after special treatment of the cells with Epstein–Barr virus.

Microsatellites are tandem repeats of short nucleotide sequences (2–6 bases). They have a large number of alleles compared with SNPs, owing to a much higher mutation rate.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Cavalli-Sforza, L. The Human Genome Diversity Project: past, present and future. Nat Rev Genet 6 , 333–340 (2005). https://doi.org/10.1038/nrg1596

Download citation

Issue Date : 01 April 2005

DOI : https://doi.org/10.1038/nrg1596

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Characterization of danube swabian population samples on a high-resolution genome-wide basis.

  • Zsolt Bánfai
  • Erzsébet Kövesdi
  • Béla Melegh

BMC Genomics (2023)

The Cost and Benefit of Regional Cultural Diversity on the Income of Rural Workers: Evidence from China

  • Jianqing Ruan

Social Indicators Research (2023)

HMOX1 STR polymorphism and malaria: an analysis of a large clinical dataset

  • Fergus Hamilton
  • Ruth Mitchell
  • Nicholas J. Timpson

Malaria Journal (2022)

Genetic data sharing and artificial intelligence in the era of personalized medicine based on a cross‐sectional analysis of the Saudi human genome program

  • Abdulmajeed F. Alrefaei
  • Yousef M. Hawsawi
  • Muhammed A. Bakhrebah

Scientific Reports (2022)

Achieving equity through science and integrity: dismantling race-based medicine

  • Joseph L. Wright
  • Gary L. Freed
  • Tamera D. Coyne-Beasley

Pediatric Research (2022)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

human genome project literature review

Advances in the Human Genome Project. A review

Affiliation.

  • 1 Department of Biological Sciences, Clark Atlanta University, GA 30314, USA.
  • PMID: 9540065
  • DOI: 10.1023/a:1006834711989

While celebrating its fifth official birthday last year it seems that the Human Genome Project (HGP) has and will continue to yield important biochemical information to mankind. It is exhilarating to think about the transition from studying genome structure to understanding genome function. The collective actions of information dessimination, technology development for efficient and faster sequencing, high-volume sequencing and developing model organisms has led to its success sofar. Various genome-wide STS-based human maps were completed in 1995, including a genetic map, a YAC map, a RH map with, and an integrated YAC-RH genetic map. These maps provide comprehensive frameworks for positioning additional loci, with the current genetic and RH maps spanning essentially 100% of the human genome and the YAC maps covering 95%. Few genes, however, have yet been localized on these framework maps. To date the Human Genome Project has experienced gratifying success. The technology and data produced by the genome project will provide a strong stimulus to broad areas of biological research and biotechnology. However, enormous challenges remain.

Publication types

  • Research Support, U.S. Gov't, P.H.S.
  • Chromosome Mapping
  • Cloning, Molecular
  • Databases, Factual
  • Human Genome Project*

Grants and funding

  • 3G12RR03062/RR/NCRR NIH HHS/United States
  • S06GM08247/GM/NIGMS NIH HHS/United States

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Med Internet Res
  • v.23(9); 2021 Sep

Logo of jmir

Exploring the Use of Genomic and Routinely Collected Data: Narrative Literature Review and Interview Study

Helen daniels.

1 Population Data Science, Swansea University, Swansea, United Kingdom

Kerina Helen Jones

Sharon heys, david vincent ford, associated data.

Summary of findings.

Advancing the use of genomic data with routinely collected health data holds great promise for health care and research. Increasing the use of these data is a high priority to understand and address the causes of disease.

This study aims to provide an outline of the use of genomic data alongside routinely collected data in health research to date. As this field prepares to move forward, it is important to take stock of the current state of play in order to highlight new avenues for development, identify challenges, and ensure that adequate data governance models are in place for safe and socially acceptable progress.

We conducted a literature review to draw information from past studies that have used genomic and routinely collected data and conducted interviews with individuals who use these data for health research. We collected data on the following: the rationale of using genomic data in conjunction with routinely collected data, types of genomic and routinely collected data used, data sources, project approvals, governance and access models, and challenges encountered.

The main purpose of using genomic and routinely collected data was to conduct genome-wide and phenome-wide association studies. Routine data sources included electronic health records, disease and death registries, health insurance systems, and deprivation indices. The types of genomic data included polygenic risk scores, single nucleotide polymorphisms, and measures of genetic activity, and biobanks generally provided these data. Although the literature search showed that biobanks released data to researchers, the case studies revealed a growing tendency for use within a data safe haven. Challenges of working with these data revolved around data collection, data storage, technical, and data privacy issues.

Conclusions

Using genomic and routinely collected data holds great promise for progressing health research. Several challenges are involved, particularly in terms of privacy. Overcoming these barriers will ensure that the use of these data to progress health research can be exploited to its full potential.

Introduction

The progression of genomics in the last few decades has been remarkable. Since 2001, when the Human Genome Project mapped and sequenced virtually every gene in the human genome, genetic sequencing technology has advanced rapidly in both the public and private domains. Next-generation sequencing costs have plummeted by almost 100%, and research opportunities have grown exponentially as a result [ 1 ]. For example, a simple search in the medical database, PubMed, shows that research on genomics has more than quadrupled since 2000, from around 340,000 published articles on this topic growing to 1.5 million by 2020. This increase has translated into quicker diagnoses, better outcomes, and more effective health care for patients [ 2 - 4 ]. Great strides have been made in cancer research, for example, where patients are now being treated according to their own or the tumor’s genomic data [ 5 ].

Being able to use genomic data in conjunction with routinely collected data holds even greater potential to advance knowledge by including factors wider in scope. Precision medicine requires that novel correlations of genotype, phenotype, and the environment be identified to inform new methods for diagnosing, treating, and preventing disease in a way that is responsive to the individual [ 5 ]. Knowledge of gene-environment interactions can also contribute on a population level by informing health and public health services in areas such as service planning, population genetic testing, disease prevention programs, and policy development [ 6 ]. Routinely collected data, electronic health records (EHRs) in particular, already hold vast amounts of clinical and environmental information on large numbers of people and preclude the need for lengthy and expensive data collection. Adding these phenotypic data to knowledge about a person’s genome can elucidate new knowledge, such as that about gene-environment and gene-drug interactions, and can thus provide a richer understanding of health and disease [ 7 ].

Increasing the use of genomic data for health research is a high government priority to understand and address the causes of disease. The potential of integrating genomic and routine data sets has been recognized in the United Kingdom by the Welsh Government [ 8 ] via their Genomics for Precision Medicine Strategy and by Genomics England [ 9 ] with the inception of the 100,000 genome project in 2018, both of whom are investigating ways to link genomic data and EHRs. From a more international perspective, the former president of the United States, Barack Obama, launched the Precision Medicine Initiative to improve individualized care by combining genomic data and EHRs with diet and lifestyle information from US citizens [ 10 ]. The UK Chief Medical Officer, Dame Sally Davies’s “genomic dream ” of mainstreaming genomic medicine into National Health Service (NHS) standard care is becoming ever closer, which means that data from a person’s genome will likely be directly recorded into EHRs, making this type of research far more accessible [ 11 ].

As this field prepares to move forward, it is important to take stock of the current state of play in order to highlight new avenues for development, identify challenges, and ensure that adequate data governance models are in place for safe and socially acceptable progress. Previous work has examined the benefits and logistical challenges of integrating genomic and routinely collected data in health care practice, but less is known of this specifically in a research setting [ 6 , 12 , 13 ]. Therefore, our objective is to add to this literature by conducting a narrative literature review and a series of interviews that would provide an outline of the use of genomic data alongside routinely collected data in health research to date. It focuses on the types of data that have been used, the role of routinely collected data in these studies, the data sources, how researchers access the data, and the challenges surrounding their use. This will inform further work in developing a framework for working with genomic and routinely collected data [ 14 ].

Literature Review

First, we conducted a literature search of research that used genomic data in conjunction with routinely collected data. We define routinely collected data as data collected as a matter of course and not specifically for research [ 8 ]. Genomic data refer to the data generated after processing a person’s genome, in full or in part, for example, by sequencing [ 7 ]. Studies were eligible for inclusion if they had used both types of data in combination to answer a health research question. We included studies of any design published in either peer-reviewed journals or gray literature in the English language.

We searched the following databases to identify these studies: PubMed, Ovid, CINAHL, OpenGrey, CENTRAL, LILACS, and Web of Knowledge from inception until January 31, 2019. We also searched for books, gray literature, and websites. We used a piloted search strategy that included keywords representing genetic and routinely collected data, and the following is the search strategy used for PubMed and modified for use with the remaining databases:

  • (Gene OR Genetic* OR Genome* OR Genomic*)
  • ( Administration record* OR Anonymised OR Anonymized OR Anonymisation OR Anonymization OR Big data OR Clinical record* OR Data linkage OR Data mining OR Data science OR Education record* OR Ehealth OR EHR OR Electronic data OR Health record* OR Housing record* OR Encrypt* OR Insurance OR Linked data OR Medical record* OR Patient record* OR Prison record* OR Publically available OR Publicly available OR Register OR Registry OR Registries OR Routine data OR Routinely collected OR Safe haven )

This search in PubMed resulted in more than 50,000 hits. After initial pilot screenings, for example, by restricting to only publications within the last 10 years, it was clear that, given the number and heterogeneity of potentially relevant articles, neither a systematic review nor a meta-analysis would be practical. We took a pragmatic approach by scanning these articles to retrieve information on the following items until we reached data saturation, when no new information appeared in the text: types of genomic data; types and roles of routinely collected data; data sources; and data access models, that is, how researchers access the data. We chose examples from each criterion to ensure that we included a range of health conditions and presented them in a narrative format, which followed the format of the criteria given above.

To understand the use of genomic and routinely collected data in context, we recruited a purposive sample of individuals who have been involved in leading research projects using a combination of genomic and routinely collected data. We identified potential participants from the literature search outlined above and sent 19 interview invitations via email. Reminders were sent after 2 weeks, and if there was no response, we made no further contact. In total, 11 individuals agreed to participate in either an in-person or teleconferencing interview, depending on their geographical location. Participants were involved with the following projects: the Swansea Neurology Biobank (Swansea, Wales); Dementia Platform UK (Oxford, England); PsyCymru (Swansea, Wales); UK Biobank (Stockport, England); BC Generations Project (British Columbia, Canada); the Province of Ontario Neurodevelopmental Disorders (POND) Network, IC/ES (Ontario, Canada); Electronic and Medical Records and Genomics (eMERGE) Network (Vanderbilt, United States); and the Sax Institute’s 45 and Up study (New South Wales, Australia). We omitted any further information on the participants to maintain their anonymity. We developed interview questions with our advisory board: a group of UK geneticists and data scientists who were interested in using genomic data and our discussions centered around these:

  • What is the purpose of integrating the genetic data with health data?
  • What types of genetic data are being included?
  • Were there particular approvals you had to obtain? And if yes, what were these?
  • What were the main challenges encountered?
  • How did you address the challenges?
  • What is your main model for storing these data?
  • What access model(s) do you use? For example, safe room only, remote access, data released externally to researchers;
  • What are the conditions for access to data?

We followed up interviews by email if any answers needed clarification.

Textbox 1 provides a summary of the results of the literature review and interviews. Multimedia Appendix 1 Table S1 [ 15 - 32 ] provides a detailed summary of the results.

Summary of results.

Type of genomic data

  • Single nucleotide polymorphisms
  • Polygenic risk scores
  • Gene activity scores
  • DNA methylation status

Purpose of combining data

  • Genome-wide association studies
  • Phenome-wide association studies
  • Longitudinal studies
  • Candidate gene studies
  • Gene profiling studies
  • Exploratory studies

Types of routinely collected data

  • Electronic health records
  • Disease registry data
  • Disease registries
  • Mortality registers
  • Deprivation indices
  • Health insurance

Role of routinely collected data

  • Identifying cases and controls
  • Baseline data
  • Deep phenotyping
  • Long-term follow-up
  • Sociodemographic information

Sources of data

Governance models for data access

  • Publicly available on the web
  • Released to researchers
  • Data safe havens
  • Data collection
  • Data storage and costs
  • Technical and/or software issues
  • Data privacy and data protection laws

We included 19 studies in this literature review that provided broad examples of the different types of genomic and routinely collected data that can be used together to answer a health-related research question. The countries where this research was based were the United Kingdom [ 15 - 20 ], China [ 21 ], United States [ 22 - 29 ], Canada [ 19 , 30 , 31 ], and Australia [ 32 ].

Types of Genomic Data

Examples found of the types of genomic data used in these studies included single nucleotide polymorphisms (SNPs), gene activity scores, and DNA methylation status. The most frequently used were SNPs [ 16 , 17 , 19 , 23 - 29 ], which represent a single base-pair change in the DNA sequence and are highly granular; hence, they are popular in health research [ 33 ]. Research studies tend to refer to SNPs by their reference SNP number, a unique identifier given to a SNP, or a cluster of SNPs by the National Center for Biotechnology (NCBI) [ 34 ].

Polygenic risk scores are closely related to SNPs and were used in 2 of the studies included in this review [ 15 , 16 ]. These predict a person’s likelihood of developing a particular disease based on the cumulative effect of a number of genetic variants [ 15 ]. Our literature search found other instances of quantitative measures of gene activity combined with routine data, including the 21-gene Recurrence Score, which measures the activity of 21 genes (16 cancer-related and 5 reference) for patients with breast cancer [ 33 ] and the enzyme activity score for the CYP2D6 (cytochrome P450 family 2 subfamily D member 6) gene, which codes for an important drug-metabolizing enzyme, and is highly polymorphic in humans [ 34 ]. Rusiecki et al [ 29 ] measured changes in DNA methylation status in their participants to investigate whether this predicted posttraumatic stress disorder in US military service members.

Purpose of Combining Genomic and Routine Data

Conducting genome-wide association studies (GWAS) [ 15 , 17 - 19 , 21 ], phenome-wide association studies (PheWAS) [ 24 , 25 ], and a combination of both [ 26 , 27 ] were the main purposes of combining genomic and routine data. Both these methods use powerful statistical techniques to find associations between genetic variants (SNPs) and phenotypes, which can then be used to predict the genetic risk factors of disease, levels of gene expression, and even social and behavioral characteristics such as educational attainment, impulsivity, and recreational drug experimentation [ 35 , 36 ]. Given the number of tested associations, these studies require large sample sizes to yield enough statistical power to detect differences in genetic variation between cases and controls [ 37 ]. Other study designs included longitudinal studies [ 16 , 18 , 30 - 32 ], a candidate gene study [ 23 ], case control studies [ 19 , 29 ], a gene profiling study [ 22 ], and an exploratory study [ 28 ]. Each of these studies was designed either to identify genetic risk factors of the disease or to investigate drug safety or efficacy ( Multimedia Appendix 1 Table S1).

Types and Role of Routinely Collected Data

EHRs appear to be, by far, the most common form of routinely collected data used in the studies identified in this review. For the type of research discussed here, data in EHRs have been used to identify eligible participants, for phenotyping, and to provide long-term follow-up on specific health outcomes. These data are collected as a matter of course in health care systems, and depending on their country of origin, EHR content can vary, although usually this digital record will include the patient’s name, address, demographics, medical history, care preferences, lifestyle information (such as diet, exercise, and smoking status), and free-text notes [ 38 ]. An example of EHR data used in the identified studies was the International Classification of Disease (ICD) coding. This allows clinicians to record the status of a patient in a standardized way, whether it is for disease, disorder, injury, infection, or symptoms [ 39 ]. Hebbring et al [ 26 ] used ICD codes in EHRs to identify appropriate cases and controls for their PheWAS study of the HLA-DRB1*150 gene involved in immune regulation. Rusiecki et al [ 29 ] used ICD codes to identify cases with a postdeployment diagnosis of posttraumatic stress disorder in their study on the link between gene expression and posttraumatic stress disorder in US military service members.

EHRs were also used to describe and validate a phenotype of interest, which is particularly important for PheWAS studies. The large number of phenotypes used in these analyses need to be clearly defined to ensure that any associations made with genetic variants are precise and replicable [ 40 ]. For example, Breitenstein et al [ 23 ] used EHRs to define the type 2 diabetes phenotype needed for their candidate gene study based on diagnoses, medications, and laboratory tests. The algorithm used to achieve this was developed by the eMERGE Network (see below for more details on this organization) and has been used successfully many times since [ 41 ].

In addition to providing a baseline snapshot of the patient, EHRs are also longitudinal in nature, and this makes them ideally placed to provide long-term follow-up to study participants. An example of this is the Genetics and Psychosis (GAP) study, which looked at a purported association between a variant of the ZNF804A gene and poor outcomes after first-episode psychosis [ 20 ]. Using individual-level linkage between EHRs and genotype data, this study followed the clinical outcomes for 291 patients over a period of 2 years, and subsequently found strong evidence for their hypothesis.

Other examples of routine data used in genomic research include disease registry morbidity and mortality records [ 20 , 22 ]. Disease registries collect information on clinical outcomes and care for a specific patient population over time. EHRs often feed data into these registries, but registry data can also include patient-reported outcomes and other biometric data, and therefore provide a more holistic view of the patient than an EHR would in isolation, for example, the UK MS Register [ 42 ]. Routine data sets need not be individual-level or person-based to prove useful in genetic research. A case in point is the Scottish Index of Multiple Deprivation, which gives an indication of a geographical area’s socioeconomic deprivation based on data about employment, income, health, education, housing, crime, and access to services [ 43 ]. Clarke et al [ 15 ] were able to link participants’ postcodes to the Scottish Index of Multiple Deprivation and generate socioeconomic deprivation variables to investigate their association with polygenic risk scores for alcohol dependence.

Sources of Genomic and Routinely Collected Data

Although individual research projects often collect biological samples and generate their own genomic data for combination with routinely collected data [ 19 , 29 ], studies can also make use of the many sources of genomic and routinely collected data already available. Our interview participants spoke to us about the different data sources they used in their research and provided details on the use of these data and the participants of their projects. From this information, there seem to be two general categories of these sources: databanks and biobanks, the former where only the data are stored, and the latter, which store both biological samples and genomic data. Below are illustrative examples; therefore, this is not an exhaustive list.

Dementias Platform UK

Dementias Platform UK (DPUK) [ 16 ] is a data portal funded by the Medical Research Council that hosts the data of 2 million people from over 40 cohorts relevant to dementia research. Combining these data enhances the individual research power of each study and brings together knowledge from a number of stakeholders to facilitate and accelerate new discoveries. Examples of these cohorts are the GENetic Frontotemporal Dementia Initiative (GENFI; genotype data for GRN , MAPT, and C9ORF72 genes) and the Genetic and Environmental Risk in Alzheimer’s Disease (GERAD) Consortium (whole exome sequences) [ 16 ]. Presently, DPUK provides EHR linkage for Welsh participants via the Secure Anonymized Information Linkage (SAIL) databank [ 44 ].

UK Biobank [ 45 ] is a national resource with data from 500,000 participants aged between 40 and 69 years who have donated blood, urine, and saliva samples, have undergone a number of baseline measures, and have provided detailed health and behavioral information about themselves. Genomic data are available for 488,000 participants and comprise SNPs, genotypes, and haplotypes. The UK Biobank holds a number of routine data sets, including hospital inpatient episodes, cancer registrations, and deaths. Studies using the UK Biobank genomic and routinely collected data include genome-wide meta-analyses of depression [ 17 ] and identifying candidate gene and disease associations that could help predict adverse drug reactions [ 26 ].

Personal Genome Project

The Personal Genomes Project (PGP) is a databank founded in 2005 at Harvard University and now extends worldwide. It provides a web-based platform for individuals (over 100,000 people to date) to share their genomic data publicly, along with their EHRs, and other trait information to progress science without many of the governance restrictions of traditional research. Most of the genomic data on the PGP database are in SNP format, although files of raw sequence data are available for some participants [ 43 ].

eMERGE Network

The eMERGE Network is a consortium of American medical institutions whose goal is to use EHR data in combination with a variety of genomic data types to advance translational research. The network also releases its genomic data, including GWAS, whole genome, and whole exome sequence data, along with a subset of phenotypic elements to the broader community of researchers via the dbGaP—an NCBI [ 46 ] database of genotypes and phenotypes. Through this mechanism, any research project can be applied to the uploaded data. A wealth of publications have resulted from eMERGE’s work; these are available to view on the web [ 24 , 25 , 47 ].

POND Network

The POND network is an IC/ES [ 31 ] initiative based in Ontario, Canada, and involves a cohort of children and young people (the total number of the sample is approximately 3000) with a neurological development disorder, with a particular focus on autism. This is a highly phenotyped cohort with all participants having undergone multiple clinical tests, including those for attention-deficit/hyperactivity disorder, obsessive-compulsive disorder, and family history and demographic data. A subset (n=667) provided consent for linkage with administrative data held in IC/ES via health card numbers. The aim is to identify subgroups of autism based on the co-occurrence of other developmental conditions, other comorbidities, and health service use, and to characterize these groups based on clinical attributes and genomics. The network data are project-level only, although it is anticipated that linkage of genomic data to administrative and health data will become more routine in the future.

BC Generations Project

The BC Generations Project [ 48 ] is British Columbia’s (BC) largest ever health study and is part of a national initiative—the Canadian Partnership for Tomorrow Project—to aid researchers in answering questions about how environment, lifestyle, and genes contribute to cancer and other chronic diseases. Almost 30,000 participants were involved in the project, and they provided baseline information about their health, diet, lifestyle, and medical and family history. Many have also donated blood and urine samples, and the type of genomic data generated from these samples is based on the needs of the researchers (subject to approval). The BC cohort is one of several provincial cohorts that can be combined for national studies or can be linked to provincial administrative data via PopDataBC [ 49 ].

Sax Institute and the 45 and Up Study

Based in Australia, the main business of the Sax Institute is to manage the 45 and up study, which has been following 260,000 people for over 12 years who provide both routinely collected and self-reported health data [ 32 ]. The aim of this study was to collect samples from 50,000 individuals for full genome sequencing. The Sax Institute has a partnership with the Garvan Institute of Medical Research [ 50 ], which acts as a genome sequencing facility and makes data available for research subject to approval. The Garvan Institute retains all the genomic data, but with data linkage to the Sax Institute.

Data Governance Access Models

We surveyed the data access models and information governance systems of genomic and routine data sources identified by our literature search and interviews. From the additional information given to us by our interview participants, we were able to categorize data access models as follows: (1) publicly available on the web, (2) released to researchers, or accessed via a (3) data safe haven.

Publicly Available on the Web

There are a wealth of free, genomic data sources on the web, made available because of individual projects or from the pooled results of a variety of different projects. These data are either downloadable or viewable on the web. For this type of resource, data tend to be at the gene or variant level (eg, GIANT Consortium [ 51 ]; GWAS Catalog [ 52 ]), but some do hold individual-level data, eg, Database of Genomic Variants [ 53 ]. The Personal Genome Project, described above, is the only data source that provides identifiable, individual-level genomic data [ 43 ]. All other data were deidentified.

Released to Researchers

Currently, the most common way for genomic data to be accessed for research is through secure electronic file transfer to the researcher. This generally occurs after successful application to an internal review board (IRB) and signed data use agreement (DUA). All of the published research studies identified by our literature search used this model, unless the genomic data were generated specifically for that particular project. For instance, Cronin et al [ 24 ] received genomic data regarding 54 SNPs, and also demographic, vital sign, and billing data derived linked EHRs by the eMERGE network. Other biobanks using this model include Mayo Genome Consortia [ 23 ], China Kadoorie Biobank [ 21 ], UK Biobank [ 17 , 26 ], BioVu [ 24 ], and Generation Scotland [ 17 ], although access to BioVu data is only granted to Vanderbilt faculty members [ 54 ]. In addition to IRB and DUA procedures, the use of individual-level EHR data supplied by Generation Scotland also requires an application to the NHS Research Ethics Committee [ 55 ].

Data Safe Havens

The use of large-scale population data in health research has become increasingly popular in the last few decades, and this has seen the evolution of data safe havens as a way to ensure its safe and secure use. Lea et al [ 56 ] defined data safe havens as a system that invokes procedural, technical, and physical controls, including access to data within a secure environment (rather than data release), in order to safeguard the identities of people providing the data. Despite there being a greater tendency in the literature for releasing data to researchers, we have seen from our work with interview participants, a trend toward using a data safe haven system for both younger and more well-established organizations.

The PsyCymru study [ 18 ] and the Swansea Neurology Biobank [ 19 ] have deposited polygenic risk scores and SNPs, respectively, into the SAIL Databank [ 44 ], a data safe haven based at Swansea University. The SAIL Databank provides remote access to many linkable anonymized data sets, and both of these studies have used SAIL to link their genetic data and other phenotypic data to EHRs and other routinely collected data. These data are available only for project access, and currently cannot be shared. However, their intention is to make linked genetic and routinely collected data available for research in the near future [ 57 , 58 ].

The Sax Institute operates in a similar way to the SAIL Databank in that it provides access to genomic and health data via a virtual lab by remote access anywhere in the world. Data access requires approval by two data access committees: one at the Garvan Institute and one at the Sax Institute. The BC Generations Project said, “should the researcher access the project’s data via PopDataBC then they would only be allowed to use the data on within the secured research environment” (Participant 5, BC Generations Project).

For the UK Biobank, the current modus operandi releases anonymized data externally to researchers. However, they stated that this is unlikely to be sustainable because genomic data files are too large. The UK Biobank will likely be changing to a remote access model in the near future. The eMERGE Network also confirmed that they are experimenting with remote access into a data safe haven. One of our participants who used IC/ES data explained that “as the data are considered highly sensitive, access is only by an ICES analyst, with results provided to the project lead” (Participant 4, IC/ES).

Combining genomic and routine data does not come without its challenges. Our participants, who are currently working with these data, spoke to us about these different challenges during their interviews, and their experiences are summarized below.

Data Collection

A challenge described by several participants concerned data collection. Conducting long-term follow-up over long periods and at regular intervals means that participants need to be invited and reconsented to provide more blood and other health information. In addition, poor quality sequence alignment could render the samples useless, and with repeated use by researchers, blood samples will eventually become exhausted. Each of these issues necessitates a lengthy and expensive process of resampling thousands of individuals and imposes a burden on participants. A possible solution to this, as one participant suggested, is to “join up with others biobanks in order to work towards epidemiologically valid sample sizes” (Participant 1, 45 and Up Study)

Data Storage and Costs

Genomic data are huge, approximately 90 GB for the raw data of 1 whole genome, and this is often the source of technical issues surrounding its use [ 59 ]. One participant told us that their main challenge was data storage capacity and that “it may no longer be cost or space efficient for organisations to hold multiple datasets” (Participant 2, UK Biobank). They go on to explain that there is certainly a case to be made for storing genomic data as VCF files only, which keep record only of gene sequence variations and are much smaller and easier to work with. However, this restricts the type of analyses possible, particularly in the advent of new discoveries about the anatomy of the genome. There may be a need to find solutions regarding storage space for raw genomic data and for specialist platforms required to conduct analyses.

Technical and/or Software Issues

Many of the challenges faced by participants working with these data are technical in nature. Participant 3 (eMERGE) described creating a sequencing platform from the ground up, which was much longer than initially projected. Furthermore, they described that multiple sequencing centers needed harmonization, as well as the needs across sites and projects for network-wide data collection. Another participant found difficulties surrounding analysis software and analyst capacity, since this is a very specialized skill (Participant 4, IC/ES). Researchers often conduct genetic data analysis using publicly available, downloadable software applications, which are subject to frequent updates for improvements. This is a challenge to incorporate this software and keep it up to date. Several other participants spoke of similar software issues. Potential solutions discussed during our interviews were to upscale and have dedicated servers for genomic data, as well as to install specialist toolsets within the secure research environment.

Data Privacy and Data Protection Laws

Given the possibility of identifying individuals from genomic data [ 60 ], privacy is of primary concern. We were told that, for one research project using IC/ES data (Participant 4), the privacy approval group was concerned about identifiability due to the genomic data being unique, particularly where there are rare variants. This means that the genomic data contained in variant call files have been brought into the databank to show the feasibility of transfer but has to be integrated into the analytical platform.

The General Data Protection Regulation [ 61 ] in the United Kingdom states that if researchers are to rely on the lawful basis of consent to process medical genomic data, the reasons for doing so must be described in an explicit and transparent way. The rapid decrease in the cost of whole genome sequencing in the last few decades has opened many new avenues for genetic research, but this means that it is impossible to predict what this research will actually look like in the future [ 55 ]. One participant felt that this means they would need to seek consent from the participant for each new research proposal [ 62 ], rather than just once when their sample was taken. They expressed concerns that “continued and lengthy re-contact with participants was not only costly and difficult, it may also be invasive and burdensome” (Participant 11, Swansea Neurology Biobank). Also mentioned was Canada’s antidiscrimination laws (Bill S-201: Genetic Nondiscrimination Act) to avoid prejudice on the grounds of genetics (including insurance and employability) [ 63 ] (Participant 5, BC Generations Project). Several participants believed that the public acceptability of their work was important, but they “weren’t sure how to go about ensuring it” (Participant 7, Dementia Platform UK).

Principal Findings

Using genomic and routinely collected data holds great potential for health research. The genomic data we identified in our literature review included SNPs, polygenic risk scores, and gene activity scores. Routine data primarily consisted of EHRs, but we did find other routine data types including registry data and deprivation indices that had been combined with genomic data. This paper shows how genomic research has progressed in recent years—from basic GWA and PheWA studies—to more complex methodologies in which health records are linked to genomic data at the individual level. Associations between genetic variants and phenotypes, identification of drug targets, knowledge about drug toxicity, and effectiveness can all be studied from the combination of genomic and routine data and leveraged for public benefit.

The EHRs created during routinely collected care provide the large sample sizes needed for GWAS analysis without the need for costly and time-consuming prospective data collection. These larger samples allow for greater validity, generalizability, and yield adequate statistical power, while also minimizing participant burden and reducing attrition during follow-up [ 64 ]. In addition, these data sets are not limited to a circumscribed number of phenotypes in the way that a traditional research study might be [ 40 ]. This richness and diversity of EHR data means that the large number of phenotypes used in PheWAS can be very clearly defined (referred to as deep phenotyping), which enhances the accuracy of phenotype-genotype associations [ 40 ].

Most importantly, with regard to EHRs, perhaps, is the additional information that routinely collected data provide about an individual’s environment. This can include lifestyle factors, education, work history, pollution, and even traumatic events. Genetic determinism has long been rejected, and we now widely accept the powerful influence that lifestyle and the environment have on the way that our genes are expressed [ 65 - 67 ]. Precision medicine requires novel correlations not only between genotype and phenotype to be made but also with an individual’s environment to inform new methods for diagnosing, treating, and preventing disease in a way that is tailored to that individual. Linking genomic data to routine data promises to elucidate important findings for precision medicine research, which will enable researchers to understand the relationship between an individual’s genome and their complete life course [ 68 ]. More of these types of studies are needed for precision medicine to reach its full potential, to make the intricate genotype-phenotype associations needed to advance the understanding and treatment of disease.

From our interviews, we identified 3 main ways that researchers can access genomic and routine data: publicly available on the web, released to researchers, and via a data safe haven. We also observed that biobanks and databanks seem to be moving toward renouncing a data release model and instead favoring a data safe haven approach. Aside from potentially solving data storage issues, this will also help assuage privacy and governance concerns. Reidentification and disclosure from genomic data is possible [ 69 , 70 ] and can lead to many undesirable consequences: discrimination by health or life insurance companies, societal stigma, and the discovery of a genetic predisposition to a condition when one does not want to be told. This is complicated further because of the familial nature of genes, and disclosure could cause some devastating effects for biological relatives as well [ 71 - 73 ].

Despite this, we must keep in mind that simply because genomic data are unique, this in itself does not render it identifiable. Rather, the reidentification risk of genomic data depends on the way they are accessed, the type of analyses that are conducted, and the format in which the results are finally published [ 70 ]. This means that decisions to use certain data access models based on practicality and decreased costs are not sufficient. Genomic research relies heavily on human participation, and the public should be consulted to inform the way in which their data are accessed [ 74 ]. There is a plethora of activity taking place to consult the public about health research in general, and the success and acceptability of large-scale data research is owed, in part, to the extensive public engagement activities that have been taking place [ 75 , 76 ]. Organizations such as the Global Alliance for Genomics and Health [ 77 ] are engaging with the public on the use of genomic data in research, but there is still more work to be done regarding the use of genomic and routinely collected data [ 78 ].

Limitations

This review was not intended to be exhaustive or systematic; therefore, it does not include all health research studies that have used genomic and routinely collected data. We also excluded biobanks and databanks that do not house routine data; therefore, we only include examples here. This may have resulted in inadvertently excluding some countries and institutions where this research takes place and the types of genomic and routinely collected data that have been used. Not implementing systematic methods in study selection may have introduced bias in this review’s conclusions; however, the pragmatic approach used here was deemed sufficient to meet our objectives.

We included 19 studies in this review, only one of these involved data and research from a non-Western country (China) [ 21 ]. We did not come across any research that had taken place in low- to middle-income countries, although as this review was not conducted systematically, we may have unintentionally overlooked these. However, the absence of any studies from such countries may be, as Tekola-Ayele and Rotimi [ 79 ] explain, a result of having a limited number of well-trained genomic scientists and poor research infrastructure, and due to a less well-established routine data collection infrastructure and procedures such as EHRs [ 80 ].

Qualitative interviews may be subject to recruitment bias, which means that some viewpoints and experiences were excluded. In addition, there is limited information about participants; however, as this is a relatively small field of research, it was deemed necessary to maintain participants’ privacy.

Given the projected increase in the availability of genomic data, the potential to be obtained from its combination with routine health data is vast. This review has shown examples of what has been done in this field so far with, for example, GWAS and PheWAS plus other study designs. For fields such as pharmacogenomics, these methodologies need to be used further, where using routinely collected data will simplify the process of tracking longer-term outcomes of personalized medical treatments, and elucidate new findings on the effects of the environment on drug-gene interactions. Our take away from this study is that there are several challenges involved in using these data, particularly surrounding privacy. Therefore, it is imperative that appropriate data governance be documented and that public engagement activities take place to ensure socially acceptable practices. Overcoming these barriers will ensure that the use of these data to progress health research can be exploited to its full potential.

Acknowledgments

This study was funded by the UK Medical Research Council: MC_PC_16035.

Abbreviations

Multimedia appendix 1.

Conflicts of Interest: None declared.

  • Open access
  • Published: 19 April 2024

Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

  • Paolo Parini 2 , 3 ,
  • Roman Tremmel 4 , 5 ,
  • Joseph Loscalzo 6 ,
  • Volker M. Lauschke 4 , 5 , 7 ,
  • Bradley A. Maron 6 ,
  • Paola Paci 8 ,
  • Ingemar Ernberg 9 ,
  • Nguan Soon Tan 10 , 11 ,
  • Zehuan Liao 10 , 9 ,
  • Weiyao Yin 1 ,
  • Sundararaman Rengarajan 12 ,
  • Xuexin Li   ORCID: orcid.org/0000-0001-5824-9720 13 , 14 on behalf of

The SCA Consortium

Genome Biology volume  25 , Article number:  104 ( 2024 ) Cite this article

577 Accesses

32 Altmetric

Metrics details

Single-cell sequencing datasets are key in biology and medicine for unraveling insights into heterogeneous cell populations with unprecedented resolution. Here, we construct a single-cell multi-omics map of human tissues through in-depth characterizations of datasets from five single-cell omics, spatial transcriptomics, and two bulk omics across 125 healthy adult and fetal tissues. We construct its complement web-based platform, the Single Cell Atlas (SCA, www.singlecellatlas.org ), to enable vast interactive data exploration of deep multi-omics signatures across human fetal and adult tissues. The atlas resources and database queries aspire to serve as a one-stop, comprehensive, and time-effective resource for various omics studies.

The human body is a highly complex system with dynamic cellular infrastructures and networks of biological events. Thanks to the rapid evolution of single-cell technologies, we are now able to describe and quantify different aspects of single cellular activities using various omics techniques [ 1 , 2 , 3 , 4 ]. Observing or integrating multiple molecular layers of single cells has promoted profound discoveries in cellular mechanisms [ 5 , 6 , 7 , 8 ]. To accommodate the exponential growth of single-cell data [ 9 , 10 ] and to provide comprehensive reference catalogs of human cells [ 11 ], many have dedicated to single-cell database or repository constructions [ 9 , 11 , 12 , 13 , 14 , 15 ]. These databases vary in purpose and scope: some served as data repositories for raw/processed data retrieval [ 11 , 12 , 14 ]; quick references to cell type compositions and cellular molecular phenotypes across tissues [ 11 , 16 , 17 ]; summarized published study findings for global cellular queries across tissues or diseases [ 9 , 13 , 18 ]; or simply web-indexed published results [ 19 ]. The aim of these resources is to provide immediate information sharing among the scientific communities and real-time queries of diverse cellular phenotypes, which, in turn, to accelerate research progress and to provide additional research opportunities.

However, majority of these databases often provide simple cellular overviews or signature profiles largely based on single-cell RNA-sequencing (scRNA-seq) data confined to limited multi-omics landscape [ 9 , 11 , 13 , 20 ]. The need for a database capable of conducting in-depth, real-time rapid queries of several single-cell omics at a time across almost all human tissues has not yet been met. This limitation has motivated us to build a one-stop single-cell multi-omics queryable database on top of constructing the multi-tissue and multi-omics human atlas.

Here, we present the Single Cell Atlas (SCA), a single-cell multi-omics map of human tissues, through a comprehensive characterization of molecular phenotypic variations across 125 healthy adult and fetal tissues and eight omics, including five single-cell (sc) omics modalities, i.e., scRNA-seq [ 21 ], scATAC-seq [ 22 ], scImmune profiling [ 23 ], mass cytometry (CyTOF) [ 24 , 25 ], and flow cytometry [ 26 , 27 ]; alongside spatial transcriptomics [ 28 ]; and two bulk omics, i.e., RNA-seq [ 29 ] and whole-genome sequencing (WGS) [ 30 ]. Prior to quality control (QC) filtering, we have collected 67,674,775 cells from scRNA-Seq, 1,607,924 cells from scATAC-Seq, 526,559 clonotypes from scImmune profiling, and 330,912 cells from multimodal scImmune profiling with scRNA-Seq, 95,021,025 cells from CyTOF, and 334,287,430 cells from flow cytometry; 13 tissues from spatial transcriptomics; and 17,382 samples from RNA-seq and 837 samples from WGS. We demonstrated through case studies the inter-/intra-tissue and cell-type variabilities in molecular phenotypes between adult and fetal tissues, immune repertoire variations across different T and B cell types in various tissues, and the interplay between multiple omics in adult and fetal colon tissues. We also exemplified the extensive effects of monocyte chemoattractant family ligands (i.e., the CCL family) [ 31 ] on interactions between fibroblasts and other cell types, which demonstrates its key regulatory role in immune cell recruitment for localized immunity [ 32 , 33 ].

Construction and content

An overview of the multi-omics healthy human map.

We conducted integrative assessments of eight omics types from 125 adult and fetal tissues from published resources and constructed a comprehensive single-cell multi-omics healthy human map termed SCA (Fig.  1 ). Each tissue consisted of at least two omics types, with the colon having the full spectrum of omics layers, which allowed us to investigate extensively the key mechanisms in each molecular layer of colonic tissue. Organs and tissues with at least five omics layers included colon, blood (whole blood and PBMCs), skin, bone marrow, lung, lymph node, muscle, spleen, and uterus (Additional file 2 : Table S1). Overall, the scRNA-seq data set contained the highest number of matching tissues between adult and fetal groups, which allowed us to study the developmental differences between their cell types. For scRNA-seq data, majority of the sample matrices retrieved from published studies have already undergone filtering to eliminate background noise, including low-quality cells which are most probable empty droplets. However, some samples downloaded retained their raw matrix form, which contained a significant amount of background noise. Consequently, before proceeding with any additional QC filtering, we standardized all scRNA-seq data inputs to the filtered matrix format, ensuring that all samples underwent the removal of background noise before further processing (Additional file 2 : Table S2). This preprocessing step resulted in the removal of 61,774,307 cells out of the original 67,674,775 cells in the downloaded scRNA-seq dataset, leaving us with 5,900,468 cells for subsequent QC filtering. Strict QC was then carried out to filter debris, damaged cells, low-quality cells, and doublets for single-cell omics data [ 34 ], as well as low-quality samples for bulk omics data. After QC filtering, 3,881,472 high-quality cells were obtained for scRNA-Seq; 773,190 cells for scATAC-Seq; 209,708 cells for multimodal scImmune profiling with scRNA-seq data; 2,278,550 cells for CyTOF; and 192,925,633 cells for flow cytometry data. For scImmune profiling alone, clonotypes with missing CDR3 sequences and amino acid information were filtered, leaving 167,379 unique clonotypes across 21 tissues in the TCR repertoires and 16 tissues in the BCR repertoires. For RNA-seq and WGS, 163 severed autolysis samples were removed, leaving 16,704 samples for RNA-seq and 837 for genotyping data.

figure 1

A multi-omics healthy human single-cell atlas. Circos plot depicting the tissues present in the atlas. Tissues belonging to the same organ were placed under the same cluster and marked with the same color. Circles and stars represent adult and fetal tissues, respectively. The size of a circle or a star indicates the number of its omics data sets present in the atlas. The intensity of the heatmap in the middle of the Circos plot represents the cell count for single-cell omics or the sample count for bulk omics. The bar plots on the outer surface of the Circos represent the number of cell types in the scRNA-seq tissues (in blue) or the number of samples in bulk RNA-seq tissues (in red)

Single-cell RNA-sequencing analysis of adult and fetal tissues revealed cell-type-specific developmental differences

In total, out of the 125 adult and fetal tissues from all omics types, the scRNA-seq molecular layer in the SCA consisted of 92 adult and fetal tissues (Additional file 1 : Fig. S1, Additional file 2 : Additional file 2 : Table S1), spanning almost all organs and tissues of the human body. We profiled all cells from scRNA-seq data and annotated 417 cell types at fine granularity, in which we categorized them into 17 major cell type classes (Fig.  2 A). Comparing across tissues, most of them contained stromal cells, endothelial cells, monocytes, epithelial cells, and T cells (Fig.  2 A). Comparing across the cell type classes, epithelial cells constituted the highest cell count proportions, followed by stromal cells, neurons, and immune cells (Fig.  2 A). For adult tissues, most of the cells were epithelial cells, immune cells, and endothelial cells; whereas in fetal tissues, stromal cells, epithelial cells, and hematocytes constituted the largest cell type class proportions. Of these 92 tissues from the scRNA-seq data, we carried out integrative assessments of these tissues (Figs. 2 and 3 ) to study cellular heterogeneities in different developmental stages of the tissues.

figure 2

scRNA-seq integrative analysis revealed similarity and heterogeneity between adult and fetal tissues. A Clustering of the 417 cell types from scRNA-seq data, consisting of 92 tissues based on their cell type proportion within each tissue group. Cell types were colored based on the cell type class indicated in the legend. The numbers in the bracket represent the cell number within the tissue group. B UMAP of the cells present in the 94 adult and fetal tissues from scRNA-seq data, colored based on their cell type class. C Phylogenetic tree of the adult (left) and fetal (right) cell types. Clustering was performed based on their top regulated genes. The color represents the cell type class. Distinct clusters are outlined in black and labeled

figure 3

In-depth assessment of the integrated scRNA-seq further revealed inter-and intra-group similarities between adult and fetal tissues. A Chord diagrams of the highly correlated (AUROC > 0.9) adult and fetal cell types. Each connective line in the middle of the diagrams represents the correlation between two cell types. The color represents the cell type class. B Top receptor-ligand interactions between cell type classes in adult tissues (left) and fetal tissues (right). Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions. C 3D tSNE of the integrative analysis between scRNA-seq and bulk RNA-seq tissues. The colors of the solid dots represent cell types in scRNA-seq data, and the colors of the spheres represent tissues of the bulk data. T indicates the T cell cluster, and B indicates the B cell cluster. D Heatmap showing the top DE genes in each cell type class of the adult and fetal tissues. Scaled expression values were used. Color blocks on the top of the heatmap represent cell type classes. Red arrows indicate the selected cell type classes for subsequent analyses. E Top significant GO BP and KEGG pathways for the cell type classes in adult and fetal tissues. The size of the dots represents the significance level. The color represents the cell type class

For each cell type, we performed differential expression (DE) analysis for each tissue to obtain the DE gene (DEG) signature for each cell type. We assessed the global gene expression patterns between cell types across the tissues based on their upregulated genes (Additional file 2 : Table S3) for adult and fetal tissues (Fig.  2 C, Additional file 1 : Fig. S2). In adult tissues, immune cells (i.e., B, T, monocytes, and NK cells) with hematocytes, stromal cells, neurons, endothelial cells, and epithelial cells formed distinct cellular clusters (Fig.  2 C, Additional file 1 : Fig. S2A), demonstrating highly similar DEG signatures within each of these cell type classes, consistent with the clustering patterns in the previous scRNA-seq atlas [ 35 ]. In fetal tissues, segregation is comparatively less distinctive such that only a subgroup of epithelial cells formed a distinct cell type cluster, cells from the immune cell type classes as well as hematocytes coalesced to form another cluster, and stromal cells formed small clusters between other fetal cell types (Fig.  2 C, Additional file 1 : Fig. S2B), which could represent the similarity in gene expression with other cell types during lineage commitment of stromal cell differentiation [ 36 ].

We next investigated the underlying gene regulatory network (GRN) of the transcriptional activities of cell types across adult and fetal tissues [ 37 ]. We identified active transcription factors (TFs) detected for cell types within each tissue (AUROC > 0.1), and based on these TF signatures, we measured similarities between cell types for adult and fetal tissues (Additional file 1 : Fig. S3). For adult tissues, clustering patterns similar to Additional file 1 : Fig. S1A were observed (Fig.  2 C, Additional file 1 : Fig. S3A). In fetal tissues, two unique clusters, including immune cells with hematocytes and stromal cells, were observed (Additional file 1 : Fig. S3B). Higher similarity in transcription regulatory patterns of stromal cells was observed compared to their gene expression patterns. The concordance between gene expression and transcription regulatory patterns within adult and fetal tissues demonstrated a direct and uniform interplay between the two molecular activities. In terms of the varying TF and DEG clustering patterns between adult and fetal tissues, the adult cell types demonstrated more similar transcriptional activities within the cell type classes than the less-differentiated fetal cell types, which shared more common transcriptional activities.

We dissected the correlation pattern of the clusters shown in Fig.  2 C by drawing inferences from their highly correlated (AUROC > 0.9) cell-type pairs (Fig.  3 A). Specifically, for the immune cluster in adult tissues, monocytes accounted for most of the high correlations within the immune cell cluster, followed by T cells (Fig.  3 A). For fetal tissues, a high number of correlations was observed between the immune cells (i.e., mostly monocytes and T cells) and hematocytes (Fig.  3 A), which explained the clustering pattern observed in fetal tissues (Fig.  2 C). For fetal stromal cells, other than with their own cell types, large coexpression patterns were observed with the hematocytes and the epithelial cells, and a smaller proportion of correlations with other clusters (Fig.  3 A), which accounted for the small clusters of stromal cells formed between other cell types (Fig.  2 C, Additional file 1 : Fig. S2B).

To describe possible cellular networking between the cell type class clusters in Fig.  2 C, we inferred cell–cell interactions [ 38 ] based on their gene expression (Additional file 2 : Table S4), and variations between adult and fetal tissues were observed (Fig.  3 B). In adult tissues, many cell type classes displayed interactions with the neurons, in which they networked with epithelial cells through UNC5D/NTN1 interaction; with stromal cells through SORCS3/NGF; with T cells through LRRC4C/NTNG2; etc. (Fig.  3 B). Among the top interactions of fetal tissues, among the top interactions, monocytes actively network with other cells, such as via CCR1/CCL7 with hematocytes, CSF1R/CSF1 with stromal cells, and FPR1/SSA1 with epithelial cells.

We performed a pseudobulk integrative analysis of the cell types of the scRNA-seq data from 19 tissues found in both adult and fetal tissues, with the 54 tissues from the bulk RNA-seq data (Fig.  3 C) to compare single-cell tissues with the corresponding tissues in the bulk datasets. For cell types of scRNA-seq data, adult cell types formed distinct clusters of T cells, B cells, hematocytes, stromal cells, epithelial cells, endothelial cells, and neurons (Fig.  3 C). Fetal cell types, by comparison, formed a unique cluster of cell types separating themselves from adult cell types. Internally, a gradient of cell types from brain tissues to cell types from the digestive system was observed in this fetal cluster. Fusing the bulk tissue-specific RNA-seq data sets with the pseudobulk scRNA-seq cell types gave close proximities of the bulk brain tissues with the pseudobulk brain-specific cell types, such as neurons and astrocytes (Fig.  3 C). Bulk whole blood clustered with pseudobulk hematocytes, and bulk EBV-transformed lymphocytes clustered with pseudobulk B cells. Other distinctive clusters included bulk colon and small intestine clustered with pseudobulk colon- and small intestine-specific epithelial cells, and bulk heart clustered with pseudobulk cardiomyocytes and other muscle cells (Fig.  3 C).

Next, we conducted gene ontology (GO) of biological processes (BPs) and KEGG pathway analyses [ 39 , 40 , 41 , 42 ] of the top upregulated genes of each cell type class cluster (Fig.  3 D) found in Fig.  2 C. Multiple testing correction for each cell type class was performed using Benjamini & Hochberg (BH) false discovery rate (FDR) [ 43 ]. At 5% FDR and average log2-fold-change > 0.25 (ranked by decreasing fold-change), the top three most significant genes of the remaining cell type classes were each scanned through the phenotypic traits from 442 genome-wide association studies (GWAS) and the UK Biobank [ 44 , 45 ] to seek significant genotypic associations of the top genes with diseases and traits. Notably, for GO pathways, the most significant BPs for B and T cells in both adult and fetal tissues were similar (Fig.  3 E). In contrast, epithelial cells and neurons differ in their associated BPs between adult and fetal tissues. For KEGG pathways, adult and fetal tissues shared common top pathways in T cells and in epithelial cells (Fig.  3 E). Among the top genotype–phenotype association results of the top genes (Additional file 1 : Fig. S4), SNP rs2239805 in HLA-DRA of adult monocytes has a high-risk association with primary biliary cholangitis, which is consistent with previous studies showing associations of HLA-DRA or monocytes with the disease [ 46 , 47 , 48 , 49 , 50 ].

Multimodal analysis of scImmune profiling with scRNA-sequencing in multiple tissues

To decipher the immune landscape at the cell type level in the scImmune profiling data, we carried out an integrative in-depth analysis of the immune repertoires with their corresponding scRNA-seq data. The overall landscape of the cell types mainly included clusters of naïve and memory B cells, naïve T/helper T/cytotoxic T cells, NK cells, monocytes, and dendritic cells (Fig.  4 A) and mainly comprised immune repertoires from the blood, cervix, colon, esophagus, and lung (Additional file 1 : Fig. S5). On a global scale, we examined clonal expansions [ 51 , 52 ] in both T and B cells across all tissues. Here, we defined unique clonal types as unique combinations of VDJ genes of the T cell receptor (TCR) chains (i.e., alpha and beta chains) and immunoglobin (Ig) chains on T cells and B cells, respectively. Integrating clonal type information from both the T and B cell repertoires with their scRNA-seq revealed sites of differential clonal expansion in various cell types (Fig.  4 B and C, Additional file 1 : Fig. S5). In T cell repertoires, high proportions of large or hyperexpanded clones were found in terminally differentiated effector memory cells reexpressing CD45RA (Temra) CD8 T cells [ 53 , 54 ] and cytotoxic T cells, and a large proportion of them was found in the lung (Fig.  4 C, Additional file 1 : Fig. S5), which interplays with the highly immune regulatory environment of the lungs to defend against pathogen or microbiota infections [ 55 , 56 ]. MAIT cells [ 57 , 58 ] have also demonstrated their large or high expansions across tissues, especially in the blood, colon, and cervix (Additional file 1 : Fig. S5A), with their main function to protect the host from microbial infections and to maintain mucosal barrier integrity [ 58 , 59 ]. In contrast, single clones were present mostly in naïve helper T cells and naïve cytotoxic T cells. (Additional file 1 : Fig. S5B) and were almost homogeneously across tissues (Fig.  4 C). This observation ensures the availability of high TCR diversity to trigger sufficient immune response for new pathogens [ 60 ]. For the B cell repertoire in blood, most of these immunocytes remained as single clones or small clones, with a small subset of naïve B cells and memory B cells exhibiting medium clonal expansion (Additional file 1 : Fig. S5B).

figure 4

Multi-modal analysis of scImmune profiling with scRNA-seq revealed a clonotype expansion landscape in six tissues. A tSNE of cell types from the multi-modal tissues of the scImmune-profiling data. Colors represent cell types. Cell clusters were outlined and labeled. B tSNE of cell types from the multi-modal tissues of the scImmune-profiling data. Colors indicate clonal-type expansion groups of the cells. Cells not present in the T or B repertoires are shown in gray (NA group). C Stacked bar plots revealing the clonal expansion landscapes of the T and B cell repertoires across 6 tissues. Colors represent clonal type groups. D Alluvial plot showing the top clonal types in T cell repertoires and their proportions shared across the cell types. Colors represent clonotypes. E Alluvial plot showing the top clonal types in B cell repertoires and their proportions shared across the cell types. Colors represent clonotypes

Among the top clones (Fig.  4 D), TRAV17.TRAJ49.TRAC_TRBV6-5.TRBJ1-1.TRBD1.TRBC1 was present mostly in Temra CD8 T cells and shared the same clonal type sequence with cytotoxic T and helper T cells (Additional file 2 : Table S5). This top clone was found to be highly represented in the lung, and comparatively, other large clones of CD8 T cells were found in the blood (Additional file 1 : Fig. S5C). The top ten clones were found in Temra CD8 T cells of blood and lung tissues and cytotoxic T cells and helper T cells from blood, cervix, and lung tissues (Additional file 1 : Fig. S5C). Some of them exhibited a high prevalence of cell proportions in Temra CD8 T cells (Fig.  4 D). In the B cell repertoire of blood, the top clones were found only in naïve and memory B cells, with similar proportions for each of the top clones (Fig.  4 E).

Multi-omics analysis of colon tissues across five omics data sets

To examine the phenotypic landscapes and interplays between different omics methods and data sets, we carried out an interrogative analysis of colon tissue across five omics data sets, including scRNA-Seq, scATAC-Seq, spatial transcriptomics, RNA-seq, and WGS, to examine the phenotypic landscapes across omics layers and the interplays and transitions between omics layers. In the overview of the transcriptome landscapes in adult and fetal colons (Fig.  5 A and B), the adult colon consisted of a large proportion of immune cells (such as B cells, T cells, and macrophages) and epithelial cells (such as mucin-secreting goblet cells and enterocytes) (Fig.  5 A). In contrast, the fetal colon contained a substantial number (proportion) of mesenchymal stem cells (MSCs), fibroblasts, smooth muscle cells, neurons, and enterocytes and a very small proportion of immune cells (Fig.  5 B).

figure 5

In-depth scRNA-seq analysis revealed distinct variations between adult and fetal colons. A tSNE of the adult colon; colors represent cell types. B tSNE of the fetal colon; colors represent cell types. C Heatmap showing the correlations of the cell types of the MSC lineage from adult and fetal colons based on their top upregulated genes. The intensity of the heatmap shows the AUROC level between cell types. Color blocks on the top of the heatmap represent classes (first row from the top), cell types (second row), and cell type classes (third row). D Heatmap showing the correlations of the cell types of the MSC lineage from adult and fetal colons based on the expression of the TFs. The intensity of the heatmap shows the AUROC level between cell types. Color blocks on the top of the heatmap represent classes (first row from the top), cell types (second row), and cell type classes (third row). E Pseudotime trajectory of the MSC lineage in the adult colon. The color represents the cell type, and the violin plots represent the density of cells across pseudo-time. F Pseudo-time trajectory of the MSC lineage in the fetal colon. The color represents the cell type, and the violin plots represent the density of cells across pseudotime. G Heatmap showing the pseudotemporal expression patterns of TFs in the lineage transition of MSCs to enterocytes in both adult and fetal colons. Intensity represents scaled expression data. The top 25 TFs for MSCs or their differentiated cells are labeled. H Pseudotemporal expression transitions of the top TFs in the MSC-to-enterocyte transitions for both adult and fetal colons. I Heatmap showing the pseudotemporal expression patterns of TFs in the lineage transition of MSCs to fibroblasts in both adult and fetal colons. Intensity represents scaled expression data. The top 25 TFs for MSCs or their differentiated cells are labeled. J Pseudotemporal expression transitions of the top TFs in the MSC-to-fibroblast transitions for both adult and fetal colons

As there were fewer immune cells observed in the fetal colon as compared to the adult colon, we compared the MSC lineage cell types between the two groups. Based on their differential gene expression signatures (Fig.  5 C) and their TF expression (Fig.  5 D), the highly specialized columnar epithelial cells, enterocytes, for both molecular layers correlated well between adult and fetal colons, unlike other cell types, which did not demonstrate high correlations between their adult and fetal cells. Other than the enterocytes, adult and fetal fibroblasts were highly similar to MSCs in both transcriptomic and regulatory patterns (Fig.  5 C and D). We modeled pseudo-temporal transitions of MSC lineage cells, and similar phenomena were observed (Fig.  5 E and F). Both adult and fetal fibroblasts were pseudotemporally closer to MSCs, and the transitions were much earlier than other cells. Analysis across regulatory, gene expression, and pseudotemporal patterns showed in both adult and fetal colons that fibroblasts were more similar to MSCs phenotypically, as shown in prior literature reports [ 61 , 62 , 63 ] and recently with therapeutic implications [ 64 , 65 ]. In addition, transient phases of cells along the MSC lineage trajectory were observed for enterocytes and goblet cells (Fig.  5 E and F), which demonstrated that these high plasticity cells were at different cell-state transitions before their full maturation, as evident in the literature [ 66 , 67 ]. By contrast, the fetal intestine was more primitive than the adult intestine during fetal development, and as a key cell type in extracellular matrix (ECM) construction [ 68 ], fibroblasts displayed transitional cell stages of cells along the pseudotime trajectory (Fig.  5 F).

Comparing regulatory elements of these transitions demonstrated similarities and differences (Fig.  5 G–J, Additional file 1 : Fig. S6). For MSC-to-enterocyte transitions (Fig.  5 G, Additional file 2 : Table S6), the leading TFs with significant pseudotemporal changes were labeled. The expression E74 Like ETS transcription factor 3, ELF3, which belongs to the epithelium-specific ETS (ESE) subfamily [ 69 ], increased during the transition for both adult and fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6) and as previously demonstrated is important in intestinal epithelial differentiation during embryonic development in mice [ 69 , 70 ]. Conversely, high mobility group box 1, HMGB1 [ 71 ], decreased pseudotemporally for both adult and fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6) and has been shown to inhibit enterocyte migration [ 72 ]. The nuclear orphan receptor, NR2F6, a non-redundant negative regulator of adaptive immunity, [ 73 , 74 ], displayed a comparative decline in expression halfway through the pseudotime transition for adult enterocytes but continued to increase for fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6). Another TF from the ETS family, Spi-B transcription factor, SPIB, also showed differential expression during the transition between adult and fetal enterocytes (Fig.  5 H), which was up-regulated in fetal enterocytes and down-regulated in adult enterocytes, suggesting its potential bi-functional role in enterocyte differentiation in fetal-to-adult transition.

For MSC-to-fibroblast transitions (Fig.  5 I, Additional file 2 : Table S6), TFs such as ARID5B, FOS, FOSB, JUN, and JUNB displayed almost identical trajectory patterns between adult and fetal fibroblasts (Fig.  5 J, Additional file 2 : Table S6). Of these TFs, FOS, FOSB, JUN, and JUNB were shown to be absent in the healthy mucosa transcriptional networks [ 75 ], in line with their observations in Fig.  5 J. By contrast, Bcl-2-associated transcription factor 1, BCLAF1, was pseudotemporally up-regulated in fetal fibroblasts but downregulated in adult fibroblasts. Prior studies showed that knocking out BCLAF1 is embryonic lethal [ 76 , 77 ] and yet could be oncogenic in colon cancer [ 78 ], which could explain the trajectory difference of it in fetal and adult. Other cell types also displayed varying degrees of similarities and differences (Additional file 1 : Fig. S5, Additional file 2 : Table S6).

In scATAC-Sequencing, we examined the contributions of cis -regulatory elements in the adult colon. We identified DA peaks for cell clusters and identified corresponding genes closest to these DA peak regions. Cell type identities were postulated based on the gene activities of the scATAC-Seq data (GSEA) [ 79 , 80 ] (Fig.  6 A). Common cell types were detected in scATAC-Seq compared to scRNA-seq (Figs. 5 A and 6 A). We performed sequence motif analysis to detect regulatory sequences unique to each cell type based on their leading DA peaks; among the top enriched motifs, many of the Myocyte Enhancer Factors such as MEF2B, MEF2C, and MEF2D from cells such as smooth muscle cells and pericytes, were found to be significantly enriched (Fig.  6 B), which were also up-regulated in the scRNASeq findings shown earlier (Additional file 2 : Table S6).

figure 6

Multi-omics analysis of adult and fetal colon tissues revealed distinct variations between adults and fetuses as well as across omics. A UMAP of cell types present in the scATAC-Seq of the adult colon. Colors represent cell types. B Top enriched motif sequences in cell types of the adult colon scATAC-Seq data. C , D Spatial transcriptomic profiles of adult colon sample 1 ( C ) and sample 2 ( D ). The top TFs were selected, and their spatial expressions were mapped onto the slide images. E , F Top receptor-ligand interactions between cell type classes in colon 1 ( E ) and colon 2 ( F ) of the spatial transcriptomics data. Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions. G , H Top receptor-ligand interactions between cell type classes in the adult colon ( G ) and fetal colon ( H ) of the scRNA-seq data. Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions

We examined the physical landscape of the leading TFs (found in scRNA-Seq and scATAC-Seq) in spatial transcriptomics data from two adult colons [ 5 ]. TFs ELF3 and NR2F6 were expressed generally in many locations in colonic tissue and displayed similar expression patterns for both of the adult colons (Fig.  6 C and D), consistent with significant up-regulation in almost all MSC lineage cell types in the pseudotemporal transitions (Additional file 2 : Table S6). In contrast, SPIB was not up-regulated in general, while displaying higher expression in B cells (Fig.  6 C and D), consistent with its role in adaptive immunity, as previously discussed. For other leading TFs, such as BCLAF1, EPAS1, and PLAG1, there were no clear discrete patterns of expression among the cell types.

To examine how cells interact with one another in spatial transcriptomics of the adult colon, we performed receptor-ligand interaction analysis [ 38 ]. Leading interactions included VIP/VIPR2 and ADCYAP1/VIPR2 interactions between neurons and fibroblasts, the NCAM1/GFRA1 interaction between neuronal cells, as well as LTB/CD40 and LY86/CD180 interactions between B cells (Fig.  6 E, Additional file 2 : Table S7). In colon 2, leading interactions occurred between the B cells and between the B cells and enterocytes or fibroblasts. These included LTB/CD40, APOE/LRP8, LY86/CD180, and VCAM1/ITGB7 between B cells; APOE/VLDLR between B cells (APOE) and enterocytes (VLDLR); and CXCL12/CXCR4, FN1/CD79A, CD34/SELL, and ICAM2/ITGAL between fibroblasts and B cells (Fig.  6 F, Additional file 2 : Table S7).

The same type of analysis was performed on both scRNA-seq from both adult and fetal colons. In the adult colon in scRNA-seq (Fig.  6 G), the fibroblasts comprised the leading interactions with cells such as CD8 T cells (CCL8-ACKR2), with (other) fibroblasts (CCL13-CCR9), goblet cells (CCL13-CCR3), and mast cells (PROC-PROCR). In the fetal colon, leading interaction pairs were derived mostly from fibroblasts and macrophages with other cells (Fig.  6 H, Additional file 2 : Table S7), including C4BPA-CD40 between fibroblasts (C4BPA) and endothelial cells (CD40); CCL24-CCR2 between neuronal cells (CCL24) and macrophages (CCR2); CCL13-CCR1 and MUC7-SELL between goblet cells (CCL13 and MUC7) and macrophages (CCR1 and SELL); and IL21-IL21R between smooth muscle cells (IL21) and macrophages (IL21R). In scRNA-seq of both adult and fetal colons, the active interactions of fibroblasts with other cells based on CCL family ligand-receptor interactions seemed to suggest its key regulatory role in immune cell recruitment in the colon (via the active interaction and activation of monocyte chemoattractants, i.e., the CCL family), consistent with prior publications [ 32 , 33 ].

Comparing the two omics data sets, both colon samples from spatial transcriptomics data shared leading interactions with that of the scRNA-seq from adult and fetal colons (Additional file 2 : Table S7). Between spatial colon 1 and the scRNA-seq fetal colon, common interaction pairs were found between neuronal cells, enterocytes with neurons, and neurons with fibroblasts (Additional file 2 : Table S7). For spatial colon 2, 25 of its 95 top unique interactions were shared with the scRNA-seq adult colon, and 10 were shared with the scRNA-seq fetal colon (Additional file 2 : Table S7). For the scRNA-seq adult colon, 445 of its 852 top unique interactions were found in the scRNA-seq fetal colon. For example, CLEC3A-CLEC10A interactions between macrophages (CLEC10A) and enterocytes (CLEC3A), goblet cells (CLEC3A), or smooth muscle cells (CLEC3A), as well as between macrophages. Among them, the scRNA-seq fetal colon seemed to share the greatest number of cell-type-specific interactions with the other three groups (Additional file 2 : Table S7).

At 1% BH FDR and log2FC > 0.25 for the bulk RNA-seq data in adult transverse colon data, we compared these upregulated genes with the top genes in scRNA-seq and the top genes in expression quantitative trait loci (eQTL) (eGenes) and splicing QTL (sQTL) (sGenes) of WGS of the corresponding transverse colon data (Additional file 1 : Fig. S6). Comparing the top 10 genes of eGenes and sGenes, no common genes were found (Additional file 1 : Figs. S7A and S7B). Comparing the overlapping patterns in bulk transcriptomics with scRNA-seq data, there was a much higher number of overlaps in scRNA-seq with eGenes and sGenes compared to bulk RNA-seq (Additional file 1 : Fig. S7C). We grouped the overlapping genes according to their cell types in scRNA-seq (Additional file 1 : Fig. S7D). In particular, the goblet cells and enterocytes in eGenes were similar in proportion within eGenes for bulk RNA-seq compared to scRNA-Seq. Similar phenomena were observed in sGenes (Additional file 1 : Fig. S7D).

Utility and discussion

User interface (ui) overview.

SCA offers an intuitive, user-friendly interface designed to facilitate seamless navigation and efficient phenotype retrieval by researchers across eight single-cell and bulk omics from 125 healthy adult and fetal tissues. Designed with a focus on user experience, the UI offers intuitive and simple navigations for users to explore complex layers of multi-omics multi-tissue resources. Here is an overview of the SCA UI, (I) Home Page: Landing page of the database to serve as the gateway to the comprehensive features of the SCA, offering users a starting point to dive into the wealth of multi-omics data. (II) About: This section offers a thorough description of the portal, complemented by an introductory video summarizing the key features of the database to provide guidance to new users. (II) Overview: Here, we highlight the diversity of omics data available, providing a snapshot of the various omics types and summarizing key information about each. (IV) Atlas: Features interactive representations of human adult and fetal anatomies, and a gateway for users to explore each tissue in-depth with detailed phenotypes specific to each tissue and their corresponding omics. (V) Query: While the Atlas tab is to showcase comprehensive features in each tissue, the Query tab is dedicated to exploring key phenotypic features across all tissues for different omics types, such as regulon search, receptor-ligand interactions, and clonotype abundance, etc. (VI) Demo: Offers a comprehensive walkthrough of the database, using the adult colon transverse tissue as an illustrative example, to demonstrate the capability of the platform and how users can extract meaningful insights. (VII) Analyze: Provides an extensive suite of tools tailored to assist users in performing single-cell analyses across a wide array of omics, along with rapid plotting tools that allow for the creation of customizable plots quickly and efficiently. (VIII) Download: Provides the option for batch downloads, enabling users to conveniently download the data utilized within the database based on their specific selections. (IX) Sources: Offers detailed information about the origins of the raw data used to construct the database, ensuring transparency and trust in the data provided. (X) Discussion: Facilitates a collaborative community space where users can interact, offer assistance, pose questions, and share feedback and suggestions, enhancing the collective utility of the platform. (XI) News: Keeps users informed about the latest updates, additions, and enhancements to the database, ensuring the SCA community stays abreast of new developments.

Intended uses of the database and envisioned benefits

SCA is crafted to serve as a comprehensive resource in the burgeoning field of single-cell and multi-omics research. Its primary intention is to facilitate a deeper understanding of the cellular complexity and diversity inherent in healthy adult and fetal tissues through simultaneous exploration of multiple omics. Beyond this, SCA aims to serve as a robust analysis platform to support post-quantification analysis of high-throughput single-cell sequencing data. As such, researchers can leverage SCA for comparative studies, hypothesis generation, and validation purposes. The integration of multi-omics data facilitates a deeper understanding of cellular mechanisms, potentially accelerating discoveries in cellular mechanisms, developmental biology, and potential therapeutic targets.

Explicitly, SCA enables scientists to quickly derive insights that would otherwise require extensive time and resources to uncover, thereby speeding up the cycle of hypothesis, experimentation, and conclusion. The database will significantly enhance data accessibility and integration, allowing researchers to easily combine data from different omics types and tissues to obtain a holistic view of cellular functions. This integrative approach is crucial for understanding complex biological systems and for the development of comprehensive models of human health and disease. By cataloging cellular characteristics across a range of tissues and conditions, SCA empowers precision medicine initiatives. It provides a detailed cellular context for phenotypic variations and potential markers at the single-cell level and with bulk level for comparative assessments, supporting the development of potential personalized treatment plans based on cellular profiles.

SCA fosters a collaborative research environment by providing a common platform for scientists from diverse backgrounds with research specialties across tissues, diseases, and omics analysis. It encourages interdisciplinary approaches, connecting researchers from diverse fields and promoting the exchange of knowledge and methodologies. This collaborative ethos is expected to drive forward innovations in research and technology.

Benchmarking with existing databases

Here, we evaluated our SCA database against other existing databases [ 9 , 11 , 13 , 20 , 81 ], emphasizing the distinctive attributes that make SCA stand out (Additional file 2 : Table S8). SCA integrates eight distinct omics types, surpassing the scope of Single Cell Portal (SCP) [ 20 ], Human Cell Atlas (HCA) [ 11 ], GTEx Portal [ 81 ], DISCO [ 9 ], and Panglaodb [ 13 ] in providing a wide-ranging multi-omics platform for exhaustive single-cell omics research. Data accessibility is publicly available for all these platforms, except that GTEx Portal encompassing both public and protected datasets (Additional file 2 : Table S8). SCA is noteworthy for its extensive coverage of eight single-cell and bulk omics over 125 differentiated tissues, established a significant lead over the other portals in terms of omics types. Furthermore, SCA sets a new standard with its unmatched capabilities. Other than the typical representations of cell type proportions and visualizing basic features in cell types, features that are notably limited or absent in SCP, HCA, DISCO, and Panglaodb, such as cell–cell interactions, transcription factor activities, the visualization of regulon modules, motif enrichments, clonotype abundance, detailed repertoire profiles, etc., are areas unaddressed by other databases. SCA is the sole provider of specialized queries targeting various phenotypes across multiple omics (Additional file 2 : Table S8). This specificity of analysis remains unparalleled when juxtaposed with other databases in our comparative cohort. Ultimately, SCA stands out as a premier, all-encompassing resource for the omics research community.

Future development and maintenance

In an effort to ensure the platform remains relevant, up-to-date, and increasingly valuable to the broad spectrum of researchers, we will be implementing annual updates. These will incorporate findings from newly published studies and novel phenotypic analyses gathered over the year. As we strive to continually enrich our platform, these updates will address gaps in tissue representation for each omics type, and simultaneously expand the sample size within each tissue. Our commitment to transparency and traceability is reflected in our approach to versioning. We will systematically denote improvements to the database, including new features and datasets, in an accessible point-form format. Updates will be marked by adjustments to the database accession number, with the current version designated as SCA V1.0.0. In addition to serving as a resource for data and phenotypic features, our ultimate aim is for SCA to function as a user-friendly platform, facilitating rapid access to multi-omics data resources and enabling cross-comparison of user datasets with our own.

Conclusions

Our study establishes a comprehensive evaluation of the healthy human multi-tissue and multi-omics landscape at the single-cell level, culminating in the construction of a multi-omics human map and its accompanying web-based platform SCA. This innovative platform streamlines the delivery of multi-omics insights, potentially reducing costs and accelerating research by obviating the need for extensive data consolidation. The big data framework of SCA facilitates the exploration of a broad spectrum of phenotypic features, offering a more representative snapshot of the study population than traditional single omics or bulk analysis could achieve. This multi-omics approach is poised to be instrumental in unraveling the complexities of multidimensional biological systems, offering a holistic perspective that enhances our understanding of biological phenomena.

Despite its robust capabilities, SCA faces challenges associated with the technological limitations of flow cytometry and CyTOF modalities, which restrict the number of detectable proteins. These constraints complicate the integration of data from different studies. We have consciously chosen not to pursue the imputation of expression values across these datasets due to concerns about reliability. Moving forward, we aim to refine tissue stratification within the portal by introducing more detailed sample classifications, such as sampling sites, age groups, genders across tissues, and for fetal tissues, different developmental stages. This advancement depends on the acquisition of comprehensive data to support more precise and accurate analyses.

SCA is designed not only as a database but as a catalyst for a paradigm shift towards a multi-omics-focused research approach. It encourages the scientific community to embrace a multi-omics perspective in their research, facilitating the generation of new hypotheses and the discovery of novel insights. This platform is expected to foster an environment rich in intellectual exploration, propelling forward the development of groundbreaking research trajectories. In essence, SCA emerges as a pioneering open-access, single-cell multi-omics atlas, offering an in-depth view of healthy human tissues across a wide array of omics disciplines and 125 diverse adult and fetal tissues. It unlocks new avenues for exploration in multi-omics research, positioning itself as a vital tool in advancing our understanding of life sciences. SCA is set to become an invaluable asset in the research community, significantly contributing to advancements in biology and medicine by facilitating a deeper comprehension of complex biological systems.

Availability of data and materials

This paper used and analyzed publicly available data sets and their resource references are available at http://www.singlecellatlas.org . Codes used for the construction of the database, data analysis, and visualization have been deposited on GitHub and can be accessed via https://github.com/eudoraleer/sca and is under the MIT License [ 82 ], and is also on Zenodo at https://zenodo.org/records/10906053 [ 83 ]. Web-based platforms hosting the interactive atlas and database queries are available at https://www.singlecellatlas.org .

Aldridge S, Teichmann SA. Single cell transcriptomics comes of age. Nat Commun. 2020;11:4307.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zhu C, Preissl S, Ren B. Single-cell multimodal omics: the power of many. Nat Methods. 2020;17:11–4.

Article   CAS   PubMed   Google Scholar  

Mimitou EP, Lareau CA, Chen KY, Zorzetto-Fernandes AL, Hao Y, Takeshima Y, Luo W, Huang T-S, Yeung BZ, Papalexi E, et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat Biotechnol. 2021;39:1246–58.

Li X. Harnessing the potential of spatial multiomics: a timely opportunity. Signal Transduct Target Ther. 2023;8:234.

Article   PubMed   PubMed Central   Google Scholar  

Fawkner-Corbett D, Antanaviciute A, Parikh K, Jagielowicz M, Gerós AS, Gupta T, Ashley N, Khamis D, Fowler D, Morrissey E, et al. Spatiotemporal analysis of human intestinal development at single-cell resolution. Cell. 2021;184:810-826.e823.

Miao Z, Humphreys BD, McMahon AP, Kim J. Multi-omics integration in the age of million single-cell data. Nat Rev Nephrol. 2021;17:710–24.

Chappell L, Russell AJC, Voet T. Single-Cell (Multi)omics Technologies. Annu Rev Genomics Hum Genet. 2018;19:15–41.

Li H, Qu L, Yang Y, Zhang H, Li X, Zhang X. Single-cell transcriptomic architecture unraveling the complexity of tumor heterogeneity in distal cholangiocarcinoma. Cell Mol Gastroenterol Hepatol. 2022;13(1592–1609): e1599.

Google Scholar  

Li M, Zhang X, Ang KS, Ling J, Sethi R, Lee NYS, Ginhoux F, Chen J. DISCO: a database of Deeply Integrated human Single-Cell Omics data. Nucleic Acids Res. 2022;50:D596-d602.

Pan L, Mou T, Huang Y, Hong W, Yu M, Li X. Ursa: A comprehensive multiomics toolbox for high-throughput single-cell analysis. Mol Biol Evol. 2023;40(12):msad267.

Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. The Human Cell Atlas eLife. 2017;6: e27041.

PubMed   Google Scholar  

Clough E, Barrett T. The gene expression omnibus database. Statistical Genomics: Methods and Protocols. 2016:93–110.

Franzén O, Gan L-M, Björkegren JLM: PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, 2019.

Cummins C, Ahamed A, Aslam R, Burgin J, Devraj R, Edbali O, Gupta D, Harrison PW, Haseeb M, Holt S, et al. The European Nucleotide Archive in 2021. Nucleic Acids Res. 2022;50:D106-d110.

Pan L, Shan S, Tremmel R, Li W, Liao Z, Shi H, Chen Q, Zhang X, Li X. HTCA: a database with an in-depth characterization of the single-cell human transcriptome. Nucleic Acids Res. 2022;51:D1019–28.

Article   PubMed Central   Google Scholar  

Elmentaite R, Domínguez Conde C, Yang L, Teichmann SA. Single-cell atlases: shared and tissue-specific cell types across human organs. Nat Rev Genet. 2022;23:395–410.

Quake SR: A decade of molecular cell atlases. Trends in Genetics 2022.

Zeng J, Zhang Y, Shang Y, Mai J, Shi S, Lu M, Bu C, Zhang Z, Zhang Z, Li Y, et al. CancerSCEM: a database of single-cell expression map across various human cancers. Nucleic Acids Res. 2022;50:D1147-d1155.

Ner-Gaon H, Melchior A, Golan N, Ben-Haim Y, Shay T. JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets. J Immunol. 2017;198:3375–9.

Tarhan L, Bistline J, Chang J, Galloway B, Hanna E, Weitz E: Single Cell Portal: an interactive home for single-cell genomics data. bioRxiv 2023.

Kolodziejczyk Aleksandra A, Kim JK, Svensson V, Marioni John C, Teichmann Sarah A. The Technology and Biology of Single-Cell RNA Sequencing. Mol Cell. 2015;58:610–20.

Schwartzman O, Tanay A. Single-cell epigenomics: techniques and emerging applications. Nat Rev Genet. 2015;16:716–26.

Gomes T, Teichmann SA, Talavera-López C. Immunology Driven by Large-Scale Single-Cell Sequencing. Trends Immunol. 2019;40:1011–21.

Cheung RK, Utz PJ. CyTOF—the next generation of cell detection. Nat Rev Rheumatol. 2011;7:502–3.

Spitzer Matthew H, Nolan Garry P. Mass Cytometry: Single Cells. Many Features Cell. 2016;165:780–91.

CAS   PubMed   Google Scholar  

Tian Y, Carpp LN, Miller HER, Zager M, Newell EW, Gottardo R. Single-cell immunology of SARS-CoV-2 infection. Nat Biotechnol. 2022;40:30–41.

McKinnon KM: Flow Cytometry: An Overview. Current Protocols in Immunology 2018, 120:5.1.1–5.1.11.

Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–20.

Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631–56.

Ng PC, Kirkness EF. Whole Genome Sequencing. In: Barnes MR, Breen G, editors. Genetic Variation: Methods and Protocols. Totowa, NJ: Humana Press; 2010. p. 215–26.

Chapter   Google Scholar  

Hughes CE, Nibbs RJB. A guide to chemokines and their receptors. Febs j. 2018;285:2944–71.

Stadler M, Pudelko K, Biermeier A, Walterskirchen N, Gaigneaux A, Weindorfer C, Harrer N, Klett H, Hengstschläger M, Schüler J, et al. Stromal fibroblasts shape the myeloid phenotype in normal colon and colorectal cancer and induce CD163 and CCL2 expression in macrophages. Cancer Lett. 2021;520:184–200.

Davidson S, Coles M, Thomas T, Kollias G, Ludewig B, Turley S, Brenner M, Buckley CD. Fibroblasts as immune regulators in infection, inflammation and cancer. Nat Rev Immunol. 2021;21:704–17.

Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573-3587.e3529.

Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Chen H, Wang J, Tang H, Ge W, et al. Construction of a human cell landscape at single-cell level. Nature. 2020;581:303–9.

Kariminekoo S, Movassaghpour A, Rahimzadeh A, Talebi M, Shamsasenjan K, Akbarzadeh A. Implications of mesenchymal stem cells in regenerative medicine. Artificial Cells, Nanomedicine, and Biotechnology. 2016;44:749–57.

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14:1083–6.

Cillo AR, Kürten CHL, Tabib T, Qi Z, Onkar S, Wang T, Liu A, Duvvuri U, Kim S, Soose RJ, et al. Immune Landscape of Viral- and Carcinogen-Driven Head and Neck Cancer. Immunity. 2020;52:183-199.e189.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47–e47.

Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 2021;49:D545-d551.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.

The Gene Ontology resource. enriching a GOld mine. Nucleic Acids Res. 2021;49:D325-d334.

Article   Google Scholar  

Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc: Ser B (Methodol). 1995;57:289–300.

Staley JR, Blackshaw J, Kamat MA, Ellis S, Surendran P, Sun BB, Paul DS, Freitag D, Burgess S, Danesh J, et al. PhenoScanner: a database of human genotype-phenotype associations. Bioinformatics. 2016;32:3207–9.

Kamat MA, Blackshaw JA, Young R, Surendran P, Burgess S, Danesh J, Butterworth AS, Staley JR. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics. 2019;35:4851–3.

Ballardini G, Bianchi F, Doniach D, Mirakian R, Pisi E, Bottazzo G. ABERRANT EXPRESSION OF HLA-DR ANTIGENS ON BILEDUCT EPITHELIUM IN PRIMARY BILIARY CIRRHOSIS: RELEVANCE TO PATHOGENESIS. The Lancet. 1984;324:1009–13.

Hirschfield GM, Liu X, Xu C, Lu Y, Xie G, Lu Y, Gu X, Walker EJ, Jing K, Juran BD, et al. Primary Biliary Cirrhosis Associated with HLA, IL12A, and IL12RB2 Variants. N Engl J Med. 2009;360:2544–55.

Peng A, Ke P, Zhao R, Lu X, Zhang C, Huang X, Tian G, Huang J, Wang J, Invernizzi P, et al. Elevated circulating CD14(low)CD16(+) monocyte subset in primary biliary cirrhosis correlates with liver injury and promotes Th1 polarization. Clin Exp Med. 2016;16:511–21.

Chen Y-Y, Arndtz K, Webb G, Corrigan M, Akiror S, Liaskou E, Woodward P, Adams DH, Weston CJ, Hirschfield GM. Intrahepatic macrophage populations in the pathophysiology of primary sclerosing cholangitis. JHEP Reports. 2019;1:369–76.

Olmos JM, García JD, Jiménez A, de Castro S. Impaired monocyte function in primary biliary cirrhosis. Allergol Immunopathol (Madr). 1988;16:353–8.

Britanova OV, Putintseva EV, Shugay M, Merzlyak EM, Turchaninova MA, Staroverov DB, Bolotin DA, Lukyanov S, Bogdanova EA, Mamedov IZ, et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J Immunol. 2014;192:2689–98.

Borcherding N, Bormann NL, Kraus G. scRepertoire: An R-based toolkit for single-cell immune receptor analysis. F1000Research. 2020;9.

Larbi A, Fulop T. From “truly naïve” to “exhausted senescent” T cells: When markers predict functionality. Cytometry A. 2014;85:25–35.

Article   PubMed   Google Scholar  

Lee S-W, Choi HY, Lee G-W, Kim T, Cho H-J, Oh I-J, Song SY, Yang DH, Cho J-H. CD8<sup>+</sup> TILs in NSCLC differentiate into TEMRA via a bifurcated trajectory: deciphering immunogenicity of tumor antigens. J Immunother Cancer. 2021;9: e002709.

Chen K, Kolls JK. T Cell-Mediated Host Immune Defenses in the Lung. Annu Rev Immunol. 2013;31:605–33.

Mowat AM, Agace WW. Regional specialization within the intestinal immune system. Nat Rev Immunol. 2014;14:667–85.

Godfrey DI, Koay H-F, McCluskey J, Gherardin NA. The biology and functional importance of MAIT cells. Nat Immunol. 2019;20:1110–28.

Nel I, Bertrand L, Toubal A, Lehuen A. MAIT cells, guardians of skin and mucosa? Mucosal Immunol. 2021;14:803–14.

Legoux F, Salou M, Lantz O. MAIT Cell Development and Functions: the Microbial Connection. Immunity. 2020;53:710–23.

van den Broek T, Borghans JAM, van Wijk F. The full spectrum of human naive T cells. Nat Rev Immunol. 2018;18:363–73.

Soundararajan M, Kannan S. Fibroblasts and mesenchymal stem cells: Two sides of the same coin? J Cell Physiol. 2018;233:9099–109.

Muzlifah AH, Matthew PC, Christopher DB, Francesco D. Mesenchymal stem cells: the fibroblasts’ new clothes? Haematologica. 2009;94:258–63.

Lendahl U, Muhl L, Betsholtz C. Identification, discrimination and heterogeneity of fibroblasts. Nat Commun. 2022;13:3409.

Steens J, Unger K, Klar L, Neureiter A, Wieber K, Hess J, Jakob HG, Klump H, Klein D. Direct conversion of human fibroblasts into therapeutically active vascular wall-typical mesenchymal stem cells. Cell Mol Life Sci. 2020;77:3401–22.

Ichim TE, O’Heeron P, Kesari S. Fibroblasts as a practical alternative to mesenchymal stem cells. J Transl Med. 2018;16:212.

Beumer J, Clevers H. Cell fate specification and differentiation in the adult mammalian intestine. Nat Rev Mol Cell Biol. 2021;22:39–53.

Moor AE, Harnik Y, Ben-Moshe S, Massasa EE, Rozenberg M, Eilam R, Bahar Halpern K, Itzkovitz S. Spatial Reconstruction of Single Enterocytes Uncovers Broad Zonation along the Intestinal Villus Axis. Cell. 2018;175:1156-1167.e1115.

Kendall RT, Feghali-Bostwick CA. Fibroblasts in fibrosis: novel roles and mediators. Front Pharmacol. 2014;5:123.

Oliver JR, Kushwah R, Wu J, Pan J, Cutz E, Yeger H, Waddell TK, Hu J. Elf3 plays a role in regulating bronchiolar epithelial repair kinetics following Clara cell-specific injury. Lab Invest. 2011;91:1514–29.

Ng AYN, Waring P, Ristevski S, Wang C, Wilson T, Pritchard M, Hertzog P, Kola I. Inactivation of the transcription factor Elf3 in mice results in dysmorphogenesis and altered differentiation of intestinal epithelium. Gastroenterology. 2002;122:1455–66.

Chen R, Kang R, Tang D. The mechanism of HMGB1 secretion and release. Exp Mol Med. 2022;54:91–102.

Dai S, Sodhi C, Cetin S, Richardson W, Branca M, Neal MD, Prindle T, Ma C, Shapiro RA, Li B, et al. Extracellular High Mobility Group Box-1 (HMGB1) Inhibits Enterocyte Migration via Activation of Toll-like Receptor-4 and Increased Cell-Matrix Adhesiveness 2<sup></sup>. J Biol Chem. 2010;285:4995–5002.

Klepsch V, Gerner RR, Klepsch S, Olson WJ, Tilg H, Moschen AR, Baier G, Hermann-Kleiter N. Nuclear orphan receptor NR2F6 as a safeguard against experimental murine colitis. Gut. 2018;67:1434–44.

Klepsch V, Hermann-Kleiter N, Baier G. Beyond CTLA-4 and PD-1: Orphan nuclear receptor NR2F6 as T cell signaling switch and emerging target in cancer immunotherapy. Immunol Lett. 2016;178:31–6.

Sanz-Pamplona R, Berenguer A, Cordero D, Molleví DG, Crous-Bou M, Sole X, Paré-Brunet L, Guino E, Salazar R, Santos C, et al. Aberrant gene expression in mucosa adjacent to tumor reveals a molecular crosstalk in colon cancer. Mol Cancer. 2014;13:46.

McPherson JP, Sarras H, Lemmers B, Tamblyn L, Migon E, Matysiak-Zablocki E, Hakem A, Azami SA, Cardoso R, Fish J, et al. Essential role for Bclaf1 in lung development and immune system function. Cell Death Differ. 2009;16:331–9.

Aw S. Sun H, Geng Y, Peng Q, Wang P, Chen J, Xiong T, Cao R, Tang J: Bclaf1 is an important NF-κB signaling transducer and C/EBPβ regulator in DNA damage-induced senescence. Cell Death Differ. 2016;23:865–75.

Zhou X, Li X, Cheng Y, Wu W, Xie Z, Xi Q, Han J, Wu G, Fang J, Feng Y. BCLAF1 and its splicing regulator SRSF10 regulate the tumorigenic potential of colon cancer cells. Nat Commun. 2014;5:4581.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102:15545–50.

Liberzon A, Subramanian A, Pinchback R. Thorvaldsdottir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–40.

GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.

Pan L, Parini P, Tremmel R, Loscalzo J, Lauschke VM, Maron BA, Paci P, Ernberg I, Tan NS, Liao Z, Yin W, Rengarajan S, Li X: Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Github. https://github.com/eudoraleer/sca/ ; 2024.

Pan L, Parini P, Tremmel R, Loscalzo J, Lauschke VM, Maron BA, Paci P, Ernberg I, Tan NS, Liao Z, Yin W, Rengarajan S, Wang ZN, Li X: Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Zenodo. https://zenodo.org/doi/10.5281/zenodo.10906053 ; 2024.

Download references

Acknowledgements

The computations and data handling were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Rackham, partially funded by the Swedish Research Council through grant agreement no. 2018-05973. We would like to thank Vladimir Kuznetsov for his advice on the manuscript, and Liming Zhang and Xueqiang Peng for their help in data handling.

Members of The SCA Consortium

Lu Pan 1 , Paolo Parini 2,3 , Roman Tremmel 4,5 , Joseph Loscalzo 6 , Volker M. Lauschke 4,5,7 , Bradley A. Maron 6 , Paola Paci 8 , Ingemar Ernberg 9 , Nguan Soon Tan 10,11 , Zehuan Liao 9,10 , Weiyao Yin 1 , Sundararaman Rengarajan 12 , Xuexin Li 13,14,*

1 Institute of Environmental Medicine, Karolinska Institutet, Solna, 171 65, Sweden.

2 Cardio Metabolic Unit, Department of Medicine, and Department of Laboratory Medicine, Karolinska Institutet, Stockholm, 141 86, Sweden.

3 Medicine Unit, Theme Inflammation and Ageing, Karolinska University Hospital, Stockholm, 141 86, Sweden.

4 Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, 70376, Germany.

5 University of Tuebingen, Tuebingen, 72076, Germany.

6 Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA.

7 Department of Physiology and Pharmacology, Karolinska Institutet, Solna, 171 65, Sweden.

8 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, 00185, Italy.

9 Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Solna, 171 65, Sweden.

10 School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.

11 Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore 308232, Singapore.

12 Department of Physical Therapy, Movement & Rehabilitation Sciences, Northeastern University, Boston, MA, 02115, USA.

13 Department of General Surgery, The Fourth Affiliated Hospital, China Medical University, Shenyang 110032, China.

14 Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, 171 65, Sweden.

Review history

The review history is available as Additional File 4 .

Peer review information

Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Open access funding provided by Karolinska Institute. This work is supported by the Karolinska Institute Network Medicine Global Alliance (KI NMA) collaborative grant C24401073 (X.L., L.P.), C62623013 (X.L., L.P.), and C331612602 (X.L., L.P.).

Author information

Authors and affiliations.

Institute of Environmental Medicine, Karolinska Institutet, 171 65, Solna, Sweden

Lu Pan & Weiyao Yin

Cardio Metabolic Unit, Department of Medicine, and, Department of Laboratory Medicine , Karolinska Institutet, 141 86, Stockholm, Sweden

Paolo Parini

Theme Inflammation and Ageing, Medicine Unit, Karolinska University Hospital, 141 86, Stockholm, Sweden

Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, 70376, Stuttgart, Germany

Roman Tremmel & Volker M. Lauschke

University of Tuebingen, 72076, Tuebingen, Germany

Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA

Joseph Loscalzo & Bradley A. Maron

Department of Physiology and Pharmacology, Karolinska Institutet, 171 65, Solna, Sweden

Volker M. Lauschke

Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185, Rome, Italy

Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, 171 65, Solna, Sweden

Ingemar Ernberg & Zehuan Liao

School of Biological Sciences, Nanyang Technological University, Singapore, 637551, Singapore

Nguan Soon Tan & Zehuan Liao

Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, 308232, Singapore

Nguan Soon Tan

Department of Physical Therapy, Movement & Rehabilitation Sciences, Northeastern University, Boston, MA, 02115, USA

Sundararaman Rengarajan

Department of General Surgery, The Fourth Affiliated Hospital, China Medical University, Shenyang, 110032, China

Department of Medical Biochemistry and Biophysics, Karolinska Institutet, 171 65, Solna, Sweden

You can also search for this author in PubMed   Google Scholar

  • , Paolo Parini
  • , Roman Tremmel
  • , Joseph Loscalzo
  • , Volker M. Lauschke
  • , Bradley A. Maron
  • , Paola Paci
  • , Ingemar Ernberg
  • , Nguan Soon Tan
  • , Zehuan Liao
  • , Weiyao Yin
  • , Sundararaman Rengarajan
  •  & Xuexin Li

Contributions

Conceptualization, X.L., L.P., and J.L.; methodology, X.L. and L.P.; investigation, X.L., L.P., V.M.L., R.T., and J.L.; analysis and visualization, L.P.; cross-checking and validation, X.L. and L.P.; website construction, L.P., X.L., and R.T.; funding acquisition, X.L. and L.P.; project administration, X.L., L.P., P.P., and V.M.L.; supervision, X.L. and J.L.; writing, L.P. and X.L. All authors edited and reviewed the manuscript.

Corresponding author

Correspondence to Xuexin Li .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

VML is CEO and shareholder of HepaPredict AB, co-founder, and shareholder of PersoMedix AB, and discloses consultancy work for Enginzyme AB. The other authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:.

Figure S1. Sample count in fetal and adult groups across tissues and omics types. Figure S2. Correlations between cell types based on gene expression signatures revealed distinct cell type class clusters. (A-B) Heatmap showing the correlations of the cell types from adult (A) and fetal (B) cell types based on the expression of their top upregulated genes. The intensity of the heatmap shows the AUROC level between cell types. Colour blocks on the top of the heatmap represent tissues (first row from the top), biological systems (second row), cell types (third row) and cell type classes (fourth row). Figure S3. Correlations between cell types based on TF signatures revealed similar clustering patterns. (A-B) Heatmap showing the correlations of the cell types from adult (A) and fetal (B) cell types based on the expression of the TF signatures of each cell type. The intensity of the heatmap shows the AUROC level between cell types. Colour blocks on the top of the heatmap represent tissues (first row from the top), biological systems (second row), cell types (third row) and cell type classes (fourth row). Figure S4. Phenotype or disease trait associations. Forest plot showing the associations of phenotype or disease traits in selected cell type classes of scRNA-seq data for both adult and fetal tissues. The X-axis displays the odds ratio of each trait, and the colors of the points represent cell type classes. Figure S5. Landscape of clonal expansion patterns across tissues. (A) tSNE of the tissues from the multi-modal tissues of the scImmune-profiling data. Colors indicate clonal type expansion groups of the cells. Cells not present in the T or B repertoires are colored gray (NA group). Tissues with too few cells present in the T or B repertoires were filtered (i.e., bile duct and kidney) in the main analysis. (B) Stacked bar plots revealing the overall clonal expansion landscapes of the T and B cell repertoires. Colors represent clonal type groups. (C) Alluvial plot showing the top clonal types in T cell repertoires and their proportions shared across tissues containing these clonotypes. Colors represent clonotypes. Figure S6. Pseudotime heatmaps of MSC lineage cell types in the adult and fetal colon. (A-B) Pseudotime trajectory of each cell type in the MSC lineage of adult (A) and fetal (B) colons. The color represents the cell type, and the violin plots represent the density of cells across pseudotime. Figure S7. Comparison of DE gene overlaps between bulk RNA-seq, scRNA-seq and WGS. (A) Chromosomal positions of the top 10 eGenes in colon transverse bulk RNA-seq data. Gene names and their SNP rsid are shown. (B) Chromosomal positions of the top 10 sGenes in colon transverse bulk RNA-seq data. Gene names and their SNP rsid are shown. (C) Stacked bar plot showing the number of shared DE genes of the bulk RNA-seq data and the scRNA-seq data with the genes of the top eQTLs and sQTLs. The color represents the omics type. (D) Stacked bar plot showing the number of shared DE genes across the bulk RNA-seq data, the scRNA-seq data, genes of the top eQTLs and sQTLs. Colors represent the cell types to which the genes belonged with reference to the DE genes of the cell types in the scRNA-seq data. Fig. S8. Comprehensive workflow for scATAC-Seq data analyses in SCA V1.0.0.

Additional file 2:

Table S1. Cell counts of the adult and fetal tissue groups at each omics level. Table S2. Filtered matrix raw read counts for scRNA-Seq across tissues in both fetal and adult groups. Cell_Count_Filtered_Matrix column represents raw read counts initially obtained from published studies or after filtering for the removal of background noises. Table S3. Statistics of the upregulated genes from adult and fetal tissues, filtered by average Log2FoldChange > 0.25 and adjusted P of 0.05. Clusters represent cell types. Genes were ranked by average log2-fold-change. Table S4. Top receptor–ligand interaction profiles of the cell types in the 38 matching adult and fetal tissues. Interaction analysis was done separately for each tissue, and information on the interaction pairs can be viewed from the first column. Table S5: Top clonotypes (VDJ gene combinations) of each cell type present in the T and B cell repertoires. Table S6. Top TFs in the pseudotime transitions of adult and fetal colon cell types. Table S7 . Top receptor-ligand pairs in spatial transcriptomics of adult colons (colon 1 and colon 2) as well as in scRNA-seq adult and fetal colons. The first column represents the data type to which the interactions belong. Table ranked by decreasing interaction ratios. Table S8 . Comparison of SCA with other single-cell omics databases. Green tick indicates a yes and a red cross indicates a no. Table S9. List of public resources included in the SCA database portal. SCA_PID refers to SCA-designated project identity number (PID).

Additional file 3.

Supplementary Methods.

Additional file 4.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Pan, L., Parini, P., Tremmel, R. et al. Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Genome Biol 25 , 104 (2024). https://doi.org/10.1186/s13059-024-03246-2

Download citation

Received : 16 November 2022

Accepted : 12 April 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s13059-024-03246-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-cell omics
  • Multi-omics
  • Single Cell Atlas
  • Human database
  • Single-cell RNA-sequencing
  • Spatial transcriptomics
  • Single-cell ATAC-sequencing
  • Single-cell immune profiling
  • Mass cytometry
  • Flow cytometry

Genome Biology

ISSN: 1474-760X

human genome project literature review

IMAGES

  1. (PDF) The Human Genome Project

    human genome project literature review

  2. Human Genome Project

    human genome project literature review

  3. The Human Genome Project Essay Example

    human genome project literature review

  4. (PDF) The Human Genome Project, and recent advances in personalized

    human genome project literature review

  5. (PDF) The Human Genome Project

    human genome project literature review

  6. Human Genome Project, 978-613-0-03707-9, 6130037074 ,9786130037079

    human genome project literature review

VIDEO

  1. Genomics for everyone: UCSC researchers release first human pangenome

  2. Human Genome Project (HGP)

  3. Human Genome Project-Biology- Session 116

  4. Human genome project

  5. ZOOLOGY || Human Genome Project || Siva Chandra

  6. Human Genome Project PART A

COMMENTS

  1. The Human Genome Project changed everything

    The joint announcement of the release of the human 'draft' genome sequences occurred 20 years ago, at a ceremony in the White House. The first analyses by two groups, the publicly funded ...

  2. The Human Genome Project: big science transforms biology and medicine

    Impact of the human genome project on biology and technology. First, the human genome sequence initiated the comprehensive discovery and cataloguing of a 'parts list' of most human genes [16,17], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs.Understanding a complex biological system requires knowing the parts, how they are ...

  3. The Human Genome Project

    The Human Genome Project. The signature aim of the Human Genome Project (HGP), which was launched in 1990, was to sequence the 3 billion bases of the human genome. Additional goals included the ...

  4. A review of human genome project (HGP) from ethical perspectives

    Apparently, this study employed descriptive literature review and the results show that the Human Genome Project has significantly increased the level of understanding of the basic damage or ...

  5. Anticipating the Ethical, Legal, and Social Implications of Human

    The basic charge to this multidisciplinary group in the first five-year plan for the U.S. Human Genome Project was two-fold: to "[d]evelop programs addressed at understanding the ethical, legal and social implications of the human genome project," and to "[i]dentify and define the major issues and develop initial policy options to address ...

  6. The International Human Genome Project

    The human genome project was conceived and executed as an international project, due to both pragmatic and principled reasons. This internationality has served the project well, with the resulting human genome being freely available for all researchers in all countries. Over time the reference human genome will likely have to evolve to a graph ...

  7. A decade of human genome project conclusion: Scientific ...

    The fact that only 20,000 genes are protein and RNA-coding is one of the most striking HGP results. A new concept about the organization of genome arose. The ENCODE project was initiated in 2003 and targeted to map the functional elements of the human genome. This project revealed that the human genome is pervasively transcribed.

  8. PDF Twenty years of the human genome

    human genome To fulfil the promises of the Human Genome Project, researchers, journals and funders must re-commit to equity and open data sharing. T he first drafts of the human genome, published in Nature and Science 20 years ago, flung open the doors for what some predicted would be 'biology's century'. In just one-fifth of the

  9. The International Human Genome Project

    The human genome project was conceived and executed as an international project, due to both pragmatic and principled reasons. This internationality has served the project well, with the resulting human genome being freely available for all researchers in all countries. Over time the reference human genome will likely have to evolve to a graph ...

  10. Progress, Challenges, and Surprises in Annotating the Human Genome

    Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities.

  11. Full article: The Human Genome Project, and recent advances in

    The Human Genome Project represented the collective efforts of sci... Skip to Main Content ... In this review we consider the evidence concerning the application of such personalized genomics within the context of population screening, and potential implications that arise from this. ... There is an extensive literature on psychosocial aspects ...

  12. How to design a national genomic project—a systematic review of active

    An increasing number of countries are investing efforts to exploit the human genome, in order to improve genetic diagnostics and to pave the way for the integration of precision medicine into health systems. The expected benefits include improved understanding of normal and pathological genomic variation, shorter time-to-diagnosis, cost-effective diagnostics, targeted prevention and treatment ...

  13. The Human Genome Project

    The Human Genome Project. First published Wed Nov 26, 2008; substantive revision Thu Sep 14, 2023. The 20 th century opened with rediscoveries of Gregor Mendel's studies on patterns of inheritance in peas and closed with a research project in molecular biology that was heralded as the initial and necessary step for attaining a complete ...

  14. A review of human genome project (HGP) from ethical perspectives

    The results show that the Human Genome Project has significantly increased the level of understanding of the basic damage or genetic defects, the structure of DNA, the identification of the position of all genes and human genome databases as well as HGP major contributions in the field of biology specifically in developmental biology and neurobiology. Article history: Received 7 August 2017 ...

  15. The Human Genome Project: big science transforms biology and medicine

    The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [1-3].The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [].In May 1985 a meeting focused entirely on the HGP was held, with ...

  16. The Human Genome Project

    The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome ...

  17. Was the Human Genome Project Worth the Effort?

    One of the promises of the Human Genome Project was that it would provide tools for identifying genetic factors that contribute to common, complex diseases such as cancer and diabetes. Finding these factors would, in turn, suggest possible targets for drug therapy and other forms of treatment. Three papers in this week's issue—by Edwards et al. (1) on page 421, Haines et al. (2) on page 419 ...

  18. PDF The Human Genome Diversity Project: past, present and future

    The Human Genome Diversity Project (HGDP)provides a resource that is aimed at promoting worldwide research on human genetic diversity, with the ultimate goal of understanding how and when patterns of

  19. The Human Genome Project

    The Human Genome Project is an international research project whose primary mission is to decipher the chemical sequence of the complete human genetic material (i.e., the entire genome), identify all 50,000 to 100,000 genes contained within the genome, and provide research tools to analyze all this genetic information.

  20. Advances in the Human Genome Project. A review

    While celebrating its fifth official birthday last year it seems that the Human Genome Project (HGP) has and will continue to yield important biochemical information to mankind. ... Advances in the Human Genome Project. A review Mol Biol Rep. 1998 Jan;25(1):27-43. doi: 10.1023/a:1006834711989. Authors U Kelavkar 1 , K Shah. Affiliation 1 ...

  21. Human Genome Project (HGP)

    The Human Genome Project (HGP), which operated from 1990 to 2003, provided researchers with basic information about the sequences of the three billion chemical base pairs (i.e., adenine [A], thymine [T], guanine [G], and cytosine [C]) that make up human genomic DNA (deoxyribonucleic acid). The HGP was further intended to improve the ...

  22. Human Genome Project- A Review

    A review of the process by which the actual sequencing of human genome was done, which aimed to find out the function of every gene in human body and replace any gene that is mutated and causes disease. Biotechnology and genetic engineering has advanced medical science so that we are now blessed with treatments of numerous diseases with the gene therapy. Mapping and sequencing of human genome ...

  23. Exploring the Use of Genomic and Routinely Collected Data: Narrative

    We conducted a literature review to draw information from past studies that have used genomic and routinely collected data and conducted interviews with individuals who use these data for health research. ... The progression of genomics in the last few decades has been remarkable. Since 2001, when the Human Genome Project mapped and sequenced ...

  24. Decoding triancestral origins, archaic introgression, and natural

    These analyses have yielded insights into the characteristics of human genome variation , unveiled complex histories of human populations (3, 4), and shed light on the processes of evolutionary adaptation and positive selection (5, 6). In terms of application in genetics, WGS datasets are indispensable for imputation analysis.

  25. Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

    The human body is a highly complex system with dynamic cellular infrastructures and networks of biological events. Thanks to the rapid evolution of single-cell technologies, we are now able to describe and quantify different aspects of single cellular activities using various omics techniques [1,2,3,4].Observing or integrating multiple molecular layers of single cells has promoted profound ...

  26. Buildings

    Introduction: This study examines the impact of building information modeling on the cost management of engineering projects, focusing specifically on the Mombasa Port Area Development Project. The objective of this research is to determine the mechanisms through which building information modeling facilitates stakeholder collaboration, reduces construction-related expenses, and enhances the ...