Genome Architecture and Its Roles in Human Copy Number Variation
Article information
Abstract
Besides single-nucleotide variants in the human genome, large-scale genomic variants, such as copy number variations (CNVs), are being increasingly discovered as a genetic source of human diversity and the pathogenic factors of diseases. Recent experimental findings have shed light on the links between different genome architectures and CNV mutagenesis. In this review, we summarize various genomic features and discuss their contributions to CNV formation. Genomic repeats, including both low-copy and high-copy repeats, play important roles in CNV instability, which was initially known as DNA recombination events. Furthermore, it has been found that human genomic repeats can also induce DNA replication errors and consequently result in CNV mutations. Some recent studies showed that DNA replication timing, which reflects the high-order information of genomic organization, is involved in human CNV mutations. Our review highlights that genome architecture, from DNA sequence to high-order genomic organization, is an important molecular factor in CNV mutagenesis and human genomic instability.
Introduction
Genetic mutations have been known as one of the key factors in the pathogenesis of human diseases. Besides the well-known single-nucleotide variants, it has been shown that the large-scale genomic variants also make a great contribution to human health. 'Genomic disorders' are human diseases caused by relatively large genomic rearrangements [1]. Such large-scale genomic variants (named copy number variation [CNV]) can also be frequent in human populations [2, 3]. CNV involves DNA segments larger than 1 kb and exhibits variable copy numbers among individuals, comprising deletions and duplications/insertions [4, 5]. In the past 10 years, CNV has been found to play an important role in both sporadic Mendelian disorders and complex diseases. Previous studies have reported that CNV can be mediated by multiple molecular mechanisms involving various genomic features. Here, we focus on CNV mutagenesis and review the involvement of human genome architecture in CNV instability and the underlying molecular mechanisms.
Non-allelic Homologous Recombination between Human Genomic Repeats
Genomic disorders and low-copy repeats
Large-scale genomic changes in the human genome can be associated with human diseases. Such clinical conditions resulting from human genome architecture are termed 'genomic disorders' [1]. The structural features, such as genomic repeats, can provide substrates for homologous recombination and induce genomic rearrangements and genomic disorders.
Stankiewicz and Lupski [6] defined region-specific lowcopy repeats (LCRs) as paralogous genomic segments spanning 10-400 kb of genomic DNA and sharing ≥95%-97% sequence identity. The non-allelic homologous recombination (NAHR) between directly oriented LCRs can generate microdeletions and microduplications of megabases in size, which are frequently associated with genomic disorders (Fig. 1). For example, the 22q11.2 deletion syndrome is a well-investigated disorder caused by microdeletions between the paired LCRs in human 22q11.2, which deletes one copy of TBX1, CRKL, MAPK1, and several additional genes [7, 8, 9, 10]. In addition to microdeletions, microduplications can also manifest as genomic disorders. The 1.4-Mb microduplication involving the PMP22 gene in human 17p12 can lead to CMT1A, which is a classical model for disease resulting from gene dosage effects [11, 12].

The non-allelic homologous recombination (NAHR) events between paired low-copy repeats (LCRs)/segmental duplications (SDs) [1]. Paired LCRs/SDs are depicted as bold arrows (red and blue) with the orientation indicated by arrowheads. Capital letters near the LCRs/SDs refer to the flanking unique sequences, while the same letter on different lines indicates the homologues on the other strand. Dashed crossed lines represent a homologous recombination event. (A) The NAHR event between reversely oriented LCRs/SDs can cause inversion, a copy-neutral structural variation. (B) The inter-chromatid NAHR events between directly oriented LCRs/SDs result in deletions and duplications. (C) The intra-chromatid NAHR events between directly oriented LCRs/SDs can generate deletions and ring-shaped DNA segments that will be lost in subsequent cell divisions.
Segmental duplication and NAHR
Genomic repeats play a significant role in human evolution and have a strong association with genomic CNVs [6, 13, 14, 15]. In 2001, Eichler [16] initially conducted a systematically bioinformatic analysis for such low-copy genomic repeats and defined them as segmental duplications (SDs), which have a high degree of sequence identity (>90%-95%) and large genomic sizes (1-100 kb). After that, Bailey et al. [17] further performed a whole-genome assembly comparison to detect SDs with pair-wise alignments ≥ 90% and ≥ 1 kb in the human genome. In addition to human SDs, the subsequent analyses also identified the SD architecture in the genomes of other primates, including chimpanzee, gorilla, and orangutan, and even in the mouse genome [17, 18, 19, 20], all of which have been archived in the online database of the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/).
LCR/SD is a very important category of DNA architecture in the human genome. It has been found that they are associated with duplicated genes and pseudogenes [21], co-localize and overlap with Alu elements and CNV [22, 23], and play an important role in genome evolution [24, 25, 26]. LCR/SD pairs, acting as substrates, are thought to be a key factor in triggering NAHR events and causing CNV mutations [4, 27, 28, 29, 30, 31, 32].
Generally, reversely oriented SDs can align and subsequently crossover with each other via NAHR, resulting in copy-neutral inversions of the flanking DNA fragments (Fig. 1A). Similarly, NAHR events between direct SD pairs can cause CNVs. Based on the different positions of SD pairs, different types of CNVs occur. Duplications or deletions take place in NAHR events between different chromatids (interchromatid) (Fig. 1B), while only deletions occur in NAHR on the same chromatid (intra-chromatid) (Fig. 1C).
Based on the observations in specific pathogenic loci, SD properties (including homology length, distance, and sequence similarity) were shown to affect the incidence of NAHR [33]. In a recent study on common CNVs in human populations, it was found that SD length and inter-SD distance were the major SD properties involved in NAHR frequency [34]. A model of chromosomal compression/extension/looping has also been proposed for homology mispairing in NAHR [34].
High-copy repeats in CNV instability
The genomic repeats representing DNA primary structures can be divided into LCR/SD and high-copy repeats. Compared to LCR/SD, high-copy repeats constitute a great portion of the human genome. Interspersed repeats are the most common type of high-copy repeats, which cover over 44% of the human genome [35]. One of the major classes of interspersed repeats is the retrotransposon, including short interspersed elements (SINEs), long interspersed elements (LINEs), and endogenous retroviruses (ERVs).
SINEs are short DNA sequences (100-400 bp in length) with an internal (mobile) polymerase III promoter [15], making up about 11% of the human genome [36]. The most common SINEs are Alu elements, which burst out in the evolution of primates [37]. Moreover, Alu elements play an important role in disease, such as breast cancer, Ewing's sarcoma, familial hypercholesterolemia, and so on [38]. In 2008, Kim et al. [22] found a strong association between Alu elements and old SDs. By means of NAHR, Alu elements contribute to the formation of CNVs, especially deletions [39, 40, 41, 42, 43].
The most classic repeats in LINEs are the LINE-1 (L1) elements, which are 6-8 kb in length [44]. Covering about 17% of the human genome, L1 elements can elevate genomic instability, provide resources for NAHR [39, 40, 41, 45, 46], and cause human diseases [47, 48, 49, 50].
In the human genome, there exist at least 50,000 copies of ERVs [51], which are defined as human endogenous retroviruses (HERVs), covering about 4.9% of human DNA sequences [21]. Via the NAHR mechanism, HERVs were found to induce large deletions and cause hypotonia and motor, language, and cognitive delays [52, 53]. Intriguingly, a series of studies show that there is a strong association between HERVs and CNVs in the region of AZFa, a well-known locus related to male infertility [54, 55, 56, 57].
Repeat-Induced DNA Replication Error and CNV Mutation
As discussed above, repeat-mediated NAHR is one of the major mutational mechanisms for CNV formations. In these recombination-based events, paired repeats in direct orientation contribute to CNV instability and disease traits. However, is it the only way for the genomic repeats to induce CNV mutations? The recent investigation of DNA replication-based mechanisms provides novel insights into repeat-mediated CNV instability.
Inverted repeats involved in CNV instability
Genomic repeats, especially inverted repeats (IRs), can cause DNA replication error and induce CNV formation. IRs, sharing high sequence similarity in adjacent loci, are found to align or crossover with each other and form specific DNA secondary structures, such as cruciform structures, during DNA replication [58]. Formation of such secondary structures can cause DNA replication fork stalling, and later, jumping into the wrong locus to continue replicating (Fig. 2). This mechanism of triggering replication errors subsequently results in genomic rearrangements and CNV mutations [24].

Repeat-induced DNA replication errors and copy number variation (CNV) formation. The straight lines depict single DNA strands, and the solid arrows (red and blue) represent genomic repeats. The dashed lines indicate newly synthesized DNA strands. During DNA replication, adjacent repeats could form DNA secondary structures (such as hairpin) that consequently result in replication fork stalling. Then, CNVs are generated via DNA template switching. For example, (1) jumping over the secondary structures and restarting DNA replication lead to deletions and (2) switching to a new template (shown in green lines) and switching back result in duplications of the green DNA segment.
IRs induce complex CNVs by replication errors
During DNA replication, IRs could form DNA secondary structures, which will induce replication fork stalling. Template switching and replication resumption further result in CNV mutations (Fig. 2). Notably, replication-associated events usually lead to complex CNVs, which include the combined segments of deletions and duplications. As was reported, Chen et al. [59] identified three complex CNVs that could be explained by a model of serial replication slippage (SRS). In this model, IRs have the potential to induce SRS and cause CNV mutations.
IRs can induce complex CNVs, as observed in the MECP2 locus in chromosome Xq28 and the PLP1 locus in chromosome Xq22 [60]. To elucidate the mechanisms of complex CNVs in the PLP1 locus, Hastings et al. [61] found both microhomology and IRs at the breakpoints. They proposed that both breakage of replication forks and the IR-mediated aberrant repair process can result in complex CNVs. This model was termed 'microhomology-mediated break-induced replication,' which was used to explain the formation of the complex CNVs involving individual genes or even single exons [62].
Self-chains in CNV formation
The aforementioned SDs are long (>1 kb) and LCRs in the human genome. Besides SDs, self-chains (SCs) are another type of short LCRs, which were previously analyzed and mapped via self-alignment in the human genome utilizing BLASTZ [63, 64]. SCs are short in length (91% of which range from 150 bp to 1 kb in size) [14]. Furthermore, SCs have a limited number of matched alignments in the human genome. Thus, SCs represent a distinct category of human short LCRs.
In 2013, Chen et al. [65] studied deletion CNVs in the NRXN1 gene and its flanking regions. After mapping and analyzing the breakpoints of 32 deletions, they found a significant bias that minus SCs (i.e., paired SCs in the inverted orientation) were overrepresented in the vicinity of deletion breakpoints in the NRXN1 region. Furthermore, they claimed that the SCs can increase genomic instability and cause deletions via DNA replication errors. Their work contributes to the exploration on SC-mediated CNVs.
To perform a genomewide analysis on the contribution of SCs to human CNV instability, Zhou et al. [14] plotted the numbers of SC regions with different orientations in the entire human genome. After masking the SDs and gaps in the human genome, utilizing the germline CNVs in human populations and the somatic CNVs in various cancer genomes, they observed a significant biased distribution of CNV breakpoints to SC regions, which indicated that SC-mediated secondary structures may induce DNA replication errors and potentially generate different types of CNVs, such as deletions and duplications. In this case, SCs represent a new genomic architecture for the underlying regional susceptibility to genomic instability, further giving rise to CNVs.
DNA repair and nonhomologous end-joining
While DNA double-strand breaks (DSBs) occur, nonhomologous end-joining (NHEJ) is one of the molecular mechanisms for repairing DSBs and maintaining genome integrity. Once a DSB is detected, the broken DNA ends are bridged and modified by the enzyme machinery. After that, the final ligation is needed for DNA repair. Unlike NAHR, NHEJ can take place without any homology as the substrate. Notably, deletions or insertions of several base pairs are usually brought to the joint point. More mechanistic details of NHEJ are provided in some previous works [66, 67, 68].
DNA Replication Dynamics in CNV Mutagenesis
In addition to the aforementioned genomic features, some high-order genome organizations might contribute to genome instability. New observations in the human genome showed that the DNA mutation rate is associated with DNA replication timing. Stamatoyannopoulos et al. [69] found that the human point mutation rate is markedly increased in genomic regions of late replication. This correlation indicates that DNA replication timing, as an important feature of replication dynamics, is involved in genomic instability and enlightens the investigation on the relationship of replication timing and CNV instability.
Replication timing as a high-order genomic feature
DNA replication takes place at replication forks following a fixed way [70]. In the human genome, the segments of chromosomes replicate in a temporal order [71], and the whole genome is spatially segregated by replication zones of different organizations. With some replicons in one spatial compartmentalization of chromatin fired synchronously, this chromosomal unit shares the same replication timing, termed the 'replication domain.' Therefore, the genome consists of several replication domains with different replication timing and the timing transition regions.
Replication timing can be measured by two distinct methods, based on current genome technologies [72, 73, 74]. One method is to label the newly replicated DNA with chemically tagged nucleotides. Then, the DNA will be isolated from cells at various times during S phase by immune-precipitation or density fractionation. In the other method, since DNA segments that replicate earlier accumulate more copies than those that replicate late in most cells-the DNA content of a region simply reflects the replication timing. After being classified by florescence-activated cell sorting, the DNAs extracted from S phase and G1 phase cells, respectively, are compared by next-generation sequencing or microarray technologies. By either way, a replication timing profile can be generated (Fig. 3) [75].
Based on the timing profiles, a lot of progress has been made on understanding the replication program and its relationship with other genome architectures. Recent findings indicate significant links between replication timing and the features of primary genomic structures [76]. The genomic regions where DNA replicates earlier usually have more genes, fewer LINEs, and higher GC content [77, 78]. Moreover, it is noticed that DNA replication timing correlates with transcription [79, 80, 81]. Expressed genes replicate earlier, while repressed genes replicate late. Although this correlation shows a discrepancy between multicellular and single-celled organisms, it is worth noting that such works indeed reveal the striking association of replication timing and transcriptional activity in humans [77, 82, 83]. Moreover, recent findings show that replication timing strongly correlates with three-dimensional chromatin structures [84]. In Hi-C data, it has been observed that chromatin is organized into two separate compartments. Remarkably, DNA that resides in close spatial structures replicates in near time, and chromatin that interacts between two compartments is exactly at the timing transition regions. This observation suggests replication timing as an independent advanced genomic feature.
Replication timing and CNV instability in human populations and cancers
The relationship of DNA replication timing and genomic instability, which is involved in genomic mutation and human disease, is what people are most concerned about. As mentioned above, human mutation rates, based on evolutionary divergence and single-nucleotide polymorphism frequency, are increased in late-replicated regions [69]. Koren et al. [75] generated a high-resolution timing profile of the human genome and investigated the relationship between DNA replication timing and point mutations. In accordance with the previous discovery, this association was also observed and proved to be much stronger.
How is CNV related to replication timing? Recent studies showed some distinct but multi-dimensional relationships between CNVs and replication timing. Based on the duplication hotspots conserved between two species of Drosophila, Cardoso-Moreira et al. [85] explored the roles of replication timing in genomic instability. They found that Drosophila duplication hotspots were enriched in late-replicated regions, unlike the aforementioned sequences of high sequence identity in the human genome. However, in spite of the association observed in Drosophila, the situation seems to be more complicated in mammalian genomes. In the study of Koren et al. [75], the relationship between early/late replication timing and CNV mutation was also investigated. The CNVs, mediated by different mechanisms, showed divergent patterns, suggesting a multi-dimensional interaction between CNVs and replication timing.
In addition to the observations in human populations, recent findings have also discovered the relationship of genomic reorganization and the subsequently generated genetic variation during cell fate changes. Lu et al. [86] have investigated the impact of altered replication timing on the CNV landscape during reprogramming. Approximately 40% of the human genome changes with regard to replication timing between human induced pluripotent stem cells (iPSCs) and their parent fibroblasts. Intriguingly, the CNV distribution shows a correlation with the changed timing profile. In particular, CNV gains tend to be located in the genomic regions that switch to replicate earlier. This correlation is conserved among different reprogramming methods.
Compared with cell fate changes, replication timing is disrupted in many disease states, including cancer [87]. It has been noticed that numerous alterations to the replication program take place during carcinogenesis. One of the changes is the aberrant asynchronous replication of loci that replicate synchronously in normal cells. This phenomenon exists in not only cancer but also noncancerous cells [88, 89, 90]. This abnormal replication program apparently has a notable impact on genomic stability and thus increases the frequency of chromosomal rearrangements and CNVs. Recent findings have indicated that aberrant DNA replication timing is involved in changes in gene expression, epigenetic modifications, and an increased CNV mutation frequency [91, 92]. An analysis of 331,724 somatic copy number alterations (SCNAs) has shown that SCNAs increase in late-replicating regions among cells of different cancer types. Like the findings in iPSCs, the SCNA distribution is related to replication timing in tumor cells. In particular, amplification boundaries tend to be located in early-replicated regions, whereas deletion boundaries are more likely to reside in late-replicated regions [93].
Integrated replication dynamic related with CNV instability
In the study of Koren et al. [75], point mutations and CNVs showed different patterns in their correlations with replication timing. These observations may reflect the distinct mutational mechanisms between these two types of genomic variants and suggest complex effects of DNA replication on CNV instability. We hypothesized that integrated replication dynamics, which are not just early/late replication timing, contribute to CNV mutation. It has been reported that dividing the genome into early/late replication timing alone does not give the entire characteristics of DNA replication fork dynamics [72]. Actually, the timing transitional regions represent the interactions of two spatially dependent chromatin compartments, which are DNA segments with low rates of replication fork progression. Notably, slower fork speed and increased fork stalling have been found to be associated with cancer cells and result in CNV mutations [94]. Actually, Chen et al. [95] have conducted a statistical method, estimating replication dynamics, and observed its significant association with CNV instability. Replication dynamics may be used as a measure of the progress of genome replication and regional replication stress and provides novel insights into the roles of DNA replication in CNV mutagenesis.
Conclusion
Human genomic repeats play an important role in CNV mutation, genomic disorders, and genome evolution. Both low-copy genomic repeats (including LCRs and SDs) and high-copy repeats (including Alu, LINEs, and HERVs) can induce CNV formation via classical DNA recombination-based mechanisms, such as NAHR. Furthermore, paired repeats (especially those in the inverted orientation) are even more crucial as substrates to form DNA secondary structures and cause DNA replication fork stalling and replication stress. This will induce DNA replication errors and subsequently generate CNV mutations. Besides the primary structural features (e.g., the organization of repeat sequences in the human genome) and repeat-mediated secondary DNA structures, higher-order genomic architecture (such as replication timing) is also involved in CNV instability. Further investigation of the role of DNA replication dynamics in CNV mutagenesis will reveal more mutational mechanisms underlying genomic disorders and genome evolution.
Acknowledgments
This work was supported by the National Basic Research Program of China (2012CB944600 and 2011CBA00401), National Natural Science Foundation of China (81222014, 31171210 and 31000552), Shu Guang Project (12SG08), Shanghai Pujiang Program (10PJ1400300), and Recruitment Program of Global Experts.