Early detection of emerging viral variants through analysis of community structure of coordinated substitution networks
The major goal of this study is to develop and validate a methodology that, given viral sequences sampled at several time points, infers potentially emerging viral haplotypes by analyzing dense communities of coordinated substitution networks. To achieve it, this section is organized as follows. First, we provide a theoretical justification of the proposed approach by considering an idealized model of an evolving population with given fitness landscape and epistatic network (“Model-based rationale of the proposed approach”). As epistatic networks are not directly observable, “Inference of coordinated substitution networks” outlines the methodology to infer them from sequencing data. “Estimation of density-based P values of viral haplotypes” describes our approach to validate the statistical significance of associations between known VOCs/VOIs and dense communities in inferred epistatic networks. Finally, “Inference of viral haplotypes” presents the algorithmic framework to de novo infer emerging viral variants.
Model-based rationale of the proposed approach
The major idea of this study is to predict emerging viral variants as dense communities in epistatic networks. This idea can be partially substantiated by the following simple combinatorial population genetics model assuming that the basic mutational mechanism consists of random point mutations.
Consider a population of haploid genotypes \({{{{{{{\mathcal{P}}}}}}}}=\{{g}_{1},\ldots,{g}_{n}\}\) with a fixed length L and two potential allelic states 0 and 1 at each locus, where 0 stands for the reference allele and 1 stands for an alternative allele. Each genotype is thus represented as a binary sequence, and all possible 2L genotypes form a sequence space represented as L-dimensional hypercube \({{{{{{{\mathcal{H}}}}}}}}\)48, i.e., a graph whose vertices are 0 − 1 sequences of length L, and two vertices are adjacent whenever they differ in a single coordinate.
Each genotype gi is assigned the fitness fi – a real number that serves as a quantitative measure of its reproductive capacity49,50. The function mapping genotypes into the set of their fitness values is referred to as a fitness landscape49.
In our model, the genotypes are subject to negative epistasis51,52. Following refs. 50, 53, it is defined as the statistical effect where the combined effect of mutations at two specific loci leads to a lower fitness than if these mutations occurred independently, i.e., f11 f10 + f01 − f00, where fij, i, j = 0, 1 are expected fitnesses of genotypes with allelic states (i, j) at the loci. Similarly, positive epistasis occurs when the combined effect of multiple mutations results in a higher than expected fitness, i.e., f11 > f10 + f01 − f0050,53. In our case, negative epistasis is assumed to render the corresponding genomes non-viable or evolutionary non-competitive (f11 ≤ 0). Epistatic interactions can be represented by the coordinated substitution network \({{{{{{{\mathcal{G}}}}}}}}\) (Fig. 5a), where vertices correspond to loci, and two vertices are adjacent when a 2-haplotype (1, 1) at the respective loci is viable (i.e., the loci are not under negative epistasis).
Epistasis has been proposed to constrain the selective accessibility of genomic variants and restrict potential evolutionary trajectories of a population51,54. This effect can be described in graph-theoretical terms as follows. Viable genotypes (i.e., genotypes with positive fitness values) constitute a subgraph of the hypercube \({{{{{{{\mathcal{H}}}}}}}}\), referred to as the viable space. In the model under consideration, a genotype is deemed viable if all its alternative alleles are pairwise adjacent in the network \({{{{{{{\mathcal{G}}}}}}}}\) i.e., create a clique within \({{{{{{{\mathcal{G}}}}}}}}\) (Fig. 5b). As a result, each maximal by inclusion clique C of \({{{{{{{\mathcal{G}}}}}}}}\) (i.e., a clique that is not contained in another clique) generates a complete sub-hypercube \({{{{{{{\mathcal{H}}}}}}}}(C)\) in the viable space; this sub-hypercube is a projection of genotype vectors onto the subspace formed by loci from C. Thus, decomposing the epistatic network into maximal cliques yields a partition of the viable space into sub-hypercubes. This partition defines a set of restricted evolutionary trajectories that the population could potentially explore.
More specifically, within each sub-hypercube \({{{{{{{\mathcal{H}}}}}}}}(C)\), only additive and positive epistatic fitness effects can be present. Therefore, evolutionary trajectories within \({{{{{{{\mathcal{H}}}}}}}}(C)\) will eventually accumulate all mutations in C and converge to the genotype gC with the maximum fitness within \({{{{{{{\mathcal{H}}}}}}}}(C)\) that contains all alternative alleles of the clique C (Fig. 5c). Overall, the proposed model indicates that any evolutionary trajectory within the entire viable space will ultimately converge to a genotype determined by one of the maximal cliques in the epistatic network.
In practical settings, epistatic networks are not directly observable. Therefore, in accordance with23,26,34, we approximate them using coordinated substitution networks, which are statistically inferred from genomic data. Since the inferred networks may not encompass all true links, we consider dense subgraphs rather than cliques.
Inference of coordinated substitution networks
Consider a population consisting of N haploid genotypes of length L whose observed abundances change over time points t = 1, …, T. For a pair of distinct loci u, v ∈ {1, …, L} we consider 4 possible 2-haplotypes (i, j) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}, where 0 and 1 are reference and alternative alleles in u and v respectively. Let also \({O}_{ij}^{t}\) be an observed count of 2-haplotypes (i, j), i, j = 0, 1, at a time point t ∈ {1, …, T}.
We define a coordinated substitution network at the time point t as a graph \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) with nodes representing SAVs, and two nodes being adjacent whenever the corresponding non-reference alleles are simultaneously observed more frequently than expected by chance. Formally, SAVs at positions u and v are adjacent in \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) (or linked), when the following inequality holds:
$$1-\mathop{\sum }\limits_{i=0}^{{O}_{11}^{t}-1}\left(\begin{array}{c}N\\ i\end{array}\right){\left(\frac{{O}_{10}^{t}\cdot {O}_{01}^{t}}{{O}_{00}^{t}\cdot N}\right)}^{i}{\left(1-\frac{{O}_{10}^{t}\cdot {O}_{01}^{t}}{{O}_{00}^{t}\cdot N}\right)}^{N-i}\le \frac{\rho }{\left(\begin{array}{c}L\\ 2\end{array}\right)},$$
(3)
where ρ is a pre-defined P value (in this study, we used ρ = 0.05).
In the remaining part of this subsection, we provide a justification for the formula (3). We suppose that viral evolution is driven by mutation and selection, where (a) each 2-haplotype (i, j) has fitness fij; (b) each transition (mutation) from the allele k to the allele l at the position u (resp. v) happens with probability \({q}_{kl}^{u}\) (resp. \({q}_{kl}^{v}\)). Thus, expected 2-haplotype counts \({E}_{ij}^{t}\) can be described by the quasispecies model55 (or mutation-selection balance model in the classical population genetics terms56) in the following form:
$${E}_{ij}^{t}=\mathop{\sum}\limits_{k,l\in \{0,1\}}{f}_{kl}{q}_{ki}^{u}{q}_{lj}^{v}{E}_{kl}^{t-1}$$
(4)
Mutation probabilities per genomic position per year of most viruses have orders of magnitude between 10−3 and 10−557. Thus, we can assume that for time intervals considered in this study, the non-negative probability of allelic change is smaller than the probability of no-change, i.e.,
$$0 \,
(5)
We can use the model (4) to decide whether the 2-haplotype (1, 1) is viable or its observed appearances can be plausibly explained by random mutations. The corresponding test is based on the following fact:
Theorem 1 Suppose that the 2-haplotype (1, 1) is not viable, i.e., f11 = 0. Then
$${E}_{11}^{t}\le \frac{{E}_{01}^{t}\cdot {E}_{10}^{t}}{{E}_{00}^{t}}$$
(6)
Proof The proof follows the same lines as the proof in ref. 39. Given that f11 = 0, we have
$${E}_{00}^{t}\cdot {E}_{11}^{t}= \left(\mathop{\sum}\limits_{k,l=0,1}{f}_{kl}{q}_{k0}^{u}{q}_{l0}^{v}{E}_{kl}^{t-1}\right)\left(\mathop{\sum}\limits_{k,l=0,1}{f}_{kl}{q}_{k1}^{u}{q}_{l1}^{v}{E}_{kl}^{t-1}\right)\\= {q}_{00}^{u}{q}_{00}^{v}{q}_{01}^{u}{q}_{01}^{v}{(\;{f}_{00}{E}_{00}^{t-1})}^{2}+{q}_{10}^{u}{q}_{00}^{v}{q}_{11}^{u}{q}_{01}^{v}{(\;{f}_{10}{E}_{10}^{t-1})}^{2} \\ +{q}_{00}^{u}{q}_{10}^{v}{q}_{01}^{u}{q}_{11}^{v}{(\;{f}_{01}{E}_{01}^{t-1})}^{2}+\\ +({q}_{00}^{u}{q}_{00}^{v}{q}_{01}^{u}{q}_{11}^{v}+{q}_{00}^{u}{q}_{10}^{v}{q}_{01}^{u}{q}_{01}^{v}){f}_{00}\,{f}_{01}{E}_{00}^{t-1}{E}_{01}^{t-1}+\\ +({q}_{00}^{u}{q}_{00}^{v}{q}_{11}^{u}{q}_{01}^{v}+{q}_{10}^{u}{q}_{00}^{v}{q}_{01}^{u}{q}_{01}^{v}){f}_{00}\,{f}_{10}{E}_{00}^{t-1}{E}_{10}^{t-1}+\\ +({q}_{00}^{u}{q}_{10}^{v}{q}_{11}^{u}{q}_{01}^{v}+{q}_{10}^{u}{q}_{00}^{v}{q}_{01}^{u}{q}_{11}^{v}){f}_{01}\,{f}_{10}{E}_{01}^{t-1}{E}_{10}^{t-1}$$
(7)
and
$${E}_{01}^{t}\cdot {E}_{10}^{t}= \left(\mathop{\sum}\limits_{k,l=0,1}{f}_{kl}{q}_{k0}^{u}{q}_{l1}^{v}{E}_{kl}^{t-1}\right)\left(\mathop{\sum}\limits_{k,l=0,1}{f}_{kl}{q}_{k1}^{u}{q}_{l0}^{v}{E}_{kl}^{t-1}\right)\\= {q}_{00}^{u}{q}_{01}^{v}{q}_{01}^{u}{q}_{00}^{v}{(\;{f}_{00}{E}_{00}^{t-1})}^{2}+{q}_{10}^{u}{q}_{01}^{v}{q}_{11}^{u}{q}_{00}^{v}{(\;{f}_{10}{E}_{10}^{t-1})}^{2} \\ +{q}_{00}^{u}{q}_{11}^{v}{q}_{01}^{u}{q}_{10}^{v}{(\;{f}_{01}{E}_{01}^{t-1})}^{2}+\\ +({q}_{00}^{u}{q}_{01}^{v}{q}_{01}^{u}{q}_{10}^{v}+{q}_{00}^{u}{q}_{11}^{v}{q}_{01}^{u}{q}_{00}^{v}){f}_{00}\, {f}_{01}{E}_{00}^{t-1}{E}_{01}^{t-1}+\\ +({q}_{00}^{u}{q}_{01}^{v}{q}_{11}^{u}{q}_{00}^{v}+{q}_{10}^{u}{q}_{01}^{v}{q}_{01}^{u}{q}_{00}^{v}){f}_{00}\, {f}_{10}{E}_{00}^{t-1}{E}_{10}^{t-1}+\\ +({q}_{00}^{u}{q}_{11}^{v}{q}_{11}^{u}{q}_{00}^{v}+{q}_{10}^{u}{q}_{01}^{v}{q}_{01}^{u}{q}_{10}^{v}){f}_{01}\, {f}_{10}{E}_{01}^{t-1}{E}_{10}^{t-1}$$
(8)
It is easy to see that the terms in Eqs. (7) and (8) except for the last ones are equal. Thus we have
$$ {E}_{01}^{t} \cdot {E}_{10}^{t} - {E}_{00}^{t}\cdot {E}_{11}^{t}=\\ =({q}_{00}^{u}{q}_{11}^{v}{q}_{11}^{u}{q}_{00}^{v}+{q}_{10}^{u}{q}_{01}^{v}{q}_{01}^{u}{q}_{10}^{v}-{q}_{00}^{u}{q}_{10}^{v}{q}_{11}^{u}{q}_{01}^{v} \\ \quad -{q}_{10}^{u}{q}_{00}^{v}{q}_{01}^{u}{q}_{11}^{v}){f}_{01} \, {f}_{10}{E}_{01}^{t-1}{E}_{10}^{t-1}=\\ =\left(1-\frac{{q}_{01}^{u}{q}_{10}^{u}}{{q}_{00}^{u}{q}_{11}^{u}}\right)\left(1-\frac{{q}_{01}^{v}{q}_{10}^{v}}{{q}_{00}^{v}{q}_{11}^{v}}\right){q}_{00}^{u}{q}_{11}^{v}{q}_{11}^{u}{q}_{00}^{v} \, {f}_{01} \, {f}_{10}{E}_{01}^{t-1}{E}_{10}^{t-1}\ge 0,$$
(9)
where the last inequality follows from (5). Thus, the inequality (6) holds.
Using Theorem 1, we can evaluate the likelihood of the event that a large number of genomes contain 2-haplotype (1, 1) given that this 2-haplotype is not viable. Considering the density of sampling and the number of SARS-CoV-2 genomes analyzed in this study, we assume that observed and expected numbers of 2-haplotypes are close to each other. Let q is the probability of observing a genome containing 2-haplotypes (1, 1) among N genomes given that f11 = 0. Following ref. 36, we model the count of such genomes, X, with a binomial distribution B(N, q). The probability that \(X\ge {O}_{11}^{t}\) is:
$$p(X\ge {O}_{11}^{t}| {f}_{11}=0)=1-{F}_{X}({O}_{11}^{t}-1| N,q)=1-\mathop{\sum }\limits_{i=0}^{{O}_{11}^{t}-1}\left(\begin{array}{c}N\\ i\end{array}\right){q}^{i}{(1-q)}^{N-i},$$
(10)
where FX is the binomial cumulative distribution function. Theorem 1 implies an upper bound for q: \(q\le p=\frac{{O}_{10}^{t}\cdot {O}_{01}^{t}}{{O}_{00}^{t}\cdot N}\). Therefore
$$p(X\ge {O}_{11}^{t}| \, {f}_{11}=0)=1-{F}_{X}({O}_{11}^{t}-1| N,q)\le 1-{F}_{X}({O}_{11}^{t}-1| N,p).$$
(11)
We consider SAVs at positions u and v linked when the probability \(p(X\ge {O}_{11}^{t}| {f}_{11}=0)\) is sufficiently low, which is guaranteed when its upper bound in Eq. (11) is sufficiently low, i.e.,
$$p(X\ge {O}_{11}^{t}| {f}_{11}=0)\le 1-{F}_{X}({O}_{11}^{t}-1| N,p)\le \frac{\rho }{\left(\begin{array}{c}L\\ 2\end{array}\right)},$$
(12)
where ρ is a chosen significance level, and the denominator \(\left(L\atop 2\right)\) is a Bonferroni correction. The latter is used to account for multiple comparisons between \(\left(L\atop 2\right)\) pairs of SAVs being tested for linkage. This leads to the formula (10).
Estimation of density-based P values of viral haplotypes
We hypothesize that viral haplotypes corresponding to potential VOCs and VOIs form dense subgraphs of coordinated substitution networks. Below we describe the method used to statistically verify this hypothesis.
In what follows, we will use the standard graph-theoretical notation: \(V({{{{{{{\mathcal{G}}}}}}}})\) and \(E({{{{{{{\mathcal{G}}}}}}}})\) are the sets of vertices and edges of the graph \({{{{{{{\mathcal{G}}}}}}}}\), respectively; \({N}_{{{{{{{{\mathcal{G}}}}}}}}}(v)\) is the set of neighbors of a vertex v in \({{{{{{{\mathcal{G}}}}}}}}\); the subgraph of G induced by a subset S is denoted by \({{{{{{{\mathcal{G}}}}}}}}[S]\).
We use the statistical test (12) to construct coordinated substitution networks \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) for different time points t using SARS-CoV-2 sequences sampled before or at the time t. These networks have the same set of vertices but different sets of edges. A viral haplotype thus can be associated with a subset of vertices \(H\subseteq V({{{{{{{{\mathcal{G}}}}}}}}}_{t})\) of a network \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\). The density of a haplotype H is thus defined as the density of the subgraph of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) induced by H, i.e.,
$${d}_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}(H)=\frac{| E({{{{{{{{\mathcal{G}}}}}}}}}_{t}[H])| }{| H| }$$
(13)
We estimate the statistical significance of our hypothesis by producing density-based P values of known VOC and VOI haplotypes H. Given the subgraph sample \({{{{{{{{\mathcal{S}}}}}}}}}^{*}=\{{S}_{1},\ldots,{S}_{| {{{{{{{{\mathcal{S}}}}}}}}}^{*}| }\}\), P value of a haplotype H in the network \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) is defined as
$${p}_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}(H)=\frac{ \left| \left\{{S}_{j}\in {{{{{{{{\mathcal{S}}}}}}}}}^{*}:{d}_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}({S}_{j})\ge {d}_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}(H) \right| \right.}{| {{{{{{{{\mathcal{S}}}}}}}}}^{*}| }$$
(14)
A low P value indicates that the subgraph representing haplotype H is denser compared to other subgraphs of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\).
The naive way to produce the sample \({{{{{{{{\mathcal{S}}}}}}}}}^{*}\) is to randomly generate subgraphs of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) of the size ∣H∣. However, SARS-CoV-2 coordinated substitution networks are relatively sparse, and thus many sampled subgraphs will be a priori disconnected and, consequently, also sparse. As a result, such a sampling scheme is inherently biased towards assigning low P values to haplotypes corresponding to connected subgraphs and subgraphs with few connected components. Known VOCs and VOIs at most time points have these properties, and thus their statistical significance could be overestimated. This problem can be resolved by sampling only connected subgraphs.
The following numerical example shows why an advanced method for connected subgraph sampling is essential and a naive approach is ineffective. Consider one of the coordinated substitution networks generated in this study with 1273 vertices and 7329 edges. For a tree (a minimal connected subgraph) with 10 vertices, naive sampling of 1,000,000 samples yielded 188 subgraphs not less dense than the tree, resulting in a P value of 0.000188. In contrast, the P value from connected subgraph sampling is 1, as a tree has the minimal density among all connected subgraphs. Moreover, naive sampling produced only 2 connected subgraphs, making re-normalization with respect to such a small subsample unreliable. Generating a sufficiently large sample of connected subgraphs via naive method is thus impractical due to the enormous naive sample size required.
To sample connected subgraphs, we utilize a more sophisticated randomized enumeration sampling algorithm that follows the network motif sampling scheme introduced in ref. 58. The algorithm assumes that vertices of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) are labeled by the unique integers 1, …, L, and performs a recursive backtracking. For each vertex v in ascending numerical order, the algorithm iteratively grows a connected subgraph S by adding a randomly chosen new vertex w from the set of allowed extensions W. The set W is then updated to include neighbors of w not in the exclusion set X. The exclusion set X allows to speed up the calculations and prevent double sampling by excluding (a) neighbors of vertices already in S to avoid multiple additions of the same vertex to W and (b) vertices numbered 1 to v to prevent re-sampling subgraphs that should have been sampled at earlier iterations. The process continues until a subgraph of the desired size k is formed. Sampling for subgraphs containing the vertex v goes on until the predetermined sample count is reached. The full procedure is detailed in Algorithm 1, and the proof of the correctness of this scheme can be found in ref. 58.
If, at some point, the subgraph induced by the haplotype H is disconnected, we replace it with its largest connected component. In this study, for each analyzed coordinated substitution network \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\), the sampling was performed until \(M=\min \{3000,{\eta }_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}(v)\}\) subgraphs for each vertex v are generated, where \({\eta }_{{{{{{{{{\mathcal{G}}}}}}}}}_{t}}(v)\) is the total number of connected subgraphs containing v. The value of 3000 was selected empirically to provide a sufficient number of sampled subgraphs for all analyzed viral variants.
Algorithm 1
Sampling of connectedk-subgraphs without forbidden pairs
1: Input: graph \({{{{{{{\mathcal{G}}}}}}}}\) with V(G) = {1, …, L}, an integer k and the sample size per vertex M.
2: for v = 1: L do
3: S ← {v}; \(W\leftarrow {N}_{{{{{{{{\mathcal{G}}}}}}}}}(v)\setminus \{1,\ldots,v\}\); \(X\leftarrow {N}_{{{{{{{{\mathcal{G}}}}}}}}}(v)\cup \{1,\ldots,v\}\)
4: global Mv = 0
5: call SampleSubgraph(v,S,W,X)
6: end for
SampleSubgraph(v,S,W,X)
7: if ∣S∣ = k then
8: output S, Mv ← Mv + 1 and return
9: end if
10: while \(W \,\ne \, {{\emptyset}}\) and Mv ≤ M do
11: sample a random vertex w ∈ W and set W ← W⧹{w}
12: \({S}^{{\prime} }\leftarrow S\cup \{w\}\); \({W}^{{\prime} }\leftarrow W\cup ({N}_{{{{{{{{\mathcal{G}}}}}}}}}(v)\setminus X)\); \({X}^{{\prime} }\leftarrow X\cup {N}_{{{{{{{{\mathcal{G}}}}}}}}}(w)\)
13: call SampleSubgraph(v,\({S}^{{\prime} }\),\({W}^{{\prime} }\),\({X}^{{\prime} }\))
14: end while
Inference of viral haplotypes
In this subsection, we describe the method for inference of viral haplotypes as dense communities in coordinated substitution networks. Community detection is a well-established field of network science, with numerous algorithmic solutions proposed over the last two decades59. Typically (though not always), the collection of communities in a network is defined as a partition60. However, in the case of viral genomic variants, there can be overlaps, as observed in known VOCs and VOIs. Additionally, most existing algorithms are heuristics designed to scale to the sizes of extremely large networks rather than to produce optimal solutions. Viral coordinated substitution networks, although containing hundreds of vertices, are typically smaller than most networks studied in applied network theory. Thus we use our own community detection approach, which extends our previously developed methodology36. This approach uses exact algorithms rather than heuristics and is tailored to account for the characteristics of viral data.
Major steps of our computational framework called HELEN (Heralding Emerging Lineages in Epistatic Networks) are depicted in Fig. 6, and the full algorithmic workflow is described by Algorithm 2. For a given time point t, HELEN starts by constructing a coordinated substitution network \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\), as described in “Inference of coordinated substitution networks”. Then it generates a pool of candidate dense subgraphs of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) using Integer Linear Programming (ILP). Finally, it combines generated subgraphs into clusters corresponding to different haplotypes, and infers a haplotype from each cluster.
Generation of dense subgraphs: Our approach is based on a Linear Programming (LP) formulation61 for finding the densest subgraphs of networks \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) at each time point t. This formulation contains variables xi for each vertex \(i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t})\), variables yij for each edge \(ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})\), and the following objective function and constraints:
$$\mathop{\sum}\limits_{ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})}{y}_{ij}\to \max$$
(15)
$${y}_{ij}\le {x}_{i},\,\,\,{y}_{ij}\le {x}_{j},\,\,\,ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})$$
(16)
$$\mathop{\sum}\limits_{i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t})}{x}_{i}\le 1$$
(17)
$${x}_{i},{y}_{ij}\ge 0,\,\,\,i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t}),ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})$$
(18)
Note that the variables xi, yij are continuous rather than integer since it can be shown that the value of the optimal solution of the LP (15)–(18) and the maximum subgraph density of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) coincide61; furthermore, if \(U\subseteq V({{{{{{{{\mathcal{G}}}}}}}}}_{t})\) is the vertex set of the densest subgraph, then \(({x}_{i}=\frac{1}{| U| },i\in U;{x}_{i}=0,i \, \notin \, U;{y}_{i j}=\frac{1}{| U| },\, i,\, j\subseteq U;{y}_{i j}=0,\, i,\, j \, \nsubseteq \, U)\) is the optimal solution of (15)–(18). Thus, densest subgraphs of the networks \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) can be found in a polynomial time.
The single densest subgraph can, however, provide only a single haplotype per time point. We need to produce multiple dense communities to infer multiple haplotypes that could correspond to VOCs and VOIs. So, we generate a pool of candidate dense subgraphs of \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) as follows. We iterate through a given range of fixed subgraph sizes k from \({k}_{\max }\) down to \({k}_{\min }\)); at each iteration, we generate a set \({{{{{{{{\mathcal{S}}}}}}}}}_{k}\) of up to \({n}_{\max }\) densest subgraphs of size k that are not contained in subgraphs generated in the previous iterations. Here \({k}_{\max }\),\({k}_{\min }\) and \({n}_{\max }\) are parameters of the algorithm. However, finding the densest subgraph of a given size is an NP-hard problem62,63. Therefore, for each value of k, we use the following Integer Linear Programming formulation:
$$\frac{1}{k}\mathop{\sum}\limits_{ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})}{y}_{ij}\to \max$$
(19)
$${y}_{ij}\le {x}_{i},\,\,\,{y}_{ij}\le {x}_{j},\,\,\,ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})$$
(20)
$$\mathop{\sum}\limits_{i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t})}{x}_{i}=k$$
(21)
$$\mathop{\sum}\limits_{i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t})\setminus S}{x}_{i}\ge 1,\,\,\,S\in \mathop{\bigcup }\limits_{{k}^{{\prime} }=k+1}^{{k}_{\max }}{{{{{{{{\mathcal{S}}}}}}}}}_{{k}^{{\prime} }}$$
(22)
$${x}_{i},{y}_{ij}\in \{0,1\},\,\,\,i\in V({{{{{{{{\mathcal{G}}}}}}}}}_{t}),ij\in E({{{{{{{{\mathcal{G}}}}}}}}}_{t})$$
(23)
Here the constraint (21) sets the size of a dense subgraph, and the constraints (22) ensure that for any subgraph S previously generated, the subgraphs produced in the current iteration must include at least one vertex not in S, meaning they should not be subsets of S. The problems (15)–(18) and (19)–(23) are solved using Gurobi (Gurobi Optimization, LLC); for the latter, we used an option to continue the search until the pool of up to \({n}_{\max }\) optimal solutions is produced.
Inference of haplotypes from dense subgraphs: Now, let \({\hat{{{{{{{{\mathcal{S}}}}}}}}}}_{t}={S}_{t,1},\ldots,{S}_{t,| \hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}}| }\) be the set of generated densest subgraphs with sizes ranging from \({k}_{\min }\) to \({k}_{\max }\). This set does not necessarily have a one-to-one correspondence with the true haplotypes due to two reasons. First, some haplotypes may consist of more than \({k}_{\max }\) SAVs, so the generated subgraphs only cover parts of these haplotypes. Second, many generated subgraphs overlap significantly, and thus most likely correspond to the same haplotypes.
To obtain full-length haplotypes, we employ the algorithmic pipeline described in detail in steps 3–5 of Algorithm 2 and Fig. 6. Initially, we partition the set of generated dense subgraphs into clusters such that the subgraphs from each cluster ideally correspond to the same true haplotype. To achieve this, we construct a graph of subgraphs \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})\), whose edges represent pairs of subgraphs with large overlaps, and split it into clusters using a series of graph clustering techniques. Then, we locate the haplotype for each cluster of subgraphs by finding the densest core community in the union of elements of that cluster.
Algorithm 2
HELEN: inference of viral haplotypes using coordinated substitution networks.
Input: the set \({{{{{{{{\mathcal{P}}}}}}}}}_{t}\) of aligned viral sequences sampled before or at the time point t.
Output: the set of haplotypes \({{{{{{{{\mathcal{H}}}}}}}}}_{t}=\{{H}_{t,1},\ldots,{H}_{t,| {{{{{{{{\mathcal{H}}}}}}}}}_{t}| }\}\) designated as potential variants with altered phenotypes.
1) Construct a coordinated substitution network \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\) from sequences \({{{{{{{{\mathcal{P}}}}}}}}}_{t}\), as described in “Inference of coordinated substitution networks”.
2) Using the Integer Linear Programming formulation (19)–(23), iteratively generate the set of candidate dense subgraphs \({\hat{{{{{{{{\mathcal{S}}}}}}}}}}_{t}=\{{S}_{t,1},\ldots,{S}_{t,| \hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}}| }\}\) of sizes \(k\in \{{k}_{\max },{k}_{\max }-1,\ldots,{k}_{\min }\}\), so that the elements of \({\hat{{{{{{{{\mathcal{S}}}}}}}}}}_{t}\) are not subgraphs of each other.
3) Construct an intersection graph \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})\), whose vertex set is \(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}}\), and two vertices St,i and St,j are adjacent, whenever \(| {S}_{t,i}\cap {S}_{t,j}| \ge \min \{| {S}_{t,i}|,| {S}_{t,j}| \}-1\). In other words, vertices of this graph of subgraphs are adjacent whenever they have the largest possible intersection.
4) Partition the intersection graph \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})\) into clusters Lt,1, …, Lt,r, with each cluster corresponding to a single haplotype. The partition is carried out in stages as follows:
4.1) Split the graph \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})\) into connected components and then subdivide each component into (κ + 1)-connected components, where κ denotes the minimum size of a vertex cut (vertex connectivity). To achieve this, we use an algorithm proposed in ref. 64, which computes the vertex connectivity and corresponding vertex cut as the smallest of (s, t)-cuts between the fixed vertex v of the minimal degree and its non-neighbors ordered by their distance to v, as well as between non-adjacent pairs of neighbors of v. The algorithm computes these (s, t)-cuts using network flow techniques65.
We further augmented this algorithm by adding an extra step. Consider a pair of vertices (s, t) for which the minimal vertex cut of size κs,t has been found, and \({P}_{s,t}^{1},\ldots,{P}_{s,t}^{{\kappa }_{s,t}}\) are the corresponding internal vertex-disjoint (s, t)-paths (which can be found using network flows65 and whose existence is guaranteed by Menger’s theorem66). If a vertex \({s}^{{\prime} }\) is adjacent to the internal vertices of all of these paths, then we can exclude the pair \(({s}^{{\prime} },t)\) from further consideration because \({\kappa }_{{s}^{{\prime} },t}\ge {\kappa }_{s,t}\). This step significantly accelerates the connectivity calculation for graphs with many high-degree vertices, and the connected components of \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})\) typically exhibit this property.
4.2) Suppose that \({L}_{t,1},\ldots,{L}_{t,{r}^{{\prime} }}\) are the components produced at the previous step. Further subdivide each component Lt,i as follows: first, find an embedding of the subgraph \({{{{{{{\mathcal{L}}}}}}}}(\hat{{{{{{{{{\mathcal{S}}}}}}}}}_{t}})[{L}_{t,i}]\) formed by the vertices of Lt,i into \({{\mathbb{R}}}^{3}\) using a force-directed graph drawing algorithm67; second, cluster the obtained embedded graph by a spectral clustering algorithm68 using the largest Laplacian eigenvalue gap to estimate the number of clusters.
Each cluster produced at steps (4.1)–(4.2) is supposed to contain dense subgraphs corresponding to a single haplotype.
5) For every cluster Lt,i, we examine the induced subgraph \({{{{{{{{\mathcal{G}}}}}}}}}_{t,i}={{{{{{{{\mathcal{G}}}}}}}}}_{t}[{\bigcup }_{{S}_{t,j}\in {L}_{t,i}}{S}_{t,j}]\), which consists of the SAVs covered by the dense subgraphs contained in Lt,i.
5.1) Suppose that Dt,i is the sequence of vertex degrees of \({{{{{{{{\mathcal{G}}}}}}}}}_{t,i}\). We cluster the elements of Dt,i using the k-means algorithm and select the subset of vertices Ct,i with degrees from the cluster with the largest mean value. The goal of this procedure is to identify the core of \({{{{{{{{\mathcal{G}}}}}}}}}_{t,i}\) consisting of high-degree vertices. To choose the number of clusters k, we use the gap statistics69.
5.2) Find the densest subgraph Ht,i of \({{{{{{{{\mathcal{G}}}}}}}}}_{t,i}[{C}_{i}]\) using the LP formulation (15)–(18). If the subgraph is large enough (by default ∣Ht,i∣≥5), then output Ht,i as an inferred haplotype.
In addition to the set of haplotypes \({{{{{{{{\mathcal{H}}}}}}}}}_{t}\), Algorithm 2 returns a support σ(Ht,i) for each inferred haplotype, that is defined as a relative number of elements (i.e., candidate dense subgraphs) in the cluster Lt,i: \(\sigma ({H}_{t,i})={\sigma }_{t,i}=\frac{| {L}_{t,i}| }{{\sum }_{j}| {L}_{t,j}| }\).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.