Pattern recognition in the nucleation kinetics of non-equilibrium self-assembly
Multifarious DNA tile system design
Previous theoretical proposals23,24,57 for multifarious mixtures require each component to accept multiple strongly binding partners at each binding site. However, in DNA tile assembly, each binding site can usually only bind its Watson–Crick complement, not an arbitrary set of other domains. Hence, we used an alternate approach: we laid out three structures made of entirely unique, abstract tiles, designed a merging algorithm to reuse tiles in multiple locations if consequences for unintentional binding between other tiles was minimal, and then designed DNA sequences reflecting the resulting abstract layout of tiles.
The three target shapes were drawn on a 24 × 24 single-stranded tile (SST) molecular canvas32, at an abstract level without sequences. Each location in each shape was initially a unique tile, with four abstract binding sites referred to as ‘glues’ in place of binding domains with sequences: after sequence design, ‘matching’ glues correspond to domains with complementary sequences. Edges of the shapes used a special ‘null glue’ with no valid binding partner. In total, this initial design had 2,706 glues and 1,456 tiles.
The three shapes were then processed through a ‘merging’ algorithm that attempted to reuse the same tiles in different shapes. Each step of the algorithm randomly chose two tiles in two different shapes, with null glues on the same sides of each tile, if any. It then considered a modified set where the two tiles were identical, by making them use the same four glues, and propagating the changes in the glues to all other places they occurred within all shapes, starting with the neighbouring tiles (for example, Extended Data Fig. 2c). Such a change could create undesired growth pathways, for example, allowing chimera of multiple shapes. Thus, the algorithm then checked the modified set for two criteria taken from algorithmic self-assembly (Extended Data Fig. 2a,b). The self-healing criterion requires that, for any correct subassembly of any shape, whereas attachments of the wrong tile for a particular location may take place by one bond, only the correct tile can attach by two or more bonds58. The second-order sensitivity criterion for proofreading requires that, for any correct subassembly of any shape, if an incorrect attachment by one bond takes place, the incorrectly attached tile will not create a neighbourhood where an additional incorrect tile can attach by two bonds, and thus the initial error will be likely to fall off35,36. If the modified set satisfied these two criteria, which are trivially satisfied when every tile and bond is unique to a particular location, then the merging algorithm accepted the modified set and continued to another step with a different pair of randomly chosen tiles. Thus, we ensured that there is at least a minimum barrier to continued incorrect growth in a regime where tile attachment by two or more bonds is favourable, and attachment by one bond is unfavourable, which is the case close to the melting temperature of most DNA tile assembly systems59,60.
The algorithm repeatedly merged tiles that satisfied the two criteria until no further acceptable merges were possible. As each merge could affect the acceptability of later merges by changing the glues around each tile, to guide the algorithm towards a sequence of merges it was more likely to be compatible with, the algorithm was initially restricted to considering pairs of tiles from an alternating ‘chequerboard’ subset, which, apart from edges, were likely to be merge-able. After exhausting acceptable merges from this subset, the algorithm attempted merges using all tiles in the system. After repeating this stochastic algorithm multiple times, and selecting the system with the smallest number of tiles, the final resulting system had 698 binding domain and 917 tiles, with 371 of tiles shared between at least two shapes (Extended Data Fig. 2d).
After the assignment of abstract binding domains to each tile by the merging algorithm, the sequences for the binding domains, and thus tiles themselves, were generated using the sequence design software of Woods et al.16. Tiles used a standard SST motif, with alternating 10 and 11 nt binding domains, designed to have similar binding strengths as predicted using a standard thermodynamic model16,29,61. Following Woods et al.16, we set a target range of −8.9 to −9.2 kcal mol−1 for a single domain at 53 °C, which was between the melting temperature and growth temperature for their system. Null binding domains on the edges of shapes, not intended to bind to any other tiles, were assigned poly-T sequences.
Models of nucleation
To model the dependence of the nucleation rates of the three shapes on patterns of unequal concentration, we developed a simple nucleation model based on the stochastic generation of possible nucleation pathways and critical nuclei, which we call the Stochastic Greedy Model (SGM). The model estimates nucleation rates by analysing stochastic paths generated in a greedy manner by making single-tile additions starting from a particular monomer in the system. At each step, all favourable attachments are added and then an unfavourable attachment is performed with probability weighted by the relative free-energy differences of the available tile attachment positions. When multiple favourable attachments are available, the most favourable attachment is made deterministically. This procedure is repeated for many paths over all possible initial positions within the shape considered, and the barrier (highest free-energy state visited in ‘growing’ a full structure) is recorded for each path. A nucleation rate is estimated by assuming an equilibrium occupation of this barrier state (Arrhenius’ approximation26) and summing over the kinetics of the available attachments from this state (see Extended Data Fig. 4 and Supplementary Information section 2.2 for a detailed discussion). The approximations here could be improved by running fully reversible simulations, for example, using xgrow and the kinetic Tile Assembly Model59,62 augmented with Forward Flux Sampling63.
Fluorophore labels and DNA synthesis
Sites for fluorophore and quencher modifications were chosen to avoid edges, modify only unshared tiles and provide a reasonable distribution of locations on each shape. Fluorophores were chosen for spectral compatibility and temperature stability64. ROX, ATTO550 and ATTO647N were paired with Iowa Black RQ, and FAM was paired with Iowa Black FQ. Both fluorophore and quencher modifications were made on the 5′ ends of tiles; to sufficiently colocalize fluorophores and quenchers, one tile in the label pair used a reversed orientation (Fig. 4a). Fluorophore labels are discussed in detail in Supplementary Information section 3.
Tiles without fluorophore or quencher modifications were ordered unpurified (desalted) and normalized to 400 μM in TE buffer (Integrated DNA Technologies). Tiles with fluorophore or quencher modifications were ordered purified by high-performance liquid chromatography (HPLC) and normalized to 100 μM. Given that unpurified synthetic oligonucleotides typically have less than 40 to 60% of the molecules being full length, it is remarkable (although consistent with Woods et al.16) that this did not prevent successful pattern recognition by nucleation.
Experimental overview
The basic workflow for the main experiments was as follows: for a chosen set of concentration patterns (flag or image), samples were prepared on a 96-well plate using an acoustic liquid handler to mix strand stocks in the necessary proportions; vortexed, spun and transferred to PCR tubes for the days-long anneal in the quantitative PCR (qPCR) machine; then samples were deposited on mica for AFM imaging. Fluorescence from the qPCR machine and AFM images were subsequently analysed.
Mixing and growth
Individual tiles were mixed, in the concentration patterns used for experiments, using an Echo 525 acoustic liquid handler (Beckman Coulter). Samples used TEMg buffer (TE buffer with 12.5 mM MgCl2) in a total volume of roughly 20 μl. Flag experiments used a 50 nM base concentration of unenhanced tiles and an 880 nM concentration of enhanced concentration tiles, whereas pattern recognition experiments used tiles with nominal concentrations between 16.67 and 450 nM, which were then quantized into ten discrete values to simplify mixing and conserve material (Supplementary Information section 2.8).
For each concentration pattern in the flag experiments and pattern recognition of trained images, four samples were prepared, each with the same concentration pattern of tiles, but with tiles in different locations replaced by their fluorophore–quencher-modified alternates: one sample for each shape with tiles for all four fluorophore labels on only that shape, to monitor growth of multiple regions on each shape, and an additional sample with one fluorophore on each shape: ROX, ATTO550 (‘five’) and ATTO647N (‘six’) on H, A and M structures, respectively. To reduce the total number of samples, only the lattermost sample type was prepared for pattern recognition of test images. Fluorophore and quencher-modified tile locations always had tiles mixed at the lowest concentration used in the experiment.
After transferring samples to PCR tubes, samples were grown in an mx3005p qPCR machine (Agilent), to provide a program of controlled temperature over time while monitoring fluorescence. Growth protocols began with a ramp from 71 to 53 °C over 40 min to ensure any potentially pre-existing complexes were melted, and then a slower ramp from 53 °C to an initial growth temperature at 1 °C h−1. At this point, three different protocols were used. For constant temperature flag growth experiments, the growth temperature was 47 °C and this was held for 51 h. For temperature ramp flag growth, the initial growth temperature was 48 °C, which was reduced over the course of 100 h to 46 °C. For pattern recognition, a ramp from 48 to 45 °C over 150 h was used. For constant temperature experiments, fluorescence readings were taken every 12 min and for other experiments, every 30 min. After the growth period, temperature was lowered to 39 at 1 °C per 26 min. See Supplementary Information sections 5 and 6 for temperature protocols plotted as a function of time. The experimental timescales and temperatures were chosen not to test the potential speed of selective nucleation, but rather to provide robustness to unknown nucleation temperatures and to convincingly show that nucleation of incorrect structures is limited over long timescales. Thus, on-target nucleation often took place during a comparatively short time and temperature in the experiment, with the remaining time spent either above the expected nucleation temperature or waiting to observe potential off-target nucleation. We also did not try to optimize the system’s speed: the WTA mechanism suggests that significantly faster timescales are possible, and smaller assemblies would reduce the time needed for growth after nucleation. Because of the small sample size and long experiment duration, great care to avoid evaporation was necessary. Once protocols were finished, samples were stored at room temperature until ready for AFM imaging.
Imaging
AFM imaging was performed using a FastScan AFM (Bruker) in fluid tapping mode directly after annealing was completed. In contrast to previous studies32,33,34 in which uniquely addressed SST shapes were gel purified before imaging, we did not do so here, thus we were able to observe assembly intermediates. To achieve better images, two techniques were combined: sample warming to prevent non-specific clumping of structures, and washing with Na-supplemented buffer to prevent smaller material, such as unbound, single DNA tile strands, from adhering to the mica surface. Each sample was diluted 50 times into TEMg buffer with an added 100 mM NaCl, then warmed to roughly 40 °C for 15 min. Next, 50 μl of the sample mix was deposited on freshly cleaved mica, then left for 2 min. As much liquid as possible was pipetted off the mica and discarded, then immediately replaced with Na-supplemented buffer again and mixed by pipetting up and down. This washing process of buffer removal and addition was repeated twice with added-Na buffer, then once with TEMg buffer to remove remaining Na, before imaging was performed in TEMg buffer. As adhesion of DNA to mica is dependent on the ratio of monovalent and divalent cations in the imaging buffer, this process was meant to ensure that unbound tiles were removed during the washing process where Na and Mg were present, whereas imaging itself took place with only Mg so that the lattice structures would be more strongly adhered to the surface resulting in better image quality.
Fluorescence and AFM data analysis
Fluorophore signals are known to be affected by extraneous factors such as temperature, pH, secondary structure and the local base sequence near the fluorophore64, which complicates quantitative interpretation of absolute fluorescence levels. Our own control experiments also illustrated effects due to partial assembly intermediates as well as due to the total amount of single-stranded DNA in solution (Supplementary Information section 3). For this reason, the fluorescence of each fluorophore was normalized to the maximum raw fluorescence value of that fluorophore in that particular sample, and the time at which the fluorescence signal decreased by 10% was then used as a measure of the extent of nucleation that appears less sensitive to these artefacts (Extended Data Fig. 5). The duration between the point of 10% quenching and the end of the growth segment of the experiment was defined as the ‘growth time’ for that fluorophore label; the growth time was defined as 0 in the event of quenching never reaching 10%. For concentration patterns with four samples with different fluorophore arrangements, the total growth time of a shape was defined as the average of the growth time of the five total fluorophore labels on the shape across the four samples (four in the shape-specific sample and one in the each-shape sample), whereas for concentration patterns with only one sample, the growth time of the corresponding fluorophore label was used. As the position of the fluorophore within the shape, relative to where nucleation occurs, has a substantial influence on growth time measurements, the considerable variability in these measurements relative to the true nucleation kinetics must be acknowledged.
For flag experiments, AFM imaging was done only for qualitative confirmation of the selective nucleation and growth indicated by fluorescence results. For pattern recognition and equal-concentration experiments, however, shapes in AFM images were uniformly quantified. At least one sample of each of the patterns had three 5 × 5 μm images taken under comparable conditions. The sample corresponding with each image was blinded, and structures were counted independently by each of the four authors, classifying structures as either ‘nearly complete’ or ‘clearly identifiable’ examples of each of the three shapes. For the purposes of analysing pattern-dependent nucleation and growth, no clear distinction between the number of nearly complete and clearly identifiable shapes was found, and so the two categories were summed. Counts were averaged across the three images, then averaged across the counts of the four authors, to obtain a count per shape per 25 μm2 region for each pattern. Each author used their own, subjective, interpretation of ‘nearly complete’ and ‘clearly identifiable’ structures, and the total number of structures counted in each image differed by up to ±50% for different authors. However, the ratios of different shapes in each image counted by each author remained within 5% of the mean ratios for most images, and across all images no author had a bias of more than ±4% towards identifying a particular shape more or less often than average. Results are detailed in Supplementary Information section 6.3.
To measure the selectivity of patterns, the fraction of on-target shape growth time and AFM counts, compared to the sum of shape growth times and AFM counts, was used. The total growth times, and total AFM counts, of the on-target shapes were used to measure overall shape growth.
Pattern recognition training
Images for pattern recognition were adapted from several sources (Fig. 5d). Each image was rescaled to 30 × 30, discretized to ten grayscale values and adjusted so that the number of pixels with each value was consistent across all images. Each pixel’s grayscale value, 0 ≤ pn ≤ 1, was converted to the concentration ci for the corresponding tile ti where i = θ(n) using an exponential formula, \({c}_{i}=c{{\rm{e}}}^{3{p}_{n}{\rm{l}}{\rm{n}}3}\), where the base concentration is c = 16.67 nM. The intention of the numbers used was to make the average tile concentration 60 nM for each image. As each image had 900 pixels and there are 917 tiles in the system, 17 tiles did not have their concentrations set by any pixel; these tile concentrations were uniformly set to the lowest concentration, and the assignment of these tiles was used to ensure that fluorophore label locations did not vary in concentration.
The tile-pixel assignment was optimized through a simple hill-climbing algorithm, starting from a random assignment, where random modifications to the assignment map are attempted at each step and accepted if the move increases the efficacy of the map. This efficacy was quantified through a heuristic function that accounts for relative nucleation rates, location of nucleation sites (with preference given to locations that succeeded in the flag experiments shown in Fig. 4d) and satisfaction of constraints related to the fluorescent reporters. Because the nucleation algorithm described above, the SGM, is computationally expensive, a simplistic model of nucleation we call the Window Nucleation Model (WNM) was used to evaluate relative nucleation rates for most of the optimization steps. The WNM is based on the Boltzmann-weighted sum of concentrations over a k × k window swept over each structure, similar to the model used in Zhong et al.24. The more detailed but computationally costly SGM was then used for an additional several hours in hopes of improving the mapping. The WNM, along with all constraints about nucleation location and fluorescent reporters, was also used to explore the capacity of this map-training procedure in Extended Data Fig. 8. Details of the pattern recognition training and the window-based nucleation model are discussed in Supplementary Information sections 2.4 and 2.5.