Application of mixsep software package: Performance verification of male-mixed DNA analysis

An experimental model of male-mixed DNA (n=297) was constructed according to the mixed DNA construction principle. This comprised the use of the Applied Biosystems (ABI) 7500 quantitative polymerase chain reaction system, with scientific validation of mixture proportion (Mx; root-mean-square error ≤0.02). Statistical analysis was performed on locus separation accuracy using mixsep, a DNA mixture separation R-package, and the analytical performance of mixsep was assessed by examining the data distribution pattern of different mixed gradients, short tandem repeat (STR) loci and mixed DNA types. The results showed that locus separation accuracy had a negative linear correlation with the mixed gradient (R2=−0.7121). With increasing mixed gradient imbalance, locus separation accuracy first increased and then decreased, with the highest value detected at a gradient of 1:3 (≥90%). The mixed gradient, which is the theoretical Mx, was one of the primary factors that influenced the success of mixed DNA analysis. Among the 16 STR loci detected by Identifiler®, the separation accuracy was relatively high (>88%) for loci D5S818, D8S1179 and FGA, whereas the median separation accuracy value was lowest for the D7S820 locus. STR loci with relatively large numbers of allelic drop-out (ADO; >15) were all located in the yellow and red channels, including loci D18S51, D19S433, FGA, TPOX and vWA. These five loci featured low allele peak heights, which was consistent with the low sensitivity of the ABI 3130xl Genetic Analyzer to yellow and red fluorescence. The locus separation accuracy of the mixsep package was substantially different with and without the inclusion of ADO loci; inclusion of ADO significantly reduced the analytical performance of the mixsep package, which was consistent with the lack of an ADO functional module in this software. The present study demonstrated that the mixsep software had a number of advantages and was recommended for analysis of mixed DNA. This software was easy to operate and produced understandable results with a degree of controllability.


Introduction
The R programming language, which was created as a branch of the S language in the 1980s, is widely used in the field of statistics. R is a free open source software environment that is part of the Gnu's Not Unix project. As an implementation of the S programming language, R has a complete software system for data processing, statistical computing and graphics functions (1). The primary functions of R include data storage and processing systems. It also has an array of operation tools (among which vector and matrix operations are particularly powerful functions), statistical analysis tools, statistical graphics functions and a simple powerful programming language function, which can control data input and output in order to achieve branch and cycling.
The source code for R is freely downloadable and compiled executable files are available online. R is available for multiple computer platforms, including UNIX (FreeBSD and Linux), Windows and MacOS. R predominantly runs through commands and a number of versions of the graphical user interface have been developed, among which Rstudio is the most commonly used (http://www.rstudio.com) (2). In addition, the Comprehensive R Archive Network (CRAN; http://cran.r-project.org) provides a collection of downloadable executable file version source codes and documentations for R, as well as various software packages written by R users. There are >100 CRAN mirrors worldwide, which are responsible for shunting the primary R server. There are five CRAN mirrors in China, allowing Chinese users to quickly download the R-package.
In bioinformatics, the R language is commonly used for the analysis of molecular biological data. The Bioconductor project (3), which uses R as a genome analysis tool, has been available since its launch in 2001 and is updated twice per year Application of mixsep software package: Performance verification of male-mixed DNA analysis (http://www.bioconductor.org). At present, the Bioconductor project is used in bioinformatics analysis of high-throughput data, microarray data and sequential data, with a large number of metadata packages for pathways, microarrays, genetic markers and organs (3)(4)(5)(6)(7)(8)(9)(10)(11). The purpose of the Bioconductor project is to provide powerful statistical analysis and graphics functions for genomic data analysis in order to efficiently analyze metadata in various species and to provide a common platform for bioinformatics. The mixsep package (12)(13)(14)(15)) is a DNA mixture separator in R, which is developed and maintained by Dr Torben Tvedebrink (Aalborg University, Aalborg, Denmark). This software is a forensic genetics tool used for the analysis of mixed DNA. The present study used the mixsep version 0.2.1-2, updated on May 3, 2013. The user interface of the present version is shown in Fig. 1; URL, http://cran.r-project. org/web/packages/mixsep/index.html; reference manual, http:// cran.r-project.org/web/packages/mixsep/mixsep. pdf (12). The mixsep package constructs a statistical model of a greedy algorithm (13) that separates and infers the majority of two-person mixed DNA profiles (separation results are often not unique) on the premise that it does not consider the influence of allelic drop-out (ADO; the low level of a specific DNA content, which may cause relative fluorescence that is too low and may not be separated from the background, therefore providing results in the loss of allelic peak, expressing a false homozygote), stutter and drop-in (DNA contamination), and then conducts the individual identification of mixed DNA. The mixsep package also includes a module for use in complex mixed DNA analysis (more than three people), which has shown limited analytical performance in experimental data validation.

Materials and methods
DNA sample collection. Anti-coagulated blood samples (5 ml) were collected from 40 unrelated healthy males at the Blood Center of Hebei Province (Shijiazhuang, China).
Experimental design. DNA was extracted from each of the 40 whole blood samples and quantified using the ABI 7500 quantitative polymerase chain reaction (qPCR) system (Life Technologies Inc., Carlsbad, CA, USA). Single DNA samples were classified according to whether there were minimal differences in DNA concentrations (<0.5 ng/µl) and then used to generate simulated male-mixed DNA samples of two individuals. This approach allowed the preparation of different mixed DNA gradients by adjusting the volume of DNA solution. To avoid potential over-fitting in statistical analysis caused by single sample type and inadequate sample size, various combinations of mixed DNA samples were generated using different sources (individuals), and each of these combinations was prepared in multiple mixed gradients. This procedure ensured that the influence of mixed DNA profiles and mixed gradients was objectively reflected in the analytical performance of the mixsep software. In addition, the concentration of simulated mixed DNA stock solutions was adjusted to desired levels within the range of 0.5-1.25 ng/µl (that is, the working solution concentration), to achieve the DNA template quantity required by the DNA testing kits. DNA quantification. DNA quantification was performed using the Quantifiler ® Human DNA Quantification kit (Life Technologies Inc.), containing DNA standard (200 ng/µl), Human Primer mix, and PCR Reaction mix. Human Primer mix (10.5 µl/sample) and PCR Reaction mix (12.5 µl/sample) were mixed and dispensed into reaction wells (23 µl) followed by the addition of 2 µl sample or standard to each well, in order to obtain a 25-µl PCR reaction mixture. DNA quantification was repeated three times for each sample, and the mean of these was taken as the final DNA concentration.
Principles of mixed DNA preparation. Simulated male-mixed DNA was prepared by classifying DNA quantification results of the 40 male samples (nos. 1-40) and the Promega Male-DNA standard; the classification criterion was that single DNA samples have similar concentrations (difference, ≤0.5 ng/µl). The prepared, simulated male-mixed DNA was quantified by ABI 7500 real-time PCR system (Applied Biosystems). The concentration of DNA templates was adjusted to 0.5-1.25 ng/µl as recommended in the instructions for the AmpFlSTR ® Identifiler ® PCR Amplification kit and the simulated mixed DNA was further diluted whenever necessary.
Identifiler PCR and electrophoresis. The 25-µl PCR system contained 10.5 µl PCR Reaction mix, 5.5 µl Identifiler Primer set, 0.5 µl Gold ® DNA Polymerase, 9.0 µl Nuclease-Free Water and 1 µl template DNA. Identifiler PCR amplification was performed according to the following conditions: Pre-denaturation at 95˚C for 11 min, 28 cycles of denaturation at 94˚C for 1 min, annealing at 59˚C for 1 min, extension at 72˚C for 1 min and a final extension step at 60˚C for 60 min. The AmpFlSTR ® Identifiler (Life Technologies) PCR products were checked using a 10-µl electrophoresis system containing 0.25 µl GeneScan™, 500 LIZ ® Size Standard, 9.25 µl Hi-Di™ formamide and 0.50 µl of PCR product or Allelic Ladder. Capillary electrophoresis was performed on an ABI 3130xl Genetic Analyzer (Applied Biosystems Life Technologies, Foster City, CA, USA). All PCR reagents were purchased from Invitrogen Life Technologies Inc. (Carlsbad, CA, USA).

Software operation of mixsep
Rationale for use. According to the required significance level for statistical analysis, the mixsep package provided the optimal and alternative genotype combinations of short tandem repeat (STR) loci, estimated the parameter of mixture proportion (Mx), fitted the residual peak area error and calculated goodness of fit. Additionally, the mixsep package screened out and removed STR loci with poor goodness of fit, which contributed to the overall variance.
Data formatting and loading. Experimental data were saved as a CSV file containing six variables. These were: Locus, allele, height, area, bp and dye. In the majority of cases, data analysis was performed using the first four of these, as shown in Fig. 2. Data were loaded as a CSV file by clicking ' Add file'.
Variables and genetic marker selection. The variables of locus and allele were required, height and area were alternative, and bp and dye were optional. A DNA testing kit (such as the Identifiler PCR Amplification kit) was selected prior to clicking 'select column (and kit)'.
Selecting loci and alleles. The mixsep default setting analyzed all loci and alleles. Specific loci and alleles were selected whenever necessary and the parameter setting interface was entered by clicking 'continue'.
Parameter setting and mixed DNA analysis. These included 'Number of contributors', 'Search for alternatives', 'Specify significance level' , and 'Use fixed profile'. Mixed DNA analysis was started by clicking ' Analyze mixture!'.

Parameters of analytical performance for mixsep
Rationale for use. The primary function of mixsep, which lacks a function module for ADO, is the separation of mixed DNA genotype combinations. Therefore, the simulated mixed DNA profiles of STR loci (n = 4566) were statistically analyzed excluding ADO.
Locus separation accuracy. Locus separation accuracy refers to an accurate separation of the genotype combination for a specific locus in a sample of mixed DNA profiles.
Horizontal analysis. The mixed DNA profile was used as a unit for statistical analysis of locus separation accuracy in order to compare the distribution patterns of the DNA profile data in association with different mixed gradients and mixed sample types.
Vertical analysis. The STR locus was used as a unit for the statistical analysis of locus separation accuracy in order to compare the distribution patterns of DNA profile data in association with the 16 STR loci used in the present study.
The separation efficiency of mixsep in male-mixed DNA profiles was assessed using statistical analysis in the horizontal and vertical dimensions.

Results
Preparation of simulated male-mixed DNA. The male DNA samples (n=40; nos. 1-40) and Promega male-DNA standards were classified according to the criterion of a DNA concentration difference of no greater than 0.5 ng/µl. The 22 single DNA samples that met this criterion were prepared into eleven groups of two-male mixed DNA samples. To include the Promega male-DNA standard in constructing simulated mixed DNA, the ten-fold-diluted 2800M control DNA working solution was further diluted twice, yielding a final concentration of 0.243 ng/µl (Table I). Each group of male-mixed DNA was prepared into nine mixed gradients, and the samples of each mixed gradient were amplified by PCR three times (thus, n=297). The mixed gradients of male-mixed DNA samples are shown in Table II. DNA quantity of male-mixed DNA. The simulated male-mixed DNA samples were checked by assessing selected samples using an ABI 7500 qPCR system (Applied Biosystems), including eleven groups of male-mixed DNA at a mixed gradient of 1:9. DNA quantification of each sample was repeated three times, and the mean values were taken as the DNA concentration (Table III).
To fit the concentration range (0.5-1.25 ng/µl) of template DNA recommended by the kit used in this study, 99 male-mixed DNA working solutions (eleven groups of mixed DNA with nine mixed gradients in each group) were diluted appropriately. According to the DNA quantification results (Table III)  with concentrations >0.5 ng/µl were not diluted. The volume of the DNA template was 2 µl for the mixed DNA sample, Sample 11, which was composed of the male-DNA standards (n=27), and 1 µl for the other groups, including single DNA samples used for constructing male-mixed DNA.    Table III. DNA quantity in eleven groups of male-mixed DNA with mixed gradient of 1:9.    In the present study, the estimated Mx values of mixsep were used as the estimated alpha and the pre-set mixed gradients of male-mixed DNA were used as the theoretical alpha. The distribution of estimated and theoretical alpha values in Identifiler (ID)-STR profiles of the mixed DNA was examined by excluding STR loci with ADO. In Fig. 3, the red line indicates y=x and the blue line represents the locally weighted regression curve. This approach had acceptable anti-noise performance and thus accurately reflected the correlation between estimated and theoretical alpha values. The results showed that with a theoretical alpha value ≤0.33 (that is, mixed gradients of 1:2 to 1:9), the estimated alpha of mixsep was greater than that of the theoretical value. However, with a gradient of 1:1, the estimated alpha value was smaller than that of the theoretical value. This observation may have been based on the assumption of normal distribution in constructing statistical models by mixsep, which led to conservative estimation of relatively extreme mixture proportions (such as 1:5, 1:6, 1:7, 1:8 and 1:9), inclining toward relatively balanced mixture proportions.
Two values showed an abnormal distribution in Fig. 3 and significantly deviated from the locally weighted regression curve. These two data corresponded to the third repetition of the gradient of 1:5 and the first repetition of the gradient of 1:6 for the mixed DNA samples of group no. 9, respectively. The two abnormal data were obtained when running mixsep with source code. However, when running mixsep from the software interface, the obtained alpha values were 0.1742 and 0.1537, respectively, which were each located near the weighted regression curve and followed a normal distribution. The reason for this result is elusive, since all other alpha values estimated using mixsep through source code were consistent with those estimated when using it through the software interface, and no bug was found when running mixsep through the software interface. In view of this situation, the results estimated by mixsep through the software interface are referred to in this article.
Root mean square error (RMSE) statistics showed that in ID-STR profiles, large RMSEs of estimated alpha values are scattered in eleven groups of male-mixed DNA samples, with relatively high frequencies in groups 8 and 9. In terms of mixed gradients, RMSEs were relatively large at a mixed gradient of 1:1 (>0.02) and ranged from 0.01 to 0.02 at the other gradients. Theoretically, mixed DNA at a gradient of 1:1 cannot be accurately separated (although this is ignored in statistical analysis). These results demonstrated that the RMSE between estimated and theoretical Mx was small (≤0.02) in ID-STR profiles of the male-mixed DNA model established in the present study. Thus, the obtained ID-STR profile data did allow scientific and rational analysis of mixed DNA.
Performance analysis of mixsep Horizontal analysis. The eleven groups of male-mixed DNA profiles (with three parallel tests) at each mixed gradient involved 528 STR loci. Data statistics (Table VI) and distribution (Fig. 4) of locus separation accuracy and ADO number show that the ADO number increased from a gradient of 1:4         significant correlation between these two parameters. Fig. 5 shows the distribution of average locus separation accuracy at different mixed gradients in the three parallel tests, in which the results were generally consistent. Locus separation accuracy was lowest at a mixed gradient of 1:1; with an increasing mixed gradient, the accuracy first increased and then decreased. Specifically, locus separation accuracy was relatively high at gradients of 1:2, 1:3 and 1:4 but decreased to low levels and fluctuated at gradients of 1:1 and 1:9. The accuracy was slightly higher in mixed DNA profiles excluding loci with ADO compared with those including ADO. Data statistics (Table VII) and distribution (Fig. 6) of locus separation accuracy in the eleven groups of male-mixed DNA samples at different mixed gradients show that the distribution pattern of the accuracy in every group of mixed DNA was generally consistent with the overall distribution mentioned above. The accuracy was lowest at a gradient of 1:1 (with the exception of no. 9). With an increasing mixed gradient, the accuracy first increased and then decreased. Among the eleven groups of mixed-DNA, large fluctuations in locus separation accuracy were observed in groups no. 7, 9 and 11, which may have been due to variations in experimental operations. The accuracy was generally high in groups no. 1, 3 and 4. There were differences in the overall level of locus separation accuracy among the eleven groups of mixed DNA, demonstrating the stochastic effect of sampling.
In the mixed DNA experimental model, nine mixed gradients of a specific locus involved 33 values of locus separation accuracy. Data statistics (Table IX) and distribution (Fig. 8) of locus separation accuracy for 16 STR loci at each mixed gradient show that for a gradient of 1:1, the accuracy was ≤70% for the STR loci, with the exception of AMEL-and D3S1358 (outliers are shown in the lower area of the box-whisker plot, Fig. 8). For gradients of 1:2, 1:3, 1:4 and 1:5, the accuracy of each locus was relatively high, particularly at the gradient of 1:3 (≥90%), while at the gradients of 1:8 and 1:9, the accuracy underwent large fluctuations and declined to lower levels. According to the data distribution shown in the box-whisker  3  1  0  5  2  0  64  69  39  0  3  Loci no.  429  431  432  427  430  432.  368  363  393  432  429  Sum  432  432  432  432  432  432  432  432  432  432  432 ADO, allelic drop-out.       plot, the average separation accuracy was lowest for the D7S820 locus among the 16 STR loci. Data statistics (Table X) and distribution ( Fig. 9) of locus separation accuracy in the 297 simulated male-mixed DNA profiles show that the accuracy was relatively high for loci D5S818, D8S1179 and FGA (>88%), but relatively low for loci D19S433, D2S1338 and D7S820 (≤80%). The number of ADO was lowest in AMEL-, D5S818 and D8S1179, but was relatively high in loci D18S51, D19S433, FGA, TPOX and vWA (>15). The latter five loci were all distributed in the yellow and red channels with lower APH, consistent with the relatively low sensitivity to yellow and red fluorescence in the ABI 3130xl Genetic Analyzer. There was no significant correlation between the accuracy of the 16 STR loci and number of ADO, R 2 =-0.3095 (P=0.2434).
Data statistics (Table XI) and distribution (Fig. 10) of locus separation accuracy in the eleven groups of simulated male-mixed DNA profiles show that groups no. 1, and 7 contained relatively large numbers of loci corresponding to the separation accuracy ≤0.5. The accuracy of loci D19S433, D2S1338 and D7S820 were associated with relatively large fluctuations, with the lowest median accuracy for D7S820. These results were generally consistent with the overall distribution of locus separation accuracy at the nine mixed gradients in the results from the other experiments.

Discussion
In the present study, an experimental model comprising eleven groups of male-mixed DNA (n=297) was established by following the mixed DNA construction principle of using an ABI 7500 real-time PCR system with scientific validation of the Mx parameter (RMSE≤0.02). The locus separation accuracy of the mixsep package was statistically analyzed using horizontal and vertical analysis of experimental data, with mixed DNA profiles and STR loci as units. The DNA profile distribution data corresponding to different mixed gradients, STR loci and mixed DNA types was examined to assess the performance of the mixsep package in the analysis of mixed DNA.
Locus separation accuracy of mixsep had a negative linear correlation with the Mx value (R 2 =-0.7121, with the exception of the gradient, 1:1, which first increased and then decreased with increasing mixed gradient imbalance. Thus, the Mx value was one of the primary factors that determined the success of mixed DNA analysis. Among the 16 STR loci, the number of ADO was relatively high in the D18S51, D19S433, FGA, TPOX and vWA loci (>15). These five loci were all located in the yellow and red channels and had a low APH, consistent with the low sensitivity to yellow and red fluorescence of the ABI 3130xl Genetic Analyzer. In addition, there was a large non-significant difference in locus separation accuracy obtained depending on whether the loci with ADO were included or excluded (~10%). The presence of ADO reduced the analytical performance of mixsep, consistent with the lack of ADO functional modules in this software.
The present study demonstrated that the mixsep software had a number of advantages. It was easy to operate and produced understandable results with a degree of controllability. It produced intuitive results presented in visual typing maps. Furthermore, rational assumptions were made in the established Table XI model with appropriate reasoning, and produced results with high validity. However, certain limitations remained in the use of mixsep, including the existence of bugs, which may result in the occasional generation of outliers in data analysis, as well as graphic dysfunction. In addition the control of software interface was inflexible and presentation was occasionally incomplete. Due to these limitations, the lack of analysis modules for dealing with stutter, drop-out and drop-in, and the unknown prior conditions in model assumptions, it is necessary to further optimize and improve the mixsep package in order to produce consistently reliable results.