Critical evaluation of a simple retention time predictor based on Log Kow as a complementary tool in the identification of emerging contaminants in water

There has been great interest in environmental analytical chemistry in developing screening methods based on liquid chromatography-high resolution mass spectrometry (LC-HRMS) for emerging contaminants. Using HRMS, compound identification relies on the high mass resolving power and mass accuracy attainable by these analyzers. When dealing with wide-scope screening, retention time prediction can be a complementary tool for the identification of compounds, and can also reduce tedious data processing when several peaks appear in the extracted ion chromatograms. There are many in silico, Quantitative Structure-Retention Relationship methods available for the prediction of retention time for LC. However, most of these methods use commercial software to predict retention time based on various molecular descriptors. This paper explores the applicability and makes a critical discussion on a far simpler and cheaper approach to predict retention times by using LogKow. The predictor was based on a database of 595 compounds, their respective LogKow values and a chromatographic run time of 18min. Approximately 95% of the compounds were found within 4.0min of their actual retention times, and 70% within 2.0min. A predictor based purely on pesticides was also made, enabling 80% of these compounds to be found within 2.0min of their actual retention times. To demonstrate the utility of the predictors, they were successfully used as an additional tool in the identification of 30 commonly found emerging contaminants in water. Furthermore, a comparison was made by using different mass extraction windows to minimize the number of false positives obtained.


Introduction
Many environmental chemists have focused their research on "emerging contaminants" in water, which encompass a wide-range of compounds including drugs for health and personal care, flame retardants, illicit drugs and all kinds of transformation or byproducts [1]. These compounds can have some detrimental effects on the environment [2] and it is therefore important to perform frequent monitoring on these compounds in order to know their concentrations and fate in water.
High resolution mass spectrometry (HRMS), using analyzers such as Quadrupole Time-

of-Flight (QTOF)-MS and Linear Ion Trap (LTQ) Orbitrap MS, has revolutionized
screening of emerging contaminants [3][4][5][6]. It offers the possibility to investigate the presence of theoretically unlimited compounds once the analysis has been performed and data acquired, and considering their compatibility with the requirements of the chromatographic separation and MS ionization. Due to their high resolution, both hybrid instruments provide data with high mass accuracy and frequently allow tentative identification of compounds even without having reference standards [7,8].
Identification of the compounds detected is obviously facilitated when reference standards are available, as relevant information on retention time and fragment ions is included. However, it is also possible to perform screening without the need for reference standards, simply on the basis of a large database, where empirical formulae (i.e. exact mass) are the only information required. Post-target screening without standards (i.e. suspect) is becoming more and more common. Here, the exact mass of the compounds of interest are gathered, then searched and extracted from HRMS spectra, using a narrow mass window (20mDa), in order to find potential positives [5,7,[9][10][11]. The increased resolving power of modern TOF and Orbitrap allows an even narrower mass window, reducing matrix interferences, leading to a cleaner chromatogram. Ideally, there would be a single peak in the chromatogram, coming from the suspect. However, in more complex environmental matrices, it is likely that there is more than one peak in the eXtracted Ion Chromatogram (XIC), arising from various isobaric or isomeric compounds, complicating the identification of the compound of interest. A variety of techniques can be used to aid in the detection/identification process, such as mass window filtering, mass defect analysis or isotopic pattern fit.
With the high quality data obtained by HRMS, compound separation by liquid chromatography (LC) is sometimes overlooked, while it is also an important parameter [12]. Retention time prediction does take this LC aspect into account in the identification process and can therefore be a useful technique when performing a widescope screening. Reliable information on retention time can focus the identification process solely on those peaks that are in agreement with the predicted retention time, ignoring other false positives that may appear in complex matrix samples which do not correspond to the candidate under research.
There are several in silico, Quantitative Structure-Retention Relationship (QSRR) methods available for the prediction of LC retention times, used in a variety of research [13,14]. The principal aim of QSRR is to predict retention data from the molecular structure, using descriptors such as molecular mass, polar surface area, Log P, molar polarisability and molar volume. Linear Solvation Energy Relationship (LSER) has been also proposed for retention time prediction. LSERs analyze any free energy related property by five fundamental solute parameters: the hydrogen bond acidity and hydrogen bond basicity, excess molar refraction, dipolarity/polarizability and the logarithm of the gas-hexadecane partition coefficient or the characteristic McGowan volume [14]. The major problem for LSERs is that all terms are needed and any missing value can be problematic [15]. While work with LSERs in this area is ongoing, Artificial Neural Networks (ANNs), a predictive computing technique, has shown itself as a promising retention time predictor [16].
In the presented work, a free and much simpler approach was tested using only Log Kow for retention time prediction. By utilizing this predictor, we show a way to reduce the amount of time spent poring through chromatograms, with many peaks in the chromatogram able to be disregarded. A critical discussion is made on this predictor, showing the advantages and drawbacks. The usefulness of mass window filtering as a complementary tool to eliminate peaks not arising from the compound of interest is also evaluated in order to facilitate the identification process of emerging contaminants in water in wide-scope screening procedures.

Reagents and Chemicals
A total of 595 reference standards were used in the development of the retention time predictor. This combined retention times of 311 and 284 individual standards and their retention times from an in-house database and a Waters database (Waters, Milford, MA, USA), respectively. See Table S1 for a list of all compounds used, retention time and LogKow. There were some duplicates between the two sets of standards, which were used to ensure the consistency in retention time (see Development of Retention Time Predictor for more information).
Details relating to the standards can be found elsewhere [17]. Retention times were obtained by injecting mixed working standard solutions (25 μg/L or 50 μg/L, diluted from mixed standard solutions in methanol or acetonitrile with water). In addition, grab samples were taken from 11 surface waters from several points located in Spain and Colombia between November, 2010 and May, 2013. All these samples had previously been used in different studies performed at our lab using UHPLC-QTOF MS for their analysis. Sample treatment was based on solid phase extraction using polymeric Oasis HLB cartridges, which are able to retain organic compounds within a wide range of polarity.

UHPLC-QTOF MS
A Waters Acquity UPLC system (Waters, Milford, MA, USA) was interfaced to a hybrid quadrupole-orthogonal acceleration-TOF mass spectrometer (XEVO G2 QTOF, Waters Micromass, Manchester, UK), using a ESI (Z-Spray) interface operating in positive ion mode. The chromatographic separation was performed using an Acquity UPLC BEH C18 100 × 2.1 mm, 1.7 µm particle size column (Waters) at a flow rate of 300 µl/min. The mobile phases used were A = H2O and B = MeOH, both with 0.01% formic acid. The initial percentage of B was 10%, which was linearly increased to 90% in 14 min, followed by a 2 min isocratic period and, then, returned to initial conditions during 2 min. The total run time was 18 minutes. Nitrogen was used as drying gas and nebulizing gas. TOF-MS resolution was approximately 20.000 at full width half maximum at m/z 556. MS data were acquired over an m/z range of 50-1000,at 0.4 s scan time . A capillary voltage of 0.7 kV and cone voltage of 20 V were used. Collision gas was argon 99.995% (Praxair, Valencia, Spain). The desolvation temperature was set to 600 °C, and the source temperature to 135 °C. The column temperature was set to 40 °C. MS data was acquired in MS E mode, selecting a collision energy of 4eV for low energy (LE) and a ramp of 15-40eV for high energy (HE) [5,18].

Data Processing
MS data processing was performed manually on MassLynx v 4.1 (Waters Corporation), looking at the raw data in chromatogram view, initially with a mass extraction window of 20mDa using the retention time predictor developed in this study. Later, different mass extraction windows (50mDa, 10mDa and 5mDa) were also evaluated (see "Application of the retention time predictor to real water samples"). All peaks above an intensity of 3000 were counted.

Development of Retention Time Predictor
A dataset of 595 compounds was used to initially prepare the retention time predictor.
The retention times for the compounds from the "Waters" database were obtained using the same column but a slightly different gradient from the one described in this paper, however the 94 common compounds between the datasets were compared using a linear correlation on Excel (R 2 = 0.9279). These compounds cover the Log Kow range of -1 to 8, thereby covering the entire Log Kow range of the compounds under investigation.
The equation from this correlation was used to convert all the Waters retention times to fit our in-house retention times ( Figure S1). The LogKow of each of these compounds was estimated using the freely available ALOGPS 2.1 software (VCCLAB, 2005 [19]).
A linear correlation was again made to compare the LogKow and retention time ( of organic contaminants. Specific compounds were used for the training set, with only standards that were estimated to be predominately neutral at an elution pH of 3 deemed acceptable, to reduce the number of false negatives for both neutral and ionic TPs as the latter are less retained than their corresponding neutral species. Nevertheless, the correlation for the 92 reference standards was very good, with an R 2 of 0.87. However, the current work is more wide-reaching, containing nearly 600 compounds of differing classes and physicochemical properties and is therefore expected to have a worse correlation. In order to improve this correlation, the 595 compounds were subdivided, not based on physicochemical parameters, solely on class, into two smaller subsets: pesticides and non-pesticides. The dataset initially contained pesticides, drugs of abuse, antibiotics, pharmaceuticals, veterinary drugs and mycotoxins. The vast majority of the compounds were pesticides (345 compounds), which made up one subgroup, while all the other compounds made up the other (250 compounds) (Figures S2 and S3).
The "pesticides only" grouping did make for a scarcely better correlation (R 2 = 0.6947), while the "non-pesticides" had a slightly worse correlation (R 2 =0.6518), compared with the overall correlation (R 2 =0.6704). Although these correlations have rather large variability (especially for "non-pesticides"), it was thought to compare the predicted retention time (made from the equations, where "x" is the LogKow, in each of the figures) with the actual retention time for each compound. It was expected that there would be some deviation between the experimental and predicted retention times because it is difficult for the algorithm used by the LogKow predictor (ALOGPS 2.1) to cope with complex molecules, leading to some inaccurate LogKow values. Table 1 shows the differences between the predicted and actual retention times for the three cases.
In spite of the variability of the data, the retention times of approximately 95% of the compounds were found within 4 minutes of their actual retention time. In the case of the subsets, 79% of the pesticides can be found within 2 minutes of their actual retention time, which is 14% better than for non-pesticides.
From these results, it was thought to use ± 2 minute window (± 11% of the chromatographic run) for use in real samples. The "pesticides only" predictor was selected for pesticides and "all compounds" for the other compounds, wherein approximately 80% and 70% of the compounds will be respectively found.
This predictor was designed specifically for this precise chromatographic system, gradient and method. If applied to other separation systems, the correlation between LogKow and retention time for these particular compounds may differ widely.
However, it is easy to adapt this retention time predictor to other systems using the methodology outlined in "Development of Retention Time Predictor". Some groups have introduced a retention time index [6] to cope with this limitation. In order for a more complete predictor, training/validation sets comprising different compounds could be incorporated. Furthermore, the evaluation of different chromatographic conditions such as different columns and gradients could be carried out.

Impact of experimental and predicted LogKow
In an attempt to gain narrower retention time windows, a study was made on the impact of experimental versus predicted LogKow values. A study was made comparing the accuracy of predictive software, wherein ALOGPS 2.1 was shown to be quite accurate and only differed from the measured values by up to 0.5 Log units [15]. Software such as ALOGPS 2.1 is prone to systematic errors, especially for complex molecular structures, because correction factors for certain structural configurations might be missing [9]. To test the difference between experimental and predicted LogKow for the retention time predictors, "experimental" LogKow values were found for 280 of the pesticides in the Pesticide Manual [20] and for 52 drugs of abuse, antibiotics, veterinary drugs and pharmaceuticals on DrugBank [21]. LogKow. Cumulatively, 77% of the compounds had an absolute difference within 1 Log unit; however 10% still had a difference greater than 2 Log units.
These findings alone show that while predicted values give a good estimate, the fact that some compounds had a difference more than two Log values shows that experimentally derived values are preferable for any latter retention time predictions. The 36 compounds whose values of experimental and predicted LogKow differed by greater than 2 Log units were removed from the overall compound list to see the impact on the correlation coefficient. It was found that the R 2 did not differ, with only a change from 0.6704 to 0.6737 following their removal ( Figure S4).

Application of the retention time predictor to real water samples
Fifteen influent wastewater and surface water samples were selected, representing different types of water sources, to test the retention time predictors. As stated previously, two predictors were used: one for pesticides only and one for non-pesticides. A set of 30 compounds were selected, based on their prevalence in environmental water samples [22][23][24] and their retention times were predicted using the aforementioned equations with the predicted LogKow of each of the compounds (Table   S2).
A retention time window of ± 2 minutes (from the predicted value) was given for each compound, and they were searched with a mass window of 20mDa. All peaks in the narrow window (nw)-XIC for each compound were counted manually (above intensity of 3000), while all peaks outside the prediction window were disregarded.
All of the 30 compounds were detected in the water samples. Of these, 20 were found in the ± 2 minute retention time window (Figure 3). Remarkably, eight of the 20 compounds had only one peak inside the retention time window. In addition, the percentage of peaks outside the retention time window of ± 2 minutes (and therefore not pertaining to the compound of interest) was found to be 35%, meaning that over one third of the peaks in the XICs could be disregarded through retention time prediction. This retention time prediction shows a noticeable reduction in time spent processing data for potential positive samples.

Complementary use of mass chromatogram extraction window
Using a mass window of 20mDa, four compounds only had one peak in the XIC (Figure 3). To complement the information and applicability of the retention time predictor, a comparison was made with three additional mass chromatogram extraction windows (50mDa, 10mDa and 5mDa). An extra window of 20mDa was used as a reference for these tests.
The ten extra compounds, whose predicted retention time fell outside of the ± 2 minutes window, were also included in this test (see Table S2). The results of this comparison are shown in Table 2. As expected, the number of peaks observable in the XIC decreases as the extraction window decreases, from 119 (50 mDa) down to 62 peaks (5 mDa). However, it is of note that even at the 5mDa extraction window and within the retention time window, there were still unknown peaks not just pertaining to the compound of interest. In these situations, retention time prediction is very helpful in the reduction of false positive findings. Figure 4 shows the influence of mass windows in XICs and the predicted retention time window for two compounds detected in surface water (benzoylecgonine, BE) and influent wastewater (trimethoprim). In the case of BE (major metabolite from cocaine), using a 50mDa XIC, several isobaric compounds are also seen; however, by decreasing the mass extraction window, all of these peaks disappear, leaving just the peak of BE at 4.53 minutes and a small spike at 6.5minutes. Although BE was able to be identified solely with a nw-XIC, the retention time fell just outside the predicted window (2.48 ± 2min). As stated above, the use of ± 2min retention time window led to a 70% success rate in the predictions for the "all compounds" group, where BE is included. Using a slightly wider window such as ± 2.5 min to get a success rate of 80%, similar to the pesticides group (see Table 1) would have led to BE being inside the prediction window. In any case, it seems that improvements are needed in the retention time prediction in order for it to be more useful in the identification process.
In the case of the antibiotic trimethoprim, the peak is easily seen at 5mDa (3.66 minutes), however at a larger mass extraction window, two isobaric compounds are observed at a much higher intensity (2.5-3.0 minutes). With the retention time window (5.61 ± 2 min), these peaks, as well as the one at 7.6 minutes are removed, leaving just the two peaks at 3.66 and 4.6 minutes. The first one corresponded to trimethoprim, while the second was an unknown. This example shows the true utility of retention time prediction, especially in alliance with HRMS. While mass extraction windows can be narrowed to remove some interfering peaks, even at a 5mDa mass window, pseudoisobaric interferences remain. By incorporating retention time prediction, some of these false positive peaks can be removed.

Conclusions
A critical evaluation has been made on the applicability of retention time prediction based on Log Kow to help in identification of suspect compounds in screening procedures. Two predictors were used; one based on pesticides only, and one for all other compounds (mainly licit and illicit drugs). Both were tested on 30 emerging contaminants commonly found in environmental and wastewater samples by retrospective analysis with QTOF-MS. In addition to help identify the compound of interest, the retention time predictors also allowed over one third of isobaric chromatographic peaks to be disregarded for further analysis as they were outside the retention time window, thereby enabling the reduction of tedious data processing. This is relevant when applying wide-scope screening for a large number of compounds (e.g. emerging contaminants and pesticides) or when investigating the presence transformation products, where many of the required reference standards are not available at the laboratory. In addition to the retention time prediction, the impact of extracted mass windows was investigated as a complementary tool for the screening. A smaller mass extraction window also removed unwanted peaks from the chromatogram.
However, in some cases, even at a narrow mass extraction window of 5mDa, some isobaric peaks still appeared.
The combination of this simple retention time predictor with extracted mass windows facilitated the removal of many false positives. In this work, 70-80% of compounds studied were able to be found in a ± 2 minute window, but a ±5 minute window was needed for  98% confidence. Our present research is focused on alternate and more sophisticated retention time predictors in order to improve the precision. This would allow the use of narrower time windows, thereby simplifying data processing due to fewer peaks needing to be investigated in the chromatograms. contract.

Supplementary Information
In this section, a table of the 595 compounds investigated, with their predicted LogKow values and empirical retention time is given (Table S1), as well as those compounds investigated in real samples, including their predicted and empirical retention time and precursor ion ([M+H]+) (Table S2). Furthermore, correlations of the Waters versus Inhouse compounds (Figure S1), pesticides only (n=345) with 95% confidence interval ( Figure S2) and all other compounds (n=250) with 95% confidence interval ( Figure   S3) are included. Finally, a figure showing the impact on the removal of the compounds with an experimental Log Kow differing by more than 2 Log from the experimental value from the original LogKow-Retention time correlation is included ( Figure S4) to provide supplementary information to the written text.