Machine Learning Pattern Recognition for Forensic Analysis of Detected PFAS in Environmental Samples

Objective

Aqueous film-forming foam (AFFF) formulations based on per- and polyfluoroalkyl substances (PFAS) were used extensively throughout the United States, and groundwater impact is a concern at hundreds of locations. Because PFAS have also been widely used in non-AFFF applications, it is important to be able to distinguish between detected PFAS impact that originates from non-AFFF sources, and impact that originates from AFFF sources. This proof-of-concept project explored the use of modern machine learning algorithms to search for recognizable patterns in PFAS-containing samples to identify whether detected PFAS in environmental samples originates from AFFF sources.

Technical Approach

The project was focused around three main tasks:

Task 1. Data collection and preprocessing. An extensive, machine-readable dataset was created from PFAS concentration data collected from around the world, specifically formatted to be used as an input to machine learning applications.

Task 2. Evaluation of machine learning algorithms for source identification. A range of supervised and unsupervised learning activities were conducted to assess which methods exhibit the greatest performance for source identification. Work was also conducted to determine how PFAS component selection for machine learning impacts source identification.

Task 3. Preliminary data mining and exploration of deep learning algorithms. Hybrid approaches integrating machine learning with an understanding of PFAS physicochemical behaviors were used to explore the relationship between compositions in different phases. Preliminary tests of deep neural networks were also conducted for source identification.

Results

An initial PFAS dataset containing more than 22,000 samples was compiled and used to explore machine learning classification. A total of 12 supervised machine learning classifiers were tested, and all were found to provide excellent classification performance. Classification is generally improved with greater numbers of components, but high performance could be achieved with as few as four components, provided they were selected such that they were present above detection limits in both the training dataset and the samples being classified. In contrast with supervised learning results, tests with unsupervised learning (clustering) found that the clustering methods tested were not able to distinguish between PFAS from different sources, but were rather more sensitive to the effects of transport on composition. Finally, work partially supported by this project was conducted to develop a group contribution model for estimation of physicochemical parameters describing interfacial and partitioning phenomena driven by hydrophobicity for individual PFAS components. The model was used with scaling relationships for this work to predict PFAS compositions in phases in equilibrium with water by transforming PFAS groundwater data. Overall, the transformed water data showed strong agreement with the distributions in non-water phases, and supervised learning classifiers trained on the transformed data exhibited improved ability to classify samples in non-water phases.

Benefits

The results of this project show that supervised machine learning has significant promise as a means of distinguishing between PFAS from AFFF and non-AFFF sources. A tool created using the methods from this proof-of-concept project could be of significant value, providing an immediate, quantitative means of assessing the origin of PFAS constituents in a sample. However, additional work is needed to enhance the predictive capabilities and utility of the tool. Areas where additional work is needed include the following:

Creation of an expanded dataset containing additional examples of specific source types.
Selection of methods for rejection or open set recognition to handle cases of PFAS compositions not represented in the training set.
Exploration of advanced training methods.
Making use of the expanded dataset to allow multiclass classification.
Conducting simulations to better understand classification in complex environmental settings.

Follow-up work would be designed to culminate in an user-level tool for PFAS source allocation that would provide immediate, low cost information that could guide allocation of resources by highlighting samples or areas for more detailed study, potentially providing substantial utility and cost savings to DoD.

Publications

Kibbey, T.C.G., R. Jabrzemski, and D.M. O'Carroll. 2020. Supervised Machine Learning for Source Allocation of Per- and Polyfluoroalkyl Substances (PFAS) in Environmental Samples. Chemosphere, 252:126593. https://doi.org/10.1016/j.chemosphere.2020.126593

Kibbey, T. C. G., R. Jabrzemski, and D. M. O'Carroll. 2021. Predicting the Relationship Between PFAS Component Signatures in Water and Non-Water Phases Through Mathematical Transformation: Application to Machine Learning Classification. Chemosphere, 282: 131097. doi.org/10.1016/j.chemosphere.2021.131097.

Le, S.T., T.C.G. Kibbey, K.P. Weber, W.C. Glamore, and D.M. O'Carroll. 2021. A Group-contribution Model for Predicting the Physicochemical Behavior of PFAS Components for Understanding Environmental Fate. Science of the Total Environment, 764:142882. doi.org/10.1016/j.scitotenv.2020.142882.

Kibbey, T.C.G., R. Jabrzemski, and D.M. O'Carroll. 2021. Source Allocation of Per- and Polyfluoroalkyl Substances (PFAS) with Supervised Machine Learning: Classification Performance and the Role of Feature Selection in an Expanded Dataset. Chemosphere, 275:130124. doi.org/10.1016/j.chemosphere.2021.130124.