Machine learning in earth sciences

{{Short description|Applications of machine learning in earth sciences}}

Applications of machine learning (ML) in earth sciences include geological mapping, gas leakage detection and geological feature identification. Machine learning is a subdiscipline of artificial intelligence aimed at developing programs that are able to classify, cluster, identify, and analyze vast and complex data sets without the need for explicit programming to do so.Mueller, J. P., & Massaron, L. (2021). Machine learning for dummies. John Wiley & Sons. Earth science is the study of the origin, evolution, and future{{Cite book|last=Resources.|first=National Academies Press (U.S.) National Research Council (U.S.). Commission on Geosciences, Environment, and|url=http://worldcat.org/oclc/439353646|title=Basic research opportunities in earth science|date=2001|publisher=National Academies Press|oclc=439353646}} of the Earth. The earth's system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere, and biosphere.{{Cite journal|last=Miall|first=A.D.|date=December 1995|title=The blue planet: An introduction to earth system science|url=http://dx.doi.org/10.1016/0012-8252(95)90023-3|journal=Earth-Science Reviews|volume=39|issue=3–4|pages=269–271|doi=10.1016/0012-8252(95)90023-3|issn=0012-8252|url-access=subscription}}

A variety of algorithms may be applied depending on the nature of the task. Some algorithms may perform significantly better than others for particular objectives. For example, convolutional neural networks (CNNs) are good at interpreting images, whilst more general neural networks may be used for soil classification,{{Cite journal|last1=Bhattacharya|first1=B.|last2=Solomatine|first2=D.P.|date=March 2006|title=Machine learning in soil classification|url=http://dx.doi.org/10.1016/j.neunet.2006.01.005|journal=Neural Networks|volume=19|issue=2|pages=186–195|doi=10.1016/j.neunet.2006.01.005|pmid=16530382|s2cid=14421859|issn=0893-6080|url-access=subscription}} but can be more computationally expensive to train than alternatives such as support vector machines. The range of tasks to which ML (including deep learning) is applied has been ever-growing in recent decades, as has the development of other technologies such as unmanned aerial vehicles (UAVs), ultra-high resolution remote sensing technology, and high-performance computing.{{Cite journal|last1=Si|first1=Lei|last2=Xiong|first2=Xiangxiang|last3=Wang|first3=Zhongbin|last4=Tan|first4=Chao|date=2020-03-14|title=A Deep Convolutional Neural Network Model for Intelligent Discrimination between Coal and Rocks in Coal Mining Face|journal=Mathematical Problems in Engineering|volume=2020|pages=1–12|doi=10.1155/2020/2616510|issn=1024-123X|doi-access=free}} This has led to the availability of large high-quality datasets and more advanced algorithms.

Significance

= Complexity of earth science =

Problems in earth science are often complex. It is difficult to apply well-known and described mathematical models to the natural environment, therefore machine learning is commonly a better alternative for such non-linear problems.{{Cite book|last1=Merembayev|first1=Timur|last2=Yunussov|first2=Rassul|last3=Yedilkhan|first3=Amirgaliyev|title=2018 14th International Conference on Electronics Computer and Computation (ICECCO) |chapter=Machine Learning Algorithms for Classification Geology Data from Well Logging |date=November 2018|chapter-url=http://dx.doi.org/10.1109/icecco.2018.8634775|pages=206–212|publisher=IEEE|doi=10.1109/icecco.2018.8634775|isbn=978-1-7281-0132-3|s2cid=59620103}} Ecological data are commonly non-linear and consist of higher-order interactions, and together with missing data, traditional statistics may underperform as unrealistic assumptions such as linearity are applied to the model.{{Cite journal|last1=De'ath|first1=Glenn|last2=Fabricius|first2=Katharina E.|title=Classification and Regression Trees: A Powerful Yet Simple Technique for Ecological Data Analysis|date=November 2000|url=http://dx.doi.org/10.1890/0012-9658(2000)081[3178:cartap]2.0.co;2|journal=Ecology|volume=81|issue=11|pages=3178–3192|doi=10.1890/0012-9658(2000)081[3178:cartap]2.0.co;2|issn=0012-9658|url-access=subscription}}{{Cite journal|last=Thessen|first=Anne|date=2016-06-27|title=Adoption of Machine Learning Techniques in Ecology and Earth Science|journal=One Ecosystem|volume=1|pages=e8621|doi=10.3897/oneeco.1.e8621|issn=2367-8194|doi-access=free|bibcode=2016OneEc...1E8621T }} A number of researchers found that machine learning outperforms traditional statistical models in earth science, such as in characterizing forest canopy structure,{{Cite journal|last1=Zhao|first1=Kaiguang|last2=Popescu|first2=Sorin|last3=Meng|first3=Xuelian|last4=Pang|first4=Yong|last5=Agca|first5=Muge|date=August 2011|title=Characterizing forest canopy structure with lidar composite metrics and machine learning|url=http://dx.doi.org/10.1016/j.rse.2011.04.001|journal=Remote Sensing of Environment|volume=115|issue=8|pages=1978–1996|doi=10.1016/j.rse.2011.04.001|bibcode=2011RSEnv.115.1978Z|issn=0034-4257|url-access=subscription}} predicting climate-induced range shifts,{{Cite journal|last1=Lawler|first1=Joshua J.|last2=White|first2=Denis|last3=Neilson|first3=Ronald P.|last4=Blaustein|first4=Andrew R.|date=2006-06-26|title=Predicting climate-induced range shifts: model differences and model reliability|url=http://dx.doi.org/10.1111/j.1365-2486.2006.01191.x|journal=Global Change Biology|volume=12|issue=8|pages=1568–1584|doi=10.1111/j.1365-2486.2006.01191.x|bibcode=2006GCBio..12.1568L|issn=1354-1013|citeseerx=10.1.1.582.9206|s2cid=37416127 }} and delineating geologic facies.{{Cite journal|last=Tartakovsky|first=Daniel M.|date=2004|title=Delineation of geologic facies with statistical learning theory|url=http://dx.doi.org/10.1029/2004gl020864|journal=Geophysical Research Letters|volume=31|issue=18|doi=10.1029/2004gl020864|bibcode=2004GeoRL..3118502T|issn=0094-8276|citeseerx=10.1.1.146.5147|s2cid=16256805 }} Characterizing forest canopy structure enables scientists to study vegetation response to climate change.{{Cite journal|last1=Hurtt|first1=George C.|last2=Dubayah|first2=Ralph|last3=Drake|first3=Jason|last4=Moorcroft|first4=Paul R.|last5=Pacala|first5=Stephen W.|last6=Blair|first6=J. Bryan|last7=Fearon|first7=Matthew G.|title=Beyond Potential Vegetation: Combining Lidar Data and a Height-Structured Model for Carbon Studies|date=June 2004|url=http://dx.doi.org/10.1890/02-5317|journal=Ecological Applications|volume=14|issue=3|pages=873–883|doi=10.1890/02-5317|bibcode=2004EcoAp..14..873H |issn=1051-0761|url-access=subscription}} Predicting climate-induced range shifts enable policy makers to adopt suitable conversation method to overcome the consequences of climate change. Delineating geologic facies helps geologists to understand the geology of an area, which is essential for the development and management of an area.{{Cite journal|last=Akpokodje|first=E. G.|date=June 1979|title=The importance of engineering geological mapping in the development of the Niger delta basin|url=http://dx.doi.org/10.1007/bf02600459|journal=Bulletin of the International Association of Engineering Geology|volume=19|issue=1|pages=101–108|doi=10.1007/bf02600459|bibcode= |s2cid=129112606|issn=1435-9529|url-access=subscription}}

= Inaccessible data =

In Earth Sciences, some data are often difficult to access or collect, therefore inferring data from data that are easily available by machine learning method is desirable. For example, geological mapping in tropical rainforests is challenging because the thick vegetation cover and rock outcrops are poorly exposed. Applying remote sensing with machine learning approaches provides an alternative way for rapid mapping without the need of manually mapping in the unreachable areas.

= Reduce time costs =

Machine learning can also reduce the efforts done by experts, as manual tasks of classification and annotation etc. are the bottlenecks in the workflow of the research of earth science. Geological mapping, especially in a vast, remote area is labour, cost and time-intensive with traditional methods.{{Cite journal|last1=Latifovic|first1=Rasim|last2=Pouliot|first2=Darren|last3=Campbell|first3=Janet|date=2018-02-16|title=Assessment of Convolution Neural Networks for Surficial Geology Mapping in the South Rae Geological Region, Northwest Territories, Canada|journal=Remote Sensing|volume=10|issue=2|pages=307|doi=10.3390/rs10020307|bibcode=2018RemS...10..307L|issn=2072-4292|doi-access=free}} Incorporation of remote sensing and machine learning approaches can provide an alternative solution to eliminate some field mapping needs.

= Consistent and bias-free =

Consistency and bias-free is also an advantage of machine learning compared to manual works by humans. In research comparing the performance of human and machine learning in the identification of dinoflagellates, machine learning is found to be not as prone to systematic bias as humans.{{Cite journal|last1=Culverhouse|first1=PF|last2=Williams|first2=R|last3=Reguera|first3=B|last4=Herry|first4=V|last5=González-Gil|first5=S|date=2003|title=Do experts make mistakes? A comparison of human and machine identification of dinoflagellates|journal=Marine Ecology Progress Series|volume=247|pages=17–25|doi=10.3354/meps247017|bibcode=2003MEPS..247...17C|issn=0171-8630|doi-access=free}} A recency effect that is present in humans is that the classification often biases towards the most recently recalled classes. In a labelling task of the research, if one kind of dinoflagellates occurs rarely in the samples, then expert ecologists commonly will not classify it correctly. The systematic bias strongly deteriorate the classification accuracies of humans.

Optimal machine learning algorithm

The extensive usage of machine learning in various fields has led to a wide range of algorithms of learning methods being applied. Choosing the optimal algorithm for a specific purpose can lead to a significant boost in accuracy: for example, the lithological mapping of gold-bearing granite-greenstone rocks in Hutti, India with AVIRIS-NG hyperspectral data, shows more than 10% difference in overall accuracy between using support vector machines (SVMs) and random forest.

Some algorithms can also reveal hidden important information: white box models are transparent models, the outputs of which can be easily explained, while black box models are the opposite.{{Cite journal|last=Loyola-Gonzalez|first=Octavio|date=2019|title=Black-Box vs. White-Box: Understanding Their Advantages and Weaknesses From a Practical Point of View|journal=IEEE Access|volume=7|pages=154096–154113|doi=10.1109/ACCESS.2019.2949286|s2cid=207831043|issn=2169-3536|doi-access=free|bibcode=2019IEEEA...7o4096L }} For example, although an SVM yielded the best result in landslide susceptibility assessment accuracy, the result cannot be rewritten in the form of expert rules that explain how and why an area was classified as that specific class. In contrast, decision trees are transparent and easily understood, and the user can observe and fix the bias if any is present in such models.

If computational resource is a concern, more computationally demanding learning methods such as deep neural networks are less preferred, despite the fact that they may outperform other algorithms, such as in soil classification.

Usage

= Mapping =

== Geological or lithological mapping and mineral prospectivity mapping ==

Geological or lithological mapping produces maps showing geological features and geological units. Mineral prospectivity mapping utilizes a variety of datasets such as geological maps and aeromagnetic imagery to produce maps that are specialized for mineral exploration.{{Cite web |title=Geologic mapping (UB) - AAPG Wiki |url=https://wiki.aapg.org/Geologic_mapping_(UB)#:~:text=Geological%20maps%20give%20informations%20about,had%20occurred%20in%20earlier%20times. |access-date=2024-06-27 |website=wiki.aapg.org}} Geological, lithological, and mineral prospectivity mapping can be carried out by processing data with ML techniques, with the input of spectral imagery obtained from remote sensing and geophysical data.{{Cite journal|last1=Harvey|first1=A. S.|last2=Fotopoulos|first2=G.|title=Geological Mapping Using Machine Learning Algorithms|date=2016-06-23|journal=ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences|volume=XLI-B8|pages=423–430|doi=10.5194/isprsarchives-xli-b8-423-2016|issn=2194-9034|doi-access=free}} Spectral imaging is also used – the imaging of wavelength bands in the electromagnetic spectrum, while conventional imaging captures three wavelength bands (red, green, blue) in the electromagnetic spectrum.{{Cite journal|last=Mattikalli|first=N|date=January 1997|title=Soil color modeling for the visible and near-infrared bands of Landsat sensors using laboratory spectral measurements|url=http://dx.doi.org/10.1016/s0034-4257(96)00075-2|journal=Remote Sensing of Environment|volume=59|issue=1|pages=14–28|doi=10.1016/s0034-4257(96)00075-2|bibcode=1997RSEnv..59...14M|issn=0034-4257|url-access=subscription}}

Random forests and SVMs are some algorithms commonly used with remotely-sensed geophysical data, while Simple Linear Iterative Clustering-Convolutional Neural Network (SLIC-CNN) and Convolutional Neural Networks (CNNs) are commonly applied to aerial imagery. Large scale mapping can be carried out with geophysical data from airborne and satellite remote sensing geophysical data, and smaller-scale mapping can be carried out with images from Unmanned Aerial Vehicles (UAVs) for higher resolution.

Vegetation cover is one of the major obstacles for geological mapping with remote sensing, as reported in various research, both in large-scale and small-scale mapping. Vegetation affects the quality of spectral images, or obscures the rock information in aerial images.

class="wikitable"

|+Example applications in Geological, Lithological, and Mineral Prospectivity Mapping

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Lithological Mapping of Gold-bearing granite-greenstone rocks{{Cite journal|last1=Kumar|first1=Chandan|last2=Chatterjee|first2=Snehamoy|last3=Oommen|first3=Thomas|last4=Guha|first4=Arindam|date=April 2020|title=Automated lithological mapping by integrating spectral enhancement techniques and machine learning algorithms using AVIRIS-NG hyperspectral data in Gold-bearing granite-greenstone rocks in Hutti, India|journal=International Journal of Applied Earth Observation and Geoinformation|volume=86|pages=102006|doi=10.1016/j.jag.2019.102006|bibcode=2020IJAEO..8602006K|s2cid=210040191|issn=0303-2434|doi-access=free}}

|AVIRIS-NG hyperspectral data

|Hutti, India

|Linear Discriminant Analysis (LDA),

Random Forest,

Support Vector Machine (SVM)

|Support Vector Machine (SVM) outperforms the other Machine Learning Algorithms (MLAs)

Lithological Mapping in the Tropical Rainforest{{Cite journal|last1=Costa|first1=Iago|last2=Tavares|first2=Felipe|last3=Oliveira|first3=Junny|date=April 2019|title=Predictive lithological mapping through machine learning methods: a case study in the Cinzento Lineament, Carajás Province, Brazil|journal=Journal of the Geological Survey of Brazil|volume=2|issue=1|pages=26–36|doi=10.29396/jgsb.2019.v2.n1.3|s2cid=134822423|issn=2595-1939|doi-access=free}}

|Magnetic Vector Inversion, Ternary RGB map, Shuttle Radar Topography Mission (SRTM), false color (RGB) of Landsat 8 combining bands 4, 3 and 2

|Cinzento Lineament, Brazil

|Random Forest

|Two predictive maps were generated:

(1) Map generated with remote sensing data only has a 52.7% accuracy when compared to the geological map, but several new possible lithological units are identified

(2) Map generated with remote sensing data and spatial constraints has a 78.7% accuracy but no new possible lithological units are identified

Geological Mapping for mineral explorationRadford, D. D., Cracknell, M. J., Roach, M. J., & Cumming, G. V. (2018). Geological mapping in western Tasmania using radar and random forests. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(9), 3075-3087.

|Airborne polarimetric Terrain Observation with Progressive Scans SAR (TopSAR),

geophysical data

|Western Tasmania

|Random Forest

|Low reliability of TopSAR for geological mapping, but accurate with geophysical data.

Geological and Mineralogical mapping{{citation needed|date=March 2022}}

|Multispectral and hyperspectral satellite data

|Central Jebilet,

Morocco

|Support Vector Machine (SVM)

|The accuracy of using hyperspectral data for classifying is slightly higher than that using multispectral data, obtaining 93.05% and 89.24% respectively, showing that machine learning is a reliable tool for mineral exploration.

Integrating Multigeophysical Data into a Cluster MapWang, Y., Ksienzyk, A. K., Liu, M., & Brönner, M. (2021). Multigeophysical data integration using cluster analysis: assisting geological mapping in Trøndelag, Mid-Norway. Geophysical Journal International, 225(2), 1142-1157.

|Airborne magnetic,

frequency electromagnetic, radiometric measurements, ground gravity measurements

|Trøndelag, Mid-Norway

|Random Forest

|The cluster map produced has a satisfactory relationship with the existing geological map but with minor misfits.

High-Resolution Geological Mapping with Unmanned Aerial Vehicle (UAV){{Cite journal|last1=Sang|first1=Xuejia|last2=Xue|first2=Linfu|last3=Ran|first3=Xiangjin|last4=Li|first4=Xiaoshun|last5=Liu|first5=Jiwen|last6=Liu|first6=Zeyu|date=2020-02-05|title=Intelligent High-Resolution Geological Mapping Based on SLIC-CNN|journal=ISPRS International Journal of Geo-Information|volume=9|issue=2|pages=99|doi=10.3390/ijgi9020099|bibcode=2020IJGI....9...99S|issn=2220-9964|doi-access=free}}

|Ultra-resolution RGB images

|Taili waterfront,

Liaoning Province, China

|Simple Linear Iterative Clustering-Convolutional Neural Network (SLIC-CNN)

|The result is satisfactory in mapping major geological units but showed poor performance in mapping pegmatites, fine-grained rocks and dykes. UAVs were unable to collect rock information where the rocks were not exposed.

Surficial Geology Mapping

Remote Predictive Mapping (RPM)

|Aerial Photos,

Landsat Reflectance, High-Resolution Digital Elevation Data

|South Rae Geological Region,

Northwest Territories, Canada

|Convolutional Neural Networks (CNN),

Random Forest

|The resulting accuracy of CNN was 76% in the locally trained area, while 68% for an independent test area. The CNN achieved a slightly higher accuracy of 4% than the Random Forest.

File:Landslide susceptibility mapping dataset splitting.png

== Landslide susceptibility and hazard mapping ==

Landslide susceptibility refers to the probability of landslide of a certain geographical location, which is dependent on local terrain conditions.{{Citation|title=Phillips River landslide hazard mapping project|date=2005-06-30|url=http://dx.doi.org/10.1201/9781439833711-28|work=Landslide Risk Management|pages=457–466|publisher=CRC Press|doi=10.1201/9781439833711-28|isbn=9780429151354|access-date=2021-11-12|url-access=subscription}} Landslide susceptibility mapping can highlight areas prone to landslide risks, which is useful for urban planning and disaster management. Such datasets for ML algorithms usually include topographic information, lithological information, satellite images, etc., and some may include land use, land cover, drainage information, and vegetation cover{{Cite journal|last1=Dou|first1=Jie|last2=Yamagishi|first2=Hiromitsu|last3=Pourghasemi|first3=Hamid Reza|last4=Yunus|first4=Ali P.|last5=Song|first5=Xuan|last6=Xu|first6=Yueren|last7=Zhu|first7=Zhongfan|date=2015-05-19|title=An integrated artificial neural network model for the landslide susceptibility assessment of Osado Island, Japan|url=http://dx.doi.org/10.1007/s11069-015-1799-2|journal=Natural Hazards|volume=78|issue=3|pages=1749–1776|doi=10.1007/s11069-015-1799-2|bibcode=2015NatHa..78.1749D |s2cid=51960414|issn=0921-030X|url-access=subscription}} according to the study requirements. As usual, for training an ML model for landslide susceptibility mapping, training and testing datasets are required. There are two methods of allocating datasets for training and testing: one is to randomly split the study area for the datasets; another is to split the whole study into two adjacent parts for the two datasets. To test classification models, the common practice is to split the study area randomly; however, it is more useful if the study area can be split into two adjacent parts so that an automation algorithm can carry out mapping of a new area with the input of expert-processed data of adjacent land.

class="wikitable"

|+Example applications in Landslide Susceptibility and Hazard Mapping

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Landslide Susceptibility Assessment{{Cite journal|last1=Marjanović|first1=Miloš|last2=Kovačević|first2=Miloš|last3=Bajat|first3=Branislav|last4=Voženílek|first4=Vít|date=November 2011|title=Landslide susceptibility assessment using SVM machine learning algorithm|url=http://dx.doi.org/10.1016/j.enggeo.2011.09.006|journal=Engineering Geology|volume=123|issue=3|pages=225–234|doi=10.1016/j.enggeo.2011.09.006|bibcode=2011EngGe.123..225M |issn=0013-7952|url-access=subscription}}

|Digital Elevation Model (DEM), Geological Map, 30m Landsat Imagery

|Fruška Gora Mountain, Serbia

|Support Vector Machine (SVM),

Decision Trees, Logistic Regression

|Support Vector Machine (SVM) outperforms others

Landslide Susceptibility Mapping{{Cite journal|last1=Kawabata|first1=Daisaku|last2=Bandibas|first2=Joel|date=December 2009|title=Landslide susceptibility mapping using geological data, a DEM from ASTER images and an Artificial Neural Network (ANN)|url=http://dx.doi.org/10.1016/j.geomorph.2009.06.006|journal=Geomorphology|volume=113|issue=1–2|pages=97–109|doi=10.1016/j.geomorph.2009.06.006|bibcode=2009Geomo.113...97K|issn=0169-555X|url-access=subscription}}

|ASTER satellite-based geomorphic data, geological maps

|Honshu Island, Japan

|Artificial Neural Network (ANN)

|Accuracy greater than 90% for determining the probability of landslide.

Landslide Susceptibility Zonation through ratingsChauhan, S., Sharma, M., Arora, M. K., & Gupta, N. K. (2010). Landslide susceptibility zonation through ratings derived from artificial neural network. International Journal of Applied Earth Observation and Geoinformation, 12(5), 340-350.

|Spatial data layers with slope, aspect, relative relief, lithology, structural features, land use, land cover, drainage density

|Parts of Chamoli and Rudraprayag districts of the State of Uttarakhand, India

|Artificial Neural Network (ANN)

|The AUC of this approach reaches 0.88. This approach generated an accurate assessment of landslide risks.

Regional Landslide Hazard Analysis{{Cite journal|last1=Biswajeet|first1=Pradhan|last2=Saro|first2=Lee|date=November 2007|title=Utilization of Optical Remote Sensing Data and GIS Tools for Regional Landslide Hazard Analysis Using an Artificial Neural Network Model|url=http://dx.doi.org/10.1016/s1872-5791(08)60008-1|journal=Earth Science Frontiers|volume=14|issue=6|pages=143–151|doi=10.1016/s1872-5791(08)60008-1|bibcode=2007ESF....14..143B|issn=1872-5791|url-access=subscription}}

|Topographic slope, aspect, and curvature; distance from drainage, lithology, distance from lineament, land cover from TM satellite images, vegetation index (NDVI), precipitation data

|Eastern Selangor state, Malaysia

|Artificial Neural Network (ANN)

|The approach achieved 82.92% accuracy of prediction.

= Feature identification and detection =

File:Data Augmentation of rock images revised.jpg the model.]]

== Discontinuity analyses ==

Discontinuities such as fault planes and bedding planes have important implications in civil engineering.{{Cite journal|date=December 1978|title=International society for rock mechanics commission on standardization of laboratory and field tests|url=http://dx.doi.org/10.1016/0148-9062(78)91472-9|journal=International Journal of Rock Mechanics and Mining Sciences & Geomechanics Abstracts|volume=15|issue=6|pages=319–368|doi=10.1016/0148-9062(78)91472-9|issn=0148-9062|url-access=subscription}} Rock fractures can be recognized automatically by machine learning through photogrammetric analysis, even with the presence of interfering objects such as vegetation.{{Cite journal|last1=Byun|first1=Hoon|last2=Kim|first2=Jineon|last3=Yoon|first3=Dongyoung|last4=Kang|first4=Il-Seok|last5=Song|first5=Jae-Joon|date=2021-07-08|title=A deep convolutional neural network for rock fracture image segmentation|url=http://dx.doi.org/10.1007/s12145-021-00650-1|journal=Earth Science Informatics|volume=14|issue=4|pages=1937–1951|doi=10.1007/s12145-021-00650-1|bibcode=2021EScIn..14.1937B|s2cid=235762914|issn=1865-0473|url-access=subscription}} In ML training for classifying images, data augmentation is a common practice to avoid overfitting and increase the training dataset size and variability. For example, in a study of rock fracture recognition, 68 images for training and 23 images for testing were prepared via random splitting. Data augmentation was performed, increasing the training dataset size to 8704 images by flipping and random cropping. The approach was able to recognize rock fractures accurately in most cases. Both the negative prediction value (NPV) and the specificity were over 0.99. This demonstrated the robustness of discontinuity analyses with machine learning.

class="wikitable"

|+Example applications in Discontinuity Analysis

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Recognition of Rock Fractures

|Rock images collected in field survey

|Gwanak Mountain and Bukhan Mountain,

Seoul, Korea and Jeongseon-gun, Gangwon-do, Korea

|Convolutional Neural Network (CNN)

|The approach was able to recognize the rock fractures accurately in most cases. The NPV and the specificity were over 0.99.

== Carbon dioxide leakage detection ==

Quantifying carbon dioxide leakage from a geological sequestration site has gained increased attention as the public is interested in whether carbon dioxide is stored underground safely and effectively.{{Cite report|last=Repasky|first=Kevin|date=2014-03-31|title=Development and Deployment of a Compact Eye-Safe Scanning Differential absorption Lidar (DIAL) for Spatial Mapping of Carbon Dioxide for Monitoring/Verification/Accounting at Geologic Sequestration Sites|doi=10.2172/1155030|osti=1155030|url=http://dx.doi.org/10.2172/1155030}} Carbon dioxide leakage from a geological sequestration site can be detected indirectly with the aid of remote sensing and an unsupervised clustering algorithm such as Iterative Self-Organizing Data Analysis Technique (ISODATA).{{Cite journal|last1=Bellante|first1=G.J.|last2=Powell|first2=S.L.|last3=Lawrence|first3=R.L.|last4=Repasky|first4=K.S.|last5=Dougher|first5=T.A.O.|date=March 2013|title=Aerial detection of a simulated CO₂ leak from a geologic sequestration site using hyperspectral imagery|url=http://dx.doi.org/10.1016/j.ijggc.2012.11.034|journal=International Journal of Greenhouse Gas Control|volume=13|pages=124–137|doi=10.1016/j.ijggc.2012.11.034|bibcode=2013IJGGC..13..124B |issn=1750-5836|url-access=subscription}} The increase in soil CO₂ concentration causes a stress response for plants by inhibiting plant respiration, as oxygen is displaced by carbon dioxide.{{Cite journal|last1=Bateson|first1=L.|last2=Vellico|first2=M.|last3=Beaubien|first3=S.|last4=Pearce|first4=J.|last5=Annunziatellis|first5=A.|last6=Ciotoli|first6=G.|last7=Coren|first7=F.|last8=Lombardi|first8=S.|last9=Marsh|first9=S.|date=July 2008|title=The application of remote-sensing techniques to monitor CO₂-storage sites for surface leakage: Method development and testing at Latera (Italy) where naturally produced CO₂ is leaking to the atmosphere|url=http://dx.doi.org/10.1016/j.ijggc.2007.12.005|journal=International Journal of Greenhouse Gas Control|volume=2|issue=3|pages=388–400|doi=10.1016/j.ijggc.2007.12.005|bibcode=2008IJGGC...2..388B|issn=1750-5836}} The vegetation stress signal can be detected with the Normalized Difference Red Edge Index (NDRE). The hyperspectral images are processed by the unsupervised algorithm, clustering pixels with similar plant responses. The hyperspectral information in areas with known CO₂ leakage is extracted so that areas with leakage can be matched with the clustered pixels with spectral anomalies. Although the approach can identify CO₂ leakage efficiently, there are some limitations that require further study. The NDRE may not be accurate due to reasons like higher chlorophyll absorption, variation in vegetation, and shadowing effects; therefore, some stressed pixels can be incorrectly classed as healthy. Seasonality, groundwater table height may also affect the stress response to CO₂ of the vegetation.

class="wikitable"

|+Example applications in Carbon Dioxide Leakage Detection

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Detection of CO₂ leak from a geologic sequestration site

|Aerial hyperspectral imagery

|The Zero Emissions Research and Technology (ZERT), US

|Iterative Self-Organizing Data Analysis Technique (ISODATA) method

|The approach was able to detect areas with CO₂ leaks however other factors like the growing seasons of the vegetation also interfere with the results.

== Quantification of water inflow ==

The rock mass rating (RMR){{Citation|last=Bieniawski|first=Z. T.|title=The Rock Mass Rating (RMR) System (Geomechanics Classification) in Engineering Practice|url=http://dx.doi.org/10.1520/stp48461s|work=Rock Classification Systems for Engineering Purposes|year=1988|pages=17–17–18|place=West Conshohocken, PA|publisher=ASTM International|doi=10.1520/stp48461s|isbn=978-0-8031-6663-9|access-date=2021-11-12|url-access=subscription}} system is a widely adopted rock mass classification system by geomechanical means with the input of six parameters. The amount of water inflow is one of the inputs of the classification scheme, representing the groundwater condition. Quantification of the water inflow in the faces of a rock tunnel was traditionally carried out by visual observation in the field, which is labour and time-consuming, and fraught with safety concerns. Machine learning can determine water inflow by analyzing images taken on the construction site.{{Cite journal|last1=Chen|first1=Jiayao|last2=Zhou|first2=Mingliang|last3=Zhang|first3=Dongming|last4=Huang|first4=Hongwei|last5=Zhang|first5=Fengshou|date=March 2021|title=Quantification of water inflow in rock tunnel faces via convolutional neural network approach|url=http://dx.doi.org/10.1016/j.autcon.2020.103526|journal=Automation in Construction|volume=123|pages=103526|doi=10.1016/j.autcon.2020.103526|s2cid=233849934|issn=0926-5805|url-access=subscription}} The classification of the approach mostly follows the RMR system, but combining damp and wet states, as it is difficult to distinguish only by visual inspection. The images were classified into the non-damaged state, wet state, dripping state, flowing state, and gushing state. The accuracy of classifying the images was approximately 90%.

class="wikitable"

|+Example applications in Quantification of Water Inflow

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Quantification of water inflow in rock tunnel faces

|Images of water inflow

| -

|Convolutional Neural Network (CNN)

|The approach achieved a mean accuracy of 93.01%.

= Classification =

== Soil classification ==

The most popular cost-effective method od soil investigation method is cone penetration testing (CPT).{{Cite book|author=Coerts, Alfred|url=http://worldcat.org/oclc/37725852|title=Analysis of static cone penetration test data for subsurface modelling : a methodology|date=1996|publisher=Koninklijk Nederlands Aardrijkskundig Genootschap/Faculteit Ruimtelijke Wetenschappen Universiteit Utrecht|isbn=90-6809-230-8|oclc=37725852}} The test is carried out by pushing a metallic cone through the soil: the force required to push at a constant rate is recorded as a quasi-continuous log. Machine learning can classify soil with the input of CPT data. In an attempt to classify with ML, there are two tasks required to analyze the data, namely segmentation and classification. Segmentation can be carried out with the Constraint Clustering and Classification (CONCC) algorithm to split a single series data into segments. Classification can then be carried out by algorithms such as decision trees, SVMs, or neural networks.

class="wikitable"

|+Example applications in Soil Classification

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Soil classification

|Cone Penetration Test (CPT) logs

| -

|Decision Trees, Artificial Neural Network (ANN), Support Vector Machine

|The Artificial Neural Network (ANN) outperformed the others in classifying humus clay and peat, while decision trees outperformed the others in classifying clayey peat. SVMs gave the poorest performance among the three.

== Geological structure classification ==

File:Geological feature recognition.png

Exposed geological structures such as anticlines, ripple marks, and xenoliths can be identified automatically with deep learning models. Research has demonstrated that three-layer CNNs and transfer learning have strong accuracy (about 80% and 90% respectively), while others like k-nearest neighbors (k-NN), regular neural nets, and extreme gradient boosting (XGBoost) have low accuracies (ranging from 10% - 30%). The grayscale images and colour images were both tested, with the accuracy difference being little, implying that colour is not very important in identifying geological structures.{{Cite journal|last1=Zhang|first1=Ye|last2=Wang|first2=Gang|last3=Li|first3=Mingchao|last4=Han|first4=Shuai|date=2018-12-04|title=Automated Classification Analysis of Geological Structures Based on Images Data and Deep Learning Model|journal=Applied Sciences|volume=8|issue=12|pages=2493|doi=10.3390/app8122493|issn=2076-3417|doi-access=free}}

class="wikitable"

|+Example applications in Geological Structure Classification

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Geological structures classification

|Images of geological structures

| -

|k-nearest neighbors (k-NN), Artificial Neural Network (ANN), Extreme Gradient Boosting (XGBoost), three-layer Convolutional Neural Network (CNN), transfer learning

|Three-layer Convolutional Neural Network (CNN) and Transfer Learning reached accuracies of about 80% and 90% respectively, while others were low (10% to 30%).

= Forecast and predictions =

== Earthquake early warning systems and forecasting ==

Earthquake warning systems are often vulnerable to local impulsive noise, therefore giving out false alerts. False alerts can be eliminated by discriminating the earthquake waveforms from noise signals with the aid of ML methods. The method consists of two parts, the first being unsupervised learning with a generative adversarial network (GAN) to learn and extract features of first-arrival P-waves, and the second being use of a random forest to discriminate P-waves. This approach achieved 99.2% in recognizing P-waves, and can avoid false triggers by noise signals with 98.4% accuracy.{{Cite journal|last1=Li|first1=Zefeng|last2=Meier|first2=Men-Andrin|last3=Hauksson|first3=Egill|last4=Zhan|first4=Zhongwen|last5=Andrews|first5=Jennifer|date=2018-05-28|title=Machine Learning Seismic Wave Discrimination: Application to Earthquake Early Warning|journal=Geophysical Research Letters|volume=45|issue=10|pages=4773–4779|doi=10.1029/2018gl077870|bibcode=2018GeoRL..45.4773L|s2cid=54926314|issn=0094-8276|doi-access=free}}

Earthquakes can be produced in a laboratory settings to mimic real-world ones. With the help of machine learning, the patterns of acoustic signals as precursors for earthquakes can be identified. Predicting the time remaining before failure was demonstrated in a study with continuous acoustic time series data recorded from a fault. The algorithm applied was a random forest, trained with a set of slip events, performing strongly in predicting the time to failure. It identified acoustic signals to predict failures, with one of them being previously unidentified. Although this laboratory earthquake is not as complex as a natural one, progress was made that guides future earthquake prediction work.{{Cite journal|last1=Rouet-Leduc|first1=Bertrand|last2=Hulbert|first2=Claudia|last3=Lubbers|first3=Nicholas|last4=Barros|first4=Kipton|last5=Humphreys|first5=Colin J.|last6=Johnson|first6=Paul A.|date=2017-09-22|title=Machine Learning Predicts Laboratory Earthquakes|url=http://dx.doi.org/10.1002/2017gl074677|journal=Geophysical Research Letters|volume=44|issue=18|pages=9276–9282|doi=10.1002/2017gl074677|arxiv=1702.05774|bibcode=2017GeoRL..44.9276R|s2cid=118842086|issn=0094-8276}}

class="wikitable"

|+Example applications in Earthquake Prediction

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Discriminating earthquake waveforms

|Earthquake dataset

|Southern California and Japan

|Generative adversarial network (GAN), random forest

|This approach can recognise P waves with 99.2% accuracy and avoid false triggers by noise signals with 98.4% accuracy.

Predicting time remaining for next earthquake

|Continuous acoustic time series data

| -

|Random Forest

|The R² value of the prediction reached 0.89, which demonstrated excellent performance.

== Streamflow discharge prediction ==

Real-time streamflow data is integral for decision making (e.g., evacuations, or regulation of reservoir water levels during flooding).{{Cite journal|last=Kirchner|first=James W.|date=March 2006|title=Getting the right answers for the right reasons: Linking measurements, analyses, and models to advance the science of hydrology|url=http://dx.doi.org/10.1029/2005wr004362|journal=Water Resources Research|volume=42|issue=3|doi=10.1029/2005wr004362|bibcode=2006WRR....42.3S04K|s2cid=2089939 |issn=0043-1397|url-access=subscription}} Streamflow data can be estimated by data provided by stream gauges, which measure the water level of a river. However, water and debris from flooding may damage stream gauges, resulting in lack of essential real-time data. The ability of machine learning to infer missing data enables it to predict streamflow with both historical stream gauge data and real-time data.

Streamflow Hydrology Estimate using Machine Learning (SHEM) is a model that can serve this purpose. To verify its accuracies, the prediction result was compared with the actual recorded data, and the accuracies were found to be between 0.78 and 0.99.

class="wikitable"

|+Example applications in Streamflow Discharge Prediction

!Objective

!Input dataset

!Location

!Machine Learning Algorithms (MLAs)

!Performance

Streamflow Estimate with data missing{{Cite journal|last1=Petty|first1=T.R.|last2=Dhingra|first2=P.|date=2017-08-08|title=Streamflow Hydrology Estimate Using Machine Learning (SHEM)|url=http://dx.doi.org/10.1111/1752-1688.12555|journal=JAWRA Journal of the American Water Resources Association|volume=54|issue=1|pages=55–68|doi=10.1111/1752-1688.12555|s2cid=135100027 |issn=1093-474X|url-access=subscription}}

|Streamgage data from NWIS-Web

|Four diverse watersheds in Idaho, US and Washington, US

|Random Forests

|The estimates correlated well to the historical data of the discharges. The accuracy ranges from 0.78 to 0.99.

Challenge

= Inadequate training data =

An adequate amount of training and validation data is required for machine learning. However, some very useful products like satellite remote sensing data only have decades of data since the 1970s. If one is interested in the yearly data, then only less than 50 samples are available.{{Cite journal|last1=Karpatne|first1=Anuj|last2=Ebert-Uphoff|first2=Imme|last3=Ravela|first3=Sai|last4=Babaie|first4=Hassan Ali|last5=Kumar|first5=Vipin|date=2019-08-01|title=Machine Learning for the Geosciences: Challenges and Opportunities|url=http://dx.doi.org/10.1109/tkde.2018.2861006|journal=IEEE Transactions on Knowledge and Data Engineering|volume=31|issue=8|pages=1544–1554|doi=10.1109/tkde.2018.2861006|arxiv=1711.04708|s2cid=42476116|issn=1041-4347}} Such amount of data may not be adequate. In a study of automatic classification of geological structures, the weakness of the model is the small training dataset, even though with the help of data augmentation to increase the size of the dataset. Another study of predicting streamflow found that the accuracies depend on the availability of sufficient historical data, therefore sufficient training data determine the performance of machine learning. Inadequate training data may lead to a problem called overfitting. Overfitting causes inaccuracies in machine learning{{Cite journal|last1=Farrar|first1=Donald E.|last2=Glauber|first2=Robert R.|date=February 1967|title=Multicollinearity in Regression Analysis: The Problem Revisited|url=http://dx.doi.org/10.2307/1937887|journal=The Review of Economics and Statistics|volume=49|issue=1|pages=92|doi=10.2307/1937887|jstor=1937887|hdl=1721.1/48530|issn=0034-6535|hdl-access=free}} as the model learns about the noise and undesired details.

= Limited by data input =

Machine learning cannot carry out some of the tasks as a human does easily. For example, in the quantification of water inflow in rock tunnel faces by images for Rock Mass Rating system (RMR), the damp and the wet state was not classified by machine learning because discriminating the two only by visual inspection is not possible. In some tasks, machine learning may not able to fully substitute manual work by a human.

= Black-box operation =

File:Blackbox3D-withGraphs.svg

In many machine learning algorithms, for example, Artificial Neural Network (ANN), it is considered as 'black box' approach as clear relationships and descriptions of how the results are generated in the hidden layers are unknown.{{Cite journal|last1=Taghizadeh-Mehrjardi|first1=R.|last2=Nabiollahi|first2=K.|last3=Kerry|first3=R.|date=March 2016|title=Digital mapping of soil organic carbon at multiple depths using different data mining techniques in Baneh region, Iran|url=http://dx.doi.org/10.1016/j.geoderma.2015.12.003|journal=Geoderma|volume=266|pages=98–110|doi=10.1016/j.geoderma.2015.12.003|bibcode=2016Geode.266...98T|issn=0016-7061|url-access=subscription}} 'White-box' approach such as decision tree can reveal the algorithm details to the users.{{Cite journal|last1=Delibasic|first1=Boris|last2=Vukicevic|first2=Milan|last3=Jovanovic|first3=Milos|last4=Suknovic|first4=Milija|date=August 2013|title=White-Box or Black-Box Decision Tree Algorithms: Which to Use in Education?|url=http://dx.doi.org/10.1109/te.2012.2217342|journal=IEEE Transactions on Education|volume=56|issue=3|pages=287–291|doi=10.1109/te.2012.2217342|bibcode=2013ITEdu..56..287D|s2cid=11792899|issn=0018-9359|url-access=subscription}} If one wants to investigate the relationships, such 'black-box' approaches are not suitable. However, the performances of 'black-box' algorithms are usually better.{{Cite journal|last1=Merghadi|first1=Abdelaziz|last2=Yunus|first2=Ali P.|last3=Dou|first3=Jie|last4=Whiteley|first4=Jim|last5=ThaiPham|first5=Binh|last6=Bui|first6=Dieu Tien|last7=Avtar|first7=Ram|last8=Abderrahmane|first8=Boumezbeur|date=August 2020|title=Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance|url=http://dx.doi.org/10.1016/j.earscirev.2020.103225|journal=Earth-Science Reviews|volume=207|pages=103225|doi=10.1016/j.earscirev.2020.103225|bibcode=2020ESRv..20703225M|s2cid=225816933|issn=0012-8252|url-access=subscription}}

References

Category:Machine learning

Category:Geological techniques