Water Detection in High-Resolution Satellite Images using the WaterDetect Python package

Mauricio Cordeiro

Dec 20, 20238 min read

Updated: Jan 27, 2024

Enjoy an easy-to-use, unsupervised water detection algorithm for Sentinel 2 and Landsat 8 images that uses a multi-dimensional clustering coupled with a naïve Bayes classifier for improved performance

Figure 1. Example of water mask extracted from a Camargue scene in France. Image by author.

This story is divided into two parts: Methodology and WaterDetect package. In the methodology, the main concepts of the algorithm are given to provide the reader with a better understanding of the package and how to tune it. The second part is a tutorial on the waterdetect package with sample codes to run it.

Methodology

Introduction

The use of deep learning techniques for remote sensing applications has been increasing in recent years. The recently published review paper “Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review-Part I: Evolution and Recent Trends” [1] presents the evolution of Convolutional Neural Networks (CNN) in satellite applications, notably Object Detection and Image Segmentation, where it achieves state-of-the-art results.

One drawback of this approach is the need to train the neural network for every possible condition (water type, atmospheric condition, etc.), and there are few pre-trained models ready to be used anywhere on the globe (DeepWaterMap 2.0 being one of them) [2, 3]. Besides, there are moments when we just want a quick, unsupervised tool to do the task without the complexities of model training. In such cases, I believe there is still room for a more traditional approach, and that’s what is proposed in the waterdetect package.

The Algorithm

To fill this gap, we have recently proposed the methodology that is implemented in the waterdetect package through the following paper: “Automatic Water Detection from Multidimensional Hierarchical Clustering for Sentinel-2 Images and a Comparison with Level 2A Processors” [4].

The method combines multidimensional agglomerative clustering with a machine learning classifier to make the detection effective and fast at the same time. It’s not the objective of this post to enter the details of the algorithm, as it is already described in the paper. However, some knowledge of its rationale is needed to use the package properly.

Detecting water

The main idea of the algorithm is to combine water indexes (NDWI, MNDWI, etc.) with reflectance bands (NIR, SWIR, etc.) into an automated clustering process. Water indexes tend to have high values on water surfaces; however, as shown in the literature, just a single index associated with a threshold suffers from false positives or false negatives, and the optimal threshold value is hard to find, depending on the scene. In multidimensional clustering, we can take advantage of water reflectance properties like the high absorption on SWIR and combine them with indexes for better pixel discrimination. Figure 2 shows an example of water pixels (blue dots) separated through a multidimensional clustering and what the single threshold (red line) would be.

Figure 2. Scatter plots of pixels plotted in B8 x MNDWI dimensions. The image on the left has the ground truth where blue dots are water pixels, and green dots are non-water pixels. The right image has the result of the clustering process. The red line represents a single threshold on MNDWI index. Image by author.

This is important because the algorithm lets the user choose the best combination for the water detection in the desired area in the WaterDetect.ini configuration file. Currently, the supported indexes are:

NDWI — Normalized Water Index
MNDWI — Modified Normalized Water Index
AWEI — Automated Water Extraction Index
MBWI — Multiband Water Index

The combinations that provide a better balance between robustness and accuracy are [MNDWI, NDWI, Mir2] and [NDWI, Mir2], where Mir2 is the second SWIR band present on Sentinel 2 and Landsat 8 images.

The algorithm will look for the best number of clusters (K) by testing different possibilities and deciding on the best according to the Calinsk Harabasz index. The maximum and minimum values for K are also configurable in the WaterDetect.ini file.

To identify among all the clusters, the one that contains the water pixels, the algorithm implements the following methods:

minmir: selects as water the pixels in the cluster with minimum mir value;
maxmndwi: selects as water the pixels in the cluster with maximum mndwi value;
maxndwi: selects as water the pixels in the cluster with maximum ndwi value;
maxmbwi: selects as water the pixels in the cluster the cluster with maximum mbwi value.

Performance

The clustering algorithm is agglomerative because the usual K-means don’t provide good results when the clusters have different sizes. The problem is that a single Sentinel 2 scene has 120 million pixels in full resolution (10m), and the agglomerative clustering has time complexity (O²) and space complexity of (O³), making it unfeasible to process this kind of image.

To overcome this limitation, we subsample the pixels randomly and apply the clustering in this subsample. Afterwards, a machine learning classifier (we chose näive Bayes) is applied to generalize from the subsampled pixels to reconstruct the whole scene. With this solution, the full-resolution Sentinel 2 image can be processed in less than 3 minutes. I have already written a full story about the k-means problem and this up-sampling procedure in “Leveraging the Performance of Agglomerative Clustering for High-Resolution Satellite Images”. Figure 3 has an overview of how the algorithm works.

Figure 3. Overview of the waterdetect algorithm. Image by author.

The WaterDetect package

The algorithm proposed in [4] is available as a Python package called waterdetect. The source code can be found in the following git repository https://github.com/cordmaur/WaterDetect.

Installation

The easiest way to install the waterdetect package is through the command:

pip install waterdetect

Alternatively, you can clone the repository and install it from its root through the following commands:

git clone https://github.com/cordmaur/WaterDetect.git
cd WaterDetect
pip install .

Running from Script

Once installed, a waterdetect entry point is created in the environment's path. The waterdetect can be run from this entry point (refer to the git repository for more information). Typing waterdetect -h displays the help.

usage: waterdetect [-h] [-GC] [-i INPUT] [-o OUT] [-s SHP] [-p PRODUCT]
                   [-c CONFIG]

The waterdetect is a high speed water detection algorithm for satellite
images. It will loop through all images available in the input folder and
write results for every combination specified in the .ini file to the output
folder. It can also run for single images from Python console or Jupyter
notebook. Refer to the onlinedocumentation

optional arguments:
  -h, --help            show this help message and exit
  -GC, --GetConfig      Copy the WaterDetect.ini from the package into the
                        specifieddirectory and skips the processing. Once
                        copied you can edit the .ini file and launch the
                        waterdetect without -c option.
  -i INPUT, --input INPUT
                        The products input folder. Required.
  -o OUT, --out OUT     Output directory. Required.
  -s SHP, --shp SHP     SHP file. Optional.
  -p PRODUCT, --product PRODUCT
                        The product to be processed (S2_THEIA, L8_USGS, S2_L1C
                        or S2_S2COR)
  -c CONFIG, --config CONFIG
                        Configuration .ini file. If not specified
                        WaterDetect.ini from current dir and used as default

To copy the package's default .ini file into the current directory, type:
`waterdetect -GC .` without other arguments and it will copy WaterDetect.ini
into the current directory.

A configuration file specifying the clustering bands, water cluster detection method and other parameters is necessary for the algorithm to run. The command waterdetect -GC will copy the default configuration file to the current directory. You can create other variants of this file and pass them as an argument using the -c option. If it is not explicitly specified, a WaterDetect.ini in the current directory is searched for. The input_folder argument should point to a directory that contains uncompressed images of the same product type (an example of the structure is shown in Figure 4), so the algorithm can loop through all the images and process them at once.

Figure 4. Examples of input and output structures. Image by author.

During running, one folder for each image will be created in the output directory. The final water mask, as well as the clustering results, will be available inside the folder with the name corresponding to the bands used for clustering. If more than one combination of bands is specified in the config file, they all will be processed and saved. In the config, it is also possible to specify pdf_reports=True and plot_graphs=True. With these options, the algorithm will save a PDF file with the results in low resolution and include any graphs that you have specified in the configuration. For the above directory configuration, the commands to run are:

(waterdetect_env) PS D:\> waterdetect -GC
Copying d:\programs\anaconda\envs\waterdetect_env\lib\site-packages\waterdetect\WaterDetect.ini into current dir.
WaterDetect.ini copied into D:\.
(waterdetect_env) PS D:\> waterdetect -i d:\Images\Download\France-MAJA -o d:\Images\out -p S2_THEIA
Loading configuration file WaterDetect.ini
File WaterDetect.ini verified.
Folder d:\Images\Download\France-MAJA verified.
Folder d:\Images\out verified.
Opening image in loader
Retrieving bands for image: d:/Images/Download/France-MAJA/SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0
The following bands were found:
SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0_SRE_B11.tif
SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0_SRE_B12.tif
SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0_SRE_B2.tif
...
T31TGK_C_V1-0_MTD_ALL.xml verified.
---------------------------
VALUES ANGLE GLINT
[58.89489588415046, 59.00516357439775, 59.125230327623676, 59.1975351244004, 59.27428320701952, 59.35296101146046, 58.94729756082948, 59.43995100981233, 59.260483478237774, 59.45799750367317]
PAS DE GLINT SUR IMAGE d:/Images/Download/France-MAJA/SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0/SENTINEL2B_20190224-103835-289_L2A_T31TGK_C_V1-0_MTD_ALL.xml

A full Sentinel 2 tile has approximately 120 million pixels and it takes about 3–5 min depending on the processor and operating system. At the end of each image’s processing, a message indicating a probability of sun glint is displayed. An example of the PDF report is shown in Figure 5.

Figure 5. Example of the output pdf report for one image.

Running it from a Jupyter notebook

Instead of calling the waterdetect from script, it is also possible to import the package into a console or Jupyter notebook to launch it. This way, two options are available. There is the run_batch, that is pretty similar to the script call and also the run_single, that runs the algorithm for one image only and returns the result in memory.

When you execute import waterdetect it initializes and declares the most important classes.

DWWaterDetect
DWImageClustering

The DWWaterDetect class is responsible for orchestrating the full chain, since the opening of the satellite images, to the construction of reports. For that, it will use the other modules. The DWWaterDetect allows two main modes:

Batch mode: the algorithm will loop through all the images available in the input folder and save the results to the output folder. Additionally, it can provide more than one result for each image, with different clustering parameters, depending on the configuration.
Single mode: in the single mode, just one image is processed and just one combination of bands (the first) is used to create the products. Additionally, a DWClusteringImage instance is returned with the resultings mask and clustering.

The following code shows an example on how to launch both modes from a Jupyter notebook.

Running without GDAL

If GDAL is not installed in the environment, the waterdetect will display a message and none of the above methods will work as GDAL is needed to open the satellite images. However, if you want to feed the algorithm with your own arrays in the case you use other library (such as rasterio) or the satellite you are using is not supported it is still possible to use the waterdetect package.

In such case, the bands should be passed to the DWImageClustering object as a dictionary. To work with multiple satellites (Sentinel 2 and Landsat at the moment) the waterdetect uses the following convention for the bands names:

Blue: 492nm
Green: 559nm
Red: 665nm
RedEdg1: 703nm
RedEdg2: 739nm
RedEdg3: 780nm
Nir: 833nm
Nir2: 864nm
Mir: 1610nm
Mir2: 2185nm

The minimum required bands will depend on the bands combination that has been chosen and the method for detecting the water cluster. The following code shows an example on using the waterdetect with bands loaded manually using the rasterio.

As you can see, the attributes water_mask and clustering_matrix store numpy arrays with the resulting mask and the intermediate clustering result. The resulting values are:

water_mask: 0= non water; 1= water; 255= no data

clustering_matrix: 1= water; 2…n= other clusters; 255= no data

Conclusion

As seen throughout this story, the waterdetect package is a good solution for extracting water masks from high-resolution satellite images without any prior knowledge about the scene or arbitrary thresholds. Important to notice that for tasks other than water detection, it can still be used as a high-performance clustering algorithm that delivers much better results than the most common K-means that would not be feasible due to memory and time to process.

The package is still under development, so if you find any trouble or have suggestions on how to improve this tutorial feel free to leave a message or open an issue in the github repository (https://github.com/cordmaur/WaterDetect).

References

[1] Hoeser, T.; Kuenzer, C. Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review-Part I: Evolution and Recent Trends. Remote Sensing 2020, 12 (10), 1667. https://doi.org/10.3390/rs12101667 .

[2] Isikdogan, F., Bovik, A. C., & Passalacqua, P. (2017). Surface Water Mapping by Deep Learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(11), 4909–4918. https://doi.org/10.1109/JSTARS.2017.2735443

[3] Isikdogan, L. F., Bovik, A., & Passalacqua, P. (2019). Seeing Through the Clouds With DeepWaterMap. IEEE Geoscience and Remote Sensing Letters, 1–5. https://doi.org/10.1109/LGRS.2019.2953261

[4] Cordeiro, M. C. R.; Martinez, J.-M.; Peña-Luque, S. Automatic Water Detection from Multidimensional Hierarchical Clustering for Sentinel-2 Images and a Comparison with Level 2A Processors. Remote Sensing of Environment 2021, 253, 112209. https://doi.org/10.1016/j.rse.2020.112209.