Python for Geosciences: Spectral Analysis (Step by Step)

Mauricio Cordeiro

Mar 21, 20229 min read

This is the third post in a series that will teach non-programmers how to use Python to handle and analyze geospatial data

Introduction

Hello and welcome back. This is the third story of the series Python for Geosciences, which has the objective to help non-programmers to start using Python for spatial data analysis and to automate it’s geospatial related processes. In the first post of the series (here), we learned how to prepare the Python environment on Windows using the Anaconda package manager and how to open a GeoTiff image from a Jupyter notebook. Next, on the second post (here), we saw the basics of matrix manipulation and we also learned how to created a flexible function to calculate Normalized Difference Indexes (such as NDVI, MNDWI, etc.), given two bands.

For this third post, we will continue to analyze the image’s data, and our goal is to plot some spectral data for different targets. In remote sensing, target is usually referred to what type of coverage is present that we want to select. For example, by the end if today's post we shall be able to display the mean spectral curve of the water pixels, or vegetation. To achieve this, we will first learn:

Array slicing: how to access part of the image;
Boolean Array Indexing: to mask the portions of the image we don't want;
Compute mean (or other arithmetic) along a specific dimension; and
How to plot the spectral data.

Let's go!

Step 1- Understanding array slicing

On the first post we saw how to access single pixel values by passing indexes into square brackets to the array variable, like so: img[3000, 3000]. However, accessing single values is not of much use in satellite images, where we can easily have millions of values do deal with. So, the first concept we have to learn before moving on to the spectral analysis itself is how to crop just a part of the array, something called array slicing. NumPy extends the basic concept of Python’s slice (usually applied to lists, strings or tuples) to N dimensional arrays and we can use it to access a subset of our image. It could be a line, a row, a sub-array or evenly spaced points.

The basic syntax to slice an array involves passing parameters to the array in square brackets, separated by ":". For example, considering a one-dimensional array called v: v[first_index:last_index:step]

It is important to note some points:

The index of the first element is always 0 (this is called zero-based indexing);
The first index is included in the slice, and the last index is excluded. for example, if we have an one-dimensional array with 5 elements and we want to grab 3 elements, we shall use [0:3]. Note that if index 0 is the first element, index 3 is the 4th element but it is excluded from the slice;
For more than one dimension (in our examples we have 2 or more dimensions), we should pass one slice for each dimension, separated by “,” . Ex: img[0:1000, 0:1000];
The indexes can be counted backwards, using for example, -1 (last value), -2, -3;
When a index in the slice is omitted, it means that we want all elements in that “direction”. For example: img[:2000, -3000:] will access the first 2000 lines (starting from 0 to 2000) and the last 3000 rows (from -3000 to the end), that’s represented in Figure 1.

Figure 1: Slice [:2000, -3000:] highlighted. Image by Author.

More information about array indexing and slicing will be provided as we progress, but for any special need you can refer to the NumPy’ s online documentation (here). Now that we know the basics of slicing, let’s get to the practice.

Zoom in to Manaus city

For example, in the first post I displayed a picture of the Manaus city zoomed in, and in natural colors. Let’s reproduce it.

The following code uses the same load_landsat_image that we've defined in the last post to open the image and then grab the city of Manaus. Note from the image rulers (Figure 1) that Manaus city is located approximately between lines 4000 and 5000 and rows 5400 and 6400, so we will use these values for the slicing.

Figure 2: City of Manaus zoomed in. Image by author.

Note that we used a slice in the third dimension (that represents our channels), but this could have been achieved by other syntaxes with the same results, for example (you can check the shape of these results in your Jupyter notebook to compare):

rgb[4000:5000, 5400:6400, 0:3]
rgb[4000:5000, 5400:6400, :3]
rgb[4000:5000, 5400:6400, 0:]
rgb[4000:5000, 5400:6400, :]
rgb[4000:5000, 5400:6400]

A slice can also be used to change the values within an image. The next code, will highlight the Manaus area by increasing it’s brightness (results in Figure 3).

rgb[4000:5000, 5400:6400] = rgb[4000:5000, 5400:6400] * 2
plt.figure(figsize=(7,7))
plt.imshow(rgb)

Figure 3: Manaus area highlighted. Image by author.

Step 2- Masking non-water pixels

Now that we know the basics of slicing, let’s try to mask the non-water pixels. That is not feasible using the slicing method we just saw, because our water pixels are not in contiguous areas. In this case, we will use another concept of indexing value called Boolean Array Indexing. Remember from the last post that we calculated the MNDWI index and then we compared the index value to a constant (0.0) to obtain a Boolean array (Trues and Falses) as a result:

water_mask = mndwi > 0.0
water_mask
result:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

Note: This is just an introductory approach to understand array indexing. The correct way of dealing with masks is to use Masked Arrays provided in NumPy.

So, in NumPy, we can use this resulting Boolean array to “slice” the values we want. If we would like to mask the non-water pixels, we could simply assign value 0 to pixels that are not in the water mask. We can do that easily by inverting the water_mask array using ~water_mask, where the "~" operator is the shorthand for numpy.invert (more info here). We will reload our image including band B5 and use the normalized_difference function to calculate the MNDWI.

Figure 4: Non-water pixels masked out. Image by author.

Broadcasting

One think you might be wondering is that we used a 2-dimensional Boolean (7761, 7601) mask to index a 3-dimensional RGB image (7761, 7601, 3) and it magically… worked! It worked for the same reason our rgb[4000:5000, 5400:6400] * 2 calculation worked. The was applied element-wise to an “increased” array of 2s. This is called broadcasting and the NumPy does that automatically to try to match arrays of different sizes. Figure 5, from the site Software Carpentry, has a good visual example for 2D broadcasting.

Step 3- Overlay the water mask in RGB

If instead of zero, we would like to apply a blue color to water pixels, we could do it individually for each band, like so:

rgb[..., 0][water_mask] = 0.1 # Red
rgb[..., 1][water_mask] = 0.1 # Green
rgb[..., 2][water_mask] = 0.9 # Blue

The ellipsis notation means that we don’t want to slice the first dimensions. In our example, it would be the same as [:, :, 0], but with the advantage that we don’t need to know exactly how many dimensions we have.

The other option, now that we know that the NumPy does a broadcasting for us, it to apply the mask directly to the RGB and assign the value we want. The important thing is to pass a size-3 vector with one value for each R, G and B channel and, voilà.

Figure 6: Water mask applied to the RGB image. Image by author.

Step 4- Simple spectral signature

Now we will learn how to plot spectral curves for some points. First thing is that we need all the available bands, so we will call the load_landsat_image with all the Landsat bands.

bands = ['B1','B2', 'B3', 'B4', 'B5', 'B6', 'B7']
img = load_landsat_image('D:/Images/Landsat/LC08_L2SP_231062_20201026_20201106_02_T1/', bands)

To stack all the bands into a cube we will create a function. For that, we will introduce another concept called list comprehension. It is just a reduced syntax to create a list, based on a for-loop. For example, instead of writing:

def stack_img(img, bands):
  arrays = []                # create an empty list
  for band in bands:         # loop through the bands
    arrays.append(img[band]) # append our array to the arrays list
  return np.stack(arrays, axis=-1)

We could simply write:

def stack_img(img, bands):
  arrays = [img[band] for band in bands]
  return np.stack(arrays, axis=-1)

Our cube has a shape of (7761, 7601, 7), so if we access a single pixel, we will get a size 7 vector. For example:

cube[4000, 4000].shape

code output:
(7,)

We can use these 7 values to plot a the spectral signature using the plt.plot function. The plot function accepts many different syntaxes, but we will use plt.plot(X, Y).

Figure 7: Spectral signature of a point in the image. Image by author.

It is also possible to add more line plots to the chart by calling plt.plot more than once. We can even put plt.plot inside a loop to display the curve for many points. In the example bellow, I have used the NumPy random module to randomly choose N different points in the image to plot their signatures.

Figure 8: Spectral signatures for random points in the Manaus area. Image by author.

Step 5- Spectral signature of water pixels

Ok, but it is not very useful to plot random points. It would be more interesting, for example, to check the spectral signature of a desired target, as the vegetation, or our water pixels. Let’s plot, then, the mean spectral signature of the water pixels, and also 1 standard deviation around the mean to check it’s variability. But there are many water pixels, and spread all over the scene, so how can we do this?

We have already a mask for the water pixels. So, let’s take a look what is the result if we just grab these points from the cube, like so:

cube[water_mask].shape

code output:
(3062430, 7)

Note that instead of an array of (N x M x 7), we now have a 2-dimension array (3062430, 7), with 3,062,430 lines (total amount of water pixels) and 7 columns (our bands). That happened because when we use Boolean Array Indexing, our results come out as a 1-dimension vector, but considering we are applying the same Boolean Array 7 times (remember the broadcasting?) our result is a 1-dimension vector stacked 7 times, thus a 2-dimension array in this case. Each [index, :] values represent the 7 band reflectances for the water pixels. Let’s take a look at the first 5 rows of our water pixels array. Ah, and I forgot to mention, but in order to get the real reflectance value, we should divide the integer values by 10.000.

water_pts = cube[water_mask] / 10000
water_pts[:5]

result:
array([[0.7962, 0.7995, 0.8006, 0.7976, 0.7857, 0.8014, 0.7663],
       [0.7941, 0.7894, 0.7922, 0.7787, 0.7914, 0.7727, 0.753 ],
       [0.8146, 0.808 , 0.8061, 0.7915, 0.7956, 0.7782, 0.7565],
       [0.8052, 0.8007, 0.8017, 0.7945, 0.8007, 0.7591, 0.7458],
       [0.8009, 0.7959, 0.7938, 0.7847, 0.7932, 0.7542, 0.743 ]])

The problem is that we have more than 3 million lines in the array. Let’s get the mean values, then.

water_pts.mean()

code output:
0.8466179462854205

When we try to calculate the mean, the NumPy returns just a single value. The overall mean of all values and this is not exactly what we want. We need the mean values for each band separately. In this case, we have to specify the axis (NumPy nomenclature for dimension) in which we want to apply the function. Considering our samples are identified in the first dimension, we can call mean(axis=0). Let’s do this to get the mean values and the standard deviation of each band for the water pixels.

mean = water_pts.mean(axis=0)
std = water_pts.std(axis=0)
mean.shape, std.shape

code output:
((7,), (7,))

As expected we have two 7 size vectors, representing the mean reflectance of water pixels and it’s standard deviation for each L8 band. Before plotting, we will also create a list with the wavelenghts of the bands. When we use string values for the X axis the matplotlib space the X ticks evenly, and to have the real X distances, we should pass the central wavelenghts for each band. That’s easy as the wavelenghts for each band is provided in the Landsat 8 specification (here).

wls = [0.44, 0.47, 0.56, 0.655, 0.865, 1.61, 2.2]

To finish, we will shade the area using the function plt.fill_between(X, Y1, Y2). For Y1 and Y2, we will pass mean-0.5*std and mean+0.5*std respectively. Let’s take a look at the code snippet:

Figure 9: Mean spectrum of water pixels. Image by author.

Notebook

As always, the full code to run these examples is provided in the Notebook bellow. You can run it and make experimentations by yourself to get used to dealing with array’s dimensions, slicing, etc.

Conclusion

In this post, we’ve learned the basics of array slicing and indexing to access the pixel values and plot the spectrum of a desired group of pixels. The same procedure can be repeated to plot, for example, water and vegetation in the same chart for comparison, like the example in Figure 10.

Figure 10: Comparison of water reflectance and vegetation reflectance. Image by author.

A good exercise would be to try to reproduce this figure (or even include more targets) by using the concepts we’ve just saw. If it becomes too difficult, don’t worry, I left the code to reproduce Figure 10 as an additional section at the end of the Notebook.

In the next story we will learn how to deal with the masks provided by Level-2A images from Landsat and Sentinel 2 satellites (from SEN2COR atmospheric correction). So, stay tuned and see you in the next story!