In social research, it might sometimes be useful to connect IPUMS International geography data with climate data. Linking these data sources makes it possible to explore how social variables might change over time in relation to environmental variables. For example, in their 2020 paper in Global Environmental Change, Mueller et al. estimated the effects that climate-related variables had on migration in Botswana, Kenya, and Zambia between 1989 and 2011 https://doi.org/10.1016/j.gloenvcha.2020.102183. Often, however, even expert social scientists lack experience joining social data to climate data, which is usually large and packaged in unfamiliar ways.
In the following tutorial, we will show you one way to access a reputable source of precipitation data and then process these data to get estimates of average precipitation for an areal unit. For our example, we will be remotely downloading the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) https://www.chc.ucsb.edu/data/chirps, processing it locally, and then arriving at daily estimates of the average precipitation in Bangladesh's 1991-2010 harmonized level one and level two geography units. Harmonized geographical units are units that remain consistent over time, and they can be useful for exploring social change over multiple years. Level one and level two refer to the resolution of the spatial data, where level two data is considered more granular than level one data. For more information and to download harmonized shapefiles, see IPUMS International's Geography and GIS page https://international.ipums.org/international/gis.shtml.
In the following tutorial, a basic understanding of python programming is assumed, but this example should be largely self-contained. Where appropriate, some helpful links will be provided.
Before we can obtain average precipitation estimates for the 1991-2010 harmonized level one and level two Bangladesh geography units, we need to import the requisite libraries and create a few useful functions.
Remember that in Python, in order to import libraries, you need to install the requisite packages. Instructions for installing python libraries prior to importing them can be found on the web. This is one place to start https://packaging.python.org/en/latest/tutorials/installing-packages/
Once you've installed the necessary packages, we can begin importing them. Information on the packages used in this tutorial can be found by searching for the package on the PyPI repository website. https://pypi.org/project/requests/
#import the libraries necessary for running this notebook.
import requests #package to make HTTP/1.1 requests
import xarray #package for working with multi-dimensional arrays (e.g. climate data over space and time)
import pandas #two-dimensional data processing package
import geopandas #package for importing and manipulating geographic data (shapefile polygon data)
import rioxarray #package for using xarray with spatial data
import dask #package for multiprocessing data with xarray
import matplotlib.pyplot as plot #package for visualizing data
import shapely #package for manipulating spatial geometry data
import bottleneck #package for performing mean calculations quickly
from dask.diagnostics import ProgressBar #this will allow us to see our progress when manipulating the large multi-dimensional array containing our precipitation data.
/Users/computer/miniconda3/envs/ipums_py/lib/python3.10/site-packages/geopandas/_compat.py:123: UserWarning: The Shapely GEOS version (3.11.1-CAPI-1.17.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow. warnings.warn( /var/folders/s2/sd6pfh1s3px7tndrnp9cdx940000gn/T/ipykernel_97444/1510659582.py:5: UserWarning: Shapely 2.0 is installed, but because PyGEOS is also installed, GeoPandas will still use PyGEOS by default for now. To force to use and test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. You can do this before starting the Python process, or in your code before importing geopandas: import os os.environ['USE_PYGEOS'] = '0' import geopandas In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html). import geopandas #package for importing and manipulating geographic data (shapefile polygon data)
Once the packages have been installed and imported, we will begin by writing a few bespoke functions that will make our code easier to follow. We won't use these functions right now, but they will be useful later. In general, it is good practice to write and use functions when possible. This will make your work more reproducible and less prone to error.
#create a function to download the CHIRPS data, this will make downloading multiple sample years easier later on
def get_chirps_data(year, #year of the data you want to download
resolution, #resolution of the data
download_directory = ""): #place to save the data locally (defaults to the home directory)
#create a filename for the specific data of interest
download_filename = f"chirps-v2.0.{year}.days_{resolution}.nc"
#create the path to the file of interest. For this tutorial, we are interested in daily data, and we want to download the data in netcdf format.
#Note that there are other paths to other data formats and resolutions available, and you can modify this section of the script accordingly:
#Just go to the https://data.chc.ucsb.edu/ webpage, find the product and format you'd like to use, and copy that information here.
#Note, however, that depending on the path selected, you may need to modify other parts of this script.
chirps_url = f"https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/netcdf/{resolution}/{download_filename}"
#request the data from UCSB
request_object = requests.get(chirps_url,
allow_redirects = True)
#create a download path. this is where your data will be saved
download_path = f"{download_directory}{download_filename}"
#save the requested data to the download_path file
with open(download_path, 'wb') as request_content:
for chunk in request_object.iter_content(chunk_size = 128):
request_content.write(chunk)
#create a function that calculates the average CHIRPS variable estimate within a defined geography.
#this is the function that will actually open the CHIRPS data, reduce it to the area and time of interest, and then calculate its average value within that area.
#remember, the goal is to calculate an average precipitation value for each unit in Bangladesh's harmonized geography across all time points relevant to that geography.
def get_mean_chirps_variable_by_geography(shp_path, #location of the shapefile whose geographies you want CHIRPS estimates for
nc_path, #location of the netcdf file containing the spatio-temporal CHIRPS data
variable): #name of of the variable of interest
#read in the shapefile containing the geography information (in this case, the IPUMSI harmonized geography file)
ipumsi_shp = geopandas.read_file(shp_path)
#open the chirps netcdf files and set the coordinate reference system
chirps = xarray.open_dataset(nc_path)
chirps_epsg = chirps.rio.write_crs("epsg:4326")
#create an empty list to store the unit-level geography data
dataframes = []
#for each region in the IPUMS Internation geography shapefile
for region in range(0, len(ipumsi_shp)):
#get the name of the administrative unit of interest
admin = ipumsi_shp["ADMIN_NAME"][region]
#get the geometry of the administrative unit of interest
geometry = ipumsi_shp["geometry"][region]
#reformat the geography so it plays nicely with rioxarray
geometry_formatted = [shapely.geometry.mapping(geometry)]
#clip the xarray to the regional geometry
clipped = chirps_epsg.rio.clip(geometry_formatted,
ipumsi_shp.crs,
all_touched = False, #note here we use all_touched = False, this means that only pixels (CHIRPS data) whose center is within the area of interest are included in the clip. If all_touched equals true, then all pixels touching the area of interest are included in the clip.
from_disk = True,
invert = False)
#get a mean of the CHIRPS variable pixel values accross the spatial dimension of the geography (this means you'll have an average estimate for the region in unit of your time dimension)
chirps_variable = clipped.mean(dim = ["latitude", "longitude"])
#convert this multi-dimensional array to a two-dimensional panda array and orgnaize it
chirps_variable_df = chirps_variable.to_pandas()[[variable]].rename(columns = {variable: admin})
#add this region's temporal data to the list of other regions temporal data
dataframes.append(chirps_variable_df)
#concatenate all of these individual region dataframes into a single dataframe
dataframe = pandas.concat(dataframes, axis = 1)
#return the dataframe
return(dataframe)
Now that we've imported the important packages and created the functions we'll need, we can begin processing our data.
The first step in our data pipeline will be to acquire the CHIRPS data. We are going to download daily CHIRPS data in netcdf format. Though not always familiar to social scientists, the netcdf data format is commonly used to store climate information because it can efficiently handle large amounts of data. Luckily, there are python tools designed to handle it. To download our data from CHIRPS, we will use the bespoke function we created above. Remember, this is just one example of acquiring data. This function could be modified to make other http requests for Climate Hazard Group Data (e.g. temperature) by changing the function's "chirps_url" object to a different address.
#Create a list of sample years to download (these are the years for which Bangladesh data is available on IPUMSI)
sample_year_list = [1991, 2001, 2011]
#a simple loop to download the daily chirps-2.0 data for each sample year
#Note: This could take a long time depending on your system and internet connection. Each sample year will have a relatively large file associated with it
for sample_year in sample_year_list:
get_chirps_data(year = sample_year,
resolution = "p05", #we could also use p25
download_directory = "chirps_raw/") #here I am saving these data to an empty directory I created called chirps_raw. To follow along with this tutorial, it is important that the downloaded data be saved in a separate empty directory.
Once our requested data has downloaded to our "chirps_raw" directory (or whatever you chose to call it), we can begin processing it. As mentioned above, the requested dataset is really large (it's a global dataset with more than 1000 estimates of precipitation at each grid location), so our next step will be to reduce the dataset down to only include our values relevant to our country of interest. This will significantly speed up our processing later.
For this tutorial, we are going to reduce the dataset down to the extent of Bangladesh. Remember we are just using Bangladesh as a test case. It is a good candidate for this tutorial because the harmonized level one and level two geography files available through IPUMS International will allow us to explore average precipitation in Bangladesh's harmonized areal units through time, but you could use any country shapefile.
To begin, we need to import our harmonized Bangladesh shapefile. We can obtain this file from the IPUMS International website https://international.ipums.org/international/gis.shtml. Here we'll be using the "spatially-harmonized first-level geography" file for Bangladesh. Simply download this zipped file from IPUMS International, unzip the file, and we'll be ready to work with the harmonized data.
First, let's visualize our shapefile.
#read in the ipumsi harmonized shapefile. My file is located within the unzipped directory I downloaded from the IPUMSI website.
geo1_bd = geopandas.read_file("geo1_bd1991_2011/geo1_bd1991_2011.shp") #this is just the path to your file.
#visualize the harmonized first-level bangladesh geography data
geo1_bd.plot()
<Axes: >
Now that we have a sense of what the Bangladesh harmonized level-one geography looks like, we want to reduce our CHIRPS data down to just it's extent to improve our processing speed. First, we'll open up the global CHIRPS precipitation data we downloaded and assign the appropriate coordinate reference system to the object. Next, we'll make sure the Bangladesh shapefile has the same coordinate reference system (they need to match) and create a "bounding box" for Bangladesh, which is just a set of coordinates whose internal area covers all points within our shapefile, that we can use to reduce the CHIRPS data down to only data relevant to Bangladesh. Finally, we'll send a set of instructions through the "xarray" package to make the data reduction. That is, to "clip" the CHIRPS data to only include the values that were relevant to Bangladesh.
#open the chirps netcdf files. Here we are openning all of the sample years of the raw data and reading it into one large multidimensional array (time and space)
precipitation_xarray = xarray.open_mfdataset("chirps_raw/*.nc", parallel = False)
#In order to manipulated the spatial data, we need to assign a coordinate reference system to the data. The CHIRPS data can be assigned an epsg code of 4326
precipitation_xarray_epsg = precipitation_xarray.rio.write_crs("epsg:4326")
#check to make sure the shapefile has the same coordinate reference system as the CHIRPS data
#here they do, but if they didn't you could use the .to_crs() method https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_crs.html
geo1_bd.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
#get a bounding box of geo1_bd geometry and format it to work with rioxarray. This will allow us to clip our large global CHIRPS dataset to only the extent of bangladesh and reduce our processing time.
bd_bbox = shapely.geometry.box(*geo1_bd.total_bounds)
bd_bbox_formatted = [shapely.geometry.mapping(bd_bbox)]
#clip the precipitation_xarray to barisal's geometry, including all pixels that touch Bangladesh.
geo1_bd_chirps_xarry = precipitation_xarray_epsg.rio.clip(bd_bbox_formatted,
geo1_bd.crs,
all_touched = True,
from_disk = True,
invert = False)
Note, however, that at this point xarray hasn't actually clipped our global dataset. Instead, we've created a set of instructions that we want xarray to carry out on our command.
The next step will be to ask xarray to perform these instructions and save our new clipped CHIRPS dataset (the much smaller dataset that only includes data relevant to bangladesh). We can save these data anywhere we'd like, but for the purposes of this tutorial, don't save these data to the same location where we saved the raw chirps global dataset we downloaded earlier. It might also be useful to keep that data separate in case you want to clip a different country from that same global data.
#create an operation to save the bangladesh chirps data. Here I am simply saving a new netcdf file "bd_chirps.nc" that will contain the daily CHIRPS data for only the Bangladesh area for the 1991, 2001, and 2010 years
delayed_obj = geo1_bd_chirps_xarry.to_netcdf("bd_chirps.nc",
compute = False)
#save the bangladesh chirps data and show a progress bar for the clipping and saving process
#note that this process can take a long time
with ProgressBar():
results = delayed_obj.compute()
[########################################] | 100% Completed | 527.57 s
We should now have a new netcdf file (in this case named "bd_chirps.nc") that we can use for calculating the daily average precipitation estimates.
For calculating our daily average precipitation estimates for each 1991-2011 harmonized region in Bangladesh, we can just use one of our bespoke functions: get_mean_chirps_variable_by_geography()
But, it is useful to understand how this process works, so in the following cells, we'll demonstrate the process of calculating these values for one of the 1991-2011 Bangladesh first-level harmonized geography regions.
First, let's open our new Bangladesh-specific CHIRPS data, and visualize one day of precipitation across our harmonized Bangladesh level one geography.
#open the chirps netcdf files
bd_chirps = xarray.open_dataset("bd_chirps.nc")
#again we need to add a coordinate reference system to our CHIRPS data in order to work with it spatially
bd_chirps_epsg = bd_chirps.rio.write_crs("epsg:4326")
#visualize the CHIRPS data on January 1st 1991 across all of Bangladesh.
ax = geo1_bd.boundary.plot(color = "red")
bd_chirps_epsg.sel(time = "1991-01-01")["precip"].plot(ax = ax)
<matplotlib.collections.QuadMesh at 0x12efb7850>
Now, that we've got an idea of what the data we're working with looks like. Let's visualize the process of calculating daily average precipitation for a single harmonized Bangladesh level one unit. For instruction purposes, we'll use the areal unit "Barisal".
First, we'll extract Barisal's geometry and then clip our Bangladesh-specific CHIRPS data to the Barisal extent using two-different clipping approaches.
#separate out Barisal (a region in the 1991-2010 first-level harmonized Bangladesh geometry) to visualize the process
barisal = geo1_bd[geo1_bd["ADMIN_NAME"] == "Barisal"]
barisal.plot()
<Axes: >
#get a bounding box of barisal's geometry
barisal_geometry = barisal["geometry"][0]
barisal_geometry_formatted = [shapely.geometry.mapping(barisal_geometry)]
Next, let's visualize the two different options we have for clipping our CHIRPS data to Barisal's geometry. First we'll use the "all_touched = True" option in the rio.clip method. This will clip our CHIRPS data to include all data for which the CHIRPS data grid cell touches Barisal.
#clip the precipitation_xarray to barisal's geometry, including all pixels that touch barisal
barisal_clipped_all = bd_chirps_epsg.rio.clip(barisal_geometry_formatted,
barisal.crs,
all_touched = True,
from_disk = True,
invert = False)
ax = barisal.boundary.plot(color = "red")
barisal_clipped_all.sel(time = "1991-01-01")["precip"].plot(ax = ax)
<matplotlib.collections.QuadMesh at 0x131a04a90>
Next, we'll use the "all_touched = False" option. This will clip our CHIRPS data to include only those data for which the centroid of the CHIRPS data grid cell touches Barisal.
#clip the precipitation_xarray to barisal's geometry, including all pixels that touch barisal
barisal_clipped = bd_chirps_epsg.rio.clip(barisal_geometry_formatted,
barisal.crs,
all_touched = False,
from_disk = True,
invert = False)
ax = barisal.boundary.plot(color = "red")
barisal_clipped.sel(time = "1991-01-01")["precip"].plot(ax = ax)
<matplotlib.collections.QuadMesh at 0x131afc580>
From the above map overlays, we can see that the data included in these two methods differs. This means that the mean precipitation values we estimate for a region on a given day will differ slightly depending on which method we select. This difference may or may not be meaningful to you depending on your research. In the following cells, and in the function we created above, we used the "all_touched = False" option, meaning we only use CHIRPS data whose grid cell centroid is located within a region. This, though, can be modified depending on your needs by simply modifying above function's "all_touched" option value.
In the above visualizations, we singled out the precipitation on a single day in order to explore our dataset, but we want to get average precipitation values for Barisal each date in our CHIRPS data. We can do this using the mean method on our "xarray" object. We will take the mean of our precipitation values across the spatial dimensions.
#understand xarray's architecture. Note that there are three dimensions (latitude, longitude, and time) and one data variable.
#for more on xarray and how it can be used with netcdf files check out its documentation https://docs.xarray.dev/en/stable/
barisal_clipped
<xarray.Dataset> Dimensions: (latitude: 25, longitude: 23, time: 1095) Coordinates: * latitude (latitude) float32 21.82 21.88 21.92 ... 22.92 22.97 23.02 * longitude (longitude) float32 89.88 89.92 89.98 ... 90.88 90.92 90.98 * time (time) datetime64[ns] 1991-01-01 1991-01-02 ... 2011-12-31 spatial_ref int64 0 Data variables: precip (time, latitude, longitude) float32 nan nan nan ... nan nan nan Attributes: (12/15) Conventions: CF-1.6 title: CHIRPS Version 2.0 history: created by Climate Hazards Group version: Version 2.0 date_created: 2015-11-20 creator_name: Pete Peterson ... ... reference: Funk, C.C., Peterson, P.J., Landsfeld, M.F., Pedreros,... comments: time variable denotes the first day of the given day. acknowledgements: The Climate Hazards Group InfraRed Precipitation with ... ftp_url: ftp://chg-ftpout.geog.ucsb.edu/pub/org/chg/products/CH... website: http://chg.geog.ucsb.edu/data/chirps/index.html faq: http://chg-wiki.geog.ucsb.edu/wiki/CHIRPS_FAQ
#calculate the mean of barisal across the spatial dimensions (latitude, longitude). This will yield an average value across space for each time point (each day in the 1991, 2001, and 2010 years) for Barisal
barisal_precip = barisal_clipped.mean(dim = ["latitude", "longitude"])
#restructure the data to make it into a long-format two-dimensional array
barisal_precip_df = barisal_precip.to_pandas()[["precip"]].rename(columns = {"precip": "barisal"})
barisal_precip_df
barisal | |
---|---|
time | |
1991-01-01 | 1.733025 |
1991-01-02 | 5.439487 |
1991-01-03 | 0.683614 |
1991-01-04 | 0.000000 |
1991-01-05 | 0.000000 |
... | ... |
2011-12-27 | 0.057392 |
2011-12-28 | 0.685558 |
2011-12-29 | 0.000000 |
2011-12-30 | 0.000000 |
2011-12-31 | 0.000000 |
1095 rows × 1 columns
The above cells are useful for visualizing the process of calculating daily average precipitation for a harmonized first-level region, but we want to calculate these values for each of the regions in our harmonized shapefile, and rather than going through this whole process for each region, we can use the function we created at the beginning of this notebook. This function performs the operations we described in the cells above, but loops through each region in the harmonized shapefile, creating a single dataframe containing a daily estimated average precipitation value for each harmonized region.
#run our function on the 1991-2011 harmonized first-level geography data
precip_estimates_level_one = get_mean_chirps_variable_by_geography(shp_path = "geo1_bd1991_2011/geo1_bd1991_2011.shp", #path to our shapefile
nc_path = "bd_chirps.nc", #path to the CHIRPS data we clipped to the Bangladesh extent
variable = "precip") #name of the variable in the CHIRPS data
#We can easily run it on the 1991-2011 harmonized second-level geography data, which can also be obtained from IPUMS International
precip_estimates_level_two = get_mean_chirps_variable_by_geography(shp_path = "geo2_bd1991_2011/geo2_bd1991_2011.shp",
nc_path = "bd_chirps.nc",
variable = "precip")
#Finally, we can save the data out to our local computer
precip_estimates_level_one.to_csv("precip_estimates_level_one.csv")
precip_estimates_level_two.to_csv("precip_estimates_level_two.csv")
We can now integrate these precipitation data with other IPUMS International variables to enhance our social research.
Note that the above is just one example of how to integrate climate data into social research using some of the geography data made available by IPUMS International. This script can be modified to handle other data sources, but often different data sources will require different methods of integration. Other sources of contextual information that might be useful to researchers using IPUMS International data can be found at https://international.ipums.org/international/geog_contextual.shtml