2.1. Pre-processing: Subnational statistics

Subnational statistics on harvested area are a key input for mapspamc. Before they can be used the data need to be put in the right format, made consistent and aggregated (or split) to the crops and production systems that are used by mapspamc. Unfortunately, the level of detail (e.g. number of crops covered at the level 1 and 2 administrative unit) and quality (do lower levels of subnational data add up to higher levels?) of these statistics may differ considerably across countries. Cleaning, harmonizing and preparing the crop statistics is probably the most time consuming and difficult part of running mapspamc.

In many cases the user has to make a value judgment on which statistics to keep and which to discard in order to make them consistent. There is no simple formula or function that makes it possible to (semi) automatically prepare the raw subnational statistics so that they can be used by mapspamc and therefore it requires a substantial amount of user input and ad hoc coding to prepare the data.

We prefer to break down the processing and preparation of the subnational statistics into two steps:

Processing of raw subnational statistics. In this step, the raw statistics, meaning the various data files that were downloaded from the national statistical office or taken from other sources, are reshaped, aggregated and cleaned so that they are (almost) ready to be used by mapspamc.
Checking and calibrating processed subnational statistics.: In this step, the processed statistics are checked for consistency, adjusted where needed and calibrated to the FAOSTAT national statistics.

The scripts to prepare the data for Malawi offer some guidance on how to organize these two steps and can be adapted where needed. They are located in the 02_1_pre_procesing_statistics folder. They also illustrate the use of several functions and tools that we developed to support the process. We recommend to code all the steps that are needed to process the statistics in R so everything is documented and fully reproducible. The tidyr and dplyr packages offer a variety of functions that were specifically designed to clean up and process ‘messy’ data. It is also possible to use Excel or any other software to process the raw statistics as long as the output is saved in the correct format so that they can be used for further processing. Obviously a major disadvantage of using Excel is that it will be very difficult to track changes or quickly make adjustments when new data becomes available.

Support scripts

Apart from the main scripts that clean and check the statistics, which will be discussed below, the template contains two other scripts that support these processes.

01_process_faostat_crops.R
02_process_faostat_crop_prices.R

The first script processes the national FAOSTAT crop statistics. These are used for two purposes. The first is for calibration of the subnational statistics so they are in line with FAOSTAT. As FAOSTAT is the leading source of agricultural data, it is useful to ensure that the subnational data adds up to the FAOSTAT national totals. For many countries the totals will already be the same or at least in the same range as the FAOSTAT data because subnational agricultural statistics are often the basis for the construction of the national statistics, which in turn are used as input by FAOSTAT. Nonetheless, one might also encounter large differences between the two sources of data. The second purpose is to determine the list of crops that is needed to prepare the subnational agricultural statistics. We will discuss this further below. Make sure to run 01_process_faostat_crops.R before starting to prepare the subnational statistics.

The second script extracts the crop prices from FAOSTAT, which are needed to calculate the potential revenue at the grid cell level, which is used to construct the scores/priors for the high-input and irrigated production systems (see Model description for details). As price data is often incomplete at the country level, we decided to use continental averaged prices instead.

Processing of raw subnational statistics

The 03_prepare_subnational_statistics.r can be used to store all code that is needed to process the raw subnational statistics. In the case of Malawi, we took all data from SPAM2010 and therefore it was already in the right format. As no further adjustments were needed, we did not prepare additional code to process the data. Nonetheless, we discuss several key processing steps that most users will need to go through when preparing the subnational statistics for a country case-study.

Three files with subnational statistical information need to be created, which will be combined later:

A file with Harvest area statistics (ha) in hectares for all subnational administrative unit at each level.
A file with Cropping intensity (ci) for each subnational administrative unit at the national level. If the model is solved at the subnational level 1 (by setting solve_level to 1), also cropping intensity data is required at subnational level 1.
A file with production system area shares (ps) (not percentages!) for each subnational administrative unit at the national level (same format as the cropping intensity information).

It is best to start with processing the harvest area statistics and then work on the files with cropping intensity and production system shares. Data for these three files is most likely coming from different sources and can be processed independently. Also cropping intensity and, to a lesser extent, production system shares data, sometimes needs to be adjusted after running the model as it may occur that there is not enough cropland to allocate all the harvested area from the statistics, which suggests that the cropping intensity is too low for some regions.

Putting the subnational statistics in the right format

All crop statistics need to be stored in the wide format. With this we mean that the first four columns of the data table list the administrative unit code (adm_code), name (adm_name), level (adm_level) and in case of production system shares or cropping intensity data, the production system (system), followed by named columns, one for each crop, with harvested area, production system shares or cropping intensity per subnational unit. adm_name and adm_code must be provided up to the most detailed administrative unit for which data is available. Hence, in case level 2 data is available, data should be supplied for level 0, 1 and 2. In case only level 1 data is available, data should be supplied for level 0 and level 1.

create_statistics_template() will create templates for the harvested area, production system and cropping intensity data files - also see the Malawi data for examples. The number and depth of the administrative units will be determined by the model parameters stored in the mapspamc_par object as well as the adm_list file that is created with create_adm_list(). By default, the template will have columns for all 40 mapspamc crops.

The user can save the template as csv file and use it as a basis to prepare the statistics. Note that is is essential that you follow the template closely! Adding or removing administrative units will result in errors as the data no longer matches with the map that contains the location of the administrative units - see below for more information how to deal with missing crop information. In case of Malawi, most of the data was already in the right format so there was no need to use the templates.

# Create template for ha. Replace "ha" by "ps" or "ci" for other templates.
create_statistics_template("ha", param)

For processing it is easier to transform the data into the long format, which is similar to the tidy data format as implemented by the tidyverse packages. In our case this means that the 40 crop columns are squeezed into two columns, one with the crop code and another with the data. The pivot_longer() and pivot_wider() functions in the tidyr package were designed to switch between wide and long tables.

Aggregating to `mapspamc` crop classes

The raw statistics need to be mapped to the 40 crops used by mapspamc. The list of crops can be found in crop.csv, which is automatically copied to the mappings folder after running create_spam_folders(). In practice, almost none of the countries produce all 40 crops. To determine the set of relevant crops, the user can use the list of crops for which FAOSTAT presents data. Running 01_process_faostat_crops.R will save a faostat_crop_list.csv file in the processed_data/lists folder. Note that it is no problem to keep all the 40 crops in the database as long as the values for crops that are not grown in a country are set to NA or -999 (see below) so that they are automatically removed when the data is further processed.

The easiest way to aggregate the raw subnational statistics to the FAOSTAT crop list is to create an orig2crop concordance table, which links the two lists of crops. It sometimes happens that the subnational statistics contain crops that are not present in FAOSTAT. In such cases, the best solution is to add these crops to one of the crops for which FAOSTAT presents data.

The script below illustrates how to aggregate the raw statistics to the mapspamc crops. Note that it is important to use sum with na.rm = FALSE as otherwise NA+NA will result in 0 not NA. This is crucial as missing data in the statistics need to be set to NA for mapspamc in order to process them correctly.

# NB: use sum with na.rm = F as we want NA+NA = NA, not NA+NA = 0!
stat <- stat %>%
  left_join(orig2crop) %>%
  group_by(crop, adm_code, adm_name, adm_level) %>%
  summarize(value_ha = sum(value_ha, na.rm = F)) %>%
  ungroup()

Treatment of missing data

It is unlikely that subnational statistics are available for all crops in a country. As a minimum requirement, mapspamc needs full national level statistics for each crop. This is generally no problem as information on harvested area can be taken from FAOSTAT. production system shares and cropping intensity data are generally more difficult to find and often require making strong assumptions. Missing values in the subnational harvested area statistics are allowed. In such cases the algorithm will automatically change the constraints in order to substitute the missing data with information from lower level administrative units. Full information (harvest area statistics, production system shares and cropping intensity) is only needed at administrative level 1 if the model is run at level 1 (solve_level = 1) and never for administrative level 2.

In case data on harvested area is missing for a certain administrative unit, you can indicate this by using NA or -999. We highly recommend using the latter, particularly if you are using Excel to process the data because it avoids potential errors that may result between an NA and an empty string "" value. These are both displayed as empty cells in Excel but will be treated differently by R after loading the file. Replacing NA values by -999 can be done simultaneously when putting the data in the wide format.

# Put in preferred mapspam format, adding -999 for missing values
stat_mapspam <- stat %>%
  pivot_wider(names_from = crop, value_from = value_ha, values_fill = -999) %>%
  arrange(adm_code, adm_code, adm_level)

Data consistency

Data should always be consistent and add up, meaning that the sum of harvested area for all administrative units and a given crop is the same or smaller (in case of missing information) as the corresponding lower level unit. The script to check the statistics includes a function to reaggregate the data from the bottom up and check for consistency. It is however, essential that the user also checks whether the data is consistent in the data processing step as otherwise a large bias might be introduced resulting in completely erroneous data and model output. For example, suppose there is an error in the raw harvested area statistics for maize in a certain level 2 administrative region. As a result the sum of harvested maize area of all level 2 units is much higher than the maize area of the corresponding level 1 unit, which is a clear inconsistency. The function that reaggregates the statistics will simply take the level 2 total for maize and substitute the level 1 maize data. This error is subsequently carried forward when the maize area for all level 1 units are aggregated to the national total. The final data table will include maize data that is strongly biased upwards.

Check and calibrate statistics

After the raw subnational statistics are cleaned and harmonized, the harvested area, production systems and cropping intensity files need to be checked for consistency and calibrated to the FAOSTAT national statistics. This is done in 04_check_and_calibrate_statistics.R. The first step is to check if the data is consistent, which is done by check_statistics(). If out = TRUE a report is returned which shows where the inconsistencies occur.

# Check the consistency of the ha statistics
check_statistics(mapspamc::ha_df, param, out = TRUE)
#> adm0 and adm1 are not equal!. Did you run reaggregate_statistics()?
#> # A tibble: 29 × 4
#> # Groups:   crop [29]
#>    crop     adm0    adm1 difference
#>    <chr>   <dbl>   <dbl>      <dbl>
#>  1 bana    18342   18208        134
#>  2 bean   281065  281058          7
#>  3 cass   193993  193932         61
#>  4 chic   128936  128933          3
#>  5 coff     3001    3001          0
#>  6 cott    63552   63552          0
#>  7 cowp    64297   64297          0
#>  8 grou   290092  290093         -1
#>  9 lent     1904    1905         -1
#> 10 maiz  1660214 1660044        170
#> # ℹ 19 more rows

To reaggreate the statistics from the bottom up, use reaggregate_statistics().

# Reaggreate from the bottom up.
ha_df <- reaggregate_statistics(mapspamc::ha_df, param)

# Check again.
check_statistics(mapspamc::ha_df, param, out = T)

Next, the statistics are calibrated to the national FAOSTAT statistics. The code below shows how this is implemented. First, the code checks if there are still crops, which are not in the FAOSTAT data and removes them if needed. Next, crops are identified that are not present in the statistics, after which the FAOSTAT national total is added. Finally all the subnational statistics are proportionally scaled so the total is equal to the FAOSTAT figures.

# Identify crops that are present in ha_df but not in fao and remove them from ha_df.
crop_rem <- setdiff(unique(ha_df$crop), unique(fao$crop))
ha_df <- ha_df %>%
  filter(!crop %in% crop_rem)

# Identify crops that are present in fao but not in ha_df.
# We will add then to ha_df
crop_add <- setdiff(unique(fao$crop), unique(ha_df$crop))
ha_df <- ha_df %>%
  bind_rows(
    fao %>%
      filter(crop %in% crop_add) %>%
      mutate(
        fips = unique(ha_df$adm_code[ha_df$adm_level == 0]),
        adm_level = 0,
        adm = unique(ha_df$adm_name[ha_df$adm_level == 0])
      )
  )

# Calculate scaling factor
fao_stat_sf <- bind_rows(
  fao %>%
    mutate(source = "fao"),
  ha_df %>%
    filter(adm_level == 0) %>%
    mutate(source = "ha_df")
) %>%
  dplyr::select(crop, source, ha) %>%
  spread(source, ha) %>%
  mutate(sf = fao / ha_df) %>%
  dplyr::select(crop, sf)

# rescale ha_df
ha_df <- ha_df %>%
  left_join(fao_stat_sf) %>%
  mutate(ha = ha * sf) %>%
  dplyr::select(-sf)

Finally the harvested area, production systems and cropping intensity files are saved in the processed_data/agricultural_statistics folder. It is important to use the correct file names otherwise other functions cannot load the data.

write_csv(ha_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ha_adm_{param$year}_{param$iso3c}.csv")))
write_csv(ps_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ps_adm_{param$year}_{param$iso3c}.csv")))
write_csv(ci_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ci_adm_{param$year}_{param$iso3c}.csv")))

The next step is to process the spatial data.