vignettes/preprocessing_subnational_statistics.Rmd
preprocessing_subnational_statistics.Rmd
Subnational statistics on harvested area are a key input for
mapspamc
. Before they can be used the data need to be put
in the right format, made consistent and aggregated (or split) to the
crops and production systems that are used by mapspamc
.
Unfortunately, the level of detail (e.g. number of crops covered at the
level 1 and 2 administrative unit) and quality (do lower levels of
subnational data add up to higher levels?) of these statistics may
differ considerably across countries. Cleaning, harmonizing and
preparing the crop statistics is probably the most time consuming and
difficult part of running mapspamc
.
In many cases the user has to make a value judgment on which
statistics to keep and which to discard in order to make them
consistent. There is no simple formula or function that makes it
possible to (semi) automatically prepare the raw subnational statistics
so that they can be used by mapspamc
and therefore it
requires a substantial amount of user input and ad hoc coding to prepare
the data.
We prefer to break down the processing and preparation of the subnational statistics into two steps:
mapspamc
.The scripts to prepare the data for Malawi offer some guidance on how
to organize these two steps and can be adapted where needed. They are
located in the 02_1_pre_procesing_statistics
folder. They
also illustrate the use of several functions and tools that we developed
to support the process. We recommend to code all the steps that are
needed to process the statistics in R so everything is documented and
fully reproducible. The tidyr
and dplyr packages offer a
variety of functions that were specifically designed to clean up and
process ‘messy’ data. It is also possible to use Excel or any other
software to process the raw statistics as long as the output is saved in
the correct format so that they can be used for further processing.
Obviously a major disadvantage of using Excel is that it will be very
difficult to track changes or quickly make adjustments when new data
becomes available.
Apart from the main scripts that clean and check the statistics, which will be discussed below, the template contains two other scripts that support these processes.
01_process_faostat_crops.R
02_process_faostat_crop_prices.R
The first script processes the national FAOSTAT crop statistics.
These are used for two purposes. The first is for calibration of the
subnational statistics so they are in line with FAOSTAT. As FAOSTAT is
the leading source of agricultural data, it is useful to ensure that the
subnational data adds up to the FAOSTAT national totals. For many
countries the totals will already be the same or at least in the same
range as the FAOSTAT data because subnational agricultural statistics
are often the basis for the construction of the national statistics,
which in turn are used as input by FAOSTAT. Nonetheless, one might also
encounter large differences between the two sources of data. The second
purpose is to determine the list of crops that is needed to prepare the
subnational agricultural statistics. We will discuss this further below.
Make sure to run 01_process_faostat_crops.R
before starting
to prepare the subnational statistics.
The second script extracts the crop prices from FAOSTAT, which are needed to calculate the potential revenue at the grid cell level, which is used to construct the scores/priors for the high-input and irrigated production systems (see Model description for details). As price data is often incomplete at the country level, we decided to use continental averaged prices instead.
The 03_prepare_subnational_statistics.r
can be used to
store all code that is needed to process the raw subnational statistics.
In the case of Malawi, we took all data from SPAM2010 and therefore it
was already in the right format. As no further adjustments were needed,
we did not prepare additional code to process the data. Nonetheless, we
discuss several key processing steps that most users will need to go
through when preparing the subnational statistics for a country
case-study.
Three files with subnational statistical information need to be created, which will be combined later:
solve_level
to 1), also cropping intensity data is required at subnational level
1.It is best to start with processing the harvest area statistics and then work on the files with cropping intensity and production system shares. Data for these three files is most likely coming from different sources and can be processed independently. Also cropping intensity and, to a lesser extent, production system shares data, sometimes needs to be adjusted after running the model as it may occur that there is not enough cropland to allocate all the harvested area from the statistics, which suggests that the cropping intensity is too low for some regions.
All crop statistics need to be stored in the wide
format. With this we mean that the first four columns of the data table
list the administrative unit code (adm_code
), name
(adm_name
), level (adm_level
) and in case of
production system shares or cropping intensity data, the production
system (system
), followed by named columns, one for each
crop, with harvested area, production system shares or cropping
intensity per subnational unit. adm_name
and
adm_code
must be provided up to the most detailed
administrative unit for which data is available. Hence, in case level 2
data is available, data should be supplied for level 0, 1 and 2. In case
only level 1 data is available, data should be supplied for level 0 and
level 1.
create_statistics_template()
will create templates for
the harvested area, production system and cropping intensity data files
- also see the Malawi data for examples. The number and depth of the
administrative units will be determined by the model parameters stored
in the mapspamc_par
object as well as the
adm_list
file that is created with
create_adm_list()
. By default, the template will have
columns for all 40 mapspamc
crops.
The user can save the template as csv file and use it as a basis to prepare the statistics. Note that is is essential that you follow the template closely! Adding or removing administrative units will result in errors as the data no longer matches with the map that contains the location of the administrative units - see below for more information how to deal with missing crop information. In case of Malawi, most of the data was already in the right format so there was no need to use the templates.
# Create template for ha. Replace "ha" by "ps" or "ci" for other templates.
create_statistics_template("ha", param)
For processing it is easier to transform the data into the
long format, which is similar to the tidy data
format as implemented by the tidyverse packages. In our case this
means that the 40 crop columns are squeezed into two columns, one with
the crop code and another with the data. The pivot_longer()
and pivot_wider()
functions in the tidyr package were designed to
switch between wide and long tables.
mapspamc
crop classes
The raw statistics need to be mapped to the 40 crops used by mapspamc
. The
list of crops can be found in crop.csv
, which is
automatically copied to the mappings
folder after running
create_spam_folders()
. In practice, almost none of the
countries produce all 40 crops. To determine the set of relevant crops,
the user can use the list of crops for which FAOSTAT presents data.
Running 01_process_faostat_crops.R
will save a
faostat_crop_list.csv
file in the
processed_data/lists folder
. Note that it is no problem to
keep all the 40 crops in the database as long as the values for crops
that are not grown in a country are set to NA
or
-999
(see below) so that they are automatically removed
when the data is further processed.
The easiest way to aggregate the raw subnational statistics to the
FAOSTAT crop list is to create an orig2crop
concordance
table, which links the two lists of crops. It sometimes happens that the
subnational statistics contain crops that are not present in FAOSTAT. In
such cases, the best solution is to add these crops to one of the crops
for which FAOSTAT presents data.
The script below illustrates how to aggregate the raw statistics to
the mapspamc
crops. Note that it is important to use sum
with na.rm = FALSE
as otherwise NA+NA
will
result in 0
not NA
. This is crucial as missing
data in the statistics need to be set to NA
for
mapspamc
in order to process them correctly.
# NB: use sum with na.rm = F as we want NA+NA = NA, not NA+NA = 0!
stat <- stat %>%
left_join(orig2crop) %>%
group_by(crop, adm_code, adm_name, adm_level) %>%
summarize(value_ha = sum(value_ha, na.rm = F)) %>%
ungroup()
It is unlikely that subnational statistics are available for all
crops in a country. As a minimum requirement, mapspamc
needs full national level statistics for each crop. This is generally no
problem as information on harvested area can be taken from FAOSTAT.
production system shares and cropping intensity data are generally more
difficult to find and often require making strong assumptions. Missing
values in the subnational harvested area statistics are allowed. In such
cases the algorithm will automatically change the constraints in order
to substitute the missing data with information from lower level
administrative units. Full information (harvest area statistics,
production system shares and cropping intensity) is only needed at
administrative level 1 if the model is run at level 1
(solve_level
= 1) and never for administrative level 2.
In case data on harvested area is missing for a certain
administrative unit, you can indicate this by using NA
or
-999
. We highly recommend using the latter, particularly if
you are using Excel to process the data because it avoids potential
errors that may result between an NA
and an empty string
""
value. These are both displayed as empty cells in Excel
but will be treated differently by R after loading the file. Replacing
NA
values by -999
can be done simultaneously
when putting the data in the wide format.
# Put in preferred mapspam format, adding -999 for missing values
stat_mapspam <- stat %>%
pivot_wider(names_from = crop, value_from = value_ha, values_fill = -999) %>%
arrange(adm_code, adm_code, adm_level)
Data should always be consistent and add up, meaning that the sum of harvested area for all administrative units and a given crop is the same or smaller (in case of missing information) as the corresponding lower level unit. The script to check the statistics includes a function to reaggregate the data from the bottom up and check for consistency. It is however, essential that the user also checks whether the data is consistent in the data processing step as otherwise a large bias might be introduced resulting in completely erroneous data and model output. For example, suppose there is an error in the raw harvested area statistics for maize in a certain level 2 administrative region. As a result the sum of harvested maize area of all level 2 units is much higher than the maize area of the corresponding level 1 unit, which is a clear inconsistency. The function that reaggregates the statistics will simply take the level 2 total for maize and substitute the level 1 maize data. This error is subsequently carried forward when the maize area for all level 1 units are aggregated to the national total. The final data table will include maize data that is strongly biased upwards.
After the raw subnational statistics are cleaned and harmonized, the
harvested area, production systems and cropping intensity files need to
be checked for consistency and calibrated to the FAOSTAT national
statistics. This is done in
04_check_and_calibrate_statistics.R
. The first step is to
check if the data is consistent, which is done by
check_statistics()
. If out = TRUE
a report is
returned which shows where the inconsistencies occur.
# Check the consistency of the ha statistics
check_statistics(mapspamc::ha_df, param, out = TRUE)
#> adm0 and adm1 are not equal!. Did you run reaggregate_statistics()?
#> # A tibble: 29 × 4
#> # Groups: crop [29]
#> crop adm0 adm1 difference
#> <chr> <dbl> <dbl> <dbl>
#> 1 bana 18342 18208 134
#> 2 bean 281065 281058 7
#> 3 cass 193993 193932 61
#> 4 chic 128936 128933 3
#> 5 coff 3001 3001 0
#> 6 cott 63552 63552 0
#> 7 cowp 64297 64297 0
#> 8 grou 290092 290093 -1
#> 9 lent 1904 1905 -1
#> 10 maiz 1660214 1660044 170
#> # ℹ 19 more rows
To reaggreate the statistics from the bottom up, use
reaggregate_statistics()
.
# Reaggreate from the bottom up.
ha_df <- reaggregate_statistics(mapspamc::ha_df, param)
# Check again.
check_statistics(mapspamc::ha_df, param, out = T)
Next, the statistics are calibrated to the national FAOSTAT statistics. The code below shows how this is implemented. First, the code checks if there are still crops, which are not in the FAOSTAT data and removes them if needed. Next, crops are identified that are not present in the statistics, after which the FAOSTAT national total is added. Finally all the subnational statistics are proportionally scaled so the total is equal to the FAOSTAT figures.
# Identify crops that are present in ha_df but not in fao and remove them from ha_df.
crop_rem <- setdiff(unique(ha_df$crop), unique(fao$crop))
ha_df <- ha_df %>%
filter(!crop %in% crop_rem)
# Identify crops that are present in fao but not in ha_df.
# We will add then to ha_df
crop_add <- setdiff(unique(fao$crop), unique(ha_df$crop))
ha_df <- ha_df %>%
bind_rows(
fao %>%
filter(crop %in% crop_add) %>%
mutate(
fips = unique(ha_df$adm_code[ha_df$adm_level == 0]),
adm_level = 0,
adm = unique(ha_df$adm_name[ha_df$adm_level == 0])
)
)
# Calculate scaling factor
fao_stat_sf <- bind_rows(
fao %>%
mutate(source = "fao"),
ha_df %>%
filter(adm_level == 0) %>%
mutate(source = "ha_df")
) %>%
dplyr::select(crop, source, ha) %>%
spread(source, ha) %>%
mutate(sf = fao / ha_df) %>%
dplyr::select(crop, sf)
# rescale ha_df
ha_df <- ha_df %>%
left_join(fao_stat_sf) %>%
mutate(ha = ha * sf) %>%
dplyr::select(-sf)
Finally the harvested area, production systems and cropping intensity
files are saved in the
processed_data/agricultural_statistics
folder. It is
important to use the correct file names otherwise other functions cannot
load the data.
write_csv(ha_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ha_adm_{param$year}_{param$iso3c}.csv")))
write_csv(ps_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ps_adm_{param$year}_{param$iso3c}.csv")))
write_csv(ci_df, file.path(param$mapspamc_path, glue("processed_data/agricultural_statistics/ci_adm_{param$year}_{param$iso3c}.csv")))
The next step is to process the spatial data.