Ksapa - A Data Science Approach to Process Data Collected from Vulnerable Populations

Note: this article was originally posted on https://agrilinks.org/post/data-science-approach-process-data-collected-vulnerable-populations

This article outlines a data science approach to processing data collected from vulnerable populations, which we employ at Scale Up Training Traceability Impact (SUTTI), an initiative by Ksapa. SUTTI enables the transformation of the first mile of agricultural supply chains. We work with, and for, small farmers in medium- and large-scale programs, cooperating with industrial and financial actors, public authorities, civil society organizations or academics in several countries in Asia and Africa. For that purpose, we have designed our low-tech, inclusive SUTTI digital suite enabling us to disseminate content and information toward smallholders, but also to collect data from them to monitor our social, economic and environmental impact.

The process of collecting and interpreting data in these contexts is complex and needs considerations for the context, data quality and statistical validity. To that end, we gathered a Data Committee composed of professionals and experts in data science, to support our team to build a scientifically robust data science approach.

The challenges to solve

These are some key problems we need to consider to get usable data:

The data needs to be collected at the first mile of the supply chain, that is directly from the farmers. Our SUTTI digital suite enables us to solve this problem.
To claim a positive social and environmental impact, three criteria must classically be met: intentionality, additionality and measurability.
The data collected at the ground level may contain inconsistencies or gaps, based on differences in interpretation or data entry errors, or even strategies to respond according to projected expectations — with potential consequences on data quality.

The data science approach we outline below is designed to meet the last two challenges

We built a three-step process comprised of improvement of data robustness, calculation of key performance indicators (KPIs) on robust data and inference (projection) of impact to the SUTTI Program considered.

As part of this process, we utilize artificial intelligence-based techniques to automate the clustering of data for outliers spotting (with density-based, distance-based and timeseries-based machine learning algorithms, such as z-score, autoregressive integrated moving average (ARIMA), long short-term memory (LSTM), k-means, density-based spatial clustering of applications (DBSCAN), Isolation Forests, etc.). We also use a linear regression model (called stacked difference-in-difference) for the impact inferencing.

Step 1: Data robustness improvement can be described as finding appropriate statistical methods or algorithms to spot those outliers, flag unusable or misleading data, and do data correction. This can be done through four methods:

Historical data consistency: In this sense, normal data value should be under a reasonable range and outliers are those outside this range. Verification of historical data consistency is, thus, mainly based on density or distance, using a three-sigma/z-score rule.
Sample comparison: By using this approach, we study the relationship between a targeted variable and other factors, and outliers are those that don’t follow the relationship. For instance, a negative relationship between rubber yields and farmers age. By linear regression, we mark dark orange points hereunder as suspected outliers.

External data comparison: Several external data sources are under consideration — Food and Agriculture Organization of the United Nations (FAO), Organisation for Economic Co-operation and Development (OECD), World Bank, United Nations Economic and Social Commission for Asia and the Pacific (UNESCAP), etc. For example, a particular farmer is likely (depending on the countries) to sell its production at a price ranging from 40 to 95% of the world price (depending on location, country, etc.).
Data redundancy: Data redundancy occurs when the same piece of data exists in multiple places. It enables to double-check main KPIs consistency.

As for data correction, it is more likely to be done manually through automated routine confirmation, either by farmers, champion farmers or instructors. As depicted in the following picture, it is unavoidable that only a part of outliers can be finally corrected, but this ensures that the KPIs are thus only calculated on robust data (i.e., normal data with corrected outliers).

Step 2: KPI calculation on robust data. Now equipped with robust data, we can analyze KPIs on our key metrics and benchmark them to our impact targets. Further, we can build regression models on which aspects of the program create the most value, that is, which independent variable has the largest impact on the dependent variable so that our efforts can be better focused.

Step 3: SUTTI impact inference to all data. The basic idea of impact inference is simple: we estimate the impact on sample data, based on some classical models (e.g., difference-in-difference model), then we make some adjustments according to the bias (e.g., introduce other variables into the model, or change the model structure). We can thus project estimated results based on sample data as the impact on all data. The equation estimated data on sample data is written as follows: estimated impact of SUTTI program = the actual value of KPI – the counterfactual value of KPI.

For example, suppose that in the SUTTI Program data we have 1,000 farmers and only the data of 600 farmers is robust. First, we do impact inference on those 600 farmers, and after calculation, we notice that their average income increase is $1,000 per month. Then, we realize that these farmers are younger compared with all data, and younger farmers find it easier to accept new technology. So, we made some adjustments, and the final estimated impact on the farmer’s income is $800 per month of increase.

In conclusion, we believe that it is essential to collect impact data from the beneficiaries themselves and that the digital solutions we have developed make it possible to collect this data, even under difficult access conditions. Coupled with other modes of collection by instructors, they allow us to collect statements and opinions from program beneficiaries, thus filling gaps in data on vulnerable populations as well as documenting the measurement of impact on our programs.

Iman Barua

Junior Consultant at KSAPA | more posts

Iman works across the consulting team and SUTTI program at Ksapa.
He is a second-year student at HEC Paris in the Grande École program specialising in Impact Investing. He has over 3 years of experience in growth and innovation consulting for large corporates across SEA, South Asia and Latin America. More recently he was an investment analyst at an early-stage VC in India.
He speaks English, Hindi, Bangla and Punjabi.

Impact, Inclusion, Smallholding Farmers

A Data Science Approach to Process Data Collected from Vulnerable Populations

The challenges to solve

The data science approach we outline below is designed to meet the last two challenges

Iman Barua

Leave a Reply Cancel reply