A Data Science Approach To Process Data Collected From Vulnerable Populations: A Case Study Of Impact Measurement Among Small-scale Farmers

Collecting and analyzing data on and from vulnerable populations is a black hole today, in a world focused on Big Data and the growing use of Artificial Intelligence. A glaring example of this paradox is the statistical understanding of the 600 million farms under 2 Ha, on which a good quarter of humanity depends. 

How to collect data on these populations? How to collect it directly from them? How to use digital means while taking into account the difficult or technically limited environments in which they evolve? 

Here are some of the answers we propose, using as an example the approach we have developed in our SUTTI (Scale Up Training Traceability Impact) initiative.

Ksapa has designed the SUTTI initiative and associated solutions to enable the transformation of the first mile of agricultural supply chains: we work with and for small farmers in medium and large scale programs, cooperating with industrial and financial actors, public authorities, civil society organizations or academics in several countries in Asia and Africa. SUTTI combines design & program management (technical assistance), low tech digital solutions and innovative impact finance.

The issue of impact measurement and therefore the quality of impact data is key to the success of these projects: 

  • We need to monitor and evaluate the social, economic and environmental impact of our programs in order to ensure that they are useful and to be able to adjust our focus if the impact objectives we are pursuing are struggling to be achieved: access to quality vocational training and digitalization, income increase and diversification, social inclusion, impact on the carbon intensity of supply chains, etc.
  • The types of coalitions organized around our programs require a materialization of the positive impact we bring, which will be the glue of these coalitions in the long term, while the stakeholders involved all have different goals and agendas. For example, we are working in Sri Lanka with the Sri Lankan and French governments, the Michelin Group, experts in various fields – agronomy, auditing, IT, legal advice, etc. – research institutes, universities, etc.
  • Finally, even if the 600 million farms of less than 2 hectares provide a living for more than a quarter of humanity, it is clear that this part of the world’s population is a real black hole in terms of data, which is scarce, scattered and rarely comparable to establish and understand the profiles and functioning of small farmers, who are nonetheless totally essential to the world’s food supply and agro-industrial value chains.

For this reason, among others, we have designed the “SUTTI Digital Suite”, a digital solution in a hybrid format to support the face-to-face deployment of these programs:  

  • to disseminate content and information to participating small farmers – e-learning, information on prices or collection rounds, etc. -, 
  • but also to collect data from these same farmers – initial profile, P&L, geoplotting but also regular surveys  

This solution is especially designed for complicated environments, with little access to recent terminals, limited connectivity or reading difficulties: thus, the SUTTI Digital Suite includes a multi-lingual farmer application, working on all types of phones (not only smartphones), working offline and usable by illiterate people. 

It includes a back-office tool to administer the whole, but also a reporting tool to follow the progress of the program daily.

The SUTTI Data Committee

The difficulty of collecting data in difficult environments is multiple: 

  • First of all, it is necessary to have the means to collect data not only at the level of the civil society organizations running these programs, but we have also chosen to collect them directly from small farmers, by developing our digital solutions. The participating farmers are naturally sensitized and trained to collect the data. 
  • But the data collected may contain inconsistencies, gaps, based on differences in interpretation or data entry errors, or even strategies to respond according to projected expectations 
  • This is even more difficult because data collected directly from so-called vulnerable populations is relatively rare and the methods for processing and interpreting it are lacking.
  • Finally, to claim a positive impact, three criteria must be met: intentionality, additionality and measurability. While some impact indicators are relatively easy to calculate, such as the number of people who had access to the program and reported not having had access to vocational training in the last five years, others are more complicated to measure or demonstrate. For example, monitoring changes in income alone cannot measure the additionality of our efforts: changes in income can come from the training provided or the resources provided in the program, but also from the increase in world prices, the negotiation capacities of farming communities, or the change in the margin taken by intermediaries in the supply chain. Thus, indicators that are truly representative of additionality should be included in the data collection: for example, the increase in yield as a result of changes in techniques or the provision of methods to control diseases in the plants and trees harvested, or the increase in income from crop diversifications promoted by the program.

Unevenly Collected Data Sources

With these principles in mind, we began collecting data, only to realize that the quality of the data was uneven and that some of the data was not usable, threatening to make the overall analysis meaningless. 

Thus, missing or erroneous values prevent the establishment of averages or correlations that would allow us to monitor the program’s impact in a meaningful way, depriving us of the means of analysis to adjust our actions. But these issues can potentially undermine our ability to report to different stakeholders (and especially to the funders of these programs) with a well-calibrated level of quality of impact indicators, which stakeholders should be able to rely on to guide their actions or communicate to their own stakeholders.

The experts of the SUTTI Data Committee 

We have decided to strengthen our team’s data science capabilities, but also to add a Data Committee composed of professionals and experts in data science, to guide our choices and validate their scientific and operational logic. 

This committee is composed of :

  • Fatima Roudani, ESG & Impact specialist, who has been responsible for ESG big data issues in banking institutions
  • Edwige Rey, Partner at Mazar and specialist in extra-financial auditing 
  • Nathan Hara, an astronomer specializing in the use of data to find exoplanets 

They meet regularly to guide and validate our data science approach and its successful integration into our SUTTI solutions. In particular, it has allowed us to develop a 3-step approach, described below (improvement of data robustness, calculation of KPIs on robust data, inference of impact from SUTTI Program to all data).

Step 1: Data Robustness Improvement

Data robustness improvement can be described as finding appropriate statistical methods or algorithms to spot those outliers and do data correction. Outliers refer to abnormal data that differs significantly from other observations.

As for methods to spot outliers, we mainly have four types of outliers spotting methodologies or sources: historical data consistency, sample comparison, external data comparison, and data redundancy. The details of these methods can be described as follows:

(1) Historical data consistency. In this sense, normal data’s value should be under a reasonable range (neither too big nor too small) and outliers are those outside this range. Verification of historical data consistency is thus mainly based on density or distance. The most common way is the three-sigma rule: nearly all values should lie within standard deviations of the mean. Typically, if a farmer produces steadily x kg per month for 2 years, a monthly declaration x times higher than the usual range of declaration would be considered as a “data outlier”.

(2) Sample comparison. By using this approach, we study the relationship between targeted variable and other factors, and outliers are those that don’t follow the relationship. For instance, we have found a negative relationship between rubber yields and ages. By linear regression in the following picture, the dark orange points clearly don’t follow this kind of relationship (the points outside the confident intervals). Therefore, we mark them as suspected outliers: they represent data points differing significantly from average or usual range of a group of comparable farmer’s profile. It does not necessarily mean that they will be finally excluded from calculations, but they will be investigated with more caution.

(3) External data comparison. Several external data sources are under consideration: FAO, OECD, World Bank, UNESCAP, etc. For example, a particular farmer is likely (depending on the countries) to sell its production at a price ranging from 40 to 95% of world price (depending on location, country, …) – we use external data to detect outliers for rubber prices in our Indonesian programs. Other similar calculations might be made as well on yield per Ha. According to this external data, we have found that the time series of the difference of rubber price is stationary (stable) and we thus use ARIMA(1,1,1) (a classical time series model) to model the rubber price. Hence, we can use this pattern to predict rubber price on our own dataset and outliers are those outside the confident intervals.

(4) Data redundancy. Data redundancy occurs when the same piece of data exists in multiple places. It usually happens when the design of the dataset is not appropriate.

 As for data correction, the most efficient way for it is to be done through pre-configured analysis routines leading to manual reprocessing or confirmation, by farmers, agricultural champions or instructors. As shown in the following image, it is inevitable that only a portion of the outliers can eventually be corrected, but this ensures that the Key Performance Indicators (KPIs) are therefore calculated only on robust data (i.e., normal data with outliers corrected).

Step 2: KPI Calculation on Robust Data

Secondly, we calculate KPI on robust data. The design of KPIs mainly focuses on several aspects:

  • Number of people reached (direct & indirect beneficiaries, % women, % youth…)
  • Volume of professional training in hours/person (both in-person and digitally)
  • Impact on increase and diversification of revenues
  • Environmental impact, including carbon sequestration and greenhouse gas (GHG) avoidance which are part of the program objectives

Hence, there are mainly 3 types of KPIs in SUTTI programs: participants’ number of SUTTI programs; professional training; impact on incomes of smallholder farmers. This approach is also adapted to carbon projects qualification & monitoring for example: carbon sequestered or GHG emissions avoided could be monitored through large-scale data collection directly from the hand of farmers and complemented by approaches like carbon account, satellite monitoring and so on.

For the first type (number of people reached), it includes all kinds of participants’ numbers counts, with different genders, ages, relatives, employments and so on. Examples: active participants count, direct beneficiaries count, number of people whose livelihood will be impacted, etc.

For the second type (volume of professional training), it mainly evaluates the training evolution, like training hours of the courses. Examples: in-person training hours, number of people accessing the capacity building on the SUTTI App, etc.

For the third type (impact on increase and diversification of revenues), a case by case study has been made to monitor the main levers we use: productivity and its impact on yield, for example, but also diversification of crops recommended or brought through SUTTI. There can be other levers, but they have to be proven and have a “logical” rationale.

Step 3: SUTTI Impact Inference to All Data

The need to infer impact is due to two reasons. First, only the performance of the responding farmers is observable, whereas we are more interested in the overall impact. On the other hand, only a part of the outliers is finally corrected, and the key performance indicators are therefore calculated on robust data.   

In this case, we proceed to impact inference. The basic idea of impact inference is simple: we estimate the impact on the sample data, make some adjustments for identified biases, and then project the impact on all data. The equation for the estimated data on the sample data is written as follows: estimated impact of the SUTTI program = the actual value of the KPI – the counterfactual value of the KPI (i.e., the value the KPI would have if there were no SUTTI program).

For example, suppose that in the SUTTI data we have 1000 farmers and only the data for 600 farmers is robust. First, we perform an impact inference on these 600 farmers, and, after calculation, we find that their average income increases by $1,000 per year. We then realize that these farmers are younger relative to the data set, and that it is easier for young farmers to accept a new technology and that projecting their average increase over all participants would be too optimistic. So, we make some adjustments (which may be the subject of a future article) considering the statistical distribution of the sample of “respondents” but also of all participants, and the final estimated impact on farmers’ income is for example an increase of (1000 – x%) dollars per month over all participants.

By definition, the counterfactual value of KPIs is unobservable and so logical links and statistical models are needed to “predict” them. Several methods are available to do this, but when we design the impact inference model, the main keys are the introduction of control variables and the resolution of selection bias.

The model includes control variables because those variables are those that may also have an influence on the outcome variables (KPIs). For example, we introduce ages and genders as control variables when we estimate SUTTI impact on maincrop productivity. Undoubtedly, the more related control variables we introduce, the estimated impact will be more accurate.

And as for the selection bias, it is the bias that results from the failure to ensure the proper randomization of a population sample, like self-selection, survivorship bias, pre-screening of participants, etc. In any case, the most radical solution is to know how does the exact selection bias look like in the dataset and so we can adjust it in the estimation model.


In conclusion, we believe that it is essential to collect impact data from the beneficiaries themselves, and that the digital solutions we have developed make it possible to collect this data, even under difficult access conditions. Coupled with other modes of collection by instructors, they allow us to collect statements and opinions from program beneficiaries, thus filling gaps in data on vulnerable populations as well as documenting the measurement of impact on our programs. 

The data science approaches developed under the supervision of our Data Committee make it possible to process significant amounts of data and to derive in-depth analyses and impact projections that are widely applicable outside the agricultural field for which they were designed: for example, these solutions can also be applied to mass studies on issues of decent income or working conditions for workers.

more posts

20+ years of experience in investment & asset management.
Raphael Hara works on relationships between finance and sustainability, in particular through the development and management of impact investment funds and projects.

Xianlin Ding
Data Analyst at Ksapa | more posts

Xianlin works as a Data Analyst at Ksapa. Among other things, he manages and analyses the data collected from agricultural populations via the SUTTI application, as well as the implementation of impact monitoring systems and indicators.

Xianlin is a student of both ENSAE Paris in Data Sciences and Social Sciences and Sciences Po Paris in International Economic Policy. He has already worked in applied statistics (GDP estimation in China, real estate price forecasting in Japan...). He has previously analysed data from “European Societies” to bring out the valued elements of the journal.

Xianlin speaks Mandarin, English, French and Spanish.

Leave a Reply

Your email address will not be published. Required fields are marked *