Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 07 September 2021

An interaction regression model for crop yield prediction

  • Javad Ansarifar 1 ,
  • Lizhi Wang 1 &
  • Sotirios V. Archontoulis 2  

Scientific Reports volume  11 , Article number:  17754 ( 2021 ) Cite this article

19k Accesses

34 Citations

2 Altmetric

Metrics details

  • Machine learning
  • Plant sciences

Crop yield prediction is crucial for global food security yet notoriously challenging due to multitudinous factors that jointly determine the yield, including genotype, environment, management, and their complex interactions. Integrating the power of optimization, machine learning, and agronomic insight, we present a new predictive model (referred to as the interaction regression model) for crop yield prediction, which has three salient properties. First, it achieved a relative root mean square error of 8% or less in three Midwest states (Illinois, Indiana, and Iowa) in the US for both corn and soybean yield prediction, outperforming state-of-the-art machine learning algorithms. Second, it identified about a dozen environment by management interactions for corn and soybean yield, some of which are consistent with conventional agronomic knowledge whereas some others interactions require additional analysis or experiment to prove or disprove. Third, it quantitatively dissected crop yield into contributions from weather, soil, management, and their interactions, allowing agronomists to pinpoint the factors that favorably or unfavorably affect the yield of a given location under a given weather and management scenario. The most significant contribution of the new prediction model is its capability to produce accurate prediction and explainable insights simultaneously. This was achieved by training the algorithm to select features and interactions that are spatially and temporally robust to balance prediction accuracy for the training data and generalizability to the test data.

Similar content being viewed by others

crop prediction using machine learning research paper

Python farming as a flexible and efficient form of agricultural food security

D. Natusch, P. W. Aust, … T. Coulson

crop prediction using machine learning research paper

Meta-analysis shows the impacts of ecological restoration on greenhouse gas emissions

Tiehu He, Weixin Ding, … Quanfa Zhang

crop prediction using machine learning research paper

Climate change impacts and adaptations of wine production

Cornelis van Leeuwen, Giovanni Sgubin, … Gregory A. Gambetta

Introduction

Predicting crop yield is crucial to addressing emerging challenges in food security, particularly in an era of global climate change. Accurate yield predictions not only help farmers make informed economic and management decisions but also support famine prevention efforts. Underlying crop yield prediction is a fundamental research question in plant biology, which is to understand how plant phenotype is determined by genotype (G), environment (E), management (M), and their interactions (G \(\times \) E \(\times \) M) 1 , 2 , 3 , 4 , 5 , 6 . State-of-the-art crop yield prediction methods fall into three main categories: linear models, machine learning models, and crop models, which have complementary strengths and limitations. Linear models are explainable by quantifying the additive effect of each variable, but they often struggle to achieve high prediction accuracy due to the inability to capture the intrinsically nonlinear interactions among G, E, and M variables.

Machine learning models have been successfully used for crop yield prediction, including stepwise multiple linear regression 7 , random forest 8 , neural networks 9 , 10 , 11 , convolutional neural networks 12 , recurrent neural networks 13 , weighted histograms regression 14 , interaction based model 15 , and association rule mining and decision tree 16 . Most of these studies were based on environmental and managerial variables only, due to lack of publicly available genotype data at the state or national scale. Some studies 16 , 17 , 18 , 19 explored the relationship between genotype and grain yield from regional yield trials from a plant breeding perspective, which would be hard to scale up to statewide or nationwide predictions. Many machine learning algorithms are scalable to large datasets and have reasonably high prediction accuracy. However, due to the black-box nature of these models, prediction accuracy is sensitive to model structure and parameter calibration, and it can prove difficult to explain why predictions are accurate or inaccurate.

Crop models are another type of nonlinear models, including APSIM 20 , DSSAT 21 , 22 , RZWQM 23 , and SWAP/WOFOST 24 , which build upon the physiological understanding of plant and soil processes to develop biologically meaningful non-linear equations to predict crop yield and other phenotypes. These models provide explicit (albeit complex) explanations of the interactions between traits and environmental conditions in different phases of the crop growth cycle. They also offer biological insights into causes of phenotypic variation 25 . Nevertheless, the collection of trait measurement data and calibration of model coefficients can be labor intensive and time consuming 26 , 27 , 28 , 29 , computation speed could be low 29 , and prediction accuracy may not be as high as some machine learning algorithms.

We propose a novel model, the interaction regression model, for crop yield prediction, which attempts to combine the strengths and avoid the limitations of the aforementioned approaches. At the core of this model lies a combinatorial optimization algorithm, which not only selects the most revealing E and M features but also detects their most pronounced interactions; the contributions of these features and interactions to the crop yield are then quantified with a multiple linear regression. To ensure the explainability of the results, we trained our algorithm to find features and interactions that are spatially and temporally robust, which means that they should be consistently predictive of crop yield across all counties in all years. As such, results from this model have the potential to propose biologically and agronomically insightful hypotheses on E \(\times \) M interactions that can be validated experimentally. A similar concept of robust inference model in spatial–temporal models was presented in Santos and Erniel 30 . A measure of robustness was proposed in Nogueira et al. 31 , which was based on the number of overlapping features selected using different subsets of training data. In our approach, the robustness measure is defined as the average prediction performance in multiple validation datasets at different temporal and spatial spectra. As such, our robustness definition allowed the algorithm to strike a balance between prediction accuracy and generalizability.

The proposed model has demonstrated notable performance in a comprehensive case study, in which it was compared with eight other machine learning models to predict corn and soybean yield in 293 counties of the states of Illinois, Indiana, and Iowa from 2015 to 2018. Moreover, prediction performance with and without knowing weather during the growing season and temporal and spatial extrapolation performance of the proposed model in unseen counties were tested. The proposed model not only achieved a less than 8% relative root mean square error (RRMSE) for both corn and soybean in all three states, outperforming all other machine learning models in the case study, but also produced explainable insights. In particular, our model identified 11 E \(\times \) M interactions for corn and 12 for soybean, and also dissected the total yield into contributions from weather, soil, management, and their interactions. To test the generalizability of the model in terms of both temporal and spatial extrapolation, we trained the model using historical data from two states up to 2017 and applied it to predict corn yield in a third state for 2018, and the resulting average RRMSE was less than 10%.

Let X denote the set of explanatory (including environment and management) variables and y the crop yield of a given county for a given year. We propose the interaction regression model to describe the relationship between X and y as follows.

where, \(\mathcal {N}\) is the set of sample observations (one sample per county per year), \(\mathcal {P}\) is the set of explanatory variables, \(\mathcal {M}\) is the set of interactions, \(\hat{y}_i\) is predicted crop yield of sample i , \(\beta _0\) is the intercept of crop yield, \(\beta _j\) is the additive effect of variable j , \(X_{i,j}\) is the explanatory variable j of sample i , \(b_m\) is the effect of interaction m , and \(Z_{i,m}\) is the interaction variable m of sample i .

Key to Eq. ( 1 ) is to decipher the interaction matrix Z from explanatory variables. We use a kernel-based approach to represent the interactions as

where, \(K_k(\cdot )\) is the type k kernel function, \(\mathcal {K}\) is the set of kernel functions that we use to describe nonlinear relationships between explanatory variables and crop yield, and \(\delta _{m,k}\) is a binary variable indicating whether interaction m is best described by the type k kernel ( \(\delta _{m,k} = 1\) ) or not ( \(\delta _{m,k} = 0\) ).

In order to solve Eq. ( 1 ), we propose an approach that consists of three major steps: data pre-processing, robust feature and interaction selection, and linear regression, as illustrated in Fig. 1 . Key elements of the three steps are summarized as follows.

figure 1

Illustration of the proposed interaction regression model for crop yield prediction. Step 1 is data pre-processing. In step 2, Algorithms 1 and 2 select robust features and interactions, which are then used in step 3 to predict the crop yield with a multiple linear regression model. Here, \(\hat{y}\) is the predicted yield, \(\beta _\text {W}\) , \(\beta _\text {S}\) , and \(\beta _\text {M}\) are, respectively, the additive effects of weather, soil, and management features, whereas \(\beta _\text {I}\) is the effect of E × M interactions. This plot was created with Microsof PowerPoint (Version 16.0.12827.20200 32-bit).

Step 1: Data pre-processing

We collected weather data from the Iowa Environmental Mesonet 32 , soil data from the Gridded Soil Survey Geographic Database 33 , and management and yield performance data from the National Agricultural Statistics Service 34 for all 293 counties of the states of Illinois, Indiana, and Iowa from 1990 to 2018. Weather variables include precipitation (Prcp, mm), solar radiation (Srad, MJ m \(^{-2}\) ), maximum temperature (Tmax, \(^\circ \) C), and minimum temperature (Tmin, \(^\circ \) C) from weeks 13 (late March) to 52 (late December). Soil variables include dry bulk density (BDdry, g cm \(^{-3}\) ), clay percentage (clay, %), soil pH (pH), drained upper limit (dul, mm mm \(^{-1}\) ), soil saturated hydraulic conductivity (ksat, mm day \(^{-1}\) ), wilting point (ll, mm mm \(^{-1}\) ), soil organic matter (om, %), sand percentage (sand, %), and saturated volumetric water content (sat, mm mm \(^{-1}\) ) at nine different depths of soil: 0–5, 5–10, 10–15, 15–30, 30–45, 45–60, 60–80, 80–100, and 100–120 cm. Weather data and Soil data were available at 1 km \(^2\) spatial resolution. To compute county-level information, we had to scale up and aggregate the soil and weather information. We took the average of soil at different spatial resolutions at a county to compute county-level soil information. In contrast, We took the median of weather at different spatial resolutions at a county to scale up the county-level weather information. Management variables include acres planted at the county-level, weekly cumulative percentage of planted and harvested acreages. We also created additional variables using the weather and management data based on agronomic insight to help enhance the performance of the model, such as growing degree days, number of rainy days, and heat units. Due to the lack of publicly available genotypic data, we extracted two new variables using additional data from the National Agricultural Statistics Service 34 to account for the trend of genetic improvements 2 : (1) trend of historical yields and (2) trend of population density for corn and pod count for soybean. These two variables were put in the category of management variables. All variables were normalized to the [0, 1] interval.

Step 2: Robust feature and interaction selection

To avoid overfitting, we selected a subset of all explanatory variables (features) to predict crop yield. We applied elastic net regularization model to select a set of high-quality features for each category of weather, soil, and management, and then we used forward and backward stepwise selection to identify features and interaction that are spatially and temporally robust across different counties over different years. These robust features and interactions were selected using a similar algorithm from our previous study 35 , which was modified to iterate between exploring new interactions and cross-validating their performances. Such process continues until a set of robust features and interactions has been discovered that lead to good prediction accuracy on the training data and generalizability on the validation data. The way interactions were represented in our model differs from the classical factorial interaction. However, they are also similar in the sense that our algorithm explores all possible factorial combinations to identify the most effect interactions to include in the model.

Step 3: Linear regression

The last step of the prediction model is a multiple linear regression, which attributes crop yield to additive contributions from weather, soil, management, and their interactions. As such, this prediction model combines the strengths of explainability of linear regression, prediction accuracy of machine learning, and agronomic insights.

More details about the kernel functions in Eq. ( 1 ) and the algorithm for solving it are provided in Appendix 1 .

Experimental setting

We compared the performance of the proposed algorithm with that of eight other machine learning algorithms from the literature: linear regression was implemented in R; stepwise regression was implemented in R using the MASS package 36 ; LASSO, ridge, and elastic net were implemented in R using the glmnet package 37 ; random forest was implemented in R using the ranger package 38 ; extreme gradient boosting (XGBoost) was implemented in R using the xgboost package 39 ; and neural network was implemented in Python using the Sklearn package 40 . We fed all original explanatory variables as input to these eight algorithms. The linear regression algorithm uses all features without interaction selection; stepwise regression, Lasso regression, ridge regression, and elastic net have their default feature selection settings in the software packages without interaction selection; random forecast, xgboost, and neural network use different modeling structures for feature and interaction selection. As such, the different performances of these algorithms can be attributed to how they select features and interactions from the same set of explanatory data.

All nine algorithms were deployed to predict both corn and soybean yields in the states of Illinois, Indiana, and Iowa from 2015 to 2018. To predict yield for the test year t , the training data included all the explanatory (weather, soil, and management) and response (crop yield) data from 1990 to year \(t-1\) . A 10-fold CV over training and validation partitions was applied to tune the hyperparameters using a grid search approach.

figure 2

Illustration of generating scenarios and predicting yield at each week during the growing season. This plot was created with Microsof PowerPoint (Version 16.0.12827.20200 32-bit).

Crop yield prediction during the growing season is informative for farmers to make economic or management decisions, but it is also very challenging due to weather and management uncertainty. Our model was able to provide weekly predictions by integrating continuously updated weather and management data with future weather scenarios. For this purpose, first, we trained the proposed model for historical information, and then we utilized this trained model to predict yield performance during the growing season. The process of generating scenarios during growing season and predicting yield performance was illustrated in Fig. 2 . For the prediction at each week, we recorded observed weather and management information and estimated them in advance to construct the whole weather and management profiles. For unknown part of data, we used the observed ones from previous years as different scenarios at each week. Therefore, we could generate several predictions for corn and soybean for each week corresponding to each scenario. By observing more and more weather and management data, the uncertainty decreased; thus, the prediction accuracy was expected to improve over time as more actual observations by being available to replace estimated weather and management. Our previous work using a crop model suggested that weather uncertainty decreased by 60% by mid-July in Iowa for both corn and soybean 41 . The final prediction at each week was the median of yield performances of scenarios.

To explore the prediction performance of the proposed Interaction–Regression model for corn and soybean in complete unseen counties, we created four datasets by removing the historical dataset of some counties from the training and validation sets. For the first three datasets, we removed data for Illinois (IL), Indiana (IN), and Iowa (IA) from training and validation sets, respectively; for the last dataset, we randomly picked 100 out of the 293 counties and removed all their data from training and validation sets. For this purpose, for the test dataset of unseen counties in 2018, the historical dataset of seen counties from 1990 to 2017 was divided into four time-wise folds. Then, the proposed framework used these folds for feature selection and interaction detection. After extracting robust features and interactions for each dataset, we partitioned validation and training sets as two previous years from the test year 2018 (years 2016 and 2017) and dataset corresponding to the rest of the years to 1990 (years 1990 to 2015), respectively. Then, for each test dataset, we trained the model using its training partition and robust features and interactions, and the trained models were utilized to predict crop yield of the unseen counties in the year 2018.

Prediction accuracy comparison with other machine learning models

Prediction errors for two crops over four test years using nine algorithms are summarized in Table 1 . More comparison in terms of the relative RMSE (RRMSE), the relative squared error (RSE), the mean absolute error (MAE), the relative absolute error (RAE), and the coefficient of determination ( \(R^2\) ) of nine models are reported in Appendix 2 . These results suggested that the proposed model outperformed other models for all test years for both corn and soybean in all evaluation criteria. The test root mean square errors (RMSE) are also lower than what has been reported in the literature 13 , 14 , 16 , 29 . As such, the different performances of our model and others can be attributed to how our model selects high-quality and robust features and interactions from the same set of explanatory data. Second, due to the sparsity of the modeling structure by specifically separating interactive effects from additive effects of features, the algorithms are less prone to overfitting than some machine learning approaches. In terms of the computation time, the proposed approach took approximately two hours for each test year, which was comparable with the neural network model.

Prediction performance with known weather after growing season

Figure 3 illustrates the prediction performance of the proposed model after the end of the growing season when all the weather data have been observed. These results indicate that the proposed model has an RRMSE lower than 8% in all three states (and most of the counties) over multiple years for both corn and soybean. In reference, prediction accuracy of other recent studies ranged from 7.6% mean absolute percentage error for corn using deep neural networks 42 to 16.7% RRMSE for corn using random forest 8 .

figure 3

RRMSE for corn and soybean yield prediction from 2015 to 2018. These plots were created with R (version 3.6.3) 43 .

Prediction performance with updating weather during growing season

Figure 4 shows the predictions of corn and soybean yield during the growing season of 2018 in the three states, updated weekly to incorporate new weather data. Compared with the USDA predictions, results from the proposed model have two advantages: (1) interval predictions throughout growing season with weekly updates, (2) county level (as opposed to state level) predictions with well accuracy. The pattern of increased yield prediction from April to July was caused by weather and planting time in 2018, and it varied across different counties. Our prediction continues to update until the end of December, which is more than 2 months after the end of the growing season. This is because the model is able to capture factors that affect crop yield from crop maturity to harvest, such as adverse weather conditions during harvesting.

Temporal and spatial extrapolation performance

The prediction performance of the proposed Interaction–Regression model for corn and soybean in unseen counties at the test year 2018 are reported in Table 2 . Investigation on the performance of the proposed model using four datasets by removing the historical dataset of some counties from the training and validation sets suggest that the proposed approach has a satisfactory prediction performance in both temporal and spatial extrapolation.

The result of corn yield prediction reveals that a trained model using two selected states from Illinois, Indiana, and Iowa is able to predict corn yield at selected states with at most 8.98 % error. In contrast, soybean prediction of unseen locations using a trained model of seen locations cannot provide robust enough soybean yield prediction. It means that corn yield is more predictable than soybean yield at completely unseen locations with new weather, soil, and management profiles. Also, results suggest that soybean yield prediction is more sensitive to the model compared with corn yield.

figure 4

State-level predictions of corn and soybean during the growing season for three states in 2018. Our model provided weekly predictions based on observed weather information; prediction intervals were constructed using historical weather scenarios for yet-to-be-observed weather. The dashed red curve is the median prediction, and the pink interval is defined by the first and third quantiles under multiple weather scenarios, constructed using historical weather data. The dotted blue curves are USDA predictions, which were released in August, September, and October of 2018 at the state level. The solid black line indicates the actual state average yield, which was announced by USDA in February 2019. These plots were created with MATLAB R2018a (Version 9.4.0.813654 64-bit).

Explainable insights

The proposed model provided accurate predictions and some additive and interactive effects, which could help farmers, breeders, and agronomists better understand the complex and interactive relationship among environment and management. Our model selected 202 robust features and 11 two-way interactions to predict the corn yield. Out of the 202 features, 155 were for weather, 37 for soil, and 10 for management. In reference, the total number of variables is 613 (including 440 for weather, 90 for soil, 83 for management), thus the total number of possible two-way interactions is \(613^2 = 375,769\) (quadratic effects are considered self-interactions 44 , 45 ). These features and interactions were carefully selected to balance prediction accuracy with spatial and temporal consistency. As such, the same set of features and interactions apply to all counties in the three states for all years between 2015 and 2018. Similarly, our model selected 160 robust features (including 91 for weather, 59 for soil, and 10 for management) and 12 two-way interactions to predict the soybean yield. The contributions of the selected features and interactions for corn and soybean are visualized in Fig. 5 .

The size of the bars shows the effects of variables and interactions on yield performance. The yield trend indicates a significant factor in estimating the yield of both corn and soybean. Soybean has one self-interactions which includes minimum temperature between October 15 and October 21, and it has negative effects on soybean yield. Corn has two self-interactions, including cold days from April 2 to April 8 and cumulative percentage of planted acreages from May 14 to May 20 with positive and negative effects, respectively. The number of weather factors in estimating corn yield is more than soybean yield. In contrast, the number of soil factors in estimating soybean yield is more than twice the number of soil factors in the prediction of corn yield. Corn yield is more sensitive than soybean yield to management factors. Detected interactions reveal that most of the interactions are between weathers from April to September (emergence to reproductive stages). Moreover, temperature plays an important role in most interactions as maximum and minimum temperature and numbers of cold days. A close-up view of the interactions are shown in Fig. 6 in two lower circular graphs, in which all 11 interactions for corn and 12 for soybean are numbered.

figure 5

The circular graphs indicate additive and interactive effects for corn and soybean. Curves inside the inner circle connect the two variables involved in the two-way interactions. The bars in the first layer around the circle represent the effects of the interactions, and the bars in the second layer show the additive effects of the features. Positive and negative effects are illustrated with red and blue colors, respectively. These plots were created with MATLAB R2018a (Version 9.4.0.813654 64-bit).

figure 6

The circular graphs show that interactions for corn (left) and soybean (right) that were discovered by the proposed model. Curves inside the inner circle connect the two variables involved in the interactions. The first layer outside the circle shows the positive (red) or negative (blue) effects of the interactions. These plots were created with MATLAB R2018a (Version 9.4.0.813654 64-bit).

We explain the contributions of weather ( \(\beta _\text {W} W\) ), soil ( \(\beta _\text {S} S\) ), management ( \(\beta _\text {M} M\) ), and their interactions ( \(\beta _\text {I} I\) ) in all counties in 2015 and 2018 as violin plots in Fig. 7 . The size of the violin plot is denoted as the contribution of parameters to yield. Although their contributions are changed from year to year, high-impact features, including maximum and minimum temperatures, number of cold days, soil organic matter, wilting point, planting time, and yield trend show high contributions to yield continuously over time. The skewness of the yield trend and heat units contributions are on the positive side, which means they increase yield performance. High-variance in temperature, soil organic matter, wilting point, clay percentage, and drained upper limit indicate that the counties across the US Corn Belt have experienced very different climates and have wide soil structures, especially in 2015. Cumulative percentage of planted acreages as the self-interaction ninth in corn yield prediction negatively impacts yield performance at most \(-4.5\) t/ha in 2015 and \(-2\) t/ha in 2018. However, interactions number 6, 7, and 8 contribute positively to corn yield. Interactions play an important role in the yield prediction of corn compared with soybean. Results also reveal that weather conditions in earlier weeks of the growing season have more influences on yield than later ones, and that late planting time is associated with lower yield performance. These findings are consistent with results from field experimental studies 41 , 46 , 47 , 48 , 49 , 50 , 51 .

figure 7

Violin plots of estimated contributions of weather (first row), soil (second row), management (third row) and interaction (fourth row) variables on corn and soybean yield in 2015 (left) and 2018 (right). Each dot on a violin plot represents a county level observation. X-axis numbers of lower panels correspond to the associated numbers to interactions in Fig. 6 . These plots were created with MATLAB R2018a (Version 9.4.0.813654 64-bit).

Insightful interactions

The upper row of Fig. 8 illustrates three of the interactions for corn using partial dependence plots, which is a popular way to show the marginal effect that one or two features have on the predicted outcome of a machine learning model.

Two-way interaction ❹ for corn: the combination of low solar radiation and high maximum temperature during the late grain filling period negatively affects corn yields. This is consistent with agronomic intuition, as low solar radiation limits the energy for photosynthesis, and high maximum temperatures are associated with additional yield losses through tissue respiration and increased evapotranspiration stress.

Self interaction ❽ for corn: average yield drops from 9.455 to 9.15 t/ha as the number of cold days in the week of April 2 increases from 0 to 4. This is insightful because the soil organic matter mineralization and soil water evaporation will slow down in low temperature, leading to delayed field operations due to reduced production of nitrogen and wetter soil surface. The upward trend of yield as the number of cold days increases from 4 to 7 days is counter-intuitive biologically, but it may reveal an important agronomic insight: when the low temperatures last long enough, farmers may start to take actions (e.g., more fertilization and irrigation) to offset its negative impact on corn yield.

Self interaction ❾ for corn: completing planting by May 14 is ideal for the yield, and leaving 50% of planting unfinished by May 20 may reduce the yield by 1.25 t/ha. This is consistent with the well-known benefit of early planting 47 . It was also validated in 2019, when the weather-caused delay in planting in IL and IN led to decreased yields 34 .

The lower row of Fig. 8 illustrates two of the interactions for soybean using partial dependence plots.

Self interaction ❸ for soybean: lower temperature, even near freezing, in mid- to late-October is favorable for soybean yield.

Two-way interaction ❺ for soybean: high precipitation in mid July makes the yield sensitive to night temperature in late August; warmer nights may lead to a 0.45 t/ha higher yield than cooler nights. It has been reported that higher temperature will negatively impact soybean yield 52 , 53 ; our results further suggest that precipitation may also affect the extent of such impact. A possible interpretation is that higher temperature accelerates leaf senescence and increases remobilization of nitrogen and dry matter from vegetative tissues to grains, and such process may be more sensitive to temperature at a higher level of soil moisture.

figure 8

The upper row indicates partial dependence plots of interactions ❹ (left), ❽ (center), and ❾ (right) for corn. The lower row shows partial dependence of interactions ❸ (left) and ❺ (right) for soybean. These plots were created with MATLAB R2018a (Version 9.4.0.813654 64-bit).

Dissection of crop yield

Breakdowns of observed yields in three states from 2015 to 2018 to contributions of weather ( \(\beta _\text {W} W\) ), soil ( \(\beta _\text {S} S\) ), management ( \(\beta _\text {M} M\) ), and their interactions ( \(\beta _\text {I} I\) ) are shown in Figs. 9 and 10 for corn and soybean, respectively. These contributions differ by county and change over time. In 2015, weather was the deciding variable for the yield, whereas interactions played a more important role in 2018. Due to the relatively static nature and lack of dramatic changes across the three Midwest states, soil variables demonstrated a lower effect on crop yield than the dynamic weather, management, and their interactions 28 , 54 .

figure 9

Breakdown of observed corn yield in three states from 2015 to 2018 to contributions of weather ( \(\beta _\text {W} W\) ), soil ( \(\beta _\text {S} S\) ), management ( \(\beta _\text {M} M\) ), and their interactions ( \(\beta _\text {I} I\) ). These plots were created with R (version 3.6.3) 43 .

figure 10

Breakdown of observed soybean yield in three states from 2015 to 2018 to contributions of weather ( \(\beta _\text {W} W\) ), soil ( \(\beta _\text {S} S\) ), management ( \(\beta _\text {M} M\) ), and their interactions ( \(\beta _\text {I} I\) ). These plots were created with R (version 3.6.3) 43 .

The main contributions of the proposed model are summarized in its three salient properties compared with other machine learning models.

The first property is to use robust features and interaction for designing a yield prediction model from year to year prediction. From an agronomic point of view, the conventional feature selection techniques are not proper for yield prediction due to changing train data set from year to year leads to a selection of a different set of features. Hence, the biological results from a different set of features are different. The lack of this robust selection structure is felt.

Second, the proposed model addresses the limitation of machine learning models in transparency by deciphering environment by management interactions for corn and soybean yield. The proposed model was designed efficiently to select a subset of interactions spatially and temporally to result in high performance and less prone to the overfitting problem.

Third, The proposed model quantifies contributions of weather ( \(\beta _\text {W} W\) ), soil ( \(\beta _\text {S} S\) ), management ( \(\beta _\text {M} M\) ), and their interactions ( \(\beta _\text {I} I\) ) to observed yield, where capable machine learning models such as neural network, random forest, and XGBoost cannot quantify these contributions.

We proposed the interaction regression model for crop yield prediction, which made three major contributions. First, it outperformed state-of-the-art machine learning algorithms with respect to prediction accuracy in a comprehensive case study, which used historical data of three Midwest states from 1990 to 2018. Second, it was able to identify about a dozen E \(\times \) M interactions for corn and soybean yield, which are spatially and temporally robust and can be used to form counter-intuitive, insightful, and testable hypotheses. Third, it was able to explain the contributions of weather, soil, management, and their interactions to crop yield. Achieving these three contributions simultaneous is particularly significant, since no other crop yield prediction algorithms have been able to satisfactorily address both prediction accuracy and explainability.

The proposed model and computational experiments are not without limitations. For example, the robust feature and interaction selection algorithms were heuristic in nature, which can find high-quality solutions efficiently but do not guarantee global optimality. By increasing the number of features (genetic information), the proposed heuristic algorithm maybe lose its efficiency in terms of running time in finding robust features and interactions. Our model is seeking self -or two-way interactions. New models are required to discover high-order interactions between variables. The non-linear functions of interaction in this paper are limited to six defined kernel functions that can be extended in future research. The performance of the algorithm may be further improved by applying more advanced techniques for hyperparameter tuning 55 . Due to lack of publicly available information on genotype and management, the W, S, and M data used in our case study may be disproportional to their true contributions to crop yield. However, the proposed modeling approach was designed for both discrete and continuous explanatory variables and capable of analyzing all G, W, S, and M variables and their interactions. Future research should explore the possibility of including additional data (such as high-dimensional genotype data, plant traits, detailed management strategies, and satellite images) to further improve prediction accuracy and make more biologically and agronomically insightful discoveries.

Data availability

The implementation of the proposed model and dataset used in this study are available at https://github.com/ansarifar/An-Explainable-Model-for-Crop-Yield-Prediction .

Cooper, M. et al. Integrating Genetic Gain and Gap Analysis to Predict Improvements in Crop Productivity (Crop Science, 2020).

Book   Google Scholar  

Duvick, D. Genetic progress in yield of United States maize ( Zea mays L.). Maydica 50 , 193 (2005).

Google Scholar  

Hipólito, J., Boscolo, D. & Viana, B. F. Landscape and crop management strategies to conserve pollination services and increase yields in tropical coffee farms. Agric. Ecosyst. Environ. 256 , 218–225 (2018).

Article   Google Scholar  

Filippi, C., Mansini, R. & Stevanato, E. Mixed integer linear programming models for optimal crop selection. Comput. Oper. Res. 81 , 26–39 (2017).

Article   MathSciNet   MATH   Google Scholar  

Alminana, M. et al. Wische: A DSS for water irrigation scheduling. Omega 38 , 492–500 (2010).

Dai, Z. & Li, Y. A multistage irrigation water allocation model for agricultural land-use planning under uncertainty. Agric. Water Manag. 129 , 69–79 (2013).

Drummond, S. T., Sudduth, K. A., Joshi, A., Birrell, S. J. & Kitchen, N. R. Statistical and neural methods for site-specific yield prediction. Trans. ASAE 46 , 5 (2003).

Jeong, J. H. et al. Random forests for global and regional crop yield predictions. PLoS One 11 , 210 (2016).

Liu, J., Goering, C. & Tian, L. A neural network for setting target corn yields. Trans. ASAE 44 , 705 (2001).

Kaul, M., Hill, R. L. & Walthall, C. Artificial neural networks for corn and soybean yield prediction. Agric. Syst. 85 , 1–18 (2005).

Crane-Droesch, A. Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environ. Res. Lett. 13 , 114003 (2018).

Article   ADS   Google Scholar  

Russello, H. Convolutional Neural Networks for Crop Yield Prediction Using Satellite Images (IBM Center for Advanced Studies, 2018).

You, J., Li, X., Low, M., Lobell, D. & Ermon, S. Deep Gaussian process for crop yield prediction based on remote sensing data. In Thirty-First AAAI Conference on Artificial Intelligence (2017).

Marko, O., Brdar, S., Panic, M., Lugonja, P. & Crnojevic, V. Soybean varieties portfolio optimisation based on yield prediction. Comput. Electron. Agric. 127 , 467–474 (2016).

Ansarifar, J., Akhavizadegan, F. & Wang, L. Performance prediction of crosses in plant breeding through genotype by environment interactions. Sci. Rep. 10 , 1–11 (2020).

Article   CAS   Google Scholar  

Romero, J. R. et al. Using classification algorithms for predicting durum wheat yield in the province of Buenos Aires. Comput. Electron. Agric. 96 , 173–179 (2013).

González-Camacho, J. M. et al. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11 , 1–15 (2018).

Basnet, B. R. et al. Hybrid wheat prediction using genomic, pedigree, and environmental covariables interaction models. Plant Genome 12 , 1–13 (2019).

Article   MathSciNet   Google Scholar  

González-Camacho, J. M., Crossa, J., Pérez-Rodríguez, P., Ornella, L. & Gianola, D. Genome-enabled prediction using probabilistic neural network classifiers. BMC Genom. 17 , 208 (2016).

Keating, B. A. et al. An overview of APSIM, a model designed for farming systems simulation. Eur. J. Agron. 18 , 267–288 (2003).

Basso, B., Liu, L. & Ritchie, J. T. A comprehensive review of the CERES-wheat,-maize and -rice models’ performances. In Advances in Agronomy Vol. 136 27–132 (Elsevier, 2016).

Monsi, M. & Saeki, T. On the factor light in plant communities and its importance for matter production. Ann. Bot. 95 , 549 (2005).

Article   PubMed   PubMed Central   Google Scholar  

Ahuja, L. & Ma, L. Methods of Introducing System Models into Agricultural Research (American Society of Agronomy, 2011).

Eitzinger, J., Trnka, M., Hösch, J., Žalud, Z. & Dubrovskỳ, M. Comparison of CERES, WOFOST and SWAP models in simulating soil water content during growing season under different soil conditions. Ecol. Model. 171 , 223–246 (2004).

Heslot, N., Akdemir, D., Sorrells, M. & Jannink, J.-L. Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions. Theor. Appl. Genet. 127 , 463–480 (2014).

Article   PubMed   Google Scholar  

Bassu, S. et al. How do various maize crop models vary in their responses to climate change factors?. Glob. Change Biol. 20 , 2301–2320 (2014).

Lamsal, A. et al. Efficient crop model parameter estimation and site characterization using large breeding trial data sets. Agric. Syst. 157 , 170–184 (2017).

Puntel, L. A., Pagani, A. & Archontoulis, S. V. Development of a nitrogen recommendation tool for corn considering static and dynamic variables. Eur. J. Agron. 105 , 189–199 (2019).

Akhavizadegan, F., Ansarifar, J., Wang, L., Huber, I. & Archontoulis, S. V. A time-dependent parameter estimation framework for crop modeling. Sci. Rep. 11 , 1–15 (2021).

Article   ADS   CAS   Google Scholar  

Santos, J. & Barrios, E. Robust inference in semiparametric spatial-temporal models. Commun. Stat. Simul. Comput. 20 , 1–20 (2019).

Nogueira, S., Sechidis, K. & Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 18 , 6345–6398 (2017).

MathSciNet   MATH   Google Scholar  

Environmental Mesonet, I. https://mesonet.agron.iastate.edu .

Database, G. S. S. G. https://gdg.sc.egov.usda.gov .

Service, N. A. S. https://quickstats.nass.usda.gov .

Ansarifar, J. & Wang, L. New algorithms for detecting multi-effect and multi-way epistatic interactions. Bioinformatics 35 , 5078–5085 (2019).

Article   CAS   PubMed   Google Scholar  

Ripley, B. et al. Mass: Support functions and datasets for venables and Ripley’s mass. R Package Version 7-3 (2011).

Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 , 1 (2010).

Wright, M. N. & Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv:1508.04409 (arXiv preprint) (2015).

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining , 785–794 (ACM, 2016).

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

Archontoulis, S. V. et al. Predicting crop yields and soil-plant nitrogen dynamics in the US corn belt. Crop Sci. 60 , 721–738 (2020).

Kim, N. et al. A comparison between major artificial intelligence models for crop yield prediction: Case study of the midwestern United States, 2006–2015. ISPRS Int. J. Geo Inf. 8 , 240 (2019).

Hornik, K. R FAQ. https://CRAN.R-project.org/doc/FAQ/R-FAQ.html (2020).

Alvarez, R. & Grigera, S. Analysis of soil fertility and management effects on yields of wheat and corn in the rolling pampa of Argentina. J. Agron. Crop Sci. 191 , 321–329 (2005).

Leeper, R., Runge, E. & Walker, W. Effect of plant-available stored soil moisture on corn yields. I. Constant climatic conditions 1. Agron. J. 66 , 723–727 (1974).

Kessler, A., Archontoulis, S. V. & Licht, M. A. Soybean yield and crop stage response to planting date and cultivar maturity in Iowa, USA. Agron. J. 112 , 382–394 (2020).

Baum, M., Archontoulis, S. & Licht, M. Planting date, hybrid maturity, and weather effects on maize yield and crop stage. Agron. J. 111 , 303–313 (2019).

Fan, Y., Li, H. & Miguez-Macho, G. Global patterns of groundwater table depth. Science 339 , 940–943 (2013).

Article   ADS   CAS   PubMed   Google Scholar  

Rizzo, G., Edreira, J. I. R., Archontoulis, S. V., Yang, H. S. & Grassini, P. Do shallow water tables contribute to high and stable maize yields in the US corn belt?. Glob. Food Sec. 18 , 27–34 (2018).

Pasley, H. R. et al. Nitrogen rate impacts on tropical maize nitrogen use efficiency and soil nitrogen depletion in eastern and southern Africa. Nutr. Cycling Agroecosyst. 20 , 1–12 (2020).

Nichols, V. A. et al. Maize root distributions strongly associated with water tables in Iowa, USA. Plant Soil 444 , 225–238 (2019).

Wilhelm, W. & Wortmann, C. S. Tillage and rotation interactions for corn and soybean grain yield as affected by precipitation and air temperature. Agron. J. 96 , 425–432 (2004).

Zhao, C. et al. Temperature increase reduces global yields of major crops in four independent estimates. Proc. Natl. Acad. Sci. 114 , 9326–9331 (2017).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zipper, S. C., Soylu, M. E., Booth, E. G. & Loheide, S. P. Untangling the effects of shallow groundwater and soil texture as drivers of subfield-scale yield variability. Water Resour. Res. 51 , 6338–6358 (2015).

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 , 281–305 (2012).

Download references

This work was partially supported by the National Science Foundation under the LEAP HI and GOALI programs (Grant number 1830478) and under the EAGER program (Grant number 1842097).

Author information

Authors and affiliations.

Department of Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, IA, 50011, USA

Javad Ansarifar & Lizhi Wang

Department of Agronomy, Iowa State University, Ames, IA, 50011, USA

Sotirios V. Archontoulis

You can also search for this author in PubMed   Google Scholar

Contributions

J.A., L.W., and S.V. designed the research questions. J.A. prepared and cleaned the database. J.A. performed the experiment, statistical analysis, and analyzed the dataset. J.A. designed and implemented a new algorithm. J.A. created the figures. J.A., L.W., and S.V. interpreted experiment results. J.A., L.W., and S.V. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Javad Ansarifar .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Ansarifar, J., Wang, L. & Archontoulis, S.V. An interaction regression model for crop yield prediction. Sci Rep 11 , 17754 (2021). https://doi.org/10.1038/s41598-021-97221-7

Download citation

Received : 05 February 2021

Accepted : 23 August 2021

Published : 07 September 2021

DOI : https://doi.org/10.1038/s41598-021-97221-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

The financial well-being of fruit farmers in chile and tunisia depends more on social and geographical factors than on climate change.

  • Fabian Obster
  • Heidi Bohle
  • Paul M. Pechan

Communications Earth & Environment (2024)

A data-driven crop model for maize yield prediction

  • Yanbin Chang
  • Jeremy Latham

Communications Biology (2023)

Spatial and temporal pattern of deficient Indian summer monsoon rainfall (ISMR): impact on Kharif (summer monsoon) food grain production in India

  • P. Vijaya Kumar

International Journal of Biometeorology (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

crop prediction using machine learning research paper

Accessibility Links

  • Skip to content
  • Skip to search IOPscience
  • Skip to Journals list
  • Accessibility help
  • Accessibility Help

Click here to close this panel.

Purpose-led Publishing is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.

Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.

We are proudly declaring that science is our only shareholder.

Crop prediction using machine learning

Madhuri Shripathi Rao 1 , Arushi Singh 1 , N.V. Subba Reddy 1 and Dinesh U Acharya 1

Published under licence by IOP Publishing Ltd Journal of Physics: Conference Series , Volume 2161 , 1st International Conference on Artificial Intelligence, Computational Electronics and Communication System (AICECS 2021) 28-30 October 2021, Manipal, India Citation Madhuri Shripathi Rao et al 2022 J. Phys.: Conf. Ser. 2161 012033 DOI 10.1088/1742-6596/2161/1/012033

Article metrics

5261 Total downloads

Share this article

Author affiliations.

1 Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal, 576104, Udupi district Karnataka, India

Buy this article in print

For most developing countries, agriculture is their primary source of revenue. Modern agriculture is a constantly growing approach for agricultural advances and farming techniques. It becomes challenging for the farmers to satisfy our planet's evolving requirements and the expectations of merchants, customers, etc. Some of the challenges the farmers face are-(i) Dealing with climatic changes because of soil erosion and industry emissions (ii) Nutrient deficiency in the soil, caused by a shortage of crucial minerals such as potassium, nitrogen, and phosphorus can result in reduced crop growth. (iii) Farmers make a mistake by cultivating the same crops year after year without experimenting with different varieties. They add fertilizers randomly without understanding the inferior quality or quantity. The paper aims to discover the best model for crop prediction, which can help farmers decide the type of crop to grow based on the climatic conditions and nutrients present in the soil. This paper compares popular algorithms such as K-Nearest Neighbor (KNN), Decision Tree, and Random Forest Classifier using two different criterions Gini and Entropy. Results reveal that Random Forest gives the highest accuracy among the three.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence . Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Crop Prediction using Machine Learning

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Help | Advanced Search

Computer Science > Machine Learning

Title: a machine learning approach for crop yield and disease prediction integrating soil nutrition and weather factors.

Abstract: The development of an intelligent agricultural decision-supporting system for crop selection and disease forecasting in Bangladesh is the main objective of this work. The economy of the nation depends heavily on agriculture. However, choosing crops with better production rates and efficiently controlling crop disease are obstacles that farmers have to face. These issues are addressed in this research by utilizing machine learning methods and real-world datasets. The recommended approach uses a variety of datasets on the production of crops, soil conditions, agro-meteorological regions, crop disease, and meteorological factors. These datasets offer insightful information on disease trends, soil nutrition demand of crops, and agricultural production history. By incorporating this knowledge, the model first recommends the list of primarily selected crops based on the soil nutrition of a particular user location. Then the predictions of meteorological variables like temperature, rainfall, and humidity are made using SARIMAX models. These weather predictions are then used to forecast the possibilities of diseases for the primary crops list by utilizing the support vector classifier. Finally, the developed model makes use of the decision tree regression model to forecast crop yield and provides a final crop list along with associated possible disease forecast. Utilizing the outcome of the model, farmers may choose the best productive crops as well as prevent crop diseases and reduce output losses by taking preventive actions. Consequently, planning and decision-making processes are supported and farmers can predict possible crop yields. Overall, by offering a detailed decision support system for crop selection and disease prediction, this work can play a vital role in advancing agricultural practices in Bangladesh.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Volume 11, Number 3

Crop Selection and Yield Prediction using Machine Learning Approach

Department of Information Technology, AISSMS Institute of Information Technology, Pune, India.

Corresponding Author Email : [email protected]

DOI : http://dx.doi.org/10.12944/CARJ.11.3.26

Article Publishing History

Review Details

Article Metrics

In recent years, Agriculture sector has been researched a lot with the advancements in technologies like machine learning and smart computing. With the dynamic economics of Agri-produce, it is becoming challenging for farmers to utilize the land efficiently to get maximum profit in the specific landscape. Crop Yield Prediction (CYP) is crucial and is greatly dependent on environmental factors like soil contents, humidity, rainfall as well as area under cultivation and other required metrics. Due to insufficient incorporation of the multiple environmental circumstances, a number of existing tools and techniques used for CYP, such as historical averages, tend to produce inaccurate findings. In such situation, with multiple options of crop, it is essential for farmers to plan the crop strategy in advance. If the farmer can get estimate of the crop yield in advance, cultivation can be done accordingly. To solve this problem, machine learning approach is implemented as a base for accurate predictions. Crop prediction is done by classification model and yield prediction uses regression models to learn from the data. Multiple ML models are analyzed based on performance metrics. Best performer model is incorporated in backend. Among the used models for yield prediction, Random Forest Regression gives best results with MAE of 0.64 and R2 score of 0.96. For crop prediction, Naïve Bayes classifier gives most accurate results with accuracy of 99.39. The study emphasizes how machine learning could revolutionize crop management techniques by giving farmers insights about optimizing resource allocation and boost overall crop yield.

Crop Yield Prediction; Digital Agriculture; Machine Learning; Naïve Bayes; Random Forest

Introduction  

The field of machine learning is advancing day by day. Learning is important when we need sample data or experience rather than the ability to immediately design a computer program to solve a particular problem. When there is no human knowledge or when people are unable to express their expertise, learning becomes important.

Computers are programmed with machine in order to improve performance criteria based on actual or hypothetical facts. Computer program learns to optimize the parameters used for the model using training input or previous information. The model may be descriptive to draw conclusions based on model data or predictive which estimates trends in future. 1 A subset of artificial intelligence (AI), machine learning (ML) enables computers to learn for a specific dataset such as playing chess or making recommendations on social networks without having to be explicitly programmed. Precision farming and Agri-technology, now referred to as Digital Agriculture, are evolving into emerging fields in research that employ highly data-driven techniques to boost productivity in agriculture while shrinking the adverse effects on the environment. Machine learning (ML), alongside big data technology and robust computing infrastructure, has arisen to create potential solutions for unravelling, quantifying, and comprehending data-intensive processes in agricultural operational environments. Data analysis, as an evolved scientific discipline, is essential to the development of a wide range of crop management applications. Many times, it is possible to efficiently use ML without having integrating data from many sources. There tends to be less emphasis on data integration when large datasets are easily available, especially on a major scale. The main force behind this development is the complexity of data preprocessing and analytical processes, as opposed to the machine learning models’ generally straightforward implementation. 2 Agriculture sector has a major contribution of almost 20% in India’s GDP in year 2019-20. 3 Also, it is the principal source of employment in India. In addition to being a significant part of the global economy, it is crucial for the continued existence of humanity. Weather, pests, and the readiness of harvesting operations are the main factors that influence agricultural production. For managing agricultural risk, it’s essential to have accurate crop history information. 4 Unethical practices are being used to produce higher yields of less-nutritious hybrid cultivars as the population grows. These techniques tend to harm soil quality. It results in environmental loss. Given the changing patterns of weather conditions and also economics, it is getting difficult to choose right crop for farmer. The use of various fertilizers is also unclear because of seasonal climate variations and changes in the availability of fundamental resources like soil, water, and air. The agricultural yield rate is continuously decreasing in this situation. 5 Farmers today cultivate crops based on knowledge gained from earlier generations. Since the traditional method of cultivation has been refined, there are either excessive or insufficient yields without really meeting the need. 6 If the producer knows yield estimates in advance, it would help to form the crop strategy. Machine learning is a rapidly expanding methodology that supports and provides a guide in decision process in various applications of multiple different industries. The majority of modern gadgets benefit from models being examined before deployment. The primary idea is to increase the efficiency and profits of the agriculture industry by using data as a tool with models. Precision farming, which prioritizes quality above unfavorable environmental variables, would be the main focus. 7  ML has advanced its applications in agriculture in areas like predicting soil properties, rainfall analysis, yield prediction, disease and weed detection, ML based computer-vision and many more. 8

The use of computer vision, machine learning, and IoT applications will assist boost productivity, enhance quality, and ultimately increase the profitability of farmers and related industries. To increase the overall harvesting output, precision farming is crucial in the world of agriculture. 9 For example, smart irrigation systems, crop disease prediction, crop selection, weather forecasting, and determining the minimal support price are all examples of techniques employed in agriculture. These methods will increase field productivity while requiring less work from farmers. 10 Crop yield estimation may be used for a variety of purposes, including helping farmers enhance production, optimizing the supply-demand cycle for fertilizers, insecticides, and other agricultural products, predicting prices, and calculating the risk levels for agricultural insurance. 11

Literature Review  

Prior research 12 used data that included nutrients and other environmental elements to anticipate crops. For CYP, several feature selection techniques and ML models are employed. In this study, the following factors were looked at: To assess the effectiveness of feature selection and classification algorithms, F1 Score, Mean Absolute Error (MAE), Logarithmic Loss (LL), Accuracy (ACC), Specificity (S), Recall (R), Precision (P), and Recall (R) were utilized. (AUC). Using Modified Removal of recursive Features, six variables – average soil and air temperatures, min and max air temperatures, precipitation, and humidity are selected. A variety of data splitting validation techniques, including (25- 75), (30-70), (35-65), (40-60), (45-55), (50-50), (55-45), (60-40), (65-35), (70-30), and (75-25), are used and evaluated against the previously stated accuracy criteria. Additionally, versions of the feature selection techniques such as MRFE, RFE, and Boruta have been applied. According to the results, the Random Forests Classifier is the most accurate in comparison with kNN and other classifiers discussed above. As characteristic ranges broadened, the measurement values decreased.

Another study by Anakha Venugopal, Jinsu Mani, Aparna S Rima Mathew, Prof. Vinu Williams 13 uses several machine learning approaches to forecast the agricultural production. By taking into account variables like temperature, rainfall, area, and other characteristics, Farmers will be able to select the crop that will provide the highest produce by using the forecasts made by ML models. The study is focused on Kerala’s Agri-produce. Among the classifier models utilized here, Random Forest has the highest accuracy, followed by Logistic Regression and Naive Bayes.

A Research 14 A smartphone app which is used in the proposed method connects farmers to the internet. GPS helps user in locating his location. The user enters the location and soil type. The most profitable crop list can be picked using machine learning algorithms, and they can also forecast crop yields for user-selected crops. Machine learning models, including random forest (RF), artificial neural network (ANN), support vector machine (SVM), multivariate linear regression (MLR), and k-nearest neighbor (KNN), are used to estimate crop productivity. Random forest demonstrated the best outcomes with 95% accuracy. The algorithm also makes recommendations on when to apply fertilizers to increase yield. This research focused on the limitations of present approaches and their applicability for yield prediction. The suggested approach then connects the farmers with an effective yield forecasting system via an app for smartphones. To assist them in selecting a crop, people may select from a number of attributes. The integrated prediction system assists farmers in estimating the crop produce. A user may research possible crops and their yield using the integrated recommendation system in order to make better educated judgements. Based on data from states of Maharashtra and Karnataka, several ML models like RF, KNN, MLR, SVM, and ANN were built and compared for accuracy. Results confirm RF Regressor, which has a 95% accuracy rate, is the best standard algorithm when applied to the presented datasets.

In, 15  the Random Forest Algorithm is used. In spite of extensive research into challenges and topics like weather, temperature, humidity, and rainfall, there are still no acceptable remedies or ideas to deal with the difficulty we face. In nations like India, there are numerous different sorts of rising economic growth, including in the agriculture sector. Additionally, crop yield predictions can be made using the processing. The current study proved the value of data mining techniques for predicting agricultural output based on input features related to the climate. All new grains and regions chosen for the investigation should have accuracy of prediction above 75%, demonstrating improved predictive performance. The produced website is user-friendly. The website was developed utilizing data from that area to predict crop yield.

According to a study, 16  selecting the best crop before sowing will increase agricultural yield. It depends on a variety of factors, such as the soil type and its composition, climate, local terrain, crop yield, market prices, etc. Techniques like Decision Trees, K-nearest Neighbors, and Artificial Neural Networks have a position in the crop selection framework, which depends on a variety of different factors. Machine learning has been used to choose crops based on how natural disasters like hunger could affect them. Researchers have employed artificial neural networks to choose crops depending on soil and climate with success.

When attempting to create a high-performance predictive model, ML studies face a variety of difficulties. To tackle the issue at hand, it is essential to choose the appropriate algorithms, and both the algorithms and the supporting platforms must be able to handle the sheer amount of data. 17

A study 18 suggested a method for unsupervised fuzzy categorization that identifies crop kinds with springtime harvests. The categorization outcomes likely to get better with time. Strategy used in 19 made use of the Bayesian network categorization supervised learning model. Crop information is analyzed with environmental parameters like temperature and rainfall to categorize crops.

A study by D. A. Reddy, B. Dadore, and A. Watekar 20 highlights how despite being one of the nations with the highest agricultural output, India’s agriculture productivity is still fairly low. Productivity needs to be increased so that farmers may get better profit from decreased costs. In order to reliably and successfully propose a suitable crop based on soil data, it offers solutions such as offering a recommender utilizing an ensemble approach with a large proportion of voting methods employing random tree, CHAID, kNN, and naive bayes classifier. Soil types, soil characteristics, and crop yield data collection are taken into consideration when advising the farmer on the best crop to grow. The majority voting process, which is the most popular assembly technique, is used in this system. Any number of primary learners may be used in the voting process. A minimum of two base learners are required. The chosen learners complement one another and impart knowledge to the others. With more competition, a better forecast may be made. The specified training data set is used to train the model. When a new record has to be categorized, each model chooses the class independently. Class predicted by consensus of learners is chosen as class label for current record.

A study 21 says Building a random forest, a group of decision trees that considers two- thirds of the records in the datasets, takes into account data sets on temperature, production, perception, and rainfall. These decision trees are then applied to the remaining data to ensure accurate categorization. For accurate crop production prediction based on the input qualities, the test data may be applied to the generated training sets. The RF method and the dataset were used to evaluate the efficacy of this technique. The advantage of the random forest approach is that overfitting is less of an issue with random forests than it is with decision tree-based model. The random forest does not need to be trimmed. The loaded data sets are divided into train, test data of 67 or 33 percentage points, or 0.67 or 0.33 respectively. In order to enable the mapping of attribute values to appropriate values and list placement, the training data must be categorized. By contrasting the initial data with model predictions, the probability is determined. Based on the result, the highest likelihood is utilized to make a forecast. The accuracy may be calculated by comparing the generated class value with the test data set.

According to a different study, 22  agriculture has positive economic effects on the country. It falls short, nevertheless, in terms of using modern machine learning techniques. As a result, our farmers ought to be knowledgeable with all of the most recent machine learning technology and fresh approaches. The productivity of agriculture is increased by using these methods. To increase agricultural productivity rates, a number of machine learning approaches are used. These techniques can help with agricultural problems. We may also assess the accuracy of the yield by looking at several ways. Thus, we may perform better by contrasting the accuracy of several crops. In agriculture, sensor technology is widely used. The study helps increase agricultural yield rates. helps choose the right crop for the chosen site and season.

Materials and Methods Data Pre-processing

A technique called data pre-processing transforms unprocessed, uncleaned data ready for further analysis. Data may be gathered from multiple sources, but as they are collected in raw form, analysis is not possible. We convert data into a comprehensible format by using several strategies, such as substituting missing values and null values. Fields in the dataset which are insignificant for label prediction are eliminated. If required, One-Hot Encoding is performed on the dataset to have dataset ready for regression model fitting. The division of train and test data is the final stage in the data preparation process. As training the machine learning algorithm usually requires as much data points as possible, the data typically has uneven distribution. Training dataset, which in this case makes up 70% of total data, is used to train machine learning models and make accurate predictions.

Factors Affecting Crop Yield

The yield of every crop is impacted by a wide range of variables. These are essentially the characteristics that aid in estimating a crop yield. For crop yield prediction, this study includes parameters such as temperature, rainfall, area, humidity, soil nutrients, pH, and AUC (Area under Cultivation).

Comparison and Selection of ML Algorithm

We first must assess and compare different algorithms before selecting the one that best matches this particular dataset. Machine Learning is an effective way to solve crop prediction problem as it learns from past data and gives predictions on current parameters. In order to make precise predictions and stand by erratic patterns in weather conditions like temperature and rainfall, various machine learning classifiers like Logistic Regression, Naïve Bayes, Random Forest, KNN are used and compared for the performance metrics and the model with best accuracy is selected for crop prediction.

For Yield prediction, regressors like Linear Regression, Random Forests and Decision Trees Regression are compared for metrics like MAE, Median Absolute Error and R2 Score. The model with best values is selected for predicting yield.

Naïve Bayes

Based on Bayes’ theorem, Naïve Bayes model is frequently employed in many classification tasks. The multinomial, Bernoulli, and Gaussian algorithms make up the three Naive Bayes algorithms. Naive Bayes Algorithm is mostly employed for classification problems. It operates under the presumption that each feature has an equal chance of occurring and that the likelihood of each feature occurring is independent of the probabilities of the occurrence of all other features. The Bayes theorem determines likelihood of an event happening when another event is occurred. Multi-class classification makes use of Bayes theorem. Also, in comparison to other ML techniques, it is quicker and simpler to construct. Additionally, it doesn’t need a lot of training data. Both discrete and continuous data may be used with it. It is extremely scalable and unaffected by insignificant features.

Decision Trees

A decision tree is a type of tree structure that resembles a flowchart and is frequently employed in supervised machine learning for classification and prediction. A DT may be transformed into a set of rules, with each path serving as a different rule, with each path travelling from the root node to each leaf node. In a decision tree, each leaf node has a class that may be reached if an attribute matches the prerequisite for the branch that leads to it. In a decision tree, each internal node corresponds to a test, condition, or attribute. 6

The machine learning approach known as kNN, which is supervised and nonparametric, is used to solve classification and regression issues. Labeled data is used with supervised algorithms. The technique relies on the distances between the points, which may be calculated in a few different ways. The fact that the distance must always be either zero or positive should be taken into account. The distance is squared, raised to a given power, or the absolute values are used to do this. Pre-processing of all the labelled data is necessary before we apply the kNN algorithm. All of the data must first be normalized. As kNN struggles to function when there are too many features present, feature selection must then be used to eliminate the insignificant features. Missing data must be filled in. Else, that particular record must be eliminated. The performance can be enhanced by including more train samples. The fundamental drawback of KNN is that as the size of dataset grows, cost of computing rises, the algorithm’s speed decreases.

Random Forests (RF)

The RF technique is a perfect example of ensemble learning in action since it connects several classifiers to tackle the challenging problem and improve a model’s efficiency. The “forest” created with this approach is actually a collection of decision trees. In each decision split, RF characteristics are chosen at random. Picking traits that encourage prediction and lead to increased efficiency reduces the correlation across trees. The Random Forest ML classification approach generates the final output by combining the results of all the decision trees after segmenting the dataset into smaller subsets or trees. The Bagging subcategory of ensemble learning methods includes Random Forest. A Sample of rows and features from the primary dataset are selected at random and fed into the Random Forest Technique’s decision trees. It can also carry out jobs requiring both regression and classification. It also works well with huge, highly dimensional data sets, and most significantly, it greatly improves the model’s accuracy and fixes the overfitting problem.

System Architecture

System architecture is represented in Figure 1. End user interacts with web user interface (UI) which is hosted on a server. OpenWeatherMap API is connected to server to deliver weather data. The machine learning models are trained and tested by admin and are loaded in the server for predicting crop and yield in tone per hectare of land area.

The public datasets have been chosen because they are readily available and easily accessible. Kaggle is a popular platform for finding and sharing datasets, so we were able to find datasets that met our criteria. We selected 3 datasets namely:

India Agriculture Crop Production 23

The dataset has following features: State, District, Crop, Year, Season, Area, Area Units, Production, Production Units, Yield. This dataset is used to build regression model for yield prediction. Yield is the required label. It has total 12176 records containing 27 unique crops and 4 unique seasons. Crops are as follows: Arhar/tur, bajra, castor seed, gram, groundnut, jowar,linseed, maize, moong, niger seed, other cereals, kharif pulses, rabi pulses, summer pulses, ragi, rapeseed and mustard, rice, safflower, sesamum, millets, soyabean, sugarcane, sunflower, tobacco, urad, wheat, oilseeds.

District wise rainfall normal 23

This dataset is used for collecting district wise rainfall data to predict yield. It is used for extracting district wise average annual rainfall for each district of Maharashtra, India. This feature is combined with Yield dataset mentioned above to get estimates of production for particular crop in the given season.

Crop Recommendation 23

The dataset is used for crop prediction. It has features like N, P, K, rainfall, humidity, pH and crop. N, P, K stands for Nitrogen, Phosphorous and Potassium nutrients in soil. It has 2200 total records containing 22 unique crops. Data consists 100 records for each of the following crops: rice, maize, chickpea, kidney beans, pigeon peas, moth beans, mung beans, black gram, lentil, pomegranate, banana, mango, grapes, watermelon, muskmelon, apple, orange, papaya, coconut, cotton, jute, coffee.

Data pre-processing

Crop prediction

Prior to start modelling the data, we need to carry out data-pre-processing. It is done in following steps as shown in Figure.

Handling Missing Values: The line df = pd.read_csv(‘Crop_recommendation2.csv’, na_values=’=’) reads the CSV file into a DataFrame (df), replacing any occurrences of ‘=’ with NaN values, which are commonly used to represent missing data in pandas.

Separating the Target Variable: The line b = df[‘label’] extracts the target variable column (‘label’) from the DataFrame df and assigns it to the variable b.

Creating a Preprocessing Pipeline: The code creates a pipeline (my_pipeline) using scikit-learn’s Pipeline class. The pipeline consists of two steps:

Imputation: The missing values in the DataFrame are imputed using the SimpleImputer transformer with a strategy set to “mean”. This strategy replaces the missing values with the mean of the corresponding column.

Standardization: The features are standardized using the StandardScaler transformer. This step scales the features to have zero mean and unit variance.

Applying the Preprocessing Pipeline: The line X = my_pipeline.fit_transform(df) applies the preprocessing pipeline (my_pipeline) to the entire DataFrame (df). It fits the pipeline on the data to learn the mean values (for imputation) and the standardization parameters. Then, it transforms the data by applying the learned transformations.

Train-Test Split: The train_test_split() function from scikit-learn is used to split the processed features (X) and the target variable (b) into training and testing sets. The stratify=b parameter ensures that the class distribution is maintained in both the training and testing sets. The split is performed with a test size of 30% (test_size=0.3) and a random state of 42 (random_state=42).

These pre-processing steps help handle missing values, standardize the features, and split the data into training and testing sets for further analysis and model training.

Yield Prediction

In data preprocessing, we did clean the data containing missing values, outliers, or errors that need to be addressed before the data can be used for machine learning. Also, we did data integration of District wise rainfall normal 23 and India Agriculture Crop Production 23 as we required it to be merged for passing it to the machine learning model. We did data transformation for India Agriculture Crop Production 23 dataset as it had categorical variables which need to be encoded as numerical values to pass to the machine learning model. We used One Hot Encoding for data transformation. We had to do data reduction to limit the dataset to the state of Maharashtra otherwise the dataset would have been too large in terms of the rows and columns.

Feature Selection

A machine learning model’s performance can be improved through feature selection, which is the process of choosing a subset of the relevant features from available data. For Crop Prediction, following features were selected: Nitrogen, Phosphorous, Potassium, Temperature, Humidity, pH and Rainfall. For Yield Prediction, features selected are as follows: City, Crop, Annual Rainfall (in mm), Season.

Train Test Splitting of Data

We have split the data in the ratio 70:30 using random sampling and stratification. Choosing an appropriate train-test split is important in ML, because it can affect the accuracy and generalization of the resulting model.

API Integration

The city of user taken as an input is given to API call as a parameter. The temperature and humidity fields from API response are given to crop prediction model as input along with other data. This helps the system to give real-time predictions. “OpenWeatherMap” API is used for the same. For creating an API URL, base URL and API key is used which is unique with each subscription. User’s city name is passed in complete URL as a parameter and response is collected. From the collected response, required fields i.e., temperature and humidity are passed to ML model for predicting crop.

Training and Evaluation of Models

The crop prediction uses the multi-class classification machine learning model to predict the crop for a set of given input features. Whereas the yield prediction incorporates the regression model to predict the yield for a given set of input features. For training and evaluation of models, Google Colab Platform is used. While the User Interface for the project is built using ReactJs, the backend is built using Python Flask framework.

Application and Advantages over existing versions

The model can be used to create an impact on right crop selection as the user would get fair prediction on yield as well as crop. Also yield prediction would be important in financial assessment of crop strategy. Model is useful if the user wants to compare yield for multiple crop options and then select the best one. It could also be used in a wide geography to estimate the yield for a particular crop. This project can be used directly by end users as farmers for taking predictions for their conditions. Instead, it can also be used by government agencies for planning and policy making if modified with wider access to reliable closed source government data. It can also be used by NGOs which work for educating farmers in adopting new technologies and precision agriculture. Also, it can be used in fields where monetary calculations come in picture as it is dependent on how much yield could be produced like in insurance claims or loan policies.

The project improves the prediction accuracy by suitable data gathering cleaning and selecting best accurate model. Also, the project incorporates both crop as well as yield prediction. So, the project is using classification as well as regression models for necessary functionality. It adds value to the modern agriculture setup by providing a way to add to the reliability of crop selection which in turn improves the yield and financial stability.

Results and Discussions

Crop Prediction

First, datasets are loaded and cleaned from insignificant features. After Data Preparation, data is split into training and testing data and various models are fitted and tested for accuracy. Feature Importance is calculated to determine the relative significance or contribution of individual features in ML model. For crop predication model, Drop Column Importance, also called as “permutation importance” or “feature importance by feature shuffling,” is calculated.

Drop Column Importance = Baseline Metric − Shuffled Metric

Drop column importance is based on the idea that removing a feature that is crucial to the performance of the model would cause it to perform less significant than before. It is calculated in following steps

Train a model with all features

Measure baseline performance with a validation data

One feature is determined of which importance is to be calculated

Train a model with all other features except the selected one

Calculate performance with a validation data

The feature importance is the drop in performance from baseline

Follow same steps 3 through 6 for every feature

As shown in above Figure 4, rainfall is most important feature for crop prediction followed by humidity, Potassium(K), Phosphorous(P), Nitrogen(N) and pH.

Classification Models’ results for Crop prediction as depicted in Figure 5.

When trained on the dataset, KNN gives accuracy of 97.72%, RF gives accuracy of 99.24%, Naïve Bayes Classifier has 99.39% accuracy score. Logistic Regression has accuracy of 94.69%. Based on these results, Naïve Bayes classifier is incorporated in the backend for Crop Prediction.

Calculating feature importance for a Random Forest Regressor with one-hot encoded features involves determining the contribution of each feature to the model’s predictive performance. It is done through following steps.

Train the Random Forest Regressor

Access Feature Importances: Random Forest Regressor has built in attribute named feature_importances_.

Map Feature Importances to Original Features: Every one-hot encoded feature is mapped with its original feature.

Aggregate Feature Importances: By aggregating we get every categorical feature’s importance

Rank Feature Importances in descending order of importance.

As shown in Figure 7, Crop is most important feature in order to predict yield followed by District, Rainfall and Season.

Yield prediction is done by regression. For comparison between different regression models, performance metrics like Mean Absolute Error, Median Absolute Error and R2 Score are used. The results are depicted in figure 8.

Random Forest Regressor gives most reliable results when given required inputs with Mean Absolute Error of 0.64, Median Absolute Error is 0.16 and R2 score of 0.96.

Decision Trees Regressor has Mean Absolute Error of 0.80, Median Absolute Error of 0.18, R2 Score of 0.94. Linear Regression has Mean Absolute Error of 1.08, Median Absolute Error of 0.47 and R2 Score of 0.92.

Conclusion  

Crop yield prediction is a complex process which relies on several different factors including weather, soil, fertilizers, pest infestations, etc. In this paper, we predict the crop yield using weather and soil parameters. The research is based on the datasets limited to districts in Maharashtra. The system incorporates regression techniques to estimate yield and multi-class classification to predict type of the crop. Among the used models for yield prediction, Random Forest Regression gives best results with MAE of 0.64 and R2 score of 0.96. For crop prediction, Naïve Bayes classifier gives most accurate results with accuracy of 99.39. The suggested method aids farmers in choosing which crop to plant in the field and how much yield any crop would give in that specific environment. Dataset used in the research can be improved by taking real time data through IoT devices. Also, various factors like irrigation and fertilizers use can be included for better prediction. Mobile App can be developed for mobile devices with added services like price estimates in accordance with current market prices. Paid datasets may bring more reliable and accurate data which in turn might help in model accuracy. They may contain more features which may help correlate more with label.

Acknowledgment

We are grateful to Prof. Pritesh A Patil for providing valuable input and feedback throughout the research process.

Conflict of Interest  

The authors declare no conflict of interest regarding this research. However, it should be noted that the first author of this paper is an employee of a company that develops and markets machine learning software for crop yield prediction. The results and conclusions presented here are solely based on the authors’ research and do not reflect any external influence.  

References  

  • Alpaydın, “Introduction to machine learning, second edition.” MIT Press, 2010. ISBN: 978-0-262-01243-0.
  • Liakos, K.G.; Busato, P.; Moshou, D.; Pearson, S.; Bochtis, D., “machine learning in agriculture: a review”, Sensors, vol. 18, no. 8, pp. 2674, August 2018. https://doi.org/10.3390/s18082674. CrossRef
  • Sabitha, “A study on sectorial contribution of gdp in india from 2010 to 2019”, AJEBA, 19, no. 1, pp. 18-31, January 2020. Article no. AJEBA. 62227. CrossRef
  • Jain A., “Analysis of growth and instability in the area, production, yield, and price of rice in India”, Journal of Social Change and Development, vol. 2, pp. 46-66, N/A,
  • Wolfert S, Ge L, Verdouw C, Bogaardt MJ, “Big data in smart farming– a review. Agricultural Systems”, 153, pp. 69-80, May 2017. CrossRef
  • Sangeeta, Shruthi G. “Design and implementation of crop yield prediction model in ” International Journal of Engineering Research & Technology (IJERT), vol. 9, no. 4, pp. 305-310, Apr. 2020.
  • Johnson LK, Bloom JD, Dunning RD, Gunter CC, Boyette MD, Creamer NG, “Farmer harvest decisions and vegetable loss in primary production. Agricultural Systems”, 176, pp. 102672, November 2019. CrossRef
  • Sharma A, Jain A, Gupta P, Chowdary V. Machine learning applications for precision agriculture: A comprehensive review. IEEE Access. 2020 Dec 31;9:4843-73. CrossRef
  • Meshram V, Patil K, Meshram V, Hanchate D, Ramkteke SD. Machine learning in agriculture domain: A state-of-art survey. Artificial Intelligence in the Life Sciences. 2021 Dec 1;1:100010. CrossRef
  • Reddy, D. J., & Kumar, M. R. (2021). Crop Yield Prediction using Machine Learning Algorithm. 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS). doi:10.1109/iciccs51141.2021.9432236 CrossRef
  • Ranjini B Guruprasad, Kumar Saurav, Sukanya Randhawa,”Machine learning methodologies for paddy yield Estimation in India: a case study”, CrossRef
  • S. P. Raja, B. Sawicka, Z. Stamenkovic and G. Mariammal, “Crop prediction based on characteristics of the agricultural environment using various feature selection techniques and classifiers,” IEEE Access, vol. 10, pp. 23625-23641, 2022, doi: 10.1109/ACCESS.2022.3154350. CrossRef
  • Venugopal, Anakha, S, Aparna, Mani, Jinsu, Mathew, Rima, Williams, Vinu. “Crop yield prediction using machine learning algorithms.” International Journal of Engineering Research & Technology (IJERT) NCREIS – 2021, vol. 09, no. 13, pp. 1-6, 2021.
  • S. M. PANDE, P. K. RAMESH, A. ANMOL, B. R. AISHWARYA, K. ROHILLA and SHAURYA, “Crop recommender system using machine learning approach,” 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), 2021, pp. 1066-1071, doi: 10.1109/ICCMC51019.2021.9418351. CrossRef
  • Suresh, N., et al. “Crop yield prediction using random forest algorithm.” 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 279-282, 2021, doi: 10.1109/ICACCS51430.2021.9441871. CrossRef
  • E. Manjula and S. Djodiltachoumy, “A model for prediction of crop yield,” Int. J. Comput. Intell. Inform., vol. 6, no. 4, pp. 298–305, 2017.
  • van Klompenburg, , Kassahun, A., & Catal, C. (2020). Crop yield prediction using machine learning: A systematic literature review in computers and electronics in Agriculture, 177, pp. 105709. doi: 10.1016/j.compag.2020.105709. CrossRef
  • M. Liu, T. Wang, A. K. Skidmore, and X. Liu, “Heavy metal-induced stress in rice crops detected using multi-temporal Sentinel-2 satellite images,” Sci. Total Environ., vol. 637-638, pp. 18-29, Oct. 2018. CrossRef
  • K. E. Eswari and L. Vinitha, “Crop yield prediction in tamil nadu using bayesian network,” Int. J. Intell. Adv. Res. Eng. Comput., vol. 6, no. 2, pp. 1571-1576, 2018.
  • D. A. Reddy, B. Dadore, and A. Watekar, “Crop recommendation system to maximize crop yield in ramtek region using machine learning,” Int. J. Sci. Res. Sci. Technol., vol. 6, no. 1, pp. 485-489, Feb. 2019. CrossRef
  • Priya, P., Muthaiah, U., Balamurugan, M. “Predicting yield of the crop using machine learning algorithm.” International Journal of Computer Science and Mobile Computing, 4, no. 5, pp. 1-7, May 2015.
  • Medar, Ramesh, S, Vijay, Shweta. “Crop yield prediction using machine learning ” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 9, no. 5, pp. 1-6, May 2019. CrossRef
  • https://www.kaggle.com/

You may also like...

Studies on air borne fungi of tea seed orchard at gaya ganga tea estate of darjeeling district, west bengal "> studies on air borne fungi of tea seed orchard at gaya ganga tea estate of darjeeling district, west bengal, unlocking growth potential:  enhancing salt-stressed seed germination and seedling growth with ga 3 priming in acacia auriculiformis a.cunn. ex benth., delonix regia  (bojer ex hook.) raf. and cassia fistula  l. "> unlocking growth potential:  enhancing salt-stressed seed germination and seedling growth with ga 3 priming in acacia auriculiformis a.cunn. ex benth., delonix regia  (bojer ex hook.) raf. and cassia fistula  l., tea gardens, a potential carbon-sink for climate change mitigation.

Can machine learning models provide accurate fertilizer recommendations?

  • Open access
  • Published: 25 March 2024

Cite this article

You have full access to this open access article

  • Takashi S. T. Tanaka   ORCID: orcid.org/0000-0001-7116-6962 1 , 2   nAff7 ,
  • Gerard B. M. Heuvelink 3 , 4 ,
  • Taro Mieno 5 &
  • David S. Bullock 6  

548 Accesses

1 Altmetric

Explore all metrics

Accurate modeling of site-specific crop yield response is key to providing farmers with accurate site-specific economically optimal input rates (EOIRs) recommendations. Many studies have demonstrated that machine learning models can accurately predict yield. These models have also been used to analyze the effect of fertilizer application rates on yield and derive EOIRs. But models with accurate yield prediction can still provide highly inaccurate input application recommendations. This study quantified the uncertainty generated when using machine learning methods to model the effect of fertilizer application on site-specific crop yield response. The study uses real on-farm precision experimental data to evaluate the influence of the choice of machine learning algorithms and covariate selection on yield and EOIR prediction. The crop is winter wheat, and the inputs considered are a slow-release basal fertilizer NPK 25–6–4 and a top-dressed fertilizer NPK 17–0–17. Random forest, XGBoost, support vector regression, and artificial neural network algorithms were trained with 255 sets of covariates derived from combining eight different soil properties. Results indicate that both the predicted EOIRs and associated gained profits are highly sensitive to the choice of machine learning algorithm and covariate selection. The coefficients of variation of EOIRs derived from all possible combinations of covariate selection ranged from 13.3 to 31.5% for basal fertilization and from 14.2 to 30.5% for top-dressing. These findings indicate that while machine learning can be useful for predicting site-specific crop yield levels, it must be used with caution in making fertilizer application rate recommendations.

Similar content being viewed by others

crop prediction using machine learning research paper

Statistical and machine learning methods for crop yield prediction in the context of precision agriculture

Hannah Burdett & Christopher Wellen

crop prediction using machine learning research paper

Predicting site-specific economic optimal nitrogen rate using machine learning methods and on-farm precision experimentation

Alfonso de Lara, Taro Mieno, … Laila A. Puntel

crop prediction using machine learning research paper

An approach to forecast grain crop yield using multi-layered, multi-farm data sets and machine learning

Patrick Filippi, Edward J. Jones, … Thomas F. A. Bishop

Avoid common mistakes on your manuscript.

Introduction

Site-specific crop management aims to use information about within-field variability of soil and topographic properties to increase farming profitability and sustainability. Understanding of site-specific crop yield response facilitates effective site-specific crop management (Bullock et al., 2019 ). Until recently, site-specific crop management was principally based on farmers’ and agronomists’ experiences and expectations about crop responses to agronomic inputs. The expectations are based largely on inferences obtained from conventional small-plot trials that are presumed to represent what is occurring elsewhere. But these trials are expensive and labor-intensive, and it may be inappropriate to draw inferences from small-plot trials to improve management across many farms (Bullock et al., 2019 ; Lacoste et al., 2022 ). In contrast, on-farm experimentations have a potential to provide more actionable and practical insights to farmers as an alternative to small-plot trials.

Based on experimental data, process-based crop simulation models, such as APSIM (Holzworth et al., 2014 ), DSSAT (Hoogenboom et al., 2019 ), and WOFOST (Boogaard & de Wit, 2020 ) have been developed to understand the effects of crop management, soil, and weather on crop growth and final yield. Crop simulation models are basically point-based (Heuvelink et al., 2010 ). Spatialization of these models is of interest to precision agriculture (PA) as it might contribute to optimizing site-specific crop management (Pasquel et al., 2022 ). However, spatialization of crop simulation models requires knowledge of site-specific input application rates and model parameters that are difficult to estimate because of data scarcity. Environmental and agricultural models (e.g., crop simulation models) also suffer from error propagation as the uncertainty in model inputs influence the output (Corner et al., 2008 ; Heuvelink, 1998 ). Furthermore, crop simulation models can only predict potential, water-limited, or nutrient-limited yield, not the actual yield if other environmental variables not accounted for (e.g., weeds, insect pests, and disease) greatly affect yields (de Wit et al., 2019 ). Therefore, it is not straightforward to use crop simulation models for the purpose of optimizing site-specific input management.

On-farm precision experimentation (OFPE) is a form of on-farm experimentation that uses PA technology to generate large amounts of crop input application and yield response data. Such data can be used to estimate spatially variable optimal input application rates and thus improve site-specific decision making (Bullock et al., 2019 ). Combining OFPE and machine learning approaches is expected to present an opportunity to facilitate understanding of site-specific crop yield response (Bullock et al., 2019 ). Since the early stage in the development of site-specific crop management, a wide range of models (e.g., intuitive, stochastic, and machine learning models) has been proposed to support farmers’ decision on the rate and timing of fertilizer application at a given location (Adams et al., 2000 ). The use of various statistical approaches and machine learning algorithms is nowadays becoming a hot topic, but no consensus has been reached on which model is the best.

Many studies have demonstrated the advantages of using spatial statistical modeling methods, including geographically weighted regression (Evans et al., 2020 ; Trevisan et al., 2021 ) and machine learning techniques, such as random forest (RF) (Krause et al., 2020 ; Paccioretti et al., 2021 ; Wen et al., 2021 ) and convolutional neural networks (Barbosa et al., 2020 ). Although Evans et al. ( 2020 ) and Trevisan et al. ( 2021 ) attempted to explicitly model the spatially variable crop yield responses, most of previous studies focused only on the accuracy of crop yield prediction. Kakimoto et al. ( 2022 ) demonstrated that a machine learning model that accurately predicts site-specific yield levels does not necessarily accurately predict yield response and the associated site-specific economically optimal input rates (EOIRs) of fertilizer. They highlighted the distinction between predicting yield levels at observed input rates and estimating yield response to input. For site-specific input management recommendations, the latter is critical, but not necessarily the former. Estimating site-specific EOIRs accurately requires that the causal relationship between agronomic inputs and crop yield be discovered accurately.

Covariate selection is an essential process in machine learning modeling. Using machine learning for yield prediction can underestimate the impact of nitrogen fertilizer (N) on crop yields and EOIRs because of the inclusion of redundant or strongly correlated covariates (Kakimoto et al., 2022 ). Estimation of the impact of input on yield may also be biased when an important covariate is not included. With the increased adoption of PA technologies in commercial farms, there are numerous possibilities for selecting covariates (e.g., elevation data, satellite imagery, on-the-go soil sensor data, and digital soil maps) in establishing a yield prediction model for OFPE. Selecting only influential covariates may be a gold standard in establishing models, but rarely can practitioners identify and quantify the complete set of covariates contributing to yield variability. Previous studies have paid very little attention to the sensitivity of the quality of fertilizer management recommendations to different machine learning approaches.

Many studies have used synthetic data to compare the effects of machine learning algorithms, covariate selection, and experimental design on yield and EOIR prediction accuracies (Alesso et al., 2020 ; Kakimoto et al., 2022 ; Saikai et al., 2020 ). To assess the prediction accuracy of site-specific crop yield response modeling, synthetic data can be generated using crop yield response functions, such as process-based crop simulation models (e.g., APSIM) and mechanistic models (e.g., the Mitscherlich-Baule function). Synthetic data can simulate ‘true’ crop yield response, which enables validating EOIR prediction accuracy. One of the shortcomings of synthetic data is that the spatial distribution of yield and yield-limiting factors are generated based on simple assumptions. Although previous studies have considered random noise (e.g., the nugget effect), real farms have more artifacts, such as wheels, overlaps, missing strips of inputs, and further historical land uses (Roques et al., 2022 ; Zhou et al., 2022 ). Therefore, synthetic data cannot fully represent the real-world conditions, and may not be capable of providing fair insights into the model uncertainty in machine learning approaches to the analysis of OFPE data.

The aim of this study was to quantify the uncertainty involved in modeling the inclusion of the application rates of two fertilizers and soil properties as covariates in a machine learning model of site-specific crop yield prediction, and to examine how the model uncertainty quantitatively affects the estimation of site-specific EOIRs and gained profits. An OFPE was conducted in Japan to assess the effects of soil properties and application rates of basal and top-dressed fertilizer on winter wheat yield. Site-specific crop yield response models were established using different combinations of machine learning algorithm and covariate selection. Site-specific EOIRs were derived for each of these combinations. A frequency distribution of the estimated EOIRs and gained profits was further assessed as a measure of model uncertainty.

Materials and methods

Experimental design and data collection.

A split-plot or checkerboard OFPE (Fig.  1 ) was implemented in 2019–2020 Gifu, Japan (35°11’N, 136°39’E) to measure the effects of changing fertilizer application rates on the yield of the ‘Satonosora’ wheat variety. The trial was conducted in cooperation with the Japanese farming company Fukue-eino, which owned the variable-rate application and yield monitoring equipment. In the first OFPE season, a checkerboard design was not implemented across all fields because the farmer was not convinced that the rate transition between plots could be achieved smoothly. Therefore, a split-plot design was implemented for the rest of the fields. Just before seeding (early November), a slow-release basal fertilizer NPK 25–6–4 was applied at rates of 0, 270, 360, 450, and 540 kg ha − 1 . Before the booting stage (early March, Zadoks 41), NPK 17–0–17 was top-dressed at rates of 0, 222, 296, 370, and 444 kg ha − 1 . The number of plots receiving no fertilizer was limited due to the risk of yield loss. A variable-rate fertilizer broadcaster with an 18-m working width (Axis 40.2, Kuhn, France) was used for both applications. All other managements (e.g., disease and weed control) were uniform. No serious disease and weed problems were observed. Yield data were collected using a combine harvester with a yield monitor sensor (WRH1200, Kubota, Japan). Although the combine had a 2.6-m header width, after data preprocessing based on the manufacturer’s recommended procedures yield values were averaged to obtain single values for each 5 m x 5 m cell within a grid. Cells in “transition zones” at the beginnings and ends of trial plots, in headlands and/or in buffer zones around the field’s perimeter were excluded from further analysis. The resulting dataset used for analysis contained 970 observations at 5 m x 5 m spatial resolution.

figure 1

Experimental design of the on-farm precision experiment. White space represents borders (e.g., transition zones and headlands). The numerals beside the X marks show the six randomly selected locations from which data were obtained to create scatterplots and histograms of the EOIRs of the basal NPK 25–6–4 and top-dressed NPK 17–0–1 fertilizers (Fig. 7)

Soil properties were used as covariates to perform site-specific crop yield response assessment. In mid October 2019, prior to the basal fertilizer application, a total of 52 soil samples were collected near the centroids of a 30 m x 30 m grid defined over the field. Within a 1 m 2 area over the centroid of each soil sampling grid cells, three randomly located partial surface soil samples (0–150 mm) weighing approximately 0.5 kg each were collected and mixed to produce one composite sample. The composite samples were air-dried and sieved through a 2.0-mm mesh before chemical analysis. Soil pH, electrical conductivity (EC), mineralizable N, available phosphorus (P), cation exchange capacity (CEC), exchangeable calcium (Ca), exchangeable magnesium (Mg) and exchangeable potassium (K) were measured. Mineralizable N was determined according toInoko’s ( 1986 ) method. Soils were anaerobically incubated at 30 °C for four weeks, and inorganic N was extracted with a 2 M KCl solution. The concentrations of NH 4 + and NO 3 − in the extracts were determined using the indophenol method (Keeney & Nelson, 2015 ) and the Cataldo method (Cataldo et al., 1975 ). Mineralizable N was calculated by balancing the inorganic N (NH 4 + and NO 3 − ) before and after anaerobic incubation. Available P was measured by the Truog method (Truog, 1930 ). Cation exchange capacity was measured by saturating the soil with a neutral 1 mol L − 1 ammonium acetate solution, washing with 80% ethanol to remove soluble NH 4 + , and extracting exchangeable NH 4 + with 2 mol L − 1 KCl. The concentrations of Ca, Mg, and K were determined by inductively coupled plasma atomic emission spectroscopy (ICP-AES, ULTIMA 2, HORIBA, Japan).

Interpolation of soil sample property values

Because the 5 m x 5 m resolution of the dataset’s grid was finer than the 30 m x 30 m resolution of the soil sampling grid, soil property measurements taken from samples pulled near the centroids of the 30 m x 30 m cells had to be spatially interpolated to assign values at the centroids of the 5 m x 5 m cells. Interpolated values were calculated using the ‘geoR’ package (Ribeiro & Diggle, 2001 ) of R version 3.6.2 (R Development Core Team, 2019 ) and applying the empirical best linear unbiased prediction (E-BLUP) method (Lark et al., 2006 ). Box–Cox transformation (Box & Cox, 1964 ) was applied prior to geostatistical modeling when the distribution of the observations was highly skewed, and predicted mean values from the E-BLUP were back-transformed. The Matérn covariance function (Webster & Oliver, 2007 ) and the restricted maximum likelihood estimator were used for estimation of the semi-variogram parameters. The resultant interpolated soil maps are presented in S1. Since covariate values were smoothed using kriging with external drift, they also had interpolation errors (S2). Interpolation errors were evaluated as the coefficients of variance (CV) by dividing the kriging standard deviation by the kriging prediction.

Data analysis

Four machine learning regression models, RF, XGBoost, support vector regression (SVR), and artificial neural network (ANN) were trained with different combinations of covariates to model site-specific yield responses to the fertilizers. RF and SVR were implemented using the Python module ‘scikit-learn’ (version 1.1.1) (Pedregosa et al., 2011 ). XGBoost was implemented with xgboost (version 1.5.1) (Chen & Guestrin, 2016 ). ANN was implemented using the Keras (version 2.9.0) machine learning application programming interface (Chollet, 2015 ) with the TensorFlow (version 2.9.0) (Abadi et al., 2015 ) backend.

All 255 possible combinations (i.e., \({2}^{8}-1\) ) that can be made from using between one and eight soil properties as covariates were included in the estimations, for each of the four machine learning algorithms, meaning that a total of 1,020 cases were examined. Because the study area was relatively small, spatial differences in weather and other environmental factors were assumed negligible and excluded from the analysis. Of course, the inference space of the experiment should not be assumed to be expandable beyond the field itself. Further research with real-world large-scale experiments is needed to test the robustness of the results reported.

The dataset’s 970 observations were randomly split into a 679 observations training dataset and a 291 observations test dataset. Hyperparameters of RF and SVR were determined by grid search with a five-fold cross-validation using the training dataset. This procedure was repeated three times with different subsets. Then models were retrained with an optimal hyperparameter using the training dataset. For RF, grid searches were performed to optimize the n_estimators (the number of decision trees). For SVR, grid searches were conducted to optimally assign values to parameters C and ε . The ANN architecture involves three hidden layers. The input layer was fed into a rectified linear unit (ReLU) layer with 64 neurons, followed by batch normalization. The batch normalization layer was fed into the ReLU layer with 128 neurons, followed by the ReLU layer with 128 neurons again. Finally, the fully-connected ReLU layer with 128 neurons was fed into an output layer with a linear activation function. According to a preliminary experiment, the model performance of ANN did not largely depend on the architecture, and three to four layers and 32–128 neurons were sufficient to model crop prediction. To avoid over-fitting, early stopping was used to monitor validation loss with a ten epochs of patience. 30% of the training dataset was used to calculate the validation loss for ANN. For both the training and test datasets, model prediction accuracies were evaluated by root mean square error (RMSE) and R 2 .

In this study, model uncertainty refers to the variability in EOIR and gained profit that are predicted from different model algorithms and covariates. To evaluate site-specific uncertainty in decision making for fertilizer application among the selection of algorithms and covariates, site-specific EOIRs were calculated by treating the predicted values from the models as deterministic outcomes. First, site-specific net revenue ($ ha − 1 ) was defined as,

where p = $1.16 kg –1 (136.8 JPY kg –1 ) is the price of wheat grain, y i is the model’s predicted wheat grain yield at location i , w BF = $1.58 kg –1 (187.0 JPY kg –1 ) is the basal fertilizer price, BF i is the basal fertilizer application rate at location i , w TF = $0.60 kg –1 (71.2 JPY kg –1 ) is the top-dressing fertilizer price, and TF i is the top-dressing fertilizer application rate at location i . Prices were obtained from the farmer in the corresponding year.

To assess EOIR estimation robustness, fertilizer application rates were optimized by running the model at intervals of 5 kg ha − 1 (basal fertilizer: 1.25 kg N ha − 1 , 0.30 kg P 2 O 5 ha − 1 , 0.20 kg K 2 O ha − 1 ; top-dressing fertilizer: 0.85 kg N ha − 1 , 0.00 kg P 2 O 5 ha − 1 , 0.05 K 2 O ha − 1 ) with other values of soil property covariates unchanged. Application rate ranges were 270–540 kg ha –1 and 222–444 kg ha –1 for the basal and top-dressing fertilizers. Rates less than the minimum application rates were not tested because of the limited number of experimental plots receiving no fertilizer. According to the information of a local crop advisory service, N and K were assumed to limit crop yield across the fields because the interpolated values were smaller than the recommended ranges (S1). Therefore, the machine learning models might assess the effect of multiple nutrients, such as N and K on crop yield. To explore the robustness of the site-specific EOIR estimations, mean values and CVs were calculated for each experimental grid. CVs were evaluated by the ratio of the standard deviation to the mean either from all combinations of algorithm and covariate selection ( n  = 1,020) or from combinations of covariates for each algorithm ( n  = 255) for each experimental grid. Six locations were randomly selected for visualizing the distributions of the basal and top-dressing EOIRs (Fig.  1 ). Furthermore, gained profits by adopting optimal site-specific fertilization were evaluated by subtracting the net revenue under the uniform conventional rate (i.e., 450 and 370 kg ha –1 for the basal and top-dressing fertilizers) from the net revenue under the optimal site-specific fertilization rate.

Results of on-farm precision experiment and yield prediction performance

The relationships between yield and inputs are shown as box plots in Fig.  2 . Yield tended to increase with basal fertilizer rates. The median value of yield was extremely low when no top-dressing fertilizer was applied. There were no large differences in yield among the top-dressing application rates from 222 to 444 kg ha –1 . This indicates the importance of OFPE for recommending basal fertilizer application rates rather than top-dressing application rates. Furthermore, high variations for each treatment indicate that yield responses can vary substantially, even within a small area.

figure 2

Box plots of yield for each application rate for basal and top-dressing fertilizers across the fields. Lower and upper box boundaries indicate 25th and 75th percentiles. Lines inside boxes represent medians. The ranges between the lower and upper whiskers are 1.5 times the interquartile range. Filled circles show outliers falling outside 1.5 times the interquartile range

Both the RMSE and R 2 values of the test dataset indicated that RF had the best yield prediction performance (Fig.  3 ). Although XGBoost generally showed high prediction accuracies, there were some cases with very high RMSEs (> 0.5 h ha –1 ) and low R 2 values (< 0.2). SVR and ANN were not capable of predicting yield values more than approx. 4.0 t ha –1 (Fig.  4 ). Meanwhile, RF and XGBoost underestimated yield values more than approx. 4.5 t ha –1 . This result indicates that the difference in yield underestimation in the ranges of high yield might affect overall yield prediction accuracies. All models failed to predict extremely high yield values (> 6 t ha –1 ). This might be due to the lack of important covariates that is related to high yield values. Importantly, the inaccuracies of yield prediction could lead to underestimation of EOIRs in high yield levels.

figure 3

Histograms of RMSE and R 2 values for all four machine learning models with different combinations of covariates

figure 4

Density scatter plots of predicted against observed yield for all four machine learning models with different combinations of covariates. The black line indicates the 1:1 reference line

Variability in EOIRs and gained profits

CV values of the EOIRs were spatially heterogeneous (Fig.  5 ) for all covariate combinations using all machine learning algorithms, ranging from 13.3 to 31.5% for the basal and from 14.2 to 30.5% for the top-dressing fertilizer. Spatial distributions of EOIR CV values for each machine learning algorithm are shown in Fig.  6 . RF was not sensitive to covariate selection, while SVR and ANN were very sensitive to it. However, this study did not attempt to identify which machine learning models were best for generating economic profits. Indeed, doing so is not possible since the true EOIRs, which are needed to validate the model performance, cannot be directly observed. Therefore, it should not be concluded that RF is the best machine learning model for optimizing site-specific input management.

figure 5

Spatial distributions EOIR CV values, resultant from the combined effects of algorithm and covariate selection

figure 6

Spatial distributions of EOIR CV values derived from 255 covariate combination runs per combination of machine learning algorithm and fertilizer type

Spatial distributions of mean values of the EOIRs varied greatly among machine learning algorithms (Fig.  7 ). For both fertilizers, recommended application rates from the tree-based models RF and XGBoost were less spatially heterogeneous and relatively lower than those from SVR and ANN. For example, for basal fertilizer, the estimated EOIRs ranged from 270 to 427 kg ha –1 for RF. In contrast, the corresponding range was from 348 to 538 kg ha –1 for ANN. This result indicates that crop yield response can vary substantially even within a small area according to the algorithm selection.

figure 7

Spatial distributions EOIRs mean values, for each machine learning algorithm

Figure  8 shows the scatterplots and histograms of estimated EOIRs from all the 1,020 cases at the six locations (See Fig.  1 ). The 5th- and 95th-percentile borders (indicated by the dashed lines in Fig.  8 ) show that the EOIRs for both fertilizers had large variations. More specifically, the interval of values containing the central 90% of EOIRs ranged almost from the lowest to highest applied rates for top-dressing fertilizer. Furthermore, a clear unimodal distribution was found only for the basal fertilizer at location (1) In contrast, a bimodal distribution was evident for basal fertilizer at locations 3 and 4 and for top-dressing fertilizer at locations 1 and (2) In these cases, decision makers would be forced to make an extreme choice between the lowest and highest fertilizer application rates, which could ultimately result in a completely different revenue. These results indicate that the EOIR predictions are highly sensitive to algorithm and covariate selection, and that simple averaging methods, such as the ensemble learning approach may not provide reliable recommendations. The high model uncertainty begs the question of whether machine learning approaches can be used effectively for site-specific input management.

figure 8

Scatterplots and histograms of EOIRs of the basal and top-dressing fertilizers in six randomly selected locations (Fig.  1 ). Data derived from all combinations of algorithms and covariate selection ( n  = 1,020). Dashed lines in scatterplots represent borders of the 5th and 95th quantiles

The estimated gained profits (relative to the uniform input management) ranged from 150 to 660 $ ha –1 depending on selected algorithms and covariates (Fig.  9 ). RF and XGBoost occasionally showed extremely high gained profits. ANN showed lower gained profits and higher uncertainty than other models. SVR showed a lower uncertainty in predicted gained profits. Thus, gained profits predicted by machine learning approach are quite sensitive to algorithm and covariate selection.

figure 9

Histograms of gained profits of entire simulations ( n  = 1,020) and each algorithm ( n  = 255)

Precise yield prediction and estimation of the causal effects of inputs are essential for the successful implementation of site-specific crop management supported by OFPE. Generally, researchers prefer to select the ‘best’ model based on the metrics of yield prediction accuracies, such as RMSE, mean absolute error, and R 2 values (Barbosa et al., 2020 ; Wen et al., 2021 ). But this study has provided valuable information about the impacts of algorithm and covariate selection on fertilization rate management. Overall, machine learning models can predict crop yield well based on RMSE and R 2 values (Figs.  3 and 4 ). But each machine learning model showed very different EOIR predictions even within a small area (Fig.  7 ). Predicted site-specific EOIRs were very sensitive to the choice of algorithm and selection of covariates (Figs.  5 , 6 and 8 ). These results highlight that practitioners need a careful consideration of model uncertainty before providing decision makers with fertilizer management recommendations.

The reason that the CV values of EOIRs are not constant in space (Figs.  5 and 6 ) must be because soil properties are not constant in space (S1), as that is the only input that spatially varies. There are also other factors that influence spatial variability of EOIRs. For instance, spatial variability in established seedlings significantly affect yield (Tanaka et al., 2019 ), while locations that were trenches before land consolidation had approx. 1.0 t ha –1 lower yield than the other parts of the study area, probably due to differences in temporal change in soil moisture conditions (Zhou et al., 2022 ). However, it is not practical to conduct manual counting of seedlings or to install soil moisture sensors across fields in order to include these factors in the crop yield response modeling. Given the expense of data collection for OFPE, remotely or proximally sensed data might be useful to develop better machine learning models. Although causal factors affecting crop yield could not be identified, either of in-season crop sensing or historical yield map might also enable explaining the pattern of yield response. Given a better management strategy might be established by combining multiple information, it might be necessary to include not only soil data but also in-season and previous proximal/remote sensing data as covariates in crop yield response modeling.

Insufficient consideration on model uncertainty may lead to making highly undesirable input use decisions. The binominal distribution having two peaks at the lowest and highest application rates were found at several locations (Fig.  8 ), indicating large uncertainty about the effect of fertilizer application on yield. Given little crop yield response to top-dressing fertilizer at the high rate of basal fertilizer (Fig.  2 ), the highest top-dressing application rates could lead to considerable revenue loss. Although basal fertilizer is a main determinant of yield variation, topdressing fertilizer might be a nuisance covariate for the yield prediction model. Thus, not only outcomes from machine learning but also supplement insights from agronomic knowledge might be important for a sensible EOIR recommendations.

High uncertainty in gained profits was evident depending on model and covariate selection (Fig.  9 ). For instance, the estimated gained profits ranged from 150 to 660 $ ha –1 . This has major implications on the analysis of cost-benefit performance of PA technology, which in turn affects adoption decision of new technology. Previous studies evaluated either of farming scales or seasons that is necessitated for recovering the purchase cost of variable-rate application equipment (Maine et al. 2010 ; Tanaka et al., 2023a ). In such cases, model uncertainty will affect decision making not only on fertilization but also on the adoption of variable-rate application equipment. ANN had higher uncertainty and lower gained profits. This indicates that ANN provided a more pessimistic and conservative scenario than other models. Therefore, it should be noted that each model tends to provide different uncertainty and recommendation in assessing gained profits. In practice, researchers tend to use only one model. But this is not without risk, since different models produce different results and lead to different recommendations. Thus, it seems to be reasonable to use ensemble learning approach. However, as discussed in the case of site-specific EOIR predictions, ensemble learning approach might not be able to enhance prediction accuracy for the causal impact derived from the yield prediction models because the predicted EOIR showed a completely different recommended application rate (i.e., either of lowest or highest input rates) (Fig.  8 ). Practitioners should keep it mind that ensemble learning might be helpful to assess the model uncertainty, but resultant recommendations should not simply be derived from the average.

This study focused only on the effect of model and covariate selection on spatial uncertainty in the EOIRs for fertilizer recommendation. But there are also other sources of uncertainty that deserve attention in future research. For instance, the CV value of exchangeable K was not small (approx. 27%) (S2), indicating substantial uncertainty in model input. While spatial interpolation of soil properties based on geostatistics has been common in producing digital soil maps (Heuvelink & Webster, 2022 ), it may lead to interpolation errors. If crop yield response models were linear in the soil input, propagation of interpolation errors in digital soil maps could be obtained by simple error propagation rules (Taylor, 1982 ), but that simple approach is not available for non-linear machine learning models. Due to the high computational cost of Monte Carlo uncertainty propagation methods, this study did not pursue this topic but instead focused on the effect of algorithm and covariate selection on prediction uncertainty. Further study might be needed to explore the effect of machine learning covariate uncertainty on EOIRs and profit prediction.

To train machine learning models, 970 observations with up to eight covariates were used in this study. The data size seemed to be sufficient to achieve the accurate prediction accuracy of site-specific yield levels (Figs.  3 and 4 ). However, this study used only small-scale on-farm experimental data. The sensitivity of the model uncertainty to the size of the training dataset should be explored in future studies. The differences in predicted EOIRs and profits between model algorithms will be smaller if the training dataset is large. Furthermore, not only spatial but also temporal uncertainty would be essential to consider for better fertilizer recommendation. This study only used annual real on-farm data, while it is difficult to repeat the experiments at the same site for multiple years due to the constraint in the real farm situation. Synthetic data generated by mechanistic models (e.g., the Mitscherlich-Baule function) are not capable of simulating the impact of weather on crop yield. Therefore, a possible solution for the data scarcity problem is to integrate geostatistical simulation and crop simulation model to simulate space-time variability in crop yield (Tanaka et al., 2023b ). A surrogate model consisting of a machine learning model trained with synthetic data from a crop simulation model has been proposed to combine biophysical domain knowledge of crop simulation models with data-driven machine learning approaches (Pylianidis et al., 2022 ). Therefore, integration of Gaussian simulation, crop simulation model, and machine learning approach would provide a chance to assess the space-time model uncertainty.

Conclusions

OFPE are frequently conducted to generate data for the estimation of site-specific crop yield response models. Interest is growing in employing machine learning algorithms to identify spatially heterogeneous and non-linear relationships between agronomic inputs and crop yield. Research has shown that EOIR and gained profit prediction were very sensitive to the selection of machine learning algorithm and covariates. Furthermore, yield response to fertilization could vary substantially from site to site, even in a small area. This might be due to the model uncertainty derived from algorithm and covariate selection. These results highlight the difficulty of providing reliable site-specific input application rate recommendations based on one specific machine learning algorithm and one specific set of covariates. Note that the outcomes of this study were based on small-scale on-farm data conducted in a single season. Further research with data analysis from large-scale OFPEs or synthetic data generated by process-based models should be oriented towards exploring causal inference, thus supporting deriving accurate and robust EOIR predictions.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Code availability

The codes generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ . Accessed 7 August 2022.

Adams, M. L., Cook, S., & Corner, R. (2000). Managing uncertainty in site-specific management: What is the best model? Precision Agriculture , 2 , 39–54.

Article   Google Scholar  

Alesso, C. A., Cipriotti, P. A., Bollero, G. A., & Martin, N. F. (2020). Design of on-farm precision experiments to estimate site-specific crop responses. Agronomy Journal , (December 2020) , 1–15. https://doi.org/10.1002/agj2.20572 .

Barbosa, A., Trevisan, R., Hovakimyan, N., & Martin, N. F. (2020). Modeling yield response to crop management using convolutional neural networks. Computers and Electronics in Agriculture , 170 (May 2019), 105197. https://doi.org/10.1016/j.compag.2019.105197 .

Boogaard, H., & de Wit, A. (2020). WOFOST: simulation model for quantitative analysis of growth/production of annual crops, (April).

Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological) , 26 (2), 211–243. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x .

Bullock, D. S., Boerngen, M., Tao, H., Maxwell, B., Luck, J. D., Shiratsuchi, L., et al. (2019). The Data-Intensive Farm Management Project: Changing Agronomic Research through On‐Farm Precision Experimentation. Agronomy Journal , 111 (6), 2736–2746. https://doi.org/10.2134/agronj2019.03.0165 .

Cataldo, D. A., Maroon, M., Schrader, L. E., & Youngs, V. L. (1975). Rapid colorimetric determination of nitrate in plant tissue by nitration of salicylic acid. Communications in Soil Science and Plant Analysis , 6 (1), 71–80. https://doi.org/10.1080/00103627509366547 .

Article   CAS   Google Scholar  

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785 .

Chollet, F. (2015). and others. Keras. GitHub . https://github.com/fchollet/keras . Accessed 7 August 2022.

Corner, R., Marinelli, M., & Wright, G. (2008). Error propagation analysis techniques Applied to Precision Agriculture and Environmental models. Quality aspects in spatial data mining (pp. 131–145). CRC. https://doi.org/10.1201/9781420069273.ch11 .

de Wit, A., Boogaard, H., Fumagalli, D., Janssen, S., Knapen, R., van Kraalingen, D., et al. (2019). 25 years of the WOFOST cropping systems model. Agricultural Systems , 168 (July 2018), 154–167. https://doi.org/10.1016/j.agsy.2018.06.018 .

R Development Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.r-project.org . Accessed 30 March 2022.

Evans, F. H., Salas, A. R., Rakshit, S., Scanlan, C. A., & Cook, S. E. (2020). Assessment of the use of geographically weighted regression for analysis of large on-farm experiments and implications for practical application. Agronomy , 10 (11), 1720. https://doi.org/10.3390/agronomy10111720 .

Heuvelink, G. B. M. (1998). Error propagation in Environmental Modelling with GIS . CRC. https://doi.org/10.4324/9780203016114 .

Heuvelink, G. B. M., & Webster, R. (2022). Spatial statistics and soil mapping: A blossoming partnership under pressure. Spatial Statistics , 100639. https://doi.org/10.1016/j.spasta.2022.100639 .

Heuvelink, G. B. M., Brus, D. J., & Reinds, G. (2010). Accounting for spatial sampling effects in regional uncertainty propagation analysis. The 9th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences, Leicester . https://edepot.wur.nl/160785 . Accessed 19 October 2022.

Holzworth, D. P., Huth, N. I., deVoil, P. G., Zurcher, E. J., Herrmann, N. I., McLean, G., et al. (2014). APSIM – Evolution towards a new generation of agricultural systems simulation. Environmental Modelling & Software , 62 , 327–350. https://doi.org/10.1016/j.envsoft.2014.07.009 .

Hoogenboom, G., Porter, C. H., Boote, K. J., Shelia, V., Wilkens, P. W., Singh, U. (2019). The DSSAT crop modeling ecosystem (pp. 173–216). https://doi.org/10.19103/AS.2019.0061.10 .

Inoko, A. (1986). Available nitrogen. In Y. Onikura, et al. (Eds.), Standard methods of soil analysis and measreument (pp. 118–121). Hakuyuusha.

Kakimoto, S., Mieno, T., Tanaka, T. S. T., & Bullock, D. S. (2022). Causal forest approach for site-specific input management via on-farm precision experimentation. Computers and Electronics in Agriculture , 199 , 107164. https://doi.org/10.1016/j.compag.2022.107164 .

Keeney, D. R., & Nelson, D. W. (2015). Nitrogen-Inorganic Forms (pp. 643–698). https://doi.org/10.2134/agronmonogr9.2.2ed.c33 .

Krause, M. R., Crossman, S., DuMond, T., Lott, R., Swede, J., Arliss, S., et al. (2020). Random forest regression for optimizing variable planting rates for corn and soybean using topographical and soil data. Agronomy Journal , 112 (6), 5045–5066. https://doi.org/10.1002/agj2.20442 .

Lacoste, M., Cook, S., McNee, M., Gale, D., Ingram, J., Bellon-Maurel, V., et al. (2022). On-Farm Experimentation to transform global agriculture. Nature Food , 3 (1), 11–18. https://doi.org/10.1038/s43016-021-00424-4 .

Article   PubMed   Google Scholar  

Lark, R. M., Cullis, B. R., & Welham, S. J. (2006). On spatial prediction of soil properties in the presence of a spatial trend: The empirical best linear unbiased predictor (E-BLUP) with REML. European Journal of Soil Science , 57 , 787–799. https://doi.org/10.1111/j.1365-2389.2005.00768.x .

Maine, N., Lowenberg-DeBoer, J., Nell, W. T., & Alemu, Z. G. (2010). Impact of variable-rate application of nitrogen on yield and profit: A case study from South Africa. Precision Agriculture , 11 , 448–463. https://doi.org/10.1007/s11119-009-9139-8 .

Paccioretti, P., Bruno, C., Gianinni Kurina, F., Córdoba, M., Bullock, D. S., & Balzarini, M. (2021). Statistical models of yield in on-farm precision experimentation. Agronomy Journal , 113 (6), 4916–4929. https://doi.org/10.1002/agj2.20833 .

Pasquel, D., Roux, S., Richetti, J., Cammarano, D., Tisseyre, B., & Taylor, J. A. (2022). A review of methods to evaluate crop model performance at multiple and changing spatial scales. Precision Agriculture . https://doi.org/10.1007/s11119-022-09885-4 .

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, V., Grisel, O. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research , 12 , 2825–2830. https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf . Accessed 30 March 2022.

Pylianidis, C., Snow, V., Overweg, H., Osinga, S., Kean, J., & Athanasiadis, I. N. (2022). Simulation-assisted machine learning for operational digital twins. Environmental Modelling and Software , 148 , https://doi.org/10.1016/j.envsoft.2021.105274 .

Ribeiro, P. J., & Diggle, P. J. (2001). The geoR package. R-NEWS , 1 , 15–18.

Google Scholar  

Roques, S. E., Kindred, D. R., Berry, P., & Helliwell, J. (2022). Successful approaches for on-farm experimentation. Field Crops Research , 287 , 108651. https://doi.org/10.1016/j.fcr.2022.108651 .

Saikai, Y., Patel, V., & Mitchell, P. D. (2020). Machine learning for optimizing complex site-specific management. Computers and Electronics in Agriculture , 174 , https://doi.org/10.1016/j.compag.2020.105381 .

Tanaka, T. S. T., Kono, Y., & Matsui, T. (2019). Assessing the spatial variability of winter wheat yield in large-scale paddy fields of Japan using structural equation modelling. Precision Agriculture , ’19 , 751–757. https://doi.org/10.3920/978-90-8686-888-9_93 .

Tanaka, T. S. T., Mieno, T., Tanabe, R., Matsui, T., & Bullock, D. S. (2023a). Toward an effective approach for on-farm experimentation: Lessons learned from a case study of fertilizer application optimization in Japan. Precision Agriculture . https://doi.org/10.1007/s11119-023-10029-5 .

Tanaka, T. S. T., Yokoyama, Y., Mieno, T., & de Wit, A. (2023b). 27. Synthetic data generation for validating site-specific crop yield response modelling using WOFOST and gaussian geostatistical simulations. Precision agriculture ’23 (pp. 229–235). Wageningen Academic. https://doi.org/10.3920/978-90-8686-947-3_27 .

Taylor, J. R. (1982). An introduction to Error Analysis: The study of uncertainties in physical measurements . University Science Books.

Trevisan, R. G., Bullock, D. S., & Martin, N. F. (2021). Spatial variability of crop responses to agronomic inputs in on-farm precision experimentation. Precision Agriculture , 22 , 342–363. https://doi.org/10.1007/s11119-020-09720-8 .

Truog, E. (1930). The determination of the readily available phosphorus of soils 1 . Agronomy Journal , 22 (10), 874–882. https://doi.org/10.2134/agronj1930.00021962002200100008x .

Webster, R., & Oliver, M. A. (2007). Geostatistics for Environmental Scientists Second Edition Geostatistics for Environmental Scientists, 2nd Edition .

Wen, G., Ma, B. L., Vanasse, A., Caldwell, C. D., Earl, H. J., & Smith, D. L. (2021). Machine learning-based canola yield prediction for site-specific nitrogen recommendations. Nutrient Cycling in Agroecosystems , 121 (2–3), 241–256. https://doi.org/10.1007/s10705-021-10170-5 .

Zhou, X., Heuvelink, G. B. M., Kono, Y., Matsui, T., & Tanaka, T. S. T. (2022). Using linear mixed-effects modeling to evaluate the impact of edaphic factors on spatial variation in winter wheat grain yield in Japanese consolidated paddy fields. European Journal of Agronomy , 133 , 126447. https://doi.org/10.1016/j.eja.2021.126447 .

Download references

Acknowledgements

The authors wish to thank the farming company ‘Fukue-eino’ for allowing the survey of their fields. This study was supported by a JST ACT-X (JPMJAX20AF), Japan. The study was conducted using the ICP-AES of the Division of Instrument Analysis, Gifu University. This study was funded in part by USDA-NRCS On-Farm Trials Conservation Innovation Grant, “Improving the Economic and Ecological Sustainability of US Crop Production through On-Farm Precision Experimentation,” award number NR213A7500013G021, and by USDA-NIFA Hatch Project 470 − 362.

This study was supported by a JST ACT-X (JPMJAX20AF), Japan. This study was funded in part by USDA-NRCS On-Farm Trials Conservation Innovation Grant, “Improving the Economic and Ecological Sustainability of US Crop Production through On-Farm Precision Experimentation,” award number NR213A7500013G021, and by USDA-NIFA Hatch Project 470 − 362.

Open access funding provided by Aarhus Universitet

Author information

Takashi S. T. Tanaka

Present address: Department of Agroecology, Faculty of Technical Sciences, Aarhus University, Forsøgsvej 1, Slagelse, 4200, Denmark

Authors and Affiliations

Faculty of Applied Biological Sciences, Gifu University, Yanagido, Gifu, 5011193, Japan

Artificial Intelligence Advanced Research Center, Gifu University, Yanagido, Gifu, 5011193, Japan

ISRIC-World Soil Information, Wageningen, The Netherlands

Gerard B. M. Heuvelink

Soil Geography and Landscape Group, Wageningen University, Wageningen, The Netherlands

Department of Agricultural Economics, University of Nebraska-Lincoln, Lincoln, NE, 68583 0922, USA

Agricultural and Consumer Economics, University of Illinois, 326 Mumford Hall, 1301 W. Gregory, Urbana, IL, 61801, USA

David S. Bullock

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, T.S.T.T. and G.B.M.H; Methodology, T.S.T.T., G.B.M.H., and T.M.; Investigation, T.S.T.T; Writing – Original Draft, T.S.T.T.; Writing –Review & Editing, T.S.T.T., G.B.M.H., T.M., and D.S.B.; Visualization, T.S.T.T.; Funding Acquisition, T.S.T.T. & D.S.B.; Resources, T.S.T.T.; Supervision, T.S.T.T., G.B.M.H., and D.S.B.

Corresponding author

Correspondence to Takashi S. T. Tanaka .

Ethics declarations

Ethical approval.

Not available.

Competing interests

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Tanaka, T.S.T., Heuvelink, G.B.M., Mieno, T. et al. Can machine learning models provide accurate fertilizer recommendations?. Precision Agric (2024). https://doi.org/10.1007/s11119-024-10136-x

Download citation

Accepted : 08 March 2024

Published : 25 March 2024

DOI : https://doi.org/10.1007/s11119-024-10136-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Economically optimal input rate
  • On-farm experimentation
  • Site-specific management
  • Variable-rate application
  • Winter wheat
  • Find a journal
  • Publish with us
  • Track your research

Machine-learning model demonstrates effect of public breeding on rice yields in climate change

  • Story by Lindsey Berebitsky
  • March 25, 2024

C limate change, extreme weather events, unprecedented records in temperatures and higher, acidic oceans make it difficult to predict the long-term fate of modern crop varieties.

In a paper published in the March 18, 2024, issue of the Proceedings of the National Academy of Sciences, Diane Wang , an assistant professor in Purdue’s Department of Agronomy , and her post-doctoral researcher Sajad Jamshidi , reported on a predictive model they’ve developed that uses machine-learning algorithms to predict how rice yields will be affected by climate change. Their work was completed in collaboration with researchers at Cornell University and the Dale Bumpers National Rice Research Center .

“With these kinds of large-scale statistical models, you're basically taking a set of predictors — like weather or genetics — and mapping them to solve for an outcome. Here, we are interested in predicting yield,” Wang said.

The U.S. is in the top five exporters of rice, making rice production across several southern states important to diets around the world. Wang and Jamshidi’s work lays a foundation for artificial intelligence predictions in rice and other crops, potentially helping agriculture hone breeding practices where crop varieties are most vulnerable to climate change.

Graph showing how newer how different groups of varieties respond to climate change in terms of yield

“The ensemble model predicts that modern groups of rice varieties will do less badly than groups of older varieties, but I would be careful to say we’ve finished our job,” Wang said. “There is a lot of uncertainty with respect to future climates, and these kinds of models are just one tool to explore scenarios.”

Rice has a small genome compared with other crops. That and the availability of historical data and old-variety seeds made it the ideal study system to design a predictive model. The team obtained historical temperatures and weather data as well as what Wang called the “serendipitous discovery of variety acreage reports.”

The southern U.S. rice-growing states of the Mississippi Delta region have recorded what variety of rice was grown in what proportion at the county level since the 1970s. Many of these acreage reports were sent to the team as typewritten documents. The group then was able to obtain, from collaborators at the Dale Bumpers National Rice Research Center, seeds from old rice varieties that are no longer commonly grown.

A graph showing how the different rice varieties changed in popularity over time

These rice varieties were analyzed at the genetic level, and Wang and collaborators grouped varieties based on alleles, or gene variations, that they shared. They translated this information from the variety acreage reports into county-level “bags of alleles” and then trained machine-learning models using the allele groups and county-level yields with historical environmental data, like temperature and precipitation.

Jamshidi’s efforts in building this model are especially novel because the final model combines 10 methods of machine learning to create an ensemble model that can process information with a more multifaceted approach. The ensemble model’s output offers more accurate results under the same predictors.

Not only will this study provide a framework to build models for other crops with similar predictors, but Wang sees another possible direction for this research. Carrying out physical experiments by growing both old and modern rice varieties under predicted conditions could serve as an additional evaluation of the model, as well as give hints to the genetic and physiological makeup causing the difference in resilience between the variety groups.

Wang said, “These kinds of predictions are really the first step. The model has given us some potential outcomes, but now someone has to run the follow-up experiments to get at underlying mechanisms.”

Wang and her lab continue to study the interactions between crops’ genetics and their environment, and they are using modeling and other technologies to create a more predictable future for agriculture.

Featured Stories

Nicholas Gallina

If it’s one thing Nick Gallina knows, it’s the value of perseverance. Growing up in...

Purdue College of Agriculture.

My testing post

Team SoySilk members holding the first-place prize check at the 2024 competition ceremony.

From winning products like soy-based candles and styrofoam to chewing gum, for 30 years the...

Evie Sierra, Academic All Bug Ten Honoree, competes in women's swimming and diving competition.

Two Purdue Agriculture student athletes were named Academic All-Big Ten Honorees for the winter...

Bruce Hamaker stands in kitchen lab with arms on counter surrounding plates of potatoes, apples, spinach and grains

Americans generally consume about half the recommended daily dietary fiber requirement. These...

The Milnes Family Farm - photos of the forest and the agricultural space with a barn

Vince Milnes grew up spending time on the family farm in Brown County, Indiana, owned by his...

IMAGES

  1. (PDF) Crop Yield Prediction Using Machine Learning

    crop prediction using machine learning research paper

  2. Frontiers

    crop prediction using machine learning research paper

  3. (PDF) Agro-Genius : Crop Prediction Using Machine Learning

    crop prediction using machine learning research paper

  4. Machine Learning in Agriculture Technologies: An Insight Into Crop

    crop prediction using machine learning research paper

  5. Crop Yield Prediction based on Indian Agriculture using Machine

    crop prediction using machine learning research paper

  6. (PDF) CROP PREDICTION USING MACHINE LEARNING TECHNIQUE

    crop prediction using machine learning research paper

VIDEO

  1. Guest Lecture-Use of Machine learning-Ensembling approaches -weather indices-crop yield forecasting

  2. Crop Prediction Using Machine Learning #ML

  3. 2024 Empowering Minds Through Data Science and Machine Learning Symposium: Jinferg Zhang PHD

  4. "Stock Predictions -GDSC Solution Challenge 2023"

  5. Deep learning‐based crop row detection for infield navigation of agri‐robots

  6. 2024 Empowering Minds Through Data Science and Machine Learning Symposium: Madalina Bulat

COMMENTS

  1. Crop yield prediction using machine learning: A systematic literature review

    Machine learning is an important decision support tool for crop yield prediction, including supporting decisions on what crops to grow and what to do during the growing season of the crops. Several machine learning algorithms have been applied to support crop yield prediction research. In this study, we performed a Systematic Literature Review ...

  2. (PDF) Crop prediction using machine learning

    This paper contributes to the following aspects- (a) Crop production prediction utilizing a range of. Machine Learning approaches and a comparison of e rror rate and accuracy for certain regions ...

  3. An interaction regression model for crop yield prediction

    Machine learning models have been successfully used for crop yield prediction, including stepwise multiple linear regression 7, random forest 8, neural networks 9,10,11, convolutional neural ...

  4. (PDF) Crop yield prediction using machine learning: A systematic

    Abstract and Figures. Machine learning is an important decision support tool for crop yield prediction, including supporting decisions on what crops to grow and what to do during the growing ...

  5. Crop Prediction Model Using Machine Learning Algorithms

    Machine learning applications are having a great impact on the global economy by transforming the data processing method and decision making. Agriculture is one of the fields where the impact is significant, considering the global crisis for food supply. This research investigates the potential benefits of integrating machine learning algorithms in modern agriculture. The main focus of these ...

  6. Crop Yield Prediction using Machine Learning and Deep Learning

    Crop yield prediction is a challenge for decision-makers at all levels, including global and local levels. Farmers may adopt a good crop yield prediction model to decide what to plant and when to plant it. Crop yield forecasting may be done in several ways [2] [3]. * Corresponding author.

  7. Full article: Deep learning for crop yield prediction: a systematic

    Here, we must distinguish shallow learning from deep learning), there is no SLR paper that focuses on the use of deep learning in crop yield prediction yet. In this respect, a pioneering effort has been made in the present study representing the way for systematically reviewing the state-of-the-art knowledge on the development of Deep Learning ...

  8. Crop yield prediction using machine learning techniques

    Methods of machine learning can aid intelligent system decision-making. • The following paper investigates a variety of methods for predicting crop yields using a variety of soil and environmental variables. • The main purpose of this project is to make a machine learning model make predictions.

  9. Crop prediction based on soil and environmental characteristics using

    Numerous recent papers [Citation 27, Citation 31] on machine learning have proved the usefulness of using feature selection in machine learning in supervised learning functions. These include sequential feature selection (SFS) algorithms, which are strategies that reduce the number of attributes by applying a local search [ Citation 20 ].

  10. A Systematic Review on Crop Yield Prediction Using Machine Learning

    Abstract. Machine learning is an essential tool for crop yield prediction. Crop. yield prediction is a challenging task in the agriculture and agronomic field. In. crop yield, many factors can ...

  11. Crop prediction using machine learning

    The paper aims to discover the best model for crop prediction, which can help farmers decide the type of crop to grow based on the climatic conditions and nutrients present in the soil. This paper compares popular algorithms such as K-Nearest Neighbor (KNN), Decision Tree, and Random Forest Classifier using two different criterions Gini and ...

  12. A Systematic Review on Crop Yield Prediction Using Machine Learning

    Abstract. Machine learning is an essential tool for crop yield prediction. Crop yield prediction is a challenging task in the agriculture and agronomic field. In crop yield, many factors can impact crop yields such as soil quality, temperature, humidity, quality of the seeds, rainfall, and many more. To give an accurate yield prediction with ...

  13. Crop Yield Prediction using Machine Learning Algorithm

    Machine learning (ML) plays a significant role as it has decision support tool for Crop Yield Prediction (CYP) including supporting decisions on what crops to grow and what to do during the growing season of the crops. The present research deals with a systematic review that extracts and synthesize the features used for CYP and furthermore ...

  14. Crop Yield Prediction Using Machine Learning Algorithms

    Crop Yield Prediction Using Machine Learning Algorithms ... Machine learning can bring a boom in the agriculture field by changing the income scenario through growing the optimum crop. This paper focuses on predicting the yield of the crop by applying various machine learning techniques. The outcome of these techniques is compared on the basis ...

  15. Crop Prediction using Machine Learning

    This research work helps the beginner farmer in such a way to guide them for sowing the reasonable crops by deploying machine learning, one of the advanced technologies in crop prediction. Naive Bayes, a supervised learning algorithm puts forth in the way to achieve it.

  16. A Machine Learning Approach for Crop Yield and Disease Prediction

    View a PDF of the paper titled A Machine Learning Approach for Crop Yield and Disease Prediction Integrating Soil Nutrition and Weather Factors, by Forkan Uddin Ahmed (1) and 4 other authors ... These issues are addressed in this research by utilizing machine learning methods and real-world datasets. The recommended approach uses a variety of ...

  17. Crop Yield Prediction Using Machine Learning Models: Case of Irish

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

  18. Crop Production Prediction Using Machine Learning: An Indian

    This research paper draws a comparative study of three regression models multiple linear regression, decision tree, and random forest regressor. ... Crop yield prediction using machine learning. Int. J. Sci. Res. (IJSR) 9, 2 (2020) Google Scholar D. Ramesh, B. Vardhan, Analysis of crop yield prediction using data mining techniques. Int. J. Res. ...

  19. Crop Yield Prediction Based on Bacterial Biomarkers and Machine Learning

    DOI: 10.1007/s42729-024-01705- Corpus ID: 268768157; Crop Yield Prediction Based on Bacterial Biomarkers and Machine Learning @article{Ma2024CropYP, title={Crop Yield Prediction Based on Bacterial Biomarkers and Machine Learning}, author={Li Ma and Wenquan Niu and Guochun Li and Yadan Du and Jun Sun and Kadambot H. M. Siddique}, journal={Journal of Soil Science and Plant Nutrition}, year ...

  20. Crops yield prediction based on machine learning models: Case of West

    Kaneko et al. [9] recently proposed a crop yield study focusing on African countries.They used a deep learning architecture on satellite image data to predict maize at the district level in six countries in Africa: Ethiopia, Kenya, Malawi, Nigeria, Tanzania, and Zambia. Their model predicted with an R 2 of 0,56. We take another direction by using climate, chemical, and agricultural parameters.

  21. (PDF) CROP YIELD PREDICTION USING MACHINE LEARNING

    CROP YIELD PREDICTION USING MACHINE LEARNING. April 2020. International Journal of Science and Research (IJSR) 9 (4 April 2020):2. Authors: Mayank Champaneri. K J Somaiya Institute of Engineering ...

  22. A Machine Learning Approach for Crop Yield and Disease Prediction

    These weather predictions are then used to forecast the possibilities of diseases for the primary crops list by utilizing the support vector classifier. Finally, the developed model makes use of the decision tree regression model to forecast crop yield and provides a final crop list along with associated possible disease forecast.

  23. Crop Selection and Yield Prediction using Machine Learning Approach

    A user may research possible crops and their yield using the integrated recommendation system in order to make better educated judgements. ... In this paper, we predict the crop yield using weather and soil parameters. ... "Crop yield prediction using machine learning " International Journal of Advanced Research in Computer Science and ...

  24. Can machine learning models provide accurate fertilizer ...

    The study uses real on-farm precision experimental data to evaluate the influence of the choice of machine learning algorithms and covariate selection on yield and EOIR prediction. The crop is winter wheat, and the inputs considered are a slow-release basal fertilizer NPK 25-6-4 and a top-dressed fertilizer NPK 17-0-17.

  25. Prediction of crop yield in India using machine learning and hybrid

    The main objective of the proposed research work is to build a high efficacious crop yield prediction model based on the data available for the period of 21 years from 1997 to 2017 using machine learning and hybrid deep learning approaches. Two prediction models have been proposed in this research work to predict the crop yield accurately. The ...

  26. Crop Prediction using Machine Learning Approaches

    Girish L [3] describe the crop yield and rain fall p rediction. using a machine learning method. In this paper they gone. through a different machin e learning approaches for the. prediction of ...

  27. Machine-learning model demonstrates effect of public breeding on rice

    Climate change, extreme weather events, unprecedented records in temperatures and higher, acidic oceans make it difficult to predict the long-term fate of modern crop varieties. In a paper published in the March 18, 2024, issue of the Proceedings of the National Academy of Sciences, Diane Wang, an assistant professor in Purdue's Department of Agronomy, and her post-doctoral researcher Sajad ...

  28. Crop Yield Prediction Based on Indian Agriculture Using Machine Learning

    Machine learning techniques can be used to improve prediction of crop yield under different climatic scenarios. This paper presents the review on use of such machine learning technique for Indian ...

  29. Research on Ship Resistance Prediction Using Machine Learning with

    Resistance serves as a critical performance metric for ships. Swift and accurate resistance prediction can enhance ship design efficiency. Currently, methods for determining ship resistance encompass model tests, estimation techniques, and computational fluid dynamics (CFDs) simulations. There is a need to improve the prediction speed or accuracy of these methods. Machine learning is gradually ...

  30. Crop Intelligent: Weather based Crop Selection using Machine Learning

    This paper discusses research developments conducted within the last 15 years on machine learning based techniques for accurate crop yield prediction and nitrogen status estimation.