Eindhoven University of Technology research portal Logo

  • Help & FAQ

Data Mining

  • Data Science
  • Data and Artificial Intelligence

Student theses

  • 1 - 50 out of 258 results
  • Title (descending)

Search results

3d face reconstruction using deep learning.

Supervisor: Medeiros de Carvalho, R. (Supervisor 1), Gallucci, A. (Supervisor 2) & Vanschoren, J. (Supervisor 2)

Student thesis : Master

Achieving Long Term Fairness through Curiosity Driven Reinforcement Learning: How intrinsic motivation influences fairness in algorithmic decision making

Supervisor: Pechenizkiy, M. (Supervisor 1), Gajane, P. (Supervisor 2) & Kapodistria, S. (Supervisor 2)

Activity Recognition Using Deep Learning in Videos under Clinical Setting

Supervisor: Duivesteijn, W. (Supervisor 1), Papapetrou, O. (Supervisor 2), Zhang, L. (External person) (External coach) & Vasu, J. D. (External coach)

A Data Cleaning Assistant

Supervisor: Vanschoren, J. (Supervisor 1)

Student thesis : Bachelor

A Data Cleaning Assistant for Machine Learning

A deep learning approach for clustering a multi-class dataset.

Supervisor: Pei, Y. (Supervisor 1), Marczak, M. (External person) (External coach) & Groen, J. (External person) (External coach)

Aerial Imagery Pixel-level Segmentation

A framework for understanding business process remaining time predictions.

Supervisor: Pechenizkiy, M. (Supervisor 1) & Scheepens, R. J. (Supervisor 2)

A Hybrid Model for Pedestrian Motion Prediction

Supervisor: Pechenizkiy, M. (Supervisor 1), Muñoz Sánchez, M. (Supervisor 2), Silvas, E. (External coach) & Smit, R. M. B. (External coach)

Algorithms for center-based trajectory clustering

Supervisor: Buchin, K. (Supervisor 1) & Driemel, A. (Supervisor 2)

Allocation Decision-Making in Service Supply Chain with Deep Reinforcement Learning

Supervisor: Zhang, Y. (Supervisor 1), van Jaarsveld, W. L. (Supervisor 2), Menkovski, V. (Supervisor 2) & Lamghari-Idrissi, D. (Supervisor 2)

Analyzing Policy Gradient approaches towards Rapid Policy Transfer

An empirical study on dynamic curriculum learning in information retrieval.

Supervisor: Fang, M. (Supervisor 1)

An Explainable Approach to Multi-contextual Fake News Detection

Supervisor: Pechenizkiy, M. (Supervisor 1), Pei, Y. (Supervisor 2) & Das, B. (External person) (External coach)

An exploration and evaluation of concept based interpretability methods as a measure of representation quality in neural networks

Supervisor: Menkovski, V. (Supervisor 1) & Stolikj, M. (External coach)

Anomaly detection in image data sets using disentangled representations

Supervisor: Menkovski, V. (Supervisor 1) & Tonnaer, L. M. A. (Supervisor 2)

Anomaly Detection in Polysomnography signals using AI

Supervisor: Pechenizkiy, M. (Supervisor 1), Schwanz Dias, S. (Supervisor 2) & Belur Nagaraj, S. (External person) (External coach)

Anomaly detection in text data using deep generative models

Supervisor: Menkovski, V. (Supervisor 1) & van Ipenburg, W. (External person) (External coach)

Anomaly Detection on Dynamic Graph

Supervisor: Pei, Y. (Supervisor 1), Fang, M. (Supervisor 2) & Monemizadeh, M. (Supervisor 2)

Anomaly Detection on Finite Multivariate Time Series from Semi-Automated Screwing Applications

Supervisor: Pechenizkiy, M. (Supervisor 1) & Schwanz Dias, S. (Supervisor 2)

Anomaly Detection on Multivariate Time Series Using GANs

Supervisor: Pei, Y. (Supervisor 1) & Kruizinga, P. (External person) (External coach)

Anomaly detection on vibration data

Supervisor: Hess, S. (Supervisor 1), Pechenizkiy, M. (Supervisor 2), Yakovets, N. (Supervisor 2) & Uusitalo, J. (External person) (External coach)

Application of P&ID symbol detection and classification for generation of material take-off documents (MTOs)

Supervisor: Pechenizkiy, M. (Supervisor 1), Banotra, R. (External person) (External coach) & Ya-alimadad, M. (External person) (External coach)

Applications of deep generative models to Tokamak Nuclear Fusion

Supervisor: Koelman, J. M. V. A. (Supervisor 1), Menkovski, V. (Supervisor 2), Citrin, J. (Supervisor 2) & van de Plassche, K. L. (External coach)

A Similarity Based Meta-Learning Approach to Building Pipeline Portfolios for Automated Machine Learning

Aspect-based few-shot learning.

Supervisor: Menkovski, V. (Supervisor 1)

Assessing Bias and Fairness in Machine Learning through a Causal Lens

Supervisor: Pechenizkiy, M. (Supervisor 1)

Assessing fairness in anomaly detection: A framework for developing a context-aware fairness tool to assess rule-based models

Supervisor: Pechenizkiy, M. (Supervisor 1), Weerts, H. J. P. (Supervisor 2), van Ipenburg, W. (External person) (External coach) & Veldsink, J. W. (External person) (External coach)

A Study of an Open-Ended Strategy for Learning Complex Locomotion Skills

A systematic determination of metrics for classification tasks in openml, a universally applicable emm framework.

Supervisor: Duivesteijn, W. (Supervisor 1), van Dongen, B. F. (Supervisor 2) & Yakovets, N. (Supervisor 2)

Automated machine learning with gradient boosting and meta-learning

Automated object recognition of solar panels in aerial photographs: a case study in the liander service area.

Supervisor: Pechenizkiy, M. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Weelinck, T. (External person) (External coach)

Automatic data cleaning

Automatic scoring of short open-ended questions.

Supervisor: Pechenizkiy, M. (Supervisor 1) & van Gils, S. (External coach)

Automatic Synthesis of Machine Learning Pipelines consisting of Pre-Trained Models for Multimodal Data

Automating string encoding in automl, autoregressive neural networks to model electroencephalograpy signals.

Supervisor: Vanschoren, J. (Supervisor 1), Pfundtner, S. (External person) (External coach) & Radha, M. (External coach)

Balancing Efficiency and Fairness on Ride-Hailing Platforms via Reinforcement Learning

Supervisor: Tavakol, M. (Supervisor 1), Pechenizkiy, M. (Supervisor 2) & Boon, M. A. A. (Supervisor 2)

Benchmarking Audio DeepFake Detection

Better clustering evaluation for the openml evaluation engine.

Supervisor: Vanschoren, J. (Supervisor 1), Gijsbers, P. (Supervisor 2) & Singh, P. (Supervisor 2)

Bi-level pipeline optimization for scalable AutoML

Supervisor: Nobile, M. (Supervisor 1), Vanschoren, J. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Bliek, L. (Supervisor 2)

Block-sparse evolutionary training using weight momentum evolution: training methods for hardware efficient sparse neural networks

Supervisor: Mocanu, D. (Supervisor 1), Zhang, Y. (Supervisor 2) & Lowet, D. J. C. (External coach)

Boolean Matrix Factorization and Completion

Supervisor: Peharz, R. (Supervisor 1) & Hess, S. (Supervisor 2)

Bootstrap Hypothesis Tests for Evaluating Subgroup Descriptions in Exceptional Model Mining

Supervisor: Duivesteijn, W. (Supervisor 1) & Schouten, R. M. (Supervisor 2)

Bottom-Up Search: A Distance-Based Search Strategy for Supervised Local Pattern Mining on Multi-Dimensional Target Spaces

Supervisor: Duivesteijn, W. (Supervisor 1), Serebrenik, A. (Supervisor 2) & Kromwijk, T. J. (Supervisor 2)

Bridging the Domain-Gap in Computer Vision Tasks

Supervisor: Mocanu, D. C. (Supervisor 1) & Lowet, D. J. C. (External coach)

CCESO: Auditing AI Fairness By Comparing Counterfactual Explanations of Similar Objects

Supervisor: Pechenizkiy, M. (Supervisor 1) & Hoogland, K. (External person) (External coach)

Clean-Label Poison Attacks on Machine Learning

Supervisor: Michiels, W. P. A. J. (Supervisor 1), Schalij, F. D. (External coach) & Hess, S. (Supervisor 2)

  • Research article
  • Open access
  • Published: 07 June 2021

Predicting the incidence of COVID-19 using data mining

  • Fatemeh Ahouz 1 &
  • Amin Golabpour   ORCID: orcid.org/0000-0001-7649-4033 2  

BMC Public Health volume  21 , Article number:  1087 ( 2021 ) Cite this article

13k Accesses

12 Citations

Metrics details

The high prevalence of COVID-19 has made it a new pandemic. Predicting both its prevalence and incidence throughout the world is crucial to help health professionals make key decisions. In this study, we aim to predict the incidence of COVID-19 within a two-week period to better manage the disease.

The COVID-19 datasets provided by Johns Hopkins University, contain information on COVID-19 cases in different geographic regions since January 22, 2020 and are updated daily. Data from 252 such regions were analyzed as of March 29, 2020, with 17,136 records and 4 variables, namely latitude, longitude, date, and records. In order to design the incidence pattern for each geographic region, the information was utilized on the region and its neighboring areas gathered 2 weeks prior to the designing. Then, a model was developed to predict the incidence rate for the coming 2 weeks via a Least-Square Boosting Classification algorithm.

The model was presented for three groups based on the incidence rate: less than 200, between 200 and 1000, and above 1000. The mean absolute error of model evaluation were 4.71, 8.54, and 6.13%, respectively. Also, comparing the forecast results with the actual values in the period in question showed that the proposed model predicted the number of globally confirmed cases of COVID-19 with a very high accuracy of 98.45%.

Using data from different geographical regions within a country and discovering the pattern of prevalence in a region and its neighboring areas, our boosting-based model was able to accurately predict the incidence of COVID-19 within a two-week period.

Peer Review reports

On December 8, 2019 the Chinese government reported the death of one patient and hospitalization of 41 others with unknown etiology in Wuhan [ 1 ]. This cluster initiated the novel coronavirus (COVID-19) epidemic respiratory disease. While the early cases were linked to the wet market, human-to-human transmission had led to widespread outbreak of the virus nationwide [ 2 ]. On January 30, 2020 the World Health Organization (WHO) declared COVID-19 as a public health emergency with international concern (PHEIC) [ 3 ].

On the basis of the global spread and severity of the disease, on March 11, 2020 the Director-General of WHO officially declared the COVID-19 outbreak a pandemic [ 4 ]. The pandemic as such, entered a new stage with rapid spread in countries outside China [ 5 ]. According to the 56th WHO situation report [ 6 ], as of March 16, 2020 the number of COVID-19 confirmed cases outside China exceeded those inside. Consequently, after March 17, 2020 WHO began to report the number of confirmed and dead cases on each continent as opposed to merely providing patient statistics in and out of China.

According to the 70th WHO situation report [ 7 ], by March 30, 2020 the number of people infected with COVID-19 worldwide were 693,282. 392,815 (about 57%) of whom were in Europe, 142,081 (about 20%) in the Americas, 103,775 (about 15%) in Western Pacific, 46,329 (about 7%) in Eastern Mediterranean, 4084 (about 0.5%) in South-East Asia, and 3486 (about 0.5%) in Africa. Of that total, 33,106 died worldwide, 23,962 of whom (around 72% of all death) were in Europe, 3649 (around 11%) in Western Pacific, and 5488 (around 17%) were in other regions collectively.

Due to the growing prevalence of COVID-19 across the world, several works have examined different aspects of the disease. They involve identifying the source of the virus as well as analyzing its gene sequences [ 8 , 9 ], patient information [ 10 ], early cases in the countries infected [ 11 , 12 , 13 ], methods of virus detection [ 14 , 15 ], the epidemiological outbreak [ 16 , 17 ], and predicting COVID-19 cases [ 2 , 17 , 18 , 19 , 20 ].

In [ 18 ], using heuristic method and WHO situation reports, an exponential curve was proposed to predict the number of cases in the next 2 weeks by March 30, 2020. The model was then tested for the 58th situation report. The authors reported 1.29% error. Afterwards, on the assumption that the current trend could continue for the next 17 days, they predicted that by March 30, 1 million cases outside China would be reported in the 70/71th WHO situation report. Given that the number of confirmed cases outside China was 693,176 on March 30 [ 21 ], their forecast error was 44.26%.

In [ 17 ], the CoronaTracker team proposed a Susceptible-Exposed-Infectious-Recovered (SEIR) model based on the queried data in their website, and made the 240-day prediction of COVID-19 cases in and out of China, started on 20 January 2020. They predicted that the outbreak would reach its peak on May 23, 2020 and the maximum number of infected individuals would amount to 425.066 million globally. In addition, the authors stated that this number would start to drop around early July 2020 and reach below 10,000 on 14 Sep 2020. Given the information available now, these predictions were far from what really happened around the world.

Elsewhere [ 19 ], the authors examined some available models to predict 5 and 10-day ahead of cumulative cases in Guangdong and Zhejiang by February 23, 2020. They used generalized logistic growth, the Richards growth, and a sub-epidemic wave model, which were utilized to forecast some previous infectious outbreaks.

Although some works have proposed methods for predicting COVID-19 cases, to our knowledge at the time of writing this paper, none have been comprehensive and have not predicted the new cases in each geographical region along with each continent. In this study, using the COVID-19 Cases dataset provided by Johns Hopkins University [ 22 ], we aim to predict COVID-19 infected people in each geographical regions included in the dataset as well as each continent in the coming 2-week period. Predicting the situation in the current pandemic is very crucial to containment of the threat because it helps make timely medical measures e.g. equipping medical facilities, managing resource allocation, sending more personnel to high-risk areas, deciding whether to close borders or resume traffic, and suspending or resuming community services.

COVID-19 epidemiological data have been compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) [ 22 ]. The data have been provided in three separate datasets for confirmed, recovered, and death cases since January 22, 2020 and are updated daily. In each of these datasets, there is a record (row) for every geographic region. The variables in each dataset are province/state, country/region, latitude, longitude, and the incremental dates since January 22. For each region, the value for any date indicates the cumulative number of confirmed/recovered/death cases from January 22, 2020.

In this study, according to the input requirements of the proposed model, we changed the data representation so that instead of three separate datasets for three groups of confirmed, recovered, and death cases, only one dataset containing the information of all three groups was arranged. In this new dataset, each record (or row) of the dataset contains information about the number of confirmed, recovered, or deaths per day for each geographic region. As a result, the variables in this new dataset are: Province / State, Country / Region, Latitude (Lat), Longitude (Long), Date (specifying a certain date), Cases (indicating the number of confirmed, recovered, or death cases on the certain date), and Type (specifying the type of cases, i.e. confirmed, recovered, or death) as suggested by Rami Krispin [ 23 ].

In this study, the data were applied into the analysis by March 29, 2020, with 50,660 records and 7 variables. This period includes information about parts of winter and spring in the northern hemisphere and summer and autumn in the southern hemisphere. By March 29, the dataset consisted of cases from 177 countries and 252 different regions around the world. There were 720,139 confirmed, 33,925 death, and 149,082 recovered cases in the dataset.

Preprocessing step

Pre-processing was carried out on the dataset before training the proposed model. Figure  1 shows the preprocessing steps. The dataset was first examined for noise, since the noise data were considered as having negative values in Cases variable. The dataset contained 42 negative values in this variable. After deleting these values, the number of records were reduced to 50,618.

figure 1

Preprocessing steps on COVID-19 dataset

Subsequently, the Date variable was written in numerical format and renamed into “Day” variable. To that effect, January 22, 2020 marked the beginning of the outbreak and the next days were calculated in terms of distance from the origin. As a result, January 22 and March 29 were considered as Day 1 and Day 68, respectively.

Since each region is uniquely identified by its latitude and longitude, the data for Province/State and Country/Region were excluded from the dataset. Moreover, as the study aimed at predicting the incidence in any geographical region, we considered only those records providing information on the confirmed cases (17,179 records), but not on the dead or the recovered. So, after preserving the records with “Confirmed” value in the Type variable, it was deleted from the dataset. In this study, the “Cases” is considered as the dependent variable.

Constructing the prediction model

An ensemble method of regression learners was utilized to predict the incidence of COVID-19 in different regions. The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models called weak learners [ 24 ]. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. One of the most commonly used loss functions is least-squares (LS) error [ 25 ].

In this study, the model employed a set of individual Least-squares boosting (LSBoost) learners trying to minimize the mean squared error (MSE). The output of the model in step m, F m (x), was calculated using Eq. 1 :

where x is input variable and h(x;a) is the parameterized function of x, characterized by parameters a [ 25 ]. The values of ρ and a were obtained from Eq. 2 :

Where N is the number of training data and \( \tilde{y}_{i} \) is the difference between the observed response and the aggregated prediction up to the previous step.

Due to the recent major changes in the incidence of COVID-19 worldwide over the past 2 weeks, we aimed to predict the number of new cases as an indicator of prevalence over the next 2 weeks. The structure of the proposed method is shown in Fig.  2 .

figure 2

The structure of the Proposed model

Since the incubation period of COVID-19 can be 14 days [ 26 ], we assumed that we needed at least 14 days prior information to predict the incidence of Covid-19 in 1 day. Therefore, the proposed model examined all possible intervals between the first and the last 14 days to find the optimal time period to use its information to predict the number of cases in the coming days.

We hypothesized that the incidence in any region might follow the pattern of recent days in the same region and nearby. Therefore, after determining the optimal time period, the model added the information on confirmed cases in each region and nearby in the specified period to the same region’s record in the dataset.

After setting the time interval, [A, B], and the number of neighbors, the dataset was rearranged. In this case, the number of records was reduced from N to M, where M is calculated from Eq. 3 :

Where R is the number of different regions in the dataset and B is the last day of the time period. Similarly, the number of variables stored for each record increased from the first 4 variables (latitude, longitude, Day and Cases) to F, which is calculated from Eq. 4 :

Where NN is the number of neighbors and 4 is the number of variables in the original data set because for each geographical region, Lat, Long, Day and Cases are stored. |B-A + 1| is the number of days within the period that participate in the forecast of the next 14 days. The value of NN is multiplied by 2 because for each neighbor, latitude and longitude are added to the record information. Furthermore, for each day within the period of forecast, the Cases were added to the record information, so NN was multiplied by|B-A + 1|. For each region, the Day and Cases data during the period were added as well. Thus, |B-A + 1| was multiplied by 2. It should be noted, however, that the dependent variable remained the Cases of current day.

Since the number of both the nearby regions and the previous days effective in forecasting were unknown, we assumed these values to be unknown variables and obtained the most accurate model by examining all possible combinations of such variables in an iterative process.

The accuracy of the model was evaluated in terms of Mean Squared Error (MSE) and Mean Absolute Error (MAE); Due to the normalization of MAE between [0, 1], the evaluation error is equal to 2 times MAE. To do so, the information of the last 2 weeks on all regions was considered as a validation set, and the model was trained using other information in the dataset.

Forecast incidence in the next 2 weeks

A new test set was created to predict incidence in the next 2 weeks (by April 12, 2020). The number of records in this dataset was equal to that of unique geographical regions in the COVID-19 dataset. Then, according to the best neighborhood and optimal time interval specified in the previous step, the necessary features were provided for each record. After that, the best model was created in the previous step was retrained on the entire dataset as a training set. Later on, this model was examined on the new test set to predict the incidence rate.

Evaluation the actual performance of the proposed model

Given that the actual number of confirmed cases within March 30–April 12, 2020 period was available at the time of review, the performance of the proposed model was measured based on percent error between the predicted and the actual values. The percent error was calculated from Eq. 5 :

Where δ is percent error, v A is the actual observed value and v E is the expected (predicted) value. Furthermore, according to the predicted and actual confirmed cases in 252 geographical regions in the dataset, the continental incidence rate was calculated using Eq. 6 :

where I C is the incidence in each continent and I W is the global incidence of COVID-19 from March 30 to April 12, 2020.

The experimentation platform is Intel® Core™ i7-8550U CPU @ 1.80GHz 1.99 GHz CPU and 12.0 GB of RAM running 64-bits OS of MS Windows. The pre-processing and model construction has been implemented in MATLAB.

Model construction

The number of neighbors ranged from zero to 10. The value of 10 was obtained by trial and error. Euclidean distance based on latitude and longitude was used to calculate nearest neighbors. Given that the dataset contains data from January 22, 2020 to March 29, 2020 for the day we want to predict the incidence, the nearest and farthest days were selected as 14 and 54, respectively. Because the number of confirmed cases varies greatly from region to region, the proposed algorithm was implemented for 3 different groups of regions: for regions with less than 200 confirmed cases per day (16,825 records), those with 200 to 1000 cases per day (220 records), and those with over 1000 cases per day (152 records).

Table  1 shows the results of the best proposed model with regard to the different composition of the neighborhood and the days before. In order to predict the incidence of COVID-19 in regions with more than 1000 confirmed cases per day, the proposed model demonstrated the best performance with MAE of 6.13%, considering the information of the last 14 to 17 days of the region and its two neighboring areas. In the dataset, the number of cases records in these regions varied from 1019 to 19,821.

For regions with 200 to 1000 cases per day, the proposed model performed best with respect to the 9 nearest neighboring areas and with data from the last 14 to 20 days, with MAE of 8.54% on the validation set. For regions with fewer than 200 cases per day, on the other hand, the proposed model performs best with MAE of 4.71%, taking into account the region data for the last 14 to 34 days.

Prediction of incidence by April 12, 2020

Figure  3 shows the prevalence of the COVID-19 from the first week to the tenth week in different regions, based on the information provided by the COVID-19 epidemiological dataset [ 22 ]. In this Figure, the diameter of the circles is proportional to the prevalence in those regions and the center of each circle matches the geographical coordinates of the region.

figure 3

Visualize the outbreak over the days (created by ourselves, gimp software, open source)

Table  2 shows the results of the forecast as to the number of new cases per day on different continents. According to the location of the continents in the northern and southern hemispheres, the period in question contains winter and early spring information in the continents of North America, Europe and almost entire parts of Asia. It includes summer and parts of autumn in Australian and approximately whole South America. Given that Africa lies in all four hemispheres, the data recorded for this continent in this period in the data set includes all seasons.

By April 12, 1,134,018 new cases worldwide were expected to be on record. Of these, Europe with 687,665 (60.64%), North America with 272,957 (24.07%) and Asia with 107,000 (9.44%) new cases were the most prevalent, whereas Australia with 14,526 (1.28%), Africa with 19,131 (1.69%) and South America with 32.739 (2.89%) new cases were the least incidence. Africa, Europe and South America had the highest rates of COVID-19 incidence, with 283, 221.23, and 178.87%, respectively. Asia was the only continent that had slowed its growth with an incidence rate of − 34.

Figure  4 shows the prediction of incidence rates in different regions. Accordingly, the prevalence would decrease over the next 2 weeks in the Middle East, yet it would increase in North America and Europe. Outbreak forecasts for 244 geographic regions are provided in Additional file  1 : Appendix 1.

figure 4

Prediction of the incidence in week 10 and 11 (created by ourselves, gimp software, open source)

Comparison of predicted and actual cases from March 30 to April 12, 2020

Table  3 shows the total number of daily cases in the 252 regions surveyed between March 30 and April 12, 2020. As shown, the daily percent error is below 20%. The best accuracy of the proposed model in predicting the incidence of COVID-19 was obtained on April 10 with 99.6%, and the worst on April 11 with 81.3%. Data analysis of the two-week continental incidence rates are also shown in Fig.  5 . The best predicted continental incidence rates were found in South America and Asia with 18.15 and 21.04% percent error, respectively. The worst cases, still, were observed in Africa and Australian with more than 80% percent errors.

figure 5

Comparison of predicted and actual continental incidence rates between March 30 and April 12, 2020

Data mining is capable of presenting a predictive model and extracting new knowledge from retrospective data. The way data is processed, as well as the variables selected, had a significant impact on knowledge discovery. There are various data mining techniques used to predict an outbreak. As an actual global health concern, COVID-19 had already developed into one of the world’s major emergencies. The present study proposed to investigate its outbreak worldwide during a two-week period via a predictive model based on retrospective data. It was concluded that such a model could be presented with acceptable error rates.

The study made use of a coronavirus dataset to design an incidence of COVID-19 prediction model. According to the incidence rate per day, the model was trained based on three groups of below 200, 200–1000 and above 1000 cases. One-way ANOVA results showed that there was a statistically significant difference between the prevalence rates in the three groups ( p -value < 0.001). For each group, the prediction model was implemented and the incidence was predicted for the next 2 weeks. The proposed model achieved about 10% error (90% accuracy) for the group of less than 200 cases, 18% error (82% accuracy) for the group of 200–1000 cases, and 13% error (87% accuracy) for that exceeding 1000 cases.

In this study, as the incidence of COVID-19 was evaluated for 68 days worldwide, and a prediction model presented for the two-week period (i.e., March 30–April 12, 2020), more than 1000,000 people were expected to contract the disease within the next 2 weeks, which was statistically up 58% compared to 700,000 of the outbreak by March 29, 2020.

The study found that adjacent regions with a prevalence of less than 1000 had similar incidence, so the incidence of each of these regions could be determined from information on neighboring areas. The use of neighborhood information enables the model to indirectly consider the effective policies of other regions in predicting the incidence of COVID-19 in each region.

Given that the proposed model was trained using only 68-day data (which was the most up-to-date information at the time of writing), the accuracy of predicting the incidence above 81% was deemed acceptable for such an unknown disease. Further, according to the results shown in Table 3 , the model prediction error for a total of 12 days for 252 regions was less than 2%. Therefore, if the data of each country were stored more precisely using more geographical regions, it was promising that we could create an accurate model for predicting the incidence of covid-19 over a two-week period in each country. While many unknowns would be expected of a new pandemic, having this information can guide planning and resource allocation for prevention, treatment, and palliative care.

Although time series usually need to be long enough (normally a few years) to adequately account for seasonality, based on the results of the model implementations, we believe that this model, even with that short a time series, is able to manage seasonality and can predict the number of cases with acceptable accuracy (see Additional file 1 : Appendices 2 and 3 for the results of all analyses). However, it is suggested that future research specifically address the effect of seasonal changes on the prevalence of this disease.

One of the limitations of the study was that the dataset did not provide sufficient information from all continents. Given that the disease did not occur simultaneously on all continents, and the continental prevalence was in most cases after the 40th day of the first case in China, 68 days of data did not seem sufficient to predict the prevalence of such an unknown disease.

In Africa, the first case was reported in more than 80% of the 45 geographical regions since the 50th day. The number of confirmed cases since then was 4682, which was 97.83% of the total 4783 confirmed cases in Africa. In Australian, the first case was reported in more than 45% of the 11 geographical regions from the 40th day onwards. Also, out of a total of 4504 cases on the continent, 4478 cases (99.4%) were confirmed then.

In Europe, the first case was reported in 60 of the 69 geographic regions in the dataset from the 40th day onwards. Out of a total of 385,735 cases, information on 384,268 cases (i.e. 99.62%) has also been entered since that day. Similarly, South America confirmed its first case after the 40th day in 16 out of 17 regions. It is noteworthy that out of a total of 11,642 cases, 11,542 (14.99%) were confirmed from day 50 onwards.

In contrast, 88% of the North American regions had their first cases confirmed since day 50. In addition, of the 46 confirmed cases by March 29, 2020 on the continent, 38 were reported since day 50 (82.61%) And 41 were confirmed from day 40 onwards (89.13%).

Due to insufficient information on some continents as a result of their prevalence later than the declared beginning of the outbreak, the effect of measures such as increasing the number of tests taken per day as well as quarantine restrictions in some continents such as Europe, begin in place from March 30 to April 12, were not reflected in the dataset.

Nevertheless, the inaccurate prediction of the number of cases in Africa could be attributed, in turn, to the insufficient information about the continent in the dataset. In 80% of the African regions, the first confirmed case was recorded 50 days into the outbreak. Out of a total of 4786 cases there, up until the 68th day, 4682 cases (more than 97%) were reported since day 50.

In addition, due to the fact that latitude and longitude are two important indicators in the data set, the non-uniformity of recording these information for different geographical regions is another limitation of the work; for some areas, the information is about one state of a country and for some areas it is for the whole country. For example, in the dataset for USA, all cases are provided in terms of only one latitude and longitude, but for Netherlands, the data of COVID-19 cases are provided for four different latitude and longitude pairs.

Another limitation of this study was the use of data from all countries coping with in COVID-19 with their own protocols for testing and identifying patients. However, in general, this is the only global dataset for COVID-19 that has been used in other studies [ 16 , 17 ]. Besides, early information on each country was taken into account in the proposed model to predict the incidence in that country to reduce the mentioned limitation.

It is worth noting that the model rests on both the info provided by the dataset and the current measures taken in dealing with the disease. Hence, if government’s’ policies to tackle the disease change, so will the accuracy of the information.

Conclusions

Since epidemiological models such as SIR failed to accurately predict COVID-19 cases, as stated in [ 17 , 27 , 28 ], the current study relied on data from January 22 to March 29 provided by Johns Hopkins University and proposed a more complex model based on machine learning methods. The mean absolute error of the proposed model was 6.13% in predicting the incidence of COVID-19 in the two-week period of March 16–29 for regions with more than 1000 cases per day. The MAE was 8.45 and 4.71% for regions with a daily incidence rate between 200 and 1000 cases and less than 200 cases, respectively. An accuracy of more than 82% on the evaluation set confirms our perception that the pattern of incidence of a region is influenced by the pattern of disease in recent days in the same region and neighboring areas.

Last but not least, despite numerous limitations of the dataset, lack of knowledge about such an unknown disease and changes in disease control policies in different countries during the period under scrutiny, the proposed model proved effective in predicting the global incidence of COVID-19 in the two-week period of March 30 and April 12 with 98.45% accuracy. In addition, the accuracy of the proposed model in predicting daily cases in a worst-case scenario was 81.31%.

This model is written in general and can be run for different intervals (see Additional file 1 : Appendix 4). It is suggested that the model be implemented for future data as well.

Availability of data and materials

The dataset analyzed during the current study is public and it is available in the [ https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases ] and in [ https://codeload.github.com/RamiKrispin/coronavirus-csv/zip/master ].

Abbreviations

World Health Organization

Public Health Emergency with International Concern

Susceptible-Exposed-Infectious-Recovered

Johns Hopkins University Center for Systems Science and Engineering

Least-squares boosting

Mean Squared Error

Mean Absolute Error

Nkengasong J. Author Correction: China’s response to a novel coronavirus stands in stark contrast to the 2002 SARS outbreak response. Nat Med. 2020;26(3):441. https://doi.org/10.1038/s41591-020-0816-5 .

Roosa K, Lee Y, Luo R, Kirpich A, Rothenberg R, Hyman JM, et al. Real-time forecasts of the COVID-19 epidemic in China from February 5th to February 24th, 2020. Infect Dis Model. 2020;5:256–63. https://doi.org/10.1016/j.idm.2020.02.002 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Eurosurveillance Editorial T. Note from the editors: World Health Organization declares novel coronavirus (2019-nCoV) sixth public health emergency of international concern. Eurosurveillance. 2020;25(5):2–3.

Article   Google Scholar  

World Health Organization, WHO Director-General's opening remarks at the media briefing on COVID-19 - 11 March 2020. 2020. Available from: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19%2D%2D-11-march-2020 . Accessed 27 May 2021.

Bedford J, et al. COVID-19: towards controlling of a pandemic . 2020.

Google Scholar  

Who, World Health Organization, Coronavirus disease 2019 (COVID-19) situation report −60. 2020.

World Health Organization, Coronavirus disease 2019 (COVID-19) Situation Report −70. 2020 [updated 19March 2020. Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_4 . Accessed 27 May 2021.

Ji W, Wang W, Zhao X, Zai J, Li X. Cross-species transmission of the newly identified coronavirus 2019-nCoV. J Med Virol. 2020;92(4):433–40. https://doi.org/10.1002/jmv.25682 .

Paraskevis D, Kostaki EG, Magiorkinis G, Panayiotakopoulos G, Sourvinos G, Tsiodras S. Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infect Genet Evol. 2020;79:104212. https://doi.org/10.1016/j.meegid.2020.104212 .

Huang C, Wang Y, Li X. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China (vol 395, pg 497, 2020). Lancet. 2020;395(10223):496.

Kim JY, Choe PG, Oh Y, Oh KJ, Kim J, Park SJ, et al. The first case of 2019 novel coronavirus pneumonia imported into Korea from Wuhan, China: implication for infection prevention and control measures. J Korean Med Sci. 2020;35(5):e61.  https://doi.org/10.3346/jkms.2020.35.e61 .

Bernard Stoecklin S, Rolland P, Silue Y, Mailles A, Campese C, Simondon A, et al. First cases of coronavirus disease 2019 (COVID-19) in France: surveillance, investigations and control measures, January 2020. Euro Surveill. 2020;25(6):2000094. https://doi.org/10.2807/1560-7917.ES.2020.25.6.2000094 .

Giovanetti M, Benvenuto D, Angeletti S, Ciccozzi M. The first two cases of 2019-nCoV in Italy: Where they come from? J Med Virol. 92(5):518–21. https://doi.org/10.1002/jmv.25699 .

Corman VM, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020;25(3):23–30.

Zhang NR, et al. Recent advances in the detection of respiratory virus infection in humans. J Med Virol. 2020;92(4):408–17. https://doi.org/10.1002/jmv.25674 .

Dey SK, Rahman MM, Siddiqi UR, Howlader A. Analyzing the epidemiological outbreak of COVID-19: a visual exploratory data analysis approach. J Med Virol. 92(6):632–8. https://doi.org/10.1002/jmv.25743 .

Binti Hamzah FA, et al. CoronaTracker: world-wide COVID-19 outbreak data analysis and prediction . 2020.

Koczkodaj WW, Mansournia MA, Pedrycz W, Wolny-Dominiak A, Zabrodskii PF, Strzałka D, et al. 1,000,000 cases of COVID-19 outside of China: The date predicted by a simple heuristic. Glob Epidemiol. 2020;2:100023. https://doi.org/10.1016/j.gloepi.2020.100023 .

Roosa K, Lee Y, Luo R, Kirpich A, Rothenberg R, Hyman JM, et al. Short-term Forecasts of the COVID-19 Epidemic in Guangdong and Zhejiang, China: February 13–23, 2020. J Clin Med. 2020;9(2):596. https://doi.org/10.3390/jcm9020596 .

Nishiura H, Jung SM, Linton NM, Kinoshita R, Yang YC, Hayashi K, et al. The extent of transmission of novel coronavirus in Wuhan, China, 2020. J Clin Med. 2020;9(2):330. https://doi.org/10.3390/jcm9020330 .

Organization, W.H. Coronavirus disease 2019 (COVID-19) Situation Report −70. 2020. Available from: https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_4 .

(CCSE), J.H.U.C.f.S.S.a.E.J. Novel Coronavirus (COVID-19) Cases Data. 2020. Available from: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases .

Krispin R. Coronavirus. 2020. Available from: https://github.com/RamiKrispin/coronavirus .

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, second edition. Springer Series in Statistics. New York: Springer-Verlag; 2008.

Friedman J. Greedy function approximation: a gradient boosting machine. Ann Stat. 2000;29:1189–232. https://doi.org/10.1214/aos/1013203451 .

Organization, w.H. Transmission of SARS-CoV-2: implications for infection prevention precautions. 2020. Available from: https://www.who.int/news-room/commentaries/detail/transmission-of-sars-cov-2-implications-for-infection-prevention-precautions#:~:text=The%20incubation%20period%20of%20COVID,to%20a%20confirmed%20case .

Postnikov EB. Estimation of COVID-19 dynamics “on a back-of-envelope”: Does the simplest SIR model provide quantitative parameters and predictions? Chaos, Solitons Fractals. 2020;135:109841. https://doi.org/10.1016/j.chaos.2020.109841 .

Cooper I, Mondal A, Antonopoulos CG. A SIR model assumption for the spread of COVID-19 in different communities. Chaos, Solitons Fractals. 2020;139:110057.

Download references

Acknowledgements

The authors appreciate Deputy of research and technology of Khatam Alanbia University of technology.

Not applicable.

Author information

Authors and affiliations.

Department of Computer Engineering, School of Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran

Fatemeh Ahouz

School of Medicine, Shahroud University of Medical Sciences, Shahroud, Iran

Amin Golabpour

You can also search for this author in PubMed   Google Scholar

Contributions

‘FA’ and ‘AG’ equally contributed to the conception, design of the work, analysis and interpretation of data. In addition, they read and approved the final manuscript.

Corresponding author

Correspondence to Amin Golabpour .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: appendix 1..

Point-to-point forecast for all areas in the dataset. Appendix 2. Investigation the effect of seasonal changes on model performance. Appendix 3. The performance of the proposed method on randomly selected regions. Appendix 4. The results of the proposed method on the updated data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Ahouz, F., Golabpour, A. Predicting the incidence of COVID-19 using data mining. BMC Public Health 21 , 1087 (2021). https://doi.org/10.1186/s12889-021-11058-3

Download citation

Received : 03 April 2020

Accepted : 13 May 2021

Published : 07 June 2021

DOI : https://doi.org/10.1186/s12889-021-11058-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining

BMC Public Health

ISSN: 1471-2458

data mining project thesis

  • Open access
  • Published: 05 July 2017

Analysis of agriculture data using data mining techniques: application of big data

  • Jharna Majumdar 1 ,
  • Sneha Naraseeyappa 1 &
  • Shilpa Ankalaki 1  

Journal of Big Data volume  4 , Article number:  20 ( 2017 ) Cite this article

105k Accesses

128 Citations

1 Altmetric

Metrics details

In agriculture sector where farmers and agribusinesses have to make innumerable decisions every day and intricate complexities involves the various factors influencing them. An essential issue for agricultural planning intention is the accurate yield estimation for the numerous crops involved in the planning. Data mining techniques are necessary approach for accomplishing practical and effective solutions for this problem. Agriculture has been an obvious target for big data. Environmental conditions, variability in soil, input levels, combinations and commodity prices have made it all the more relevant for farmers to use information and get help to make critical farming decisions. This paper focuses on the analysis of the agriculture data and finding optimal parameters to maximize the crop production using data mining techniques like PAM, CLARA, DBSCAN and Multiple Linear Regression. Mining the large amount of existing crop, soil and climatic data, and analysing new, non-experimental data optimizes the production and makes agriculture more resilient to climatic change.

Today, India ranks second worldwide in the farm output. Agriculture is demographically the broadest economic sector and plays a significant role in the overall socio-economic fabric of India. Agriculture is a unique business crop production which is dependent on many climate and economy factors. Some of the factors on which agriculture is dependent are soil, climate, cultivation, irrigation, fertilizers, temperature, rainfall, harvesting, pesticide weeds and other factors. Historical crop yield information is also important for supply chain operation of companies engaged in industries. These industries use agricultural products as raw material, livestock, food, animal feed, chemical, poultry, fertilizer, pesticides, seed and paper. An accurate estimate of crop production and risk helps these companies in planning supply chain decision like production scheduling. Business such as seed, fertilizer, agrochemical and agricultural machinery industries plan production and marketing activities based on crop production estimates [ 1 , 2 ]. There are 2 factors which are helpful for the farmers and the government in decision making namely:

It helps farmers in providing the historical crop yield record with a forecast reducing the risk management.

It helps the government in making crop insurance policies and policies for supply chain operation.

Data mining technique plays a vital role in the analysis of data. Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database system. Unsupervised (clustering) and supervised (classifications) are two different types of learning methods in the data mining. Clustering is the process of examining a collection of “data points,” and grouping the data points into “clusters” according to some distance measure. The goal is that data points in the same cluster have a small distance from one another, while data points in different clusters are at a large distance from one another. Cluster analysis divides data into well-formed groups. Well-formed clusters should capture the “natural” structure of the data [ 3 ]. This paper focuses on PAM, CLARA and DBSCAN clustering methods. These methods are used to categorize the different districts of Karnataka which are having similar crop production.

Literature survey

Clustering is considered as an unsupervised classification process [ 4 ]. A large number of clustering algorithms have been developed for different purposes [ 4 , 5 , 6 ]. Clustering techniques can be categorised into Partitioning clustering, Hierarchical clustering, Density-based methods, Grid-based methods and Model based clustering methods.

Partitioning clustering algorithms, such as K-means, K-medoids PAM, CLARA and CLARANS assign objects into k (predefined cluster number) clusters, and iteratively reallocate objects to improve the quality of clustering results. Hierarchical clustering algorithms assign objects in tree structured clusters, i.e., a cluster can have data point’s representatives of low level clusters [ 7 ]. The idea of Density-based clustering methods is that for each point of a cluster the neighbourhood of a given unit distance contains at least a minimum number of points, i.e. the density in the neighbourhood should reach some threshold. The idea of the density-based clustering algorithm is that, for each point of a cluster, the neighbourhood of a given unit distance has to contain at least a minimum number of points [ 8 ].

There are different forecasting methodologies developed and evaluated by the researchers all over the world in the field of agriculture. Some of such studies are: Researchers like Ramesh and Vishnu Vardhan are analysed the agriculture data for the years 1965–2009 in the district East Godavari of Andhra Pradesh, India. Rain fall data is clustered into 4 clusters by adopting the K means clustering method. Multiple linear regression (MLR) is the method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is rainfall and independent variables are year, area of sowing, production. Purpose of this work is to find suitable data models that achieve high accuracy and a high generality in terms of yield prediction capabilities [ 9 ].

Bangladesh offers several varieties of rice which has different cropping season [ 10 ]. For this a prior study of climate (effect on temperature and rainfall) in Bangladesh and its effect on agricultural production of rice has been done. Then this study was being taken into regression analysis with temperature and rainfall. Temperature puts an adverse consequence on the crop production. The data has been taken from the “Bangladesh Agricultural Research Council (BARC)” for past 20 years with 7 attributes: “rainfall”, “max and min temperature”, “sunlight”, “speed of wind”, “humidity” and “cloud-coverage”. In Pre-processing, the whole dataset was divided in 3 month duration phases (March to June, July to October, November to February). For this duration, the average for every attribute has been taken and associated with it. This pre-processing has been done for each kind of rice variety. In clustering, the different pre-processed table has been analysed to find the sharable group of region based on similar weather attribute.

Soil characteristics are studied and analysed using data mining techniques. As an example, the k-means clustering is used for clustering soils in combination with GPS-based technologies [ 11 ]. Authors like Alberto Gonzalez-Sanchez, Juan Frausto-Solis and Waldo Ojeda-Bustamante have done extensive study on predictive ability of machine learning techniques such as multiple linear regression, regression trees, artificial neural network, support vector regression and k-nearest neighbour for crop yield production [ 12 ]. Wheat yield prediction using machine learning and advanced sensing techniques has done by Pantazi, DimitriosMoshou, Thomas Alexandridis and Abdul MounemMouazen [ 13 ]. The aim of their work is to predict within field variation in wheat yield, based on on-line multi-layer soil data, and satellite imagery crop growth characteristics. Supervised self-organizing maps capable of handling existent information from different soil and crop sensors by utilizing an unsupervised learning algorithm were used. The software tool ‘Crop Advisor’ has been developed by S. Veenadhari, B. Misra and CD Singh [ 14 ] is an user friendly web page for predicting the influence of climatic parameters on the crop yields. C4.5 algorithm is used to find out the most influencing climatic parameter on the crop yields of selected crops in selected districts of Madhya Pradesh.

The objective of proposed work is to analyse the agriculture data using data mining techniques. In proposed work, agriculture data has been collected from following sources:

Dataset in agricultural sector [ https://data.gov.in/ , http://raitamitra.kar.nic.in/statistics ],

Crop wise agriculture data [html://CROPWISE_NORMAL_AREA],

Agriculture data of different districts [ http://14.139.94.101/fertimeter/Distkar.aspx ], http://raitamitra.kar.nic.in/ENG/statistics.asp ],

Agriculture data based on weather, temperature, and relative humidity [ http://dmc.kar.nic.in/trg.pdf ].

Input dataset consist of 6 year data with following parameters namely: year, State-Karnataka (28 districts), District, crop (cotton, groundnut, jowar, rice and wheat.), season (kharif, rabi, summer), area (in hectares), production (in tonnes), average temperature (°C), average rainfall (mm), soil, PH value, soil type, major fertilizers, nitrogen (kg/Ha), phosphorus (Kg/Ha),Potassium(Kg/Ha), minimum rainfall required, minimum temperature required.

In proposed work, modified approach of DBSCAN method is used to cluster the data based on districts which are having similar temperature, rain fall and soil type. PAM and CLARA are used to cluster the data based on the districts which are producing maximum crop production (In proposed work wheat crop is considered as example). Based on these analyses we are obtaining the optimal parameters to produce the maximum crop production. Multiple linear regression method is used to forecast the annual crop yield.

Modified approach of DBSCAN

DBSCAN is a base algorithm for density based clustering containing large amount of data which has noise and outliers. DBSCAN has two parameters namely Eps and MinPts. However, traditional DBSCAN cannot produce optimal Eps value [ 15 ]. Determination of the optimal Eps value automatically is the one of the most necessary modification for the DBSCAN. Figure  1 briefs the modified approach of the DBSCAN method.

Determine the Eps value automatically

Modified DBSCAN proposes the method to find the minimum points and Epsilon (radius value) automatically. KNN plot is used to find out the epsilon value where input to the KNN plot (K value) is user defined. To avoid the user define K value as input to the KNN plot, Batchelor Wilkins clustering algorithm is applied to the database and obtain the K value along with its respective cluster centres. This K value is given as input to the KNN Plot.

Determination of Eps and Minpts

The Epsilon (Eps) value can be found by drawing a “K-distance graph” for entire data-points in dataset for a given ‘K’, obtained by the Batchelor Wilkins Algorithm [ 16 ]. Initially, the distance of a point to every ‘K’ of its nearest-neighbours is calculated. KNN plot is plotted by taking the sorted values of average distance values. When the graph is plotted, a knee point is determined in order to find the optimal Eps value [ 15 ].

Partition around medoids (PAM)

It is a partitioning based algorithm. It breaks the input data into number of groups. It finds a set of objects called medoids that are centrally located. With the medoids, nearest data points can be calculated and made it as clusters. The algorithm has two phases:

BUILD phase, a collection of k objects are selected for an initial set S.

Arbitrarily choose k objects as the initial medoids.

Until no change, do.

(Re) assign each object to the cluster with the nearest medoid.

Improve the quality of the k-medoids (randomly select a non medoid object, O random, compute the total cost of swapping a medoid with O random).

SWAP phase, one tries to improve the quality of the clustering by exchanging selected objects with unselected objects. Choose the minimum swapping cost.

Example: For each medoid m1, for each non-medoid data point d; Swap m1 and d, recompute the cost (sum of distances of points to their medoid), if total cost of the configuration increased in the previous step, undo the swap Fig.  2 depicts the steps involved the PAM algorithms.

PAM Algorithm steps

CLARA (clustering large applications)

It is designed by Kaufman and Rousseeuw to handle large datasets, CLARA (clustering large applications) relies on sampling [ 17 , 18 ]. Instead of finding representative objects for the entire data set, CLARA draws a sample of the data set, applies PAM on the sample, and finds the medoids of the sample. To come up with better approximations, CLARA draws multiple samples and gives the best clustering as the output. Here, for accuracy, the quality of the clustering is measured based on the average dissimilarity of all objects in the entire data set. Figure  3 briefs about the steps involved in the CLARA Algorithm.

Steps involved in CLARA algorithm

Multiple linear regression to forecast the crop yield

Multiple linear regression is a variant of “linear regression” analysis. This model is built to establish the relationship that exists between one dependent variable and two or more independent variables [ 19 ].For a given dataset where x 1 … x k are independent variables and Y is a dependent variable, the multiple linear regression fits the dataset to the model:

where β 0 is the y-intercept and \( \beta_{1} , \beta_{2} , \ldots , \beta_{k} \) parameters are called the partial coefficients. In matrix form

Before applying the multiple linear regression to forecast the crop yield, it’s necessary to know the significant attributes from the database. All the attributes used in the database will not be significant or changing the value of these attributes will not affect anything on the dependent variables. Such attributes can be neglected. P value test is performed on the database to find the significant attributes and multiple linear regression is applied only on the significant values to forecast the crop yield.

Evaluation methods

Data mining algorithms work with different principles, being able to be influenced by different kinds of associations on data. To ensure fairer conditions in evaluation, this work finds the optimal clustering method for agriculture data analysis. Proposed work adopts the external quality metrics [ 3 ] like Purity, Homogeneity, Completeness, V Measure, Rand Index, Precision, Recall and F measure to compare the PAM, CLARA and DBSCAN clustering methods.

Purity of the clustering is computed by assigning each cluster to the class which is most frequent in the cluster. Homogeneity represents the each cluster contains only members of a single class. Completeness represents the all members of a given class are assigned to the same cluster. V-measure is computed as the harmonic mean of distinct homogeneity and completeness scores. Rand Index measures the percentage of decisions that are correct. Precision is calculated as the fraction of pairs correctly put in the same cluster. Recall represents the fraction of actual pairs that were identified. F measure indicates the harmonic mean of precision and recall. Higher quality metrics value represents the better cluster quality.

Experimental results

Before applying DBSCAN algorithm on the dataset user needs to determine the Minpts and Eps values. The Batchelor Wilkins algorithm is applied on the dataset in order to determine the K value (Number of clusters) automatically. For the dataset used in the proposed work, K value obtained from the Batchelor Wilkins is 7 with following districts as cluster centres. Results of Batchelor Wilkins algorithm are shown in Fig. 4 .

Cluster centres obtained from the Batchelor Wilkins algorithm

KNN plot is plotted using K value obtained from the Batchelor & Wilkins’ Algorithm to determine the epsilon value and the min points for the DBSCAN.

Figure  5 depicts the result of KNN plot. The KNN plot is plotted using K value obtained from the Batchelor & Wilkins’ Algorithm (i.e. here K = 7). Eps value is calculated by taking the slope of the line from any point and sought-after pair of points that have the greatest slope to locate the point. The slope of the line is located at the point of 0.4, a point which is the optimal value Eps [ 20 ].

KNN plot for given dataset

Districts of Karnataka considered for the analysis

DBSCAN clustering algorithm is applied on the dataset to cluster the different districts of Karnataka which are having similar rain fall, temperature and soil type using optimal Eps value.

Figure 6 depicts the different districts of Karnataka which are considered for the purpose of analysis.

Figures  7 ,  8 and 9 depicts the different districts of Karnataka which are having similar temperature range, rain fall range and soil types respectively.

Districts which are having similar temperature range

Districts which are having similar rain fall during 6 year duration

Districts which are having similar soil type

To apply the PAM algorithm on the dataset, initially user need to give k (Number of clusters), where k is given as 3 in current experiment. Crop yield is categorised into LOW, MODERATE and HIGH production. Total districts are clustered into 3 clusters using PAM clustering method. Resultant clusters are shown in the Table  1 .

Study and analysis of wheat crop production in different districts of Karnataka as shown in Fig.  10 .

Production of yield in tonnes per hectare of different districts

As a result of the analysis, North Karnataka districts such as Bijapur, Dharwad, Bagalkot, Belgaum, Raichur, Bellary, Chitradurga and Davangere are the districts which have maximum wheat crop production.

Districts in the dataset are clustered into 3 clusters using CLARA algorithm. Clusters are shown in the Fig.  11 . It represents the districts which are having similar factors like area, production, rainfall and temperature. Result of the CLARA algorithm is shown in the Table  2 .

Study and analysis of temperature and wheat crop production in different districts of Karnataka as shown in Fig.  12 . From the Fig.  12 , we can analyze that the optimal temperature for Wheat crop production is 29.9 °C.

Results of CLARA in R language

Plot temperature vs. production

Multiple linear regression

Before applying the multiple linear regression, the “p value test” is performed on the dataset to determine the significant attributes. Table  3 depicts the significant values. An independent variable which has a “p value” of less than 0.05, specifies that the “null-hypothesis” can be rejected means it will have effect on regression analysis. So these independent values can be added to the model. Whereas if the p value is more than common alpha level i.e. 0.05, the variable will said to be not significant to the model.

Table  4 shows the multiple linear regression equation for different crop yield. For example, for Wheat crop, if all the independent variables are zero, the yield becomes 112. 1 unit increase in temperature level reduces the yield by 4.14e−02 units, 1 unit increase in rainfall will increase yield by 1.34e−04 units, 1 unit increase in pH will increase the yield by 0.079153 units, 1 unit increase in Nitrogen reduces the yield by 1.31e−03 units, 1 unit increase in potassium level decreases the yield by 0.00167 units and 1 unit increase in water requirement decreases the yield by 0.28125 unit.

For 1 unit increase in pH, the crops like Jowar, Rice, and Wheat yield will increase but Groundnut and Cotton yield will decrease.

Results for optimal temperature and rainfall for wheat—Table  5

Table  5 shows the optimal parameters to achieve the higher wheat production.

Comparison of clustering methods

As mentioned earlier, clustering comparison has done using four performance quality metrics. Table  6 shows the comparison of PAM, CLARA and DBSCAN methods for clustering the districts which are having similar crop productivity.

Table  6 and Fig.  13 depicts the comparison of PAM, CLARA and DBSCAN clustering methods. Higher quality metric values indicates better clustering quality. Analysis of the quality metrics parameters for different clustering methods is shown in the Fig.  13 . From Fig.  13 , DBSCAN has higher value for most of the quality metrics parameter. DBSCAN gives the better clustering quality than PAM and CLARA, CLARA gives the better clustering quality than the PAM.

The crops are usually selected by its economic importance. However, the agricultural planning process requires a yield estimation of several crops. In this sense, five crops were selected for this work using the data availability as the key measure. Thus, a crop was selected when enough data samples appeared in the range of 6 years under analysis. In presents works, research is commonly limited to the 5 crops those are cotton, wheat, ground nut, jowar and rice. Example wheat crop analysis is discussed in this paper.

The present work covers the PAM, CLARA, Modified DBSCAN clustering methods and multiple linear regression method. PAM and CLARA are the traditional clustering methods where as DBSCAN method is modified by introducing the Batchelor Wilkins clustering method to determine the ‘k’ value and KNN method to determine the minimum points and radius value automatically. Using these methods crop data set is analysed and determined the optimal parameters for the wheat crop production. Multiple linear regression is used to find the significant attributes and form the equation for the yield prediction.

Some works measure the quality of the clustering methods using internal quality metrics [ 21 ], some other uses the external quality metrics. However, in these works, research is limited to the external quality metrics which are combination of several metrics those are [ 22 ]: set matching metrics, metrics based on counting pairs and metrics based on Entropy. The quality metrics were ranked, from the best to the worst, according to purity, homogeneity, completeness, v measure, precision, recall and rand index results, in the following order: DBSCAN, CLARA and PAM.

Various data mining techniques are implemented on the input data to assess the best performance yielding method. The present work used data mining techniques PAM, CLARA and DBSCAN to obtain the optimal climate requirement of wheat like optimal range of best temperature, worst temperature and rain fall to achieve higher production of wheat crop. Clustering methods are compared using quality metrics. According to the analyses of clustering quality metrics, DBSCAN gives the better clustering quality than PAM and CLARA, CLARA gives the better clustering quality than the PAM. The proposed work can also be extended to analyse the soil and other factors for the crop and to increase the crop production under the different climatic conditions.

Veenadhari S, Misra B, Singh CD. Data mining techniques for predicting crop productivity—A review article. In: IJCST. 2011; 2(1).

Gleaso CP. Large area yield estimation/forecasting using plant process models.paper presentation at the winter meeting American society of agricultural engineers palmer house, Chicago, Illinois. 1982; 14–17

Majumdar J, Ankalaki S. Comparison of clustering algorithms using quality metrics with invariant features extracted from plant leaves. In: Paper presented at international conference on computational science and engineering. 2016.

Jain A, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.

Article   Google Scholar  

Jain AK, Dubes RC. Algorithms for clustering data. New Jersey: Prentice Hall; 1988.

MATH   Google Scholar  

Berkhin P. A survey of clustering data mining technique. In: Kogan J, Nicholas C, Teboulle M, editors. Grouping multidimensional data. Berlin: Springer; 2006. p. 25–72.

Chapter   Google Scholar  

Han J, Kamber M. Data mining: concepts and techniques. Massachusetts: Morgan Kaufmann Publishers; 2001.

Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Paper presented at International conference on knowledge discovery and data mining. 1996

Ramesh D, Vishnu Vardhan B. Data mining techniques and applications to agricultural yield data. In: International journal of advanced research in computer and communication engineering. 2013; 2(9).

MotiurRahman M, Haq N, Rahman RM. Application of data mining tools for rice yield prediction on clustered regions of Bangladesh. IEEE. 2014;2014:8–13.

Google Scholar  

Verheyen K, Adrianens M, Hermy S Deckers. High resolution continuous soil classification using morphological soil profile descriptions. Geoderma. 2001;101:31–48.

Gonzalez-Sanchez Alberto, Frausto-Solis Juan, Ojeda-Bustamante W. Predictive ability of machine learning methods for massive crop yield prediction. Span J Agric Res. 2014;12(2):313–28.

Pantazi XE, Moshou D, Alexandridis T, Mouazen AM. Wheat yield prediction using machine learning and advanced sensing techniques. Comput Electron Agric. 2016;121:57–65.

Veenadhari S, Misra B, Singh D. Machine learning approach for forecasting crop yield based on climatic parameters. In: Paper presented at international conference on computer communication and informatics (ICCCI-2014), Coimbatore. 2014.

Rahmah N, Sitanggang IS. Determination of optimal epsilon (Eps) value on DBSCAN algorithm to clustering data on peatland hotspots in Sumatra. IOP conference series: earth and environmental. Science. 2016;31:012012.

Forbes G. The automatic detection of patterns in people’s movements. Dissertation, University of Cape Town. 2002.

Ng RT, Han J. CLARANS: A Method for Clustering Objects for Spatial Data Mining. In: IEEE Transactions on Knowledge and Data Engineering. 2002; 14(5).

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. Wiley. 1990. doi: 10.1002/9780470316801 .

Multiple linear regression- http://www.originlab.com/doc/Origin-Help/Multi-Regression-Algorithm . Accessed 3 July 2017.

Elbatta MNT. An improvement for DBSCAN algorithm for best results in varied densities. Dissertation, Gaza (PS): Islamic University of Gaza. 2012

Kirkl O, De La Iglesia B. Experimental evaluation of cluster quality measures. 2013. 978-1-4799-1568-2/13. IEEE.

Meila M (2003) Comparing clustering. In: Proceedings of COLT 2003.

Download references

Authors’ contributions

JM, Dean R&D, Prof & HOD of Dept of M.Tech CSE at NMIT, has 40 years of experience in India and abroad has guided and given extensive help to develop the data mining algorithms. SN, Assistant Professor of Dept of M.Tech CSE at NMIT has developed the PAM and CLARA algorithms with the help of Dr. Jharna Majumdar. SA Assistant Professor of Dept of M.Tech CSE at NMIT has developed Modified approach of DBSCAN, Multiple Linear Regression and quality metrics for cluster comparison with the guidance and help of Dr. Jharna Majumdar. All authors together analysed the crop data set to determine the optimal parameters to maximise the crop yield. All authors read and approved the final manuscript.

Acknowledgements

The authors express their sincere gratitude to Prof N.R Shetty, Advisor and Dr H.C Nagaraj, Principal, Nitte Meenakshi Institute of Technology for giving constant encouragement and support to carry out research at NMIT.

The authors extend their thanks to Vision Group on Science and Technology (VGST), Government of Karnataka to acknowledge our research and providing financial support to setup the infrastructure required to carry out the research.

Competing interests

The authors declare that they have no competing interests.

This work was supported by the Research Department of Computer science, Nitte Meenakshi Institute of Technology.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

Department of M.Tech CSE, NMIT, Bangalore, 560064, India

Jharna Majumdar, Sneha Naraseeyappa & Shilpa Ankalaki

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jharna Majumdar .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Majumdar, J., Naraseeyappa, S. & Ankalaki, S. Analysis of agriculture data using data mining techniques: application of big data. J Big Data 4 , 20 (2017). https://doi.org/10.1186/s40537-017-0077-4

Download citation

Received : 25 February 2017

Accepted : 31 May 2017

Published : 05 July 2017

DOI : https://doi.org/10.1186/s40537-017-0077-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • CLARA and DBSCAN

data mining project thesis

50 selected papers in Data Mining and Machine Learning

Here is the list of 50 selected papers in Data Mining and Machine Learning . You can download them for your detailed reading and research. Enjoy!

Data Mining and Statistics: What’s the Connection?

Data Mining: Statistics and More? , D. Hand, American Statistician, 52(2):112-118.

Data Mining , G. Weiss and B. Davison, in Handbook of Technology Management, John Wiley and Sons, expected 2010.

From Data Mining to Knowledge Discovery in Databases , U. Fayyad, G. Piatesky-Shapiro & P. Smyth, AI Magazine, 17(3):37-54, Fall 1996.

Mining Business Databases , Communications of the ACM, 39(11): 42-48.

10 Challenging Problems in Data Mining Research , Q. Yiang and X. Wu, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.

The Long Tail , by Anderson, C., Wired magazine.

AOL’s Disturbing Glimpse Into Users’ Lives , by McCullagh, D., News.com, August 9, 2006

General Data Mining Methods and Algorithms

Top 10 Algorithms in Data Mining , X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. motoda, G.J. MClachlan, A. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, Knowl Inf Syst (2008) 141-37.

Induction of Decision Trees , R. Quinlan, Machine Learning, 1(1):81-106, 1986.

Web and Link Mining

The Pagerank Citation Ranking: Bringing Order to the Web , L. Page, S. Brin, R. Motwani, T. Winograd, Technical Report, Stanford University, 1999.

The Structure and Function of Complex Networks , M. E. J. Newman, SIAM Review, 2003, 45, 167-256.

Link Mining: A New Data Mining Challenge , L. Getoor, SIGKDD Explorations, 2003, 5(1), 84-89.

Link Mining: A Survey , L. Getoor, SIGKDD Explorations, 2005, 7(2), 3-12.

Semi-supervised Learning

Semi-Supervised Learning Literature Survey , X. Zhu, Computer Sciences TR 1530, University of Wisconsin — Madison.

Introduction to Semi-Supervised Learning, in Semi-Supervised Learning (Chapter 1) O. Chapelle, B. Scholkopf, A. Zien (eds.), MIT Press, 2006. (Fordham’s library has online access to the entire text)

Learning with Labeled and Unlabeled Data , M. Seeger, University of Edinburgh (unpublished), 2002.

Person Identification in Webcam Images: An Application of Semi-Supervised Learning , M. Balcan, A. Blum, P. Choi, J. lafferty, B. Pantano, M. Rwebangira, X. Zhu, Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data , 2005.

Learning from Labeled and Unlabeled Data: An Empirical Study across Techniques and Domains , N. Chawla, G. Karakoulas, Journal of Artificial Intelligence Research , 23:331-366, 2005.

Text Classification from Labeled and Unlabeled Documents using EM , K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Machine Learning , 39, 103-134, 2000.

Self-taught Learning: Transfer Learning from Unlabeled Data , R. Raina, A. Battle, H. Lee, B. Packer, A. Ng, in Proceedings of the 24th International Conference on Machine Learning , 2007.

An iterative algorithm for extending learners to a semisupervised setting , M. Culp, G. Michailidis, 2007 Joint Statistical Meetings (JSM), 2007

Partially-Supervised Learning / Learning with Uncertain Class Labels

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers , V. Sheng, F. Provost, P. Ipeirotis, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Logistic Regression for Partial Labels , in 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems , Volume III, pp. 1935-1941, 2002.

Classification with Partial labels , N. Nguyen, R. Caruana, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2008.

Imprecise and Uncertain Labelling: A Solution based on Mixture Model and Belief Functions, E. Come, 2008 (powerpoint slides).

Induction of Decision Trees from Partially Classified Data Using Belief Functions , M. Bjanger, Norweigen University of Science and Technology, 2000.

Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth , P. Smyth, M. Burl, U. Fayyad, P. Perona, KDD Workshop 1994, AAAI Technical Report WS-94-03, pp. 109-120, 1994.

Recommender Systems

Trust No One: Evaluating Trust-based Filtering for Recommenders , J. O’Donovan and B. Smyth, In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), 2005, 1663-1665.

Trust in Recommender Systems, J. O’Donovan and B. Symyth, In Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 2005, 167-174.

General resources available on this topic :

ICML 2003 Workshop: Learning from Imbalanced Data Sets II

AAAI ‘2000 Workshop on Learning from Imbalanced Data Sets

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data , G. Batista, R. Prati, and M. Monard, SIGKDD Explorations , 6(1):20-29, 2004.

Class Imbalance versus Small Disjuncts , T. Jo and N. Japkowicz, SIGKDD Explorations , 6(1): 40-49, 2004.

Extreme Re-balancing for SVMs: a Case Study , B. Raskutti and A. Kowalczyk, SIGKDD Explorations , 6(1):60-69, 2004.

A Multiple Resampling Method for Learning from Imbalanced Data Sets , A. Estabrooks, T. Jo, and N. Japkowicz, in Computational Intelligence , 20(1), 2004.

SMOTE: Synthetic Minority Over-sampling Technique , N. Chawla, K. Boyer, L. Hall, and W. Kegelmeyer, Journal of Articifial Intelligence Research , 16:321-357.

Generative Oversampling for Mining Imbalanced Datasets, A. Liu, J. Ghosh, and C. Martin, Third International Conference on Data Mining (DMIN-07), 66-72.

Learning from Little: Comparison of Classifiers Given Little of Classifiers given Little Training , G. Forman and I. Cohen, in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases , 161-172, 2004.

Issues in Mining Imbalanced Data Sets – A Review Paper , S. Visa and A. Ralescu, in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference , pp. 67-73, 2005.

Wrapper-based Computation and Evaluation of Sampling Methods for Imbalanced Datasets , N. Chawla, L. Hall, and A. Joshi, in Proceedings of the 1st International Workshop on Utility-based Data Mining , 24-33, 2005.

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , C. Drummond and R. Holte, in ICML Workshop onLearning from Imbalanced Datasets II , 2003.

C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure , N. Chawla, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Class Imbalances: Are we Focusing on the Right Issue?, N. Japkowicz, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Learning when Data Sets are Imbalanced and When Costs are Unequal and Unknown , M. Maloof, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Uncertainty Sampling Methods for One-class Classifiers , P. Juszcak and R. Duin, in ICML Workshop on Learning from Imbalanced Datasets II , 2003.

Active Learning

Improving Generalization with Active Learning , D Cohn, L. Atlas, and R. Ladner, Machine Learning 15(2), 201-221, May 1994.

On Active Learning for Data Acquisition , Z. Zheng and B. Padmanabhan, In Proc. of IEEE Intl. Conf. on Data Mining, 2002.

Active Sampling for Class Probability Estimation and Ranking , M. Saar-Tsechansky and F. Provost, Machine Learning 54:2 2004, 153-178.

The Learning-Curve Sampling Method Applied to Model-Based Clustering , C. Meek, B. Thiesson, and D. Heckerman, Journal of Machine Learning Research 2:397-418, 2002.

Active Sampling for Feature Selection , S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003.

Heterogeneous Uncertainty Sampling for Supervised Learning , D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994.

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , G. Weiss and F. Provost, Journal of Artificial Intelligence Research, 19:315-354, 2003.

Active Learning using Adaptive Resampling , KDD 2000, 91-98.

Cost-Sensitive Learning

Types of Cost in Inductive Concept Learning , P. Turney, In Proceedings Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning.

Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , P. Chan and S. Stolfo, KDD 1998.

Recent Blogs

Artificial intelligence and machine learning: What’s the difference

Artificial intelligence and machine learning: What’s the difference

Artificial Intelligence , Machine Learning

10 online courses for understanding machine learning

10 online courses for understanding machine learning

Machine Learning , Tutorials

How is ML Being Used to Handle Security Vulnerabilities?

Machine Learning

10 groups of machine learning algorithms

10 groups of machine learning algorithms

How a nearly forgotten physicist shaped internet access today 

How a nearly forgotten physicist shaped internet access today 

Massachuse...

FinTech 2019: 5 uses cases of machine learning in finance

FinTech 2019: 5 uses cases of machine learning in finance

Banking / Finance , Machine Learning

The biggest impact of machine learning for digital marketing professionals

The biggest impact of machine learning for digital marketing professionals

Machine Learning , Marketing

Looking ahead: the innovative future of iOS in 2019

How machine learning is changing identity theft detection

How machine learning is changing identity theft detection

Machine Learning , Privacy / Security

Wearable technology to boost the process of digitalization of the modern world

Wearable technology to boost the process of digitalization of the modern world

Top 8 machine learning startups you should know about

Top 8 machine learning startups you should know about

The term...

How retargeting algorithms help in web personalization

How retargeting algorithms help in web personalization

others , Machine Learning

3 automation tools to help you in your next app build

3 automation tools to help you in your next app build

Machine learning and information security: impact and trends

Machine learning and information security: impact and trends

Machine Learning , Privacy / Security , Sectors , Tech and Tools

How to improve your productivity with AI and Machine Learning?

How to improve your productivity with AI and Machine Learning?

Artificial Intelligence , Human Resource , Machine Learning

Artificial...

Ask Data – A new and intuitive way to analyze data with natural language

10 free machine learning ebooks all scientists & ai engineers should read, yisi, a machine translation teacher who cracks down on errors in meaning, machine learning & license plate recognition: an ideal partnership, top 17 data science and machine learning vendors shortlisted by gartner, accuracy and bias in machine learning models – overview, interview with dejan s. milojicic on top technology trends and predictions for 2019.

Artificial Intelligence , Interviews , Machine Learning

Recently,...

Why every small business should use machine learning?

Why every small business should use machine learning?

Microsoft’s ML.NET: A blend of machine learning and .NET

Microsoft’s ML.NET: A blend of machine learning and .NET

Machine learning: best examples and ideas for mobile apps, researchers harness machine learning to predict chemical reactions, subscribe to the crayon blog.

Get the latest posts in your inbox!

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Dissertations / Theses on the topic 'Data mining'

Create a spot-on reference in apa, mla, chicago, harvard, and other styles.

Consult the top 50 dissertations / theses for your research on the topic 'Data mining.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

Mrázek, Michal. "Data mining." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400441.

Payyappillil, Hemambika. "Data mining framework." Morgantown, W. Va. : [West Virginia University Libraries], 2005. https://etd.wvu.edu/etd/controller.jsp?moduleName=documentdata&jsp%5FetdId=3807.

Abedjan, Ziawasch. "Improving RDF data with data mining." Phd thesis, Universität Potsdam, 2014. http://opus.kobv.de/ubp/volltexte/2014/7133/.

Liu, Tantan. "Data Mining over Hidden Data Sources." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1343313341.

Taylor, Phillip. "Data mining of vehicle telemetry data." Thesis, University of Warwick, 2015. http://wrap.warwick.ac.uk/77645/.

Sherikar, Vishnu Vardhan Reddy. "I2MAPREDUCE: DATA MINING FOR BIG DATA." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/437.

Zhang, Nan. "Privacy-preserving data mining." [College Station, Tex. : Texas A&M University, 2006. http://hdl.handle.net/1969.1/ETD-TAMU-1080.

Hulten, Geoffrey. "Mining massive data streams /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/6937.

Büchel, Nina. "Faktorenvorselektion im Data Mining /." Berlin : Logos, 2009. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=019006997&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Shao, Junming. "Synchronization Inspired Data Mining." Diss., lmu, 2011. http://nbn-resolving.de/urn:nbn:de:bvb:19-137356.

Wang, Xiaohong. "Data mining with bilattices." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ59344.pdf.

Knobbe, Arno J. "Multi-relational data mining /." Amsterdam [u.a.] : IOS Press, 2007. http://www.loc.gov/catdir/toc/fy0709/2006931539.html.

丁嘉慧 and Ka-wai Ting. "Time sequences: data mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B31226760.

Wan, Chang, and 萬暢. "Mining multi-faceted data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/197527.

García-Osorio, César. "Data mining and visualization." Thesis, University of Exeter, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.414266.

Wang, Grant J. (Grant Jenhorn) 1979. "Algorithms for data mining." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/38315.

Anwar, Muhammad Naveed. "Data mining of audiology." Thesis, University of Sunderland, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.573120.

Santos, José Carlos Almeida. "Mining protein structure data." Master's thesis, FCT - UNL, 2006. http://hdl.handle.net/10362/1130.

Garda-Osorio, Cesar. "Data mining and visualisation." Thesis, University of the West of Scotland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.742763.

Rawles, Simon Alan. "Object-oriented data mining." Thesis, University of Bristol, 2007. http://hdl.handle.net/1983/c13bda2c-75c9-4bfa-b86b-04ac06ba0278.

Mao, Shihong. "Comparative Microarray Data Mining." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1198695415.

Novák, Petr. "Data mining časových řad." Master's thesis, Vysoká škola ekonomická v Praze, 2009. http://www.nusl.cz/ntk/nusl-72068.

Blunt, Gordon. "Mining credit card data." Thesis, n.p, 2002. http://ethos.bl.uk/.

Niggemann, Oliver. "Visual data mining of graph based data." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=962400505.

Li, Liangchun. "Web-based data visualization for data mining." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ35845.pdf.

Al-Hashemi, Idrees Yousef. "Applying data mining techniques over big data." Thesis, Boston University, 2013. https://hdl.handle.net/2144/21119.

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

KAVOOSIFAR, MOHAMMAD REZA. "Data Mining and Indexing Big Multimedia Data." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2742526.

Adderly, Darryl M. "Data mining meets e-commerce using data mining to improve customer relationship management /." [Gainesville, Fla.]: University of Florida, 2002. http://purl.fcla.edu/fcla/etd/UFE0000500.

Vithal, Kadam Omkar. "Novel applications of Association Rule Mining- Data Stream Mining." AUT University, 2009. http://hdl.handle.net/10292/826.

Patel, Akash. "Data Mining of Process Data in Multivariable Systems." Thesis, KTH, Skolan för elektro- och systemteknik (EES), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-201087.

Cordeiro, Robson Leonardo Ferreira. "Data mining in large sets of complex data." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-22112011-083653/.

XIAO, XIN. "Data Mining Techniques for Complex User-Generated Data." Doctoral thesis, Politecnico di Torino, 2016. http://hdl.handle.net/11583/2644046.

Tong, Suk-man Ivy. "Techniques in data stream mining." Click to view the E-thesis via HKUTO, 2005. http://sunzi.lib.hku.hk/hkuto/record/B34737376.

Borgelt, Christian. "Data mining with graphical models." [S.l. : s.n.], 2000. http://deposit.ddb.de/cgi-bin/dokserv?idn=962912107.

Weber, Irene. "Suchraumbeschränkung für relationales Data Mining." [S.l. : s.n.], 2004. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB11380447.

Maden, Engin. "Data Mining On Architecture Simulation." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/2/12611635/index.pdf.

Drwal, Maciej. "Data mining in distributedcomputer systems." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5709.

Thun, Julia, and Rebin Kadouri. "Automating debugging through data mining." Thesis, KTH, Data- och elektroteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-203244.

Rahman, Sardar Muhammad Monzurur, and mrahman99@yahoo com. "Data Mining Using Neural Networks." RMIT University. Electrical & Computer Engineering, 2006. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080813.094814.

Guo, Shishan. "Data mining in crystallographic databases." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0012/NQ52854.pdf.

Sun, Wenyi. "Data mining extension for economics." Diss., Columbia, Mo. : University of Missouri-Columbia, 2006. http://hdl.handle.net/10355/5869.

Papadatos, George. "Data mining for lead optimisation." Thesis, University of Sheffield, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556989.

Rice, Simon B. "Text data mining in bioinformatics." Thesis, University of Manchester, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.488351.

Lin, Zhenmin. "Privacy Preserving Distributed Data Mining." UKnowledge, 2012. http://uknowledge.uky.edu/cs_etds/9.

Tong, Suk-man Ivy, and 湯淑敏. "Techniques in data stream mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2005. http://hub.hku.hk/bib/B34737376.

Luo, Man. "Data mining and classical statistics." Virtual Press, 2004. http://liblink.bsu.edu/uhtbin/catkey/1304657.

Cai, Zhongming. "Technical aspects of data mining." Thesis, Cardiff University, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.395784.

Shioda, Romy 1977. "Integer optimization in data mining." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/17579.

Lo, Ya-Chin, and 羅雅琴. "Data mining in bioinformatics -- NCBI tools for data mining." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/38227591029165701821.

M.Tech/Ph.D Thesis Help in Chandigarh | Thesis Guidance in Chandigarh

data mining project thesis

[email protected]

data mining project thesis

+91-9465330425

Data Mining

data mining project thesis

Domain Driven Data Mining

It is a methodology of data mining to discover actionable knowledge and insight from complex data in a composite environment. Data-driven pattern mining faces challenges in the discovery of actionable knowledge from databases. To tackle this issue, domain driven data mining has been proposed and this will promote the paradigm shift from data-driven pattern mining to domain-driven data mining. This is another good thesis topic in Data Mining.

Decision Support System

It is a type of information system to support businesses and organizations in decision making. It helps people to make a better decision about problems which may be unstructured or semi-structured. Data Mining techniques are used in decision support systems. These techniques help in finding hidden patterns and relations from the data. Developing a decision support system requires time, cost, and effort.

Opinion Mining

Opinion mining, also known as sentiment mining, is a natural language processing method to analyze the sentiments of customers about a particular product. It is widely used in areas like surveys, public reviews, social media, healthcare systems, marketing etc. Automated opinion mining employs machine learning algorithms to analyze the sentiments.

These were the list of latest research, project, and thesis topics in data mining. M.Tech and Ph.D. students can  contact Techsparks  for thesis and research help in data mining.

Click the following link to download the latest thesis topics in Data Mining :

  • Latest Topics in Data Mining for thesis and research(PDF)

Techsparks provide the following two guidance packages:

Techsparks standard package.

  • Problem Definition/Topic Selection
  • Latest IEEE Base Paper (Research Paper Selection)
  • Synopsis/Proposal (Plagiarism Free)
  • Complete Implementation (Base Paper Implementation, Solution Implementation, Result Analysis And Comparison
  • All Kind Of Changes And Modifications
  • Online Live Video Classes Through Skype

Techsparks Ultimate Package

  • Thesis Report (Plagiarism Free)
  • Research Paper (With Guaranteed Acceptance In Any International Journal Like IEEE, Scopus, Springer, Science Direct)

Quick Enquiry

Get a quote, share your details to get free.

Trending Data Mining Thesis Topics

            Data mining seems to be the act of analyzing large amounts of data in order to uncover business insights that can assist firms in fixing issues, reducing risks, and embracing new possibilities . This article provides a complete picture on data mining thesis topics where you can get all information regarding data mining research

How to Implement Data Mining Thesis Topics

How does data mining work?

  • A standard data mining design begins with the appropriate business statement in the questionnaire, the appropriate data is collected to tackle it, and the data is prepared for the examination.
  • What happens in the earlier stages determines how successful the later versions are.
  • Data miners should assure the data quality they utilize as input for research because bad data quality results in poor outcomes.
  • Establishing a detailed understanding of the design factors, such as the present business scenario, the project’s main business goal, and the performance objectives.
  • Identifying the data required to address the problem as well as collecting this from all sorts of sources.
  • Addressing any errors and bugs, like incomplete or duplicate data, and processing the data in a suitable format to solve the research questions.
  • Algorithms are used to find patterns from data.
  • Identifying if or how another model’s output will contribute to the achievement of a business objective.
  • In order to acquire the optimum outcome, an iterative process is frequently used to identify the best method.
  • Getting the project’s findings suitable for making decisions in real-time

  The techniques and actions listed above are repeated until the best outcomes are achieved. Our engineers and developers have extensive knowledge of the tools, techniques, and approaches used in the processes described above. We guarantee that we will provide the best research advice w.r.t to data mining thesis topics and complete your project on schedule. What are the important data mining tasks?

Data Mining Tasks 

  • Data mining finds application in many ways including description, Analysis, summarization of data, and clarifying the conceptual understanding by data description
  • And also prediction, classification, dependency analysis, segmentation, and case-based reasoning are some of the important data mining tasks
  • Regression – numerical data prediction (stock prices, temperatures, and total sales)
  • Data warehousing – business decision making and large-scale data mining
  • Classification – accurate prediction of target classes and their categorization
  • Association rule learning – market-based analytical tools that were involved in establishing variable data set relationship
  • Machine learning – statistical probability-based decision making method without complicated programming
  • Data analytics – digital data evaluation for business purposes
  • Clustering – dataset partitioning into clusters and subclasses for analyzing natural data structure and format
  • Artificial intelligence – human-based Data analytics for reasoning, solving problems, learning, and planning
  • Data preparation and cleansing – conversion of raw data into a processed form for identification and removal of errors

You can look at our website for a more in-depth look at all of these operations. We supply you with the needed data, as well as any additional data you may need for your data mining thesis topics . We supply non-plagiarized data mining thesis assistance in any fresh idea of your choice. Let us now discuss the stages in data mining that are to be included in your thesis topics

How to work on a data mining thesis topic? 

 The following are the important stages or phases in developing data mining thesis topics.

  • First of all, you need to identify the present demand and address the question
  • The next step is defining or specifying the problem
  • Collection of data is the third step
  • Alternative solutions and designs have to be analyzed in the next step
  • The proposed methodology has to be designed
  • The system is then to be implemented

Usually, our experts help in writing codes and implementing them successfully without hassles . By consistently following the above steps you can develop one of the best data mining thesis topics of recent days. Furthermore, technically it is important for you to have a better idea of all the tasks and techniques involved in data mining about which we have discussed below

  • Data visualization
  • Neural networks
  • Statistical modeling
  • Genetic algorithms and neural networks
  • Decision trees and induction
  • Discriminant analysis
  • Induction techniques
  • Association rules and data visualization
  • Bayesian networks
  • Correlation
  • Regression analysis
  • Regression analysis and regression trees

If you are looking forward to selecting the best tool for your data mining project then evaluating its consistency and efficiency stands first. For this, you need to gain enough technical data from real-time executed projects for which you can directly contact us. Since we have delivered an ample number of data mining thesis topics successfully we can help you in finding better solutions to all your research issues. What are the points to be remembered about the data mining strategy?

  • Furthermore, data mining strategies must be picked before instruments in order to prevent using strategies that do not align with the article’s true purposes.
  • The typical data mining strategy has always been to evaluate a variety of methodologies in order to select one which best fits the situation.
  • As previously said, there are some principles that may be used to choose effective strategies for data mining projects.
  • Since they are easy to handle and comprehend
  • They could indeed collaborate with definitional and parametric data
  • Tare unaffected by critical values, they could perhaps function with incomplete information
  • They could also expose various interrelationships and an absence of linear combinations
  • They could indeed handle noise in records
  • They can process huge amounts of data.
  • Decision trees, on the other hand, have significant drawbacks.
  • Many rules are frequently necessary for dependent variables or numerous regressions, and tiny changes in the data can result in very different tree architectures.

All such pros and cons of various data mining aspects are discussed on our website. We will provide you with high-quality research assistance and thesis writing assistance . You may see proof of our skill and the unique approach that we generated in the field by looking at the samples of the thesis that we produced on our website. We also offer an internal review to help you feel more confident. Let us now discuss the recent data mining methodologies

Current methods in Data Mining

  • Prediction of data (time series data mining)
  • Discriminant and cluster analysis
  • Logistic regression and segmentation

Our technical specialists and technicians usually give adequate accurate data, a thorough and detailed explanation, and technical notes for all of these processes and algorithms. As a result, you can get all of your questions answered in one spot. Our technical team is also well-versed in current trends, allowing us to provide realistic explanations for all new developments. We will now talk about the latest data mining trends

Latest Trending Data Mining Thesis Topics

  • Visual data mining and data mining software engineering
  • Interaction and scalability in data mining
  • Exploring applications of data mining
  • Biological and visual data mining
  • Cloud computing and big data integration
  • Data security and protecting privacy in data mining
  • Novel methodologies in complex data mining
  • Data mining in multiple databases and rationalities
  • Query language standardization in data mining
  • Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining

These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis . We design the plan in a chronological order relevant to the study assessment with this in mind. The incorporation of citations is one of the most important aspects of the thesis. We focus not only on authoring but also on citing essential sources in the text. Students frequently struggle to deal with appropriate proposals when commencing their thesis. We have years of experience in providing the greatest study and data mining thesis writing services to the scientific community, which are promptly and widely acknowledged. We will now talk about future research directions of research in various data mining thesis topics

Future Research Directions of Data Mining

  • The potential of data mining and data science seems promising, as the volume of data continues to grow.
  • It is expected that the total amount of data in our digital cosmos will have grown from 4.4 zettabytes to 44 zettabytes.
  • We’ll also generate 1.7 gigabytes of new data for every human being on this planet each second.
  • Mining algorithms have completely transformed as technology has advanced, and thus have tools for obtaining useful insights from data.
  • Only corporations like NASA could utilize their powerful computers to examine data once upon a time because the cost of producing and processing data was simply too high.
  • Organizations are now using cloud-based data warehouses to accomplish any kinds of great activities with machine learning, artificial intelligence, and deep learning.

The Internet of Things as well as wearable electronics, for instance, has transformed devices to be connected into data-generating engines which provide limitless perspectives into people and organizations if firms can gather, store, and analyze the data quickly enough. What are the aspects to be remembered for choosing the best  data mining thesis topics?

  • An excellent thesis topic is a broad concept that has to be developed, verified, or refuted.
  • Your thesis topic must capture your curiosity, as well as the involvement of both the supervisor and the academicians.
  • Your thesis topic must be relevant to your studies and should be able to withstand examination.

Our engineers and experts can provide you with any type of research assistance on any of these data mining development tools . We satisfy the criteria of your universities by ensuring several revisions, appropriate formatting and editing of your thesis, comprehensive grammar check, and so on . As a result, you can contact us with confidence for complete assistance with your data mining thesis. What are the important data mining thesis topics?

Trending Data Mining Research Thesis Topics

Research Topics in Data Mining

  • Handling cost-effective, unbalanced non-static data
  • Issues related to data mining and their solutions
  • Network settings in data mining and ensuring privacy, security, and integrity of data
  • Environmental and biological issues in data mining
  • Complex data mining and sequential data mining (time series data)
  • Data mining at higher dimensions
  • Multi-agent data mining and distributed data mining
  • High-speed data mining
  • Development of unified data mining theory

We currently provide full support for all parts of research study, development, investigation, including project planning, technical advice, legitimate scientific data, thesis writing, paper publication, assignments and project planning, internal review, and many other services. As a result, you can contact us for any kind of help with your data mining thesis topics.

Why Work With Us ?

Senior research member, research experience, journal member, book publisher, research ethics, business ethics, valid references, explanations, paper publication, 9 big reasons to select us.

Our Editor-in-Chief has Website Ownership who control and deliver all aspects of PhD Direction to scholars and students and also keep the look to fully manage all our clients.

Our world-class certified experts have 18+years of experience in Research & Development programs (Industrial Research) who absolutely immersed as many scholars as possible in developing strong PhD research projects.

We associated with 200+reputed SCI and SCOPUS indexed journals (SJR ranking) for getting research work to be published in standard journals (Your first-choice journal).

PhDdirection.com is world’s largest book publishing platform that predominantly work subject-wise categories for scholars/students to assist their books writing and takes out into the University Library.

Our researchers provide required research ethics such as Confidentiality & Privacy, Novelty (valuable research), Plagiarism-Free, and Timely Delivery. Our customers have freedom to examine their current specific research activities.

Our organization take into consideration of customer satisfaction, online, offline support and professional works deliver since these are the actual inspiring business factors.

Solid works delivering by young qualified global research team. "References" is the key to evaluating works easier because we carefully assess scholars findings.

Detailed Videos, Readme files, Screenshots are provided for all research projects. We provide Teamviewer support and other online channels for project explanation.

Worthy journal publication is our main thing like IEEE, ACM, Springer, IET, Elsevier, etc. We substantially reduces scholars burden in publication side. We carry scholars from initial submission to final acceptance.

Related Pages

Our benefits, throughout reference, confidential agreement, research no way resale, plagiarism-free, publication guarantee, customize support, fair revisions, business professionalism, domains & tools, we generally use, wireless communication (4g lte, and 5g), ad hoc networks (vanet, manet, etc.), wireless sensor networks, software defined networks, network security, internet of things (mqtt, coap), internet of vehicles, cloud computing, fog computing, edge computing, mobile computing, mobile cloud computing, ubiquitous computing, digital image processing, medical image processing, pattern analysis and machine intelligence, geoscience and remote sensing, big data analytics, data mining, power electronics, web of things, digital forensics, natural language processing, automation systems, artificial intelligence, mininet 2.1.0, matlab (r2018b/r2019a), matlab and simulink, apache hadoop, apache spark mlib, apache mahout, apache flink, apache storm, apache cassandra, pig and hive, rapid miner, support 24/7, call us @ any time, +91 9444829042, [email protected].

Questions ?

Click here to chat with us

  • Open access
  • Published: 03 March 2022

Educational data mining: prediction of students' academic performance using machine learning algorithms

  • Mustafa Yağcı   ORCID: orcid.org/0000-0003-2911-3909 1  

Smart Learning Environments volume  9 , Article number:  11 ( 2022 ) Cite this article

49k Accesses

104 Citations

38 Altmetric

Metrics details

Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest neighbour, support vector machines, logistic regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. The dataset consisted of the academic achievement grades of 1854 students who took the Turkish Language-I course in a state University in Turkey during the fall semester of 2019–2020. The results show that the proposed model achieved a classification accuracy of 70–75%. The predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. Such data-driven studies are very important in terms of establishing a learning analysis framework in higher education and contributing to the decision-making processes. Finally, this study presents a contribution to the early prediction of students at high risk of failure and determines the most effective machine learning methods.

Introduction

The application of data mining methods in the field of education has attracted great attention in recent years. Data Mining (DM) is the discovery of data. It is the field of discovering new and potentially useful information or meaningful results from big data (Witten et al., 2011 ). It also aims to obtain new trends and new patterns from large datasets by using different classification algorithms (Baker & Inventado, 2014 ).

Educational data mining (EDM) is the use of traditional DM methods to solve problems related to education (Baker & Yacef, 2009 ; cited in Fernandes et al., 2019 ). EDM is the use of DM methods on educational data such as student information, educational records, exam results, student participation in class, and the frequency of students' asking questions. In recent years, EDM has become an effective tool used to identify hidden patterns in educational data, predict academic achievement, and improve the learning/teaching environment.

Learning analytics has gained a new dimension through the use of EDM (Waheed et al., 2020 ). Learning analytics covers the various aspects of collecting student information together, better understanding the learning environment by examining and analysing it, and revealing the best student/teacher performance (Long & Siemens, 2011 ). Learning analytics is the compilation, measurement and reporting of data about students and their contexts in order to understand and optimize learning and the environments in which it takes place. It also deals with the institutions developing new strategies.

Another dimension of learning analytics is predicting student academic performance, uncovering patterns of system access and navigational actions, and determining students who are potentially at risk of failing (Waheed et al., 2020 ). Learning management systems (LMS), student information systems (SIS), intelligent teaching systems (ITS), MOOCs, and other web-based education systems leave digital data that can be examined to evaluate students' possible behavior. Using EDM method, these data can be employed to analyse the activities of successful students and those who are at risk of failure, to develop corrective strategies based on student academic performance, and therefore to assist educators in the development of pedagogical methods (Casquero et al., 2016 ; Fidalgo-Blanco et al., 2015 ).

The data collected on educational processes offer new opportunities to improve the learning experience and to optimize users' interaction with technological platforms (Shorfuzzaman et al., 2019 ). The processing of educational data yields improvements in many areas such as predicting student behaviour, analytical learning, and new approaches to education policies (Capuano & Toti, 2019 ; Viberg et al., 2018 ). This comprehensive collection of data will not only allow education authorities to make data-based policies, but also form the basis of software to be developed with artificial intelligence on the learning process.

EDM enables educators to predict situations such as dropping out of school or less interest in the course, analyse internal factors affecting their performance, and make statistical techniques to predict students' academic performance. A variety of DM methods are employed to predict student performance, identify slow learners, and dropouts (Hardman et al., 2013 ; Kaur et al., 2015 ). Early prediction is a new phenomenon that includes assessment methods to support students by proposing appropriate corrective strategies and policies in this field (Waheed et al., 2020 ).

Especially during the pandemic period, learning management systems, quickly put into practice, have become an indispensable part of higher education. While students use these systems, the log records produced have become ever more accessible. (Macfadyen & Dawson, 2010 ; Kotsiantis et al., 2013 ; Saqr et al., 2017 ). Universities now should improve the capacity of using these data to predict academic success and ensure student progress (Bernacki et al., 2020 ).

As a result, EDM provides the educators with new information by discovering hidden patterns in educational data. Using this model, some aspects of the education system can be evaluated and improved to ensure the quality of education.

In various studies on EDM, e-learning systems have been successfully analysed (Lara et al., 2014 ). Some studies have also classified educational data (Chakraborty et al., 2016 ), while some have tried to predict student performance (Fernandes et al., 2019 ).

Asif et al. ( 2017 ) focused on two aspects of the performance of undergraduate students using DM methods. The first aspect is to predict the academic achievements of students at the end of a four-year study program. The second one is to examine the development of students and combine them with predictive results. He divided the students into two parts as low achievement and high achievement groups. He have found that it is important for the educators to focus on a small number of courses indicating particularly good or poor performance in order to offer timely warnings, support underperforming students and offer advice and opportunities to high-performing students. Cruz-Jesus et al. ( 2020 ) predicted student academic performance with 16 demographics such as age, gender, class attendance, internet access, computer possession, and the number of courses taken. Random forest, logistic regression, k-nearest neighbours and support vector machines, which are among the machine learning methods, were able to predict students’ performance with accuracy ranging from 50 to 81%.

Fernandes et al. ( 2019 ) developed a model with the demographic characteristics of the students and the achievement grades obtained from the in-term activities. In that study, students' academic achievement was predicted with classification models based on Gradient Boosting Machine (GBM). The results showed that the best qualities for estimating achievement scores were the previous year's achievement scores and unattendance. The authors found that demographic characteristics such as neighbourhood, school and age information were also potential indicators of success or failure. In addition, he argued that this model could guide the development of new policies to prevent failure. Similarly, by using the student data requested during registration and environmental factors, Hoffait and Schyns ( 2017 ) determined the students with the potential to fail. He found that students with potential difficulties could be classified more precisely by using DM methods. Moreover, their approach makes it possible to rank the students by levels of risk. Rebai et al. ( 2020 ) proposed a machine learning-based model to identify the key factors affecting academic performance of schools and to determine the relationship between these factors. He concluded that the regression trees showed that the most important factors associated with higher performance were school size, competition, class size, parental pressure, and gender proportions. In addition, according to the random forest algorithm results, the school size and the percentage of girls had a powerful impact on the predictive accuracy of the model.

Ahmad and Shahzadi, ( 2018 ) proposed a machine learning-based model to find an answer to the question whether students were at risk regarding their academic performance. Using the students' learning skills, study habits, and academic interaction features, they made a prediction with a classification accuracy of 85%. The researchers concluded that the model they proposed could be used to determine academically unsuccessful student. Musso et al., ( 2020 ) proposed a machine learning model based on learning strategies, perception of social support, motivation, socio-demographics, health condition, and academic performance characteristics. With this model, he predicted the academic performance and dropouts. He concluded that the predictive variable with the highest effect on predicting GPA was learning strategies while the variable with the greatest effect on determining dropouts was background information.

Waheed et al., ( 2020 ) designed a model with artificial neural networks on students' records related to their navigation through the LMS. The results showed that demographics and student clickstream activities had a significant impact on student performance. Students who navigated through courses performed higher. Students' participation in the learning environment had nothing to do with their performance. However, he concluded that the deep learning model could be an important tool in the early prediction of student performance. Xu et al. ( 2019 ) determined the relationship between the internet usage behaviors of university students and their academic performance and he predicted students’ performance with machine learning methods. The model he proposed predicted students' academic performance at a high level of accuracy. The results suggested that Internet connection frequency features were positively correlated with academic performance, whereas Internet traffic volume features were negatively correlated with academic performance. In addition, he concluded that internet usage features had an important role on students' academic performance. Bernacki et al. ( 2020 ) tried to find out whether the log records in the learning management system alone would be sufficient to predict achievement. He concluded that the behaviour-based prediction model successfully predicted 75% of those who would need to repeat a course. He also stated that, with this model, students who might be unsuccessful in the subsequent semesters could be identified and supported. Burgos et al. ( 2018 ) predicted the achievement grades that the students might get in the subsequent semesters and designed a tool for students who were likely to fail. He found that the number of unsuccessful students decreased by 14% compared to previous years. A comparative analysis of studies predicting the academic achievement grades using machine learning methods is given in Table 1 .

A review of previous research that aimed to predict academic achievement indicates that researchers have applied a range of machine learning algorithms, including multiple, probit and logistic regression, neural networks, and C4.5 and J48 decision trees. However, random forests (Zabriskie et al., 2019 ), genetic programming (Xing et al., 2015 ), and Naive Bayes algorithms (Ornelas & Ordonez, 2017 ) were used in recent studies. The prediction accuracy of these models reaches very high levels.

Prediction accuracy of student academic performance requires an deep understanding of the factors and features that impact student results and the achievement of student (Alshanqiti & Namoun, 2020 ). For this purpose, Hellas et al. ( 2018 ) reviewed 357 articles on student performance detailing the impact of 29 features. These features were mainly related to psychomotor skills such as course and pre-course performance, student participation, student demographics such as gender, high school performance, and self-regulation. However, the dropout rates were mainly influenced by student motivation, habits, social and financial issues, lack of progress, and career transitions.

The literature review suggests that, it is a necessity to improve the quality of education by predicting the academic performance of the students and supporting those who are in the risk group. In the literature, the prediction of academic performance was made with many and various variables, various digital traces left by students on the internet (browsing, lesson time, percentage of participation) (Fernandes et al., 2019 ; Rubin et al., 2010 ; Waheed et al., 2020 ; Xu et al., 2019 ) and students demographic characteristics (gender, age, economic status, number of courses attended, internet access, etc.) (Bernacki et al., 2020 ; Rizvi et al., 2019 ; García-González & Skrita, 2019 ; Rebai et al., 2020 ; Cruz-Jesus et al., 2020 ; Aydemir, 2017 ), learning skills, study approaches, study habits (Ahmad & Shahzadi, 2018 ), learning strategies, social support perception, motivation, socio-demography, health form, academic performance characteristics (Costa-Mendes et al., 2020 ; Gök, 2017 ; Kılınç, 2015 ; Musso et al., 2020 ), homework, projects, quizzes (Kardaş & Güvenir, 2020 ), etc. In almost all models developed in such studies, prediction accuracy is ranging from 70 to 95%. Hovewer, collecting and processing such a variety of data both takes a lot of time and requires expert knowledge. Similarly, Hoffait and Schyns ( 2017 ) suggested that collecting so many data is difficult and socio-economic data are unnecessary. Moreover, these demographic or socio-economic data may not always give the right idea of preventing failure (Bernacki et al., 2020 ).

The study concerns predicting students’ academic achievement using grades only, no demographic characteristics and no socio-economic data. This study aimed to develop a new model based on machine learning algorithms to predict the final exam grades of undergraduate students taking their midterm exam grades, Faculty and Department of the students.

For this purpose, classification algorithms with the highest performance in predicting students’ academic achievement were determined by using machine learning classification algorithms. The reason for choosing the Turkish Language-I course was that it is a compulsory course that all students enrolled in the university must take. Using this model, students’ final exam grades were predicted. These models will enable the development of pedagogical interventions and new policies to improve students' academic performance. In this way, the number of potentially unsuccessful students can be reduced following the assessments made after each midterm.

This section describes the details of the dataset, pre-processing techniques, and machine learning algorithms employed in this study.

Educational institutions regularly store all data that are available about students in electronic medium. Data are stored in databases for processing. These data can be of many types and volumes, from students’ demographics to their academic achievements. In this study, the data were taken from the Student Information System (SIS), where all student records are stored at a State University in Turkey. In these records, the midterm exam grades, final exam grades, Faculty, and Department of 1854 students who have taken the Turkish Language-I course in the 2019–2020 fall semester were selected as the dataset. Table 2 shows the distribution of students according to the academic unit. Moreover, as a additional file 1 the dataset are presented.

Midterm and final exam grades are ranging from 0 to 100. In this system, the end-of-semester achievement grade is calculated by taking 40% of the midterm exam and 60% of the final exam. Students with achievement grade below 60 are unsuccessful and those above 60 are successful. The midterm exam is usually held in the middle of the academic semester and the final exam is held at the end of the semester. There are approximately 9 weeks (2.5 months) from the midterm exam to the final exam. In other words, there is a two and a half month period for corrective actions for students who are at risk of failing thanks to the final exam predictions made. In other words, the answer to the question of how effective the student's performance in the middle of the semester is on his performance at the end of the semester was investigated.

Data identification and collection

At this phase, it is determined from which source the data will be stored, which features of the data will be used, and whether the collected data is suitable for the purpose. Feature selection involves decreasing the number of variables used to predict a particular outcome. The goal; to facilitate the interpretability of the model, reduce complexity, increase the computational efficiency of algorithms, and avoid overfitting.

Establishing DM model and implementation of algorithm

RF, NN, LR, SVM, NB and kNN were employed to predict students' academic performance. The prediction accuracy was evaluated using tenfold cross validation. The DM process serves two main purposes. The first purpose is to make predictions by analyzing the data in the database (predictive model). The second one is to describe behaviors (descriptive model). In predictive models, a model is created by using data with known results. Then, using this model, the result values are predicted for datasets whose results are unknown. In descriptive models, the patterns in the existing data are defined to make decisions.

When the focus is on analysing the causes of success or failure, statistical methods such as logistic regression and time series can be employed (Ortiz & Dehon, 2008 ; Arias Ortiz & Dehon, 2013 ). However, when the focus is on forecasting, neural networks (Delen, 2010 ; Vandamme et al., 2007 ), support vector machines (Huang & Fang, 2013 ), decision trees (Delen, 2011 ; Nandeshwar et al., 2011 ) and random forests (Delen, 2010 ; Vandamme et al., 2007 ) is more efficient and give more accurate results. Statistical techniques are to create a model that can successfully predict output values based on available input data. On the other hand, machine learning methods automatically create a model that matches the input data with the expected target values when a supervised optimization problem is given.

The performance of the model was measured by confusion matrix indicators. It is understood from the literature that there is no single classifier that works best for prediction results. Therefore, it is necessary to investigate which classifiers are more studied for the analysed data (Asif et al., 2017 ).

Experiments and results

The entire experimental phase was performed with Orange machine learning software. Orange is a powerful and easy-to-use component-based DM programming tool for expert data scientists as well as for data science beginners. In Orange, data analysis is done by stacking widgets into workflows. Each widget includes some data retrieval, data pre-processing, visualization, modelling, or evaluation task. A workflow is a series of actions or actions that will be performed on the platform to perform a specific task. Comprehensive data analysis charts can be created by combining different components in a workflow. Figure  1 shows the workflow diagram designed.

figure 1

The workflow of the designed model

The dataset included midterm exam grades, final exam grades, Faculty, and Department of 1854 students taking the Turkish Language-I course in the 2019–2020 Fall Semester. The entire dataset is provided as Additional file 1 . Table 3 shows part of the dataset.

In the dataset, students' midterm exam grades, final exam grades, faculty, and department information were determined as features. Each measure contains data associated with a student. Midterm exam and final exam grade variables were explained under the heading "dataset". The faculty variable represents Faculties in Kırşehir Ahi Evran University and the department variable represents departments in faculties. In the development of the model, the midterm, the faculty, and the department information were determined as the independent variable and the final was determined as the dependent variable. Table 4 shows the variable model.

After the variable model was determined, the midterm exam grades and final exam grades were categorized according to the equal-width discretization model. Table 5 shows the criteria used in converting midterm exam grades and final exam grades into the categorical format.

In Table 6 , the values in the final column are the actual values. The values in the RF, SVM, LR, KNN, NB, and NN columns are the values predicted by the proposed model. For example, according to Table 5 , std1’s actual final grade was in the range 55 to 77. While the predicted value of the RF, SVM, LR, NB, and NN models were in the range of, the predicted value of the kNN model was greater than 77.

Evaluation of the model performance

The performance of model was evaluated with confusion matrix, classification accuracy (CA), precision, recall, f-score (F1), and area under roc curve (AUC) metrics.

Confusion matrix

The confusion matrix shows the current situation in the dataset and the number of correct/incorrect predictions of the model. Table 7 shows the confusion matrix. The performance of the model is calculated by the number of correctly classified instances and incorrectly classified instances. The rows show the real numbers of the samples in the test set, and the columns represent the estimation of the model.

In Table 6 , true positive (TP) and true negative (TN) show the number of correctly classified instances. False positive (FP) shows the number of instances predicted as 1 (positive) while it should be in the 0 (negative) class. False negative (FN) shows the number of instances predicted as 0 (negative) while it should be in class 1 (positive).

Table 8 shows the confusion matrix for the RF algorithm. In the confusion matrix of 4 × 4 dimensions, the main diagonal shows the percentage of correctly predicted instances, and the matrix elements other than the main diagonal shows the percentage of errors predicted.

Table 8 shows that 84.9% of those with the actual final grade greater than 77.5, 71.2% of those with range 55–77.5, 65.4% of those with range 32.5–55, and 60% of those with less than 32.5 were predicted correctly. Confusion matrixs of other algorithms are shown in Tables 9 , 10 , 11 , 12 , and 13 .

Classification accuracy:  CA is the ratio of the correct predictions (TP + TN) to the total number of instances (TP + TN + FP + FN).

Precision: Precision is the ratio of the number of positive instances that are correctly classified to the total number of instances that are predicted positive. Gets a value in the range [0.1].

Recall: Recall i s the ratio of the correctly classified number of positive instances to the number of all instances whose actual class is positive. The Recall is also called the true positive rate. Gets a value in the range [0.1].

F-Criterion (F1):  There is an opposite relationship between precision and recall. Therefore, the harmonic mean of both criteria is calculated for more accurate and sensitive results. This is called the F-criterion.

Receiver operating characteristics (ROC) curve

The AUC-ROC curve is used to evaluate the performance of a classification problem. AUC-ROC is a widely used metric to evaluate the performance of machine learning algorithms, especially in cases where there are unbalanced datasets, and explains how well the model is at predicting.

AUC: Area under the ROC curve

The larger the area covered, the better the machine learning algorithms at distinguishing given classes. AUC for the ideal value is 1. The AUC, Classification Accuracy (CA), F-Criterion (F1), precision, and recall values of the models are shown in Table 14 .

The AUC value of RF, NN, SVM, LR, NB, and kNN algorithms were 0.860, 0.863, 0.804, 0.826, 0.810, and 0.810 respectively. The classification accuracy of the RF, NN, SVM, LR, NB, and kNN algorithms were also 0.746, 0.746, 0.735, 0.717, 0.713, and 0,699 respectively. According to these findings, for example, the RF algorithm was able to achieve 74.6% accuracy. In other words, there was a very high-level correlation between the data predicted and the actual data. As a result, 74.6% of the samples were been classified correctly.

Discussion and conclusion

This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the Random Forests, nearest neighbour, support vector machines, Logistic Regression, Naïve Bayes, and k-nearest neighbour algorithms, which are among the machine learning algorithms, were calculated and compared to predict the final exam grades of the students. This study focused on two parameters. The first parameter was the prediction of academic performance based on previous achievement grades. The second one was the comparison of performance indicators of machine learning algorithms.

The results show that the proposed model achieved a classification accuracy of 70–75%. According to this result, it can be said that students' midterm exam grades are an important predictor to be used in predicting their final exam grades. RF, NN, SVM, LR, NB, and kNN are algorithms with a very high accuracy rate that can be used to predict students' final exam grades. Furthermore, the predictions were made using only three types of parameters; midterm exam grades, Department data and Faculty data. The results of this study were compared with the studies that predicted the academic achievement grades of the students with various demographic and socio-economic variables. Hoffait and Schyns ( 2017 ) proposed a model that uses the academic achievement of students in previous years. With this model, they predicted students' performance to be successful in the courses they will take in the new semester. They found that 12.2% of the students had a very high risk of failure, with a 90% confidence rate. Waheed et al. ( 2020 ) predicted the achievement of the students with demographic and geographic characteristics. He found that it has a significant effect on students' academic performance. He predicted the failure or success of the students by 85% accuracy. Xu et al. ( 2019 ) found that internet usage data can distinguish and predict students' academic performance. Costa-Mendes et al. ( 2020 ), Cruz-Jesus et al. ( 2020 ), Costa-Mendes et al. ( 2020 ) predicted the academic achievement of students in the light of income, age, employment, cultural level indicators, place of residence, and socio-economic information. Similarly, Babić ( 2017 ) predicted students’ performance with an accuracy of 65% to 100% with artificial neural networks, classification tree, and support vector machines methods.

Another result of this study was RF, NN and SVM algorithms have the highest classification accuracy, while kNN has the lowest classification accuracy. According to this result, it can be said that RF, NN and SVM algorithms perform with more accurate results in predicting the academic achievement grades of students with machine learning algorithms. The results were compared with the results of the research in which machine learning algorithms were employed to predict academic performance according to various variables. For example, Hoffait and Schyns ( 2017 ) compared the performances of LR, ANN and RF algorithms to identify students at high risk of academic failure on their various demographic characteristics. They ranked the algorithms from those with the highest accuracy to the ones with the lowest accuracy as LR, ANN, and RF. On the other hand, Waheed et al. ( 2020 ) found that the SVM algorithm performed higher than the LR algorithm. According to Xu et al. ( 2019 ), the algorithm with the highest performance is SVM, followed by the NN algorithm, and the decision tree is the algorithm with the lowest performance.

The proposed model predicted the final exam grades of students with 73% accuracy. According to this result, it can be said that academic achievement can be predicted with this model in the future. By predicting students' achievement grades in future, students can be allowed to review their working methods and improve their performance. The importance of the proposed method can be better understood, considering that there is approximately 2.5 months between the midterm exams and the final exams in higher education. Similarly, Bernacki et al. ( 2020 ) work on the early warning model. He proposed a model to predict the academic achievements of students using their behavior data in the learning management system before the first exam. His algorithm correctly identified 75% of students who failed to earn the grade of B or better needed to advance to the next course. Ahmad and Shahzadi ( 2018 ) predicted students at risk for academic performance with 85% accuracy evaluating their study habits, learning skills, and academic interaction features. Cruz-Jesus et al. ( 2020 ) predicted students' end-of-semester grades with 16 independent variables. He concluded that students could be given the opportunity of early intervention.

As a result, students' academic performances were predicted using different predictors, different algorithms and different approaches. The results confirm that machine learning algorithms can be used to predict students’ academic performance. More importantly, the prediction was made only with the parameters of midterm grade, faculty and department. Teaching staff can benefit from the results of this research in the early recognition of students who have below or above average academic motivation. Later, for example, as Babić ( 2017 ) points out, they can match students with below-average academic motivation by students with above-average academic motivation and encourage them to work in groups or project work. In this way, the students' motivation can be improved, and their active participation in learning can be ensured. In addition, such data-driven studies should assist higher education in establishing a learning analytics framework and contribute to decision-making processes.

Future research can be conducted by including other parameters as input variables and adding other machine learning algorithms to the modelling process. In addition, it is necessary to harness the effectiveness of DM methods to investigate students' learning behaviors, address their problems, optimize the educational environment, and enable data-driven decision making.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

  • Educational data mining

Random forests

Neural networks

Support vector machines

Logistic regression

Naïve Bayes

K-nearest neighbour

Decision trees

Artificial neural networks

Extremely randomized trees

Regression trees

Multilayer perceptron neural network

Feed-forward neural network

Adaptive resonance theory mapping

Learning management systems

Student information systems

Intelligent teaching systems

Classification accuracy

Area under roc curve

True positive

True negative

False positive

False negative

Receiver operating characteristics

Ahmad, Z., & Shahzadi, E. (2018). Prediction of students’ academic performance using artificial neural network. Bulletin of Education and Research, 40 (3), 157–164.

Google Scholar  

Alshanqiti, A., & Namoun, A. (2020). Predicting student performance and its influential factors using hybrid regression and multi-label classification. IEEE Access, 8 , 203827–203844. https://doi.org/10.1109/access.2020.3036572

Article   Google Scholar  

Arias Ortiz, E., & Dehon, C. (2013). Roads to success in the Belgian French Community’s higher education system: predictors of dropout and degree completion at the Université Libre de Bruxelles. Research in Higher Education, 54 (6), 693–723. https://doi.org/10.1007/s11162-013-9290-y

Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (2017). Analyzing undergraduate students’ performance using educational data mining. Computers and Education, 113 , 177–194. https://doi.org/10.1016/j.compedu.2017.05.007

Aydemir, B. (2017). Predicting academic success of vocational high school students using data mining methods graduate . [Unpublished master’s thesis]. Pamukkale University Institute of Science.

Babić, I. D. (2017). Machine learning methods in predicting the student academic motivation. Croatian Operational Research Review, 8 (2), 443–461. https://doi.org/10.17535/crorr.2017.0028

Baker, R. S., & Inventado, P. S. (2014). Educational data mining and learning analytics. Learning analytics (pp. 61–75). Springer.

Chapter   Google Scholar  

Baker, R. S., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1 (1), 3–17.

Bernacki, M. L., Chavez, M. M., & Uesbeck, P. M. (2020). Predicting achievement and providing support before STEM majors begin to fail. Computers & Education, 158 (August), 103999. https://doi.org/10.1016/j.compedu.2020.103999

Burgos, C., Campanario, M. L., De, D., Lara, J. A., Lizcano, D., & Martínez, M. A. (2018). Data mining for modeling students’ performance: A tutoring action plan to prevent academic dropout. Computers and Electrical Engineering, 66 (2018), 541–556. https://doi.org/10.1016/j.compeleceng.2017.03.005

Capuano, N., & Toti, D. (2019). Experimentation of a smart learning system for law based on knowledge discovery and cognitive computing. Computers in Human Behavior, 92 , 459–467. https://doi.org/10.1016/j.chb.2018.03.034

Casquero, O., Ovelar, R., Romo, J., Benito, M., & Alberdi, M. (2016). Students’ personal networks in virtual and personal learning environments: A case study in higher education using learning analytics approach. Interactive Learning Environments, 24 (1), 49–67. https://doi.org/10.1080/10494820.2013.817441

Chakraborty, B., Chakma, K., & Mukherjee, A. (2016). A density-based clustering algorithm and experiments on student dataset with noises using Rough set theory. In Proceedings of 2nd IEEE international conference on engineering and technology, ICETECH 2016 , March (pp. 431–436). https://doi.org/10.1109/ICETECH.2016.7569290

Costa-Mendes, R., Oliveira, T., Castelli, M., & Cruz-Jesus, F. (2020). A machine learning approximation of the 2015 Portuguese high school student grades: A hybrid approach. Education and Information Technologies, 26 , 1527–1547. https://doi.org/10.1007/s10639-020-10316-y

Cruz-Jesus, F., Castelli, M., Oliveira, T., Mendes, R., Nunes, C., Sa-Velho, M., & Rosa-Louro, A. (2020). Using artificial intelligence methods to assess academic achievement in public high schools of a European Union country. Heliyon . https://doi.org/10.1016/j.heliyon.2020.e04081

Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49 (4), 498–506. https://doi.org/10.1016/j.dss.2010.06.003

Delen, D. (2011). Predicting student attrition with data mining methods. Journal of College Student Retention: Research, Theory and Practice, 13 (1), 17–35. https://doi.org/10.2190/CS.13.1.b

Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., & Van Erven, G. (2019). Educational data mining : Predictive analysis of academic performance of public school students in the capital of Brazil. Journal of Business Research, 94 (February 2018), 335–343. https://doi.org/10.1016/j.jbusres.2018.02.012

Fidalgo-Blanco, Á., Sein-Echaluce, M. L., García-Peñalvo, F. J., & Conde, M. Á. (2015). Using Learning Analytics to improve teamwork assessment. Computers in Human Behavior, 47 , 149–156. https://doi.org/10.1016/j.chb.2014.11.050

García-González, J. D., & Skrita, A. (2019). Predicting academic performance based on students’ family environment: Evidence for Colombia using classification trees. Psychology, Society and Education, 11 (3), 299–311. https://doi.org/10.25115/psye.v11i3.2056

Gök, M. (2017). Predicting academic achievement with machine learning methods. Gazi University Journal of Science Part c: Design and Technology, 5 (3), 139–148.

Hardman, J., Paucar-Caceres, A., & Fielding, A. (2013). Predicting students’ progression in higher education by using the random forest algorithm. Systems Research and Behavioral Science, 30 (2), 194–203. https://doi.org/10.1002/sres.2130

Hellas, A., Ihantola, P., Petersen, A., Ajanovski, V.V., Gutica, M., Hynninen, T., Knutas, A., Leinonen, J., Messom, C., & Liao, S.N. (2018). Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education (pp. 175–199).

Hoffait, A., & Schyns, M. (2017). Early detection of university students with potential difficulties. Decision Support Systems, 101 (2017), 1–11. https://doi.org/10.1016/j.dss.2017.05.003

Huang, S., & Fang, N. (2013). Predicting student academic performance in an engineering dynamics course: A comparison of four types of predictive mathematical models. Computers and Education, 61 (1), 133–145. https://doi.org/10.1016/j.compedu.2012.08.015

Kardaş, K., & Güvenir, A. (2020). Analysis of the effects of Quizzes, homeworks and projects on final exam with different machine learning techniques. EMO Journal of Scientific, 10 (1), 22–29.

Kaur, P., Singh, M., & Josan, G. S. (2015). Classification and prediction based data mining algorithms to predict slow learners in education sector. Procedia Computer Science, 57 , 500–508. https://doi.org/10.1016/j.procs.2015.07.372

Kılınç, Ç. (2015). Examining the effects on university student success by data mining techniques. [Unpublished master’s thesis]. Eskişehir Osmangazi University Institute of Science.

Kotsiantis, S., Tselios, N., Filippidi, A., & Komis, V. (2013). Using learning analytics to identify successful learners in a blended learning course. International Journal of Technology Enhanced Learning, 5 (2), 133–150. https://doi.org/10.1504/IJTEL.2013.059088

Lara, J. A., Lizcano, D., Martínez, M. A., Pazos, J., & Riera, T. (2014). A system for knowledge discovery in e-learning environments within the European Higher Education Area—Application to student data from Open University of Madrid, UDIMA. Computers and Education, 72 , 23–36. https://doi.org/10.1016/j.compedu.2013.10.009

Long, P., & Siemens, G. (2011). Penetrating the fog: Analytics in learning and education. Educause Review, 46 (5), 31–40.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & Education, 54 (2), 588–599. https://doi.org/10.1016/j.compedu.2009.09.008

Musso, M. F., Hernández, C. F. R., & Cascallar, E. C. (2020). Predicting key educational outcomes in academic trajectories: A machine-learning approach. Higher Education, 80 (5), 875–894. https://doi.org/10.1007/s10734-020-00520-7

Nandeshwar, A., Menzies, T., & Nelson, A. (2011). Learning patterns of university student retention. Expert Systems with Applications, 38 (12), 14984–14996. https://doi.org/10.1016/j.eswa.2011.05.048

Ornelas, F., & Ordonez, C. (2017). Predicting student success: A naïve bayesian application to community college data. Technology, Knowledge and Learning, 22 (3), 299–315. https://doi.org/10.1007/s10758-017-9334-z

Ortiz, E. A., & Dehon, C. (2008). What are the factors of success at University? A case study in Belgium. Cesifo Economic Studies, 54 (2), 121–148. https://doi.org/10.1093/cesifo/ifn012

Rebai, S., Ben Yahia, F., & Essid, H. (2020). A graphically based machine learning approach to predict secondary schools performance in Tunisia. Socio-Economic Planning Sciences, 70 (August 2018), 100724. https://doi.org/10.1016/j.seps.2019.06.009

Rizvi, S., Rienties, B., & Ahmed, S. (2019). The role of demographics in online learning; A decision tree based approach. Computers & Education, 137 (August 2018), 32–47. https://doi.org/10.1016/j.compedu.2019.04.001

Rubin, B., Fernandes, R., Avgerinou, M. D., & Moore, J. (2010). The effect of learning management systems on student and faculty outcomes. The Internet and Higher Education, 13 (1–2), 82–83. https://doi.org/10.1016/j.iheduc.2009.10.008

Saqr, M., Fors, U., & Tedre, M. (2017). How learning analytics can early predict under-achieving students in a blended medical education course. Medical Teacher, 39 (7), 757–767. https://doi.org/10.1080/0142159X.2017.1309376

Shorfuzzaman, M., Hossain, M. S., Nazir, A., Muhammad, G., & Alamri, A. (2019). Harnessing the power of big data analytics in the cloud to support learning analytics in mobile learning environment. Computers in Human Behavior, 92 (February 2017), 578–588. https://doi.org/10.1016/j.chb.2018.07.002

Vandamme, J.-P., Meskens, N., & Superby, J.-F. (2007). Predicting academic performance by data mining methods. Education Economics, 15 (4), 405–419. https://doi.org/10.1080/09645290701409939

Viberg, O., Hatakka, M., Bälter, O., & Mavroudi, A. (2018). The current landscape of learning analytics in higher education. Computers in Human Behavior, 89 (July), 98–110. https://doi.org/10.1016/j.chb.2018.07.027

Waheed, H., Hassan, S. U., Aljohani, N. R., Hardman, J., Alelyani, S., & Nawaz, R. (2020). Predicting academic performance of students from VLE big data using deep learning models. Computers in Human Behavior, 104 (October 2019), 106189. https://doi.org/10.1016/j.chb.2019.106189

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining practical machine learning tools and techniques (3rd ed.). Morgan Kaufmann.

Xing, W., Guo, R., Petakovic, E., & Goggins, S. (2015). Participation-based student final performance prediction model through interpretable Genetic Programming: Integrating learning analytics, educational data mining and theory. Computers in Human Behavior, 47 , 168–181.

Xu, X., Wang, J., Peng, H., & Wu, R. (2019). Prediction of academic performance associated with internet usage behaviors using machine learning algorithms. Computers in Human Behavior, 98 (January), 166–173. https://doi.org/10.1016/j.chb.2019.04.015

Zabriskie, C., Yang, J., DeVore, S., & Stewart, J. (2019). Using machine learning to predict physics course outcomes. Physical Review Physics Education Research, 15 (2), 020120. https://doi.org/10.1103/PhysRevPhysEducRes.15.020120

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Kırşehir Ahi Evran University, Faculty of Engineering and Architecture, 40100, Kırşehir, Turkey

Mustafa Yağcı

You can also search for this author in PubMed   Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mustafa Yağcı .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yağcı, M. Educational data mining: prediction of students' academic performance using machine learning algorithms. Smart Learn. Environ. 9 , 11 (2022). https://doi.org/10.1186/s40561-022-00192-z

Download citation

Received : 15 November 2021

Accepted : 15 February 2022

Published : 03 March 2022

DOI : https://doi.org/10.1186/s40561-022-00192-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Predicting achievement
  • Learning analytics
  • Early warning systems

data mining project thesis

Data Mining Project Ideas

Data mining is the process of analyzing data in large size which are usually unordered and to find some of the relation between them. In order to learn more about process you can read this research paper completely which is based on Data mining.

  • Define Data Mining

Data mining involves exploring and analyzing data’s in large volume in order to find the patterns followed, hidden correlation, trends and the understanding about the project. This follows some special statistical and computational techniques for collecting information from such a big dataset and also in prediction, decision making and discovering new knowledge from science, research and business.

  • What is Data Mining?

This process is very helpful in identifying trends and patterns from a big dataset with help of different algorithms and techniques. It is useful for analyzing the data, to find valuable information and to decode a complex or unstructured source of data.

  • Where Data Mining is used?

In this section we are going to discuss about the uses of Data Mining process. It is used in many different areas and fields in several applications, from which some of them are listed here: Marketing and Business, Education, Scientific research, Healthcare, Environmental science, E-commerce and finance.

  • Why Data Mining is proposed? Previous Technology Issues

Moving on to the next section, here we are going to discuss about the reason for the proposal of this technology and the challenges faced by this technology. This was proposed so that the process of analyzing and collecting data from larger dataset becomes easy. This technology helps institutions an business for making decisions based on data, improve efficiency and to gain more knowledge about the data which will lead to better results.

The challenges and issues faced by the earlier technologies of data mining include:

Scalability: Because of issues faced by earlier system in storage capacity and high computational power, processing of complex dataset was most challenging.

Data Quality: Problems related to data quality like missing values, inconsistencies and noise leads to difficulty in data mining.

Complex Algorithms: The algorithms used for data mining in earlier stages were more intensive and complex which makes them difficult to run effectively.

Interpretability: Some of the models produced in data mining like deep learning are hard for interpreting, so it could not be adaptable in all fields.

Primary Concerns: The privacy and security of sensitive data should be concerned which leads to challenges in regulation.

  • Algorithms / Protocols

After knowing about the technology, uses of it and the issues faced by them in the earlier stage, now we are going to learn about the algorithms used for this technology. The algorithms provided for Data mining to overcome the previous issues faced by it are: “Distributed Adaptive Trust-based authentication”, “Hybrid Gray Level Co-occurrence Matrix Fast Fourier Transform” (HGLCM-FFT), “Particle Swarm Optimized Symmetrical Blowfish” (PSOSB) and “Hierarchical Gradient Boosted Isolation Forest” (HGB-IF).

  • Comparative study / Analysis

The comparative study is done in order to find the best suitable algorithm for that system to overcome the issues face by them in earlier technologies. The previous method faced trust issues in the cloud data. In the proposed work, for each process separate different algorithms are used to overcome the trust issue. Techniques like Normalization, Feature encoding and Dimensionality reduction are used in processing data. For feature extraction “Hybrid Gray Level Co-occurrence” and “Matrix Fast Fourier Transform” (HGLCM-FFT) are used. Making use of Information gain (IG), Symmetric uncertainty, Chi-squared and Gain ratio can help for feature selection. For increasing trust in cloud data, algorithms like Hierarchical Gradient Boosted Isolation Forest (HGB-IF) and “Distributed Adaptive Trust-based authentication method” are used. For data encryption “Particle Swarm Optimized Symmetrical Blowfish” (PSOSB) algorithm is used.

  • Simulation results / Parameters

The approaches which were proposed to overcome the issues faced by Data mining in the above section are tested using different methodologies to analyze its performance. The comparison is done by using metrics like Attack Detection Rate, CPU usage, Decryption time, Encryption time, False alarm rate, Network usage, Throughput and True positive rate.

  • Dataset LINKS / Important URL

Here are some of the links provided for you below to gain more knowledge about Data mining which can be useful for you:

  • https://www.hindawi.com/journals/wcmc/2022/7272405/
  • https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10162196
  • https://www.mdpi.com/2504-2289/6/4/101
  • https://www.mdpi.com/2079-9292/12/11/2427
  • https://www.mdpi.com/1999-5903/14/12/354
  • Data Mining Applications

In this next section we are going to discuss about the applications of Data mining. This technology has been employed in many industries, from which some of them are listed here: Astronomy, Customer Relationship Management (CRM), Crime Analysis, Education, Environmental Monitoring, Fraud Detection, Telecommunications and Supply Chain Management.

In this study, topology refers to the organization or the structuring of data. There are many types of topology used in structuring data suitable for each context, from which some of them are listed here: Graph Topology, Geometric Topology, Network Topology, Spatial Topology, Sensor Network Topology, Textual Topology, Temporal Topology and Topological Data Analysis (TDA).

  • Environment

The process of data mining can be done in several tools and environments like SAS or R, different programming languages like Python, specialized tool for data mining like Rapid Miner, cloud service such as AWS and data platforms such as Spark and Hadoop. This can be functioning in various other areas also such as tools for business intelligence, tools for spatial data and tools for text mining, based on the specific requirements. The environment in which these tools function properly depends on certain factors such as complexity, analysis type and data volume.

  • Simulation Tools

Here we provide the simulation software for Data mining, which is established with the usage of tool like Python of version 3.11.4, to enhance its performance.

After going through this research based on Data mining, which provide lot of information, you can utilize them to clarify the doubts you have about its technology, applications of this technology, and different topologies of it, algorithms followed by it also about the limitations and how it can be overcome.

Data Mining Project Ideas & Topics

  • The Significance of using Data Extraction Methods for an Effective Big Data Mining Process
  • Application of Data Mining Technology in Financial Data Analysis Methods under the Background of Big Data
  • Big Data Mining Algorithm of Internet of Things Based on Artificial Intelligence Technology
  • Research on The Transformation and Development of K9 Education and Training Institutions under Xuzhou Double Reduction Policy based on Data Mining Technology
  • Exploring Research Opportunities to Apply Data Mining Techniques in Software Engineering Lifecycle
  • Effective Multi-Data-Set Kernel Culture System Development in Data Mining
  • Research on Medical Big Data Mining and Intelligent Analysis for Smart Healthcare
  • An Exploration of an Operational Multi-Data-Set Kernel Culture Scheme for Practice in Data Mining
  • Research on Multi-XCTDs Measurement Information Receiving and Data Mining System
  • Predictive maintenance project implementation based on data-driven & data mining
  • A Novel Data Mining Algorithm for Power Marketing Information
  • Design of Analysis Platform for College Students’ Physical Learning Effect Based on Data Mining Algorithm
  • Boosted Hybrid Privacy Preserving Data Mining (BHPPDM) Technique to Increase Privacy and Accuracy
  • Extracting Behavioral Characteristics of College Students Using Data Mining on Big Data
  • Construction of scientific and technological innovation enterprise management information system under big data mining technology
  • Design of TCM Research Demand System Based on Data Mining Technology
  • Analysis of K-means and K-DBSCAN Commonly Used in Data Mining
  • Data Mining of Prescription Rules for Six Basic Diseases of Mongolian Medicine Based on Decision Tree
  • Detection of Behavioral Patterns of Viral Hepatitis Using Data Mining
  • Teaching Resource Sharing System in OBE Mode Based on Data Mining Technology
  • Machine Learning based Data Mining for Detection of Credit Card Frauds
  • Digit-DM: A Sustainable Data Mining Modell for Continuous Digitization in Manufacturing
  • Digitization of Emergency Monitoring Processes and Data Mining
  • Public Comment Analysis Model of Network Media Based on Big Data Mining and Implementation Plans
  • The application of data mining techniques for predicting education to new undergraduate students at Chiang Mai Rajabhat University
  • A Multi-Label Classification Method Based On Textual Data Mining
  • Implementation of Railway Accident Judgment Criteria Optimization Based on Data Mining and Digital Programming Technology
  • Waste Miner: An Efficient Waste Collection System for Smart Cities Leveraging IoT and Data Mining Technique
  • A Review of Time Series Data Mining Methods Based on Cluster Analysis
  • Application of Data Mining Technology in the Analysis of CET-4 Scores
  • A Method of Filling Missing Values in Data using Data Mining
  • Predicting Student’s Academic Performance Using Data Mining Methods: Review Paper
  • Application of Machine Learning in Data Mining under the Background of Big Data
  • Hybrid Clustering Techniques for Optimizing Online Datasets Using Data Mining Techniques
  • Remote monitoring method of deep foundation pit operation equipment based on AIOT technology and data mining
  • Research and Practice of Enterprise Education Mode in Universities Based on Data Mining
  • Vehicle Trajectory Data Mining for Artificial Intelligence and Real-Time Traffic Information Extraction
  • A DAG-NOTEARS-based Data Mining Method for Faulty Samples
  • Research on Personalized Recommendation Algorithm of Tourism E-commerce Platform Products Based on Data Mining
  • Review of Data Mining Techniques in Performance Prediction for Medical Schools
  • English pronunciation quality evaluation system based on data mining algorithm
  • Detection of Early Fault in Power Electronic Converters through Machine Learning and Data Mining Techniques
  • Brain-like Intelligent Data Mining Mechanism Based on Convolutional Neural Network
  • Implementation Data Mining with the Naive Bayes Classifier Algorithm in Determining the Type of Stroke
  • Improve Data Mining Performance by Noise Redistribution: A Mixed Integer Programming Formulation
  • Enhancing the detection of fraudulent activities in the distribution of energy through data mining algorithms
  • An Analysis of Cancer Data Sets Utilizing Data Mining
  • Optimization techniques for preserving privacy in data mining
  • Multiple Agents based Disaster Prediction for Public Environments using Data Mining Techniques
  • PHD Guidance
  • PHD PROJECTS UK
  • PHD ASSISTANCE IN BANGALORE
  • PHD Assistance
  • PHD In 3 Months
  • PHD Dissertation Help
  • PHD IN JAVA PROGRAMMING
  • PHD PROJECTS IN MATLAB
  • PHD PROJECTS IN RTOOL
  • PHD PROJECTS IN WEKA
  • PhD projects in computer networking
  • COMPUTER SCIENCE THESIS TOPICS FOR UNDERGRADUATES
  • PHD PROJECTS AUSTRALIA
  • PHD COMPANY
  • PhD THESIS STRUCTURE
  • PHD GUIDANCE HELP
  • PHD PROJECTS IN HADOOP
  • PHD PROJECTS IN OPENCV
  • PHD PROJECTS IN SCILAB
  • PHD PROJECTS IN WORDNET
  • NETWORKING PROJECTS FOR PHD
  • THESIS TOPICS FOR COMPUTER SCIENCE STUDENTS
  • IEEE JOURNALS IN COMPUTER SCIENCE
  • OPEN ACCESS JOURNALS IN COMPUTER SCIENCE
  • SCIENCE CITATION INDEX COMPUTER SCIENCE JOURNALS
  • SPRINGER JOURNALS IN COMPUTER SCIENCE
  • ELSEVIER JOURNALS IN COMPUTER SCIENCE
  • ACM JOURNALS IN COMPUTER SCIENCE
  • INTERNATIONAL JOURNALS FOR COMPUTER SCIENCE AND ENGINEERING
  • COMPUTER SCIENCE JOURNALS WITHOUT PUBLICATION FEE
  • SCIENCE CITATION INDEX EXPANDED JOURNALS LIST
  • THOMSON REUTERS INDEXED JOURNALS
  • DOAJ COMPUTER SCIENCE JOURNALS
  • SCOPUS INDEXED COMPUTER SCIENCE JOURNALS
  • SCI INDEXED COMPUTER SCIENCE JOURNALS
  • SPRINGER JOURNALS IN COMPUTER SCIENCE AND TECHNOLOGY
  • ISI INDEXED JOURNALS IN COMPUTER SCIENCE
  • PAID JOURNALS IN COMPUTER SCIENCE
  • NATIONAL JOURNALS IN COMPUTER SCIENCE AND ENGINEERING
  • MONTHLY JOURNALS IN COMPUTER SCIENCE
  • SCIMAGO JOURNALS LIST
  • THOMSON REUTERS INDEXED COMPUTER SCIENCE JOURNALS
  • RESEARCH PAPER FOR SALE
  • CHEAP PAPER WRITING SERVICE
  • RESEARCH PAPER ASSISTANCE
  • THESIS BUILDER
  • WRITING YOUR JOURNAL ARTICLE IN 12 WEEKS
  • WRITE MY PAPER FOR ME
  • PHD PAPER WRITING SERVICE
  • THESIS MAKER
  • THESIS HELPER
  • DISSERTATION HELP UK
  • DISSERTATION WRITERS UK
  • BUY DISSERTATION ONLINE
  • PHD THESIS WRITING SERVICES
  • DISSERTATION WRITING SERVICES UK
  • DISSERTATION WRITING HELP
  • PHD PROJECTS IN COMPUTER SCIENCE
  • DISSERTATION ASSISTANCE

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Data Mining in Healthcare: Applying Strategic Intelligence Techniques to Depict 25 Years of Research Development

Maikel luis kolling.

1 Graduate Program of Industrial Systems and Processes, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; [email protected] (M.L.K.); [email protected] (M.K.S.)

Leonardo B. Furstenau

2 Department of Industrial Engineering, Federal University of Rio Grande do Sul, Porto Alegre 90035-190, Brazil; rb.csinu.2xm@uanetsrufodranoel

Michele Kremer Sott

Bruna rabaioli.

3 Department of Medicine, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; moc.liamg@iloiabbaranurb

Pedro Henrique Ulmi

4 Department of Computer Science, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; [email protected]

Nicola Luigi Bragazzi

5 Laboratory for Industrial and Applied Mathematics (LIAM), Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada

Leonel Pablo Carvalho Tedesco

Associated data.

Not applicable.

In order to identify the strategic topics and the thematic evolution structure of data mining applied to healthcare, in this paper, a bibliometric performance and network analysis (BPNA) was conducted. For this purpose, 6138 articles were sourced from the Web of Science covering the period from 1995 to July 2020 and the SciMAT software was used. Our results present a strategic diagram composed of 19 themes, of which the 8 motor themes (‘NEURAL-NETWORKS’, ‘CANCER’, ‘ELETRONIC-HEALTH-RECORDS’, ‘DIABETES-MELLITUS’, ‘ALZHEIMER’S-DISEASE’, ‘BREAST-CANCER’, ‘DEPRESSION’, and ‘RANDOM-FOREST’) are depicted in a thematic network. An in-depth analysis was carried out in order to find hidden patterns and to provide a general perspective of the field. The thematic network structure is arranged thusly that its subjects are organized into two different areas, (i) practices and techniques related to data mining in healthcare, and (ii) health concepts and disease supported by data mining, embodying, respectively, the hotspots related to the data mining and medical scopes, hence demonstrating the field’s evolution over time. Such results make it possible to form the basis for future research and facilitate decision-making by researchers and practitioners, institutions, and governments interested in data mining in healthcare.

1. Introduction

Deriving from Industry 4.0 that pursues the expansion of its autonomy and efficiency through data-driven automatization and artificial intelligence employing cyber-physical spaces, the Healthcare 4.0 portrays the overhaul of medical business models towards a data-driven management [ 1 ]. In akin environments, substantial amounts of information associated to organizational processes and patient care are generated. Furthermore, the maturation of state-of-the-art technologies, namely, wearable devices, which are likely to transform the whole industry through more personalized and proactive treatments, will lead to a noteworthy increase in user patient data. Moreover, the forecast for the annual global growth in healthcare data should exceed soon 1.2 exabytes a year [ 1 ]. Despite the massive and growing volume of health and patient care information [ 2 ], it is still, to a great extent, underused [ 3 ].

Data mining, a subfield of artificial intelligence that makes use of vast amounts of data in order to allow significant information to be extracted through previously unknown patterns, has been progressively applied in healthcare to assist clinical diagnoses and disease predictions [ 2 ]. This information has been known to be rather complex and difficult to analyze. Furthermore, data mining concepts can also perform the analysis and classification of colossal bulks of information, grouping variables with similar behaviors, foreseeing future events, amid other advantages for monitoring and managing health systems ceaselessly seeking to look after the patients’ privacy [ 4 ]. The knowledge resulting from the application of the aforesaid methods may potentially improve resource management and patient care systems, assist in infection control and risk stratification [ 5 ]. Several studies in healthcare have explored data mining techniques to predict incidence [ 6 ] and characteristics of patients in pandemic scenarios [ 7 ], identification of depressive symptoms [ 8 ], prediction of diabetes [ 9 ], cancer [ 10 ], scenarios in emergency departments [ 11 ], amidst others. Thus, the utilization of data mining in health organizations ameliorates the efficiency of service provision [ 12 ], quality of decision making, and reduces human subjectivity and errors [ 13 ].

The understanding of data mining in the healthcare sector is, in this context, vital and some researchers have executed bibliometric analyses in the field with the intention of investigating the challenges, limitations, novel opportunities, and trends [ 14 , 15 , 16 , 17 ]. However, at the time of this study, there were no published works that provided a complete analysis of the field using a bibliometric performance and network analysis (BPNA) (see Table 1 . In the light of this, we have defined three research questions:

  • RQ1: What are the strategic themes of data mining in healthcare?
  • RQ2: How is the thematic evolution structure of data mining in healthcare?
  • RQ3: What are the trends and opportunities of data mining in healthcare for academics and practitioners?

Existing bibliometric analysis of data mining in healthcare in Web of Science (WoS).

Thus, with the objective to lay out a superior understanding of the data mining usage in the healthcare sector and to answer the defined research questions, we have performed a bibliometric performance and network analysis (BPNA) to set fourth an overview of the area. We used the Science Mapping Analysis Software Tool (SciMAT), a software developed by Cobo et al. [ 18 ] with the purpose of identifying strategic themes and the thematic evolution structure of a given field, which can be used as a strategic intelligence tool. The strategic intelligence, an approach that can enhance decision-making in terms of science and technology trends [ 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 ], can help researchers and practitioners to understand the area and devise new ideas for future works as well as to identify the trends and opportunities of data mining in healthcare.

This research is structured as follows: Section 2 highlights the methodology and the dataset. Section 3 presents the bibliometric performance of data mining in healthcare. In Section 4 , the strategic diagram presents the most relevant themes according to our bibliometric indicators as well as the thematic network structure of the motor themes and the thematic evolution structure, which provide a complete overview of data mining over time. Section 5 presents the conclusions, limitations, and suggestions for future works.

2. Methodology and Dataset

Attracting attention from companies, universities, and scientific journals, bibliometric analysis enhances decision-making by providing a reliable method to collect information from databases, to transform the aforementioned data into knowledge, and to stimulate wisdom development. Furthermore, the techniques of bibliometric analysis can provide higher and different perspectives of scientific production by using advanced measurement tools and methods to depict how authors, works, journals and institutions are advancing in a specific field of research through the hidden patterns that are embedded in large datasets.

The existing works on bibliometric analysis of data mining in health care in the Web of Science are shown in Table 1 , where it is depicted that only three studies have been performed and the differences between these approaches and this work are explained.

2.1. Methodology

For this study we have applied BPNA, a method that combines science mapping with performance analysis, to the field of data mining in healthcare with the support of the SciMAT software. This methodology has been chosen in view of the fact that such a combination, in addition to assisting decision-making for academics and practitioners, allows us to perform a deep investigation into the field of research by giving a new perspective of its intricacies. The BPNA conducted in this paper was composed of four steps outlined below.

2.1.1. Discovery of Research Themes

The themes were identified using a frequency and network reduction of keywords. In this process, the keywords were firstly normalized using the Salton’s Cosine, a correlation coefficient, and then clustered through the simple center algorithm. Finally, the thematic evolution structure co-word network was normalized using the equivalence index.

2.1.2. Depicting Research Themes

The previously identified themes were then plotted on a bi-dimensional diagram composed of four quadrants, in which the “vertical axis” characterizes the density (D) and the “horizontal axis” characterizes the centrality (C) of the theme [ 28 , 29 ] ( Figure 1 a) [ 18 , 20 , 25 , 30 , 31 , 32 , 33 ].

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g001.jpg

Strategic diagram ( a ). Thematic network structure ( b ). Thematic evolution structure ( c ).

  • (a) First quadrant—motor themes: trending themes for the field of research with high development.
  • (b) Second quadrant—basic and transversal themes: themes that are inclined to become motor themes in the future due to their high centrality.
  • (c) Third quadrant—emerging or declining themes: themes that require a qualitative analysis to define whether they are emerging or declining.
  • (d) Fourth quadrant—highly developed and isolated themes: themes that are no longer trending due to a new concept or technology.

2.1.3. Thematic Network Structure and Detection of Thematic Areas

The results were organized and structured in (a) a strategic diagram (b) a thematic network structure of motor themes, and (c) a thematic evolution structure. The thematic network structure ( Figure 1 b) represents the co-occurrence between the research themes and underlines the number of relationships (C) and internal strength among them (D). The thematic evolution structure ( Figure 1 c) provides a proper picture of how the themes preserve a conceptual nexus throughout the following sub-periods [ 23 , 34 ]. The size of the clusters is proportional to the number of core documents and the links indicate co-occurrence among the clusters. Solid lines indicate that clusters share the main theme, and dashed lines represent the shared cluster elements that are not the name of the themes [ 35 ]. The thickness of the lines is proportional to the inclusion index, which indicates that the themes have elements in common [ 35 ]. Furthermore, in the thematic network structure the themes were then manually classified between data mining techniques and medical research concepts.

2.1.4. Performance Analysis

The scientific contribution was measured by analyzing the most important research themes and thematic areas using the h-index, sum of citations, core documents centrality, density, and nexus among themes. The results can be used as a strategic intelligence approach to identify the most relevant topics in the research field.

2.2. Dataset

Composed of 6138 non-duplicated articles and reviews in English language, the dataset used in this work was sourced from the Web of Science (WoS) database utilizing the following query string (“data mining” and (“health*” OR “clinic*” OR “medic* OR “disease”)). The documents were then processed and had their keywords, both the author’s and the index controlled and uncontrolled terms, extracted and grouped in accordance with their meaning. In order to remove duplicates and terms which had less than two occurrences in the documents, a preprocessing step was applied to the authors, years, publication dates, and keywords. For instance, the preprocessing has reduced the total number of keywords from 21,838 to 5310, thus improving the bibliometric analysis clarity. With the exception of the strategic diagram that was plotted utilizing a single period (1995–July 2020), in this study, the timeline was divided into three sub-periods: 1995–2003, 2004–2012, and 2013–July 2020.

Subsequently, a network reduction was applied in order to exclude irrelevant words and co-occurrences. For the network extraction we wanted to identify co-occurrence among words. For the mapping process, we used a simple center algorithm. Finally, a core mapper was used, and the h-index and sum citations were selected. Figure 2 shows a good representation of the steps of the BPNA.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g002.jpg

Workflow of the bibliometric performance and network analysis (BPNA).

3. Bibliometric Performance of Data Mining in Healthcare

In this section, we measured the performance of the field of data mining in healthcare in terms of publications and citations over time, the most productive and cited researchers, as well as productivity of scientific journals, institutions, countries, and most important research areas in the WoS. To do this, we used indicators such as: number of publications, sum of citations by year, journal impact factor (JIF), geographic distribution of publications, and research field. For this, we examined the complete period (1995 to July 2020).

3.1. Publications and Citations Overtime

Figure 3 shows the performance analysis of publications and citations of data mining in healthcare over time from 1995 to July 2020 in the WoS. The first sub-period (1995–2003) shows the beginning of the research field with 316 documents and a total of 13,483 citations. Besides, the first article in the WoS was published by Szolovits (1995) [ 36 ] who presented a tutorial for handling uncertainty in healthcare and highlighted the importance to develop data mining techniques in order to assist the healthcare sector. This sub-period shows a slightly increasing number of citations until 2003 and the year with the highest number of citations was 2002.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g003.jpg

Number of publications over time (1995–July 2020).

The slightly increasing number continues from the first sub-period to the second subperiod (2004–2013) with a total of 1572 publications and 55,734 citations. The year 2006 presents the highest number of citations mainly due to the study of Fawcett [ 37 ] which attracted 7762 citations. The author introduced the concept of Receiver Operating Characteristics (ROC). This technique is widely used in data mining to assist medical decision-making.

From the second to the third sub-period, it is possible to observe a huge increase in the number of publications (4250 publications) and 41,821 citations. This elevated increase may have occurred due to the creation of strategies to implement emerging technologies in the healthcare sector in order to move forward with the third digital revolution in healthcare, the so-called Healthcare 4.0 [ 1 , 38 ]. Furthermore, although the citations are showing a positive trend, it is still possible to observe a downward trend from 2014 to 2020. This may happen, as Wang [ 39 ] highlights, due to the fact that a scientific document needs three to seven years to reach its peak point of citation [ 34 ]. Therefore, this is not a real trend.

3.2. Most Productive and Cited Authors

Table 2 displays the most productive and cited authors from 1995 to July 2020 of data mining in healthcare in the WoS. Leading as the most productive researcher in the field of data mining in healthcare is Li, Chien-Feng, a pathologist at Chi Mei Hospital which is sixth-ranked in publication numbers. He dedicates his studies to the molecular diagnosis of cancer with innovative technologies. In the sequence, Acharya, U. Rajendra, ranked in the top 1% of highly cited researchers in five consecutive years (2016, 2017, 2018, 2019, and 2020) in computer science according to Thomson’s essential science indicators, shares second place with Chung, Kyungyong from the Division of Engineering and Computer Science at the Kyonggi University in Su-won-si, South Korea. On the other hand, Bate, Andrew C., a member of the Food and Drug Administration (FDA) Science Council of Pharmacovigilance Subcommittee, which is the fourth-ranked institution in publication count as the most cited researcher with 945 citations. Subsequently, Lindquist, Marie, who monitors global pharmacovigilance and data management development at the World Health Organization (WHO), is ranked second with 943 citations. Last but not least, Edwards, E.R., an orthopedic surgeon at the Royal Australasian College of Surgeons is ranked third with 888 citations. Notably, this study does not demonstrate a direct correlation between the number of publications and the number of citations.

Most Cited/Productive authors from 1995 to July 2020.

3.3. Productivity of Scientific Journals, Universities, Countries and Most Important Research Fields

Table 3 shows the journals that publish studies related to data mining in healthcare. PLOS One is the first ranked with 124 publications, followed by Expert Systems with Applications with 105, and Artificial Intelligence in Medicine with 75. On the other hand, the journal Expert Systems with Applications is the journal that had the highest Journal Impact Factor (JIF) from 2019–2020.

Journals that publish studies to data mining in healthcare.

Table 4 shows the most productive institutions and the most productive countries. The first ranked is Columbia University followed by U.S. FDA Registration and Harvard University. In terms of country productivity, United States is the first in the rank, followed by China and England. In comparison with Table 2 , it is possible to notice that the most productive author is not related to the most productive institutions (Columbia University and U.S. FDA Registration). Besides, the institution with the highest number of publications is in the United States, which is found to be the most productive country.

Institutions and countries that publish studies to data mining in healthcare.

Regarding Columbia University, it is possible to verify its prominence in data mining in healthcare through its advanced data science programs, which are one of the best evaluated and advanced in the world. We highlight the Columbia Data Science Society, an interdisciplinary society that promotes data science at Columbia University and the New York City community.

The U.S. FDA Registration has a data mining council to promote the prioritization and governance of data mining initiatives within the Center for Biological Research and Evaluation to assess spontaneous reports of adverse events after the administration of regulated medical products. In addition, they created an Advanced and Standards-Based Network Analyzer for Clinical Assessment and Evaluation (PANACEA), which supports the application of standards recognition and network analysis for reporting these adverse events. It is noteworthy that the FDA Adverse Events Reporting System (FAERS) database is the main resource that identifies adverse reactions in medications marketed in the United States. A text mining system based on EHR that retrieves important clinical and temporal information is also highlighted along with support for the Cancer Prevention and Control Division at the Centers for Disease Control and Prevention in a big data project.

The Harvard University offers online data mining courses and has a Center for Healthcare Data Analytics created by the need to analyze data in large public or private data sets. Harvard research includes funding and providing healthcare, quality of care, studies on special and disadvantaged populations, and access to care.

Table 5 presents the most important WoS subject research fields of data mining in healthcare from 1995 to July 2020. Computer Science Artificial Intelligence is the first ranked with 768 documents, followed by Medical Informatics with 744 documents, and Computer Science Information Systems with 722 documents.

Most relevant WoS subject categories and research fields.

4. Science Mapping Analysis of Data Mining in Healthcare

In this section the science mapping analysis of data mining in healthcare is depicted. The strategic diagram shows the most relevant themes in terms of centrality and density. The thematic network structure uncovers the relationship (co-occurrence) between themes and hidden patterns. Lastly, the thematic evolution structure underlines the most important themes of each sub-period and shows how the field of study is evolving over time.

4.1. Strategic Diagram Analysis

Figure 4 presents 19 clusters, 8 of which are categorized as motor themes (‘NEURAL-NETWORKS’, ‘CANCER’, ‘ELETRONIC-HEALTH-RECORDS’, ‘DIABETES-MELLITUS’, ‘ADVERSE-DRUG-EVENTS’, ‘BREAST-CANCER’, ‘DEPRESSION’ and ‘RANDOM-FOREST’), 2 as basic and transversal themes (‘CORONARY-ARTERY-DISEASE’ and ‘PHOSPHORYLATION’), 7 as emerging or declining themes (‘PERSONALIZED-MEDICINE’, ‘DATA-INTEGRATION’, ‘INTENSIVE-CARE-UNIT’, ‘CLUSTER-ANALYSIS’, ‘INFORMATION-EXTRACTION’, ‘CLOUD-COMPUTING’ and ‘SENSORS’), and 2 as highly developed and isolated themes (‘ALZHEIMERS-DISEASE’, and ‘METABOLOMICS’).

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g004.jpg

Strategic diagram of data mining in healthcare (1995–July 2020).

Each cluster of themes was measured in terms of core documents, h-index, citations, centrality, and density. The cluster ‘NEURAL-NETWORKS’ has the highest number of core documents (336) and is ranked first in terms of centrality and density. On the other hand, the cluster ‘CANCER’ is the most widely cited with 5810 citations.

4.2. Thematic Network Structure Analysis of Motor Themes

The motor themes have an important role regarding the shape and future of the research field because they correspond to the key topics to everyone interested in the subject. Therefore, they can be considered as strategic themes in order to develop the field of data mining in healthcare. The eight motor themes are discussed below, and they are displayed below in Figure 5 together with the network structure of each theme.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g005.jpg

Thematic network structure of mining in healthcare (1995–July 2020). ( a ) The cluster ‘NEURAL-NETWORKS’. ( b ) The cluster ‘CANCER’. ( c ) The cluster ‘ELECTRONIC-HEALTH-RECORDS’. ( d ) The cluster ‘DIABETES-MELLITUS’. ( e ) The cluster ‘BREAST-CANCER’. ( f ) The cluster ‘ALZHEIMER’S DISEASE’. ( g ) The cluster ‘DEPRESSION’. ( h ) The cluster ‘RANDOM-FOREST’.

4.2.1. Neural Network (a)

The cluster ‘NEURAL-NETWORKS’ ( Figure 5 a) is the first ranked in terms of core documents, h-index, centrality, and density. The ‘NEURAL-NETWORKS’ cluster is strongly influenced by subthemes related to data science algorithms, such as ‘SUPPORT-VECTOR-MACHINE’, ‘DECISION-TREE’, among others. This network represents the use of data mining techniques to detect patterns and find important information correlated to patient health and medical diagnosis. A reasonable explanation for this network might be related to the high number of studies which conducted benchmarking of neural networks with other techniques to evaluate performance (e.g., resource usage, efficiency, accuracy, scalability, etc.) [ 40 , 41 , 42 ]. Besides, the significant size of the cluster ‘MACHINE-LEARNING’ is expected since neural networks is a type of machine learning. On the other hand, the subtheme ‘HEART-DISEASE’ stands out as the single disease in this network, which can be justified by the high number of researches with the goal to apply data mining to support decision-making in heart disease treatment and diagnosis.

4.2.2. Cancer (b)

The cluster ‘CANCER’ ( Figure 5 b) is the second ranked in terms of core documents, h-index, and density. On the other hand, it is the first in terms of citations (5810). This cluster is highly influenced by the subthemes related to the studies of cancer genes mutations, such as ‘BIOMAKERS’, ‘GENE-EXPRESSION’, among others. The use of data mining techniques has been attracting attention and efforts from academics in order to help solve problems in the field of oncology. Cancer is known as the disease that kills the most people in the 21st century due to various environmental pollutions, food pesticides and additives [ 14 ], eating habits, mental health, among others. Thus, controlling any form of cancer is a global strategy and can be enhanced by applying data mining techniques. Furthermore, the subtheme ‘PROSTATE-CANCER’ highlights that the most efforts of data mining applications focused on prostate cancer’s studies. Prostate cancer is the most common cancer in men. Although the benefits of traditional clinical exams for screening (digital rectal examination, the prostate-specific antigen and blood test and transrectal ultrasound), there is still a lack in terms of efficacy to reduce mortality with the use of such tests [ 43 ]. In this sense, data mining may be a suitable solution since it has been used in bioinformatics analyses to understand prostate cancer mutation [ 44 , 45 ] and uncover useful information that can be used for diagnoses and future prognostic tests which enhance both patients and clinical decision-making [ 46 ].

4.2.3. Electronic Health Records (HER—c)

The cluster ‘ELECTRONIC-HEALTH-RECORDS’ ( Figure 5 c) represents the concept in which patient’s health data are stored. Such data are continuously increasing over time, thereby creating a large amount of data (big data) which has been used as input (EHR) for healthcare decision support systems to enhance clinical decision-making. The clusters ‘NATURAL-LANGUAGE-PROCESSING’ and ‘TEXT MINING’ highlight that these mining techniques are the most frequently used with data mining in healthcare. Another pattern that must be highlighted is the considerable density among the clusters ‘SIGNAL-DETECTION’ and ‘PHARMACOVIGILANCE’ which represents the use of data mining to depict a broad range of adverse drug effects and to identify signals almost in real-time by using EHR [ 47 , 48 ]. Besides, the cluster ‘MISSING-DATA’ is related to studies focused on the challenge regarding to incomplete EHR and missing data in healthcare centers, which compromise the performance of several prediction models [ 49 ]. In this sense, techniques to handle missing data have been under improvement in order to move forward with the accurate prediction based on medical data mining applications [ 50 ].

4.2.4. Diabetes Mellitus (DM—d)

Nowadays, DM is one of the most frequent endocrine disorders [ 51 ] and affected more than 450 million people worldwide in 2017 and is expected to grow to 693 million by the year 2045. The same applies for the 850 billion dollars spent just in 2017 by the health sector [ 52 ]. The cluster ‘DIABETES-MELLITUS’ ( Figure 5 d) has a strong association with the risk factor subtheme group (e.g., ‘INSULIN-RESISTENCE’, ‘OBESITY’, ‘BODY-MASS-INDEX’, ‘CARDIOVASCULAR-DISEASE’, and ‘HYPERTENSION’). However, the obesity (cluster ‘OBESITY’) is the major risk factor related to DM, particularly in Type 2 Diabetes (T2D) [ 51 ]. T2D shows a prevalence of 90% of worldwide diabetic patients when compared with T1D and T3D, mainly characterized by insulin resistance [ 51 ]. Thus, this might justify the presence of the clusters ‘TYPE-2-DIABETES’ and ‘INSULIN-RESISTANCE’ which seems to be highly developed by data mining academics and practitioners. The massive number of researches into all facets of DM has led to the formation of huge volumes of EHR, in which the mostly applied data mining technique is the association rules technique. It is used to identify associations among risk factors [ 51 ], thusly justifying the appearance of the cluster ‘ASSOCIATION-RULES’.

4.2.5. Breast Cancer (e)

The cluster ‘BREAST-CANCER’ ( Figure 5 e) presents the most prevalent type of cancer affecting approximately 12.5% of women worldwide [ 53 , 54 ]. The cluster ‘OVEREXPRESSION’ and ‘METASTASIS’ highlights the high number of studies using data mining to understand the association of overexpression of molecules (e.g., MUC1 [ 54 ], TRIM29 [ 55 ], FKBP4 [ 56 ], etc.) with breast cancer metastasis. Such overexpression of molecules also appears in other forms of cancers, justifying the group of subthemes: ‘LUNG CANCER’, ‘GASTRIC-CANCER’, ‘OVARIAN-CANCER’, and ‘COLORECTALCANCER’. Moreover, the cluster ‘IMPUTATION’ highlight efforts to develop imputation techniques (data missingness) for breast cancer record analysis [ 57 , 58 ]. Besides, the application of data mining to depict breast cancer characteristics and their causes and effects has been highly supported by ‘MICROARRAY-DATA’ [ 59 , 60 ], ‘PATHWAY’ [ 61 ], and ‘COMPUTER-AIDED-DIAGNOSIS’ [ 62 ].

4.2.6. Alzheimer’s Disease (AD—f)

The cluster ‘ALZHEIMER’S DISEASE’ ( Figure 5 f) is highly influenced by subthemes related to diseases, such as ‘DEMENTIA’ and ‘PARKINSON’S-DISEASE’. This co-occurrence happens because the AD is a neurodegenerative illness which leads to dementia and Parkinson’s disease. Studies show that the money spent on AD in 2015 was about $828 billion [ 63 ]. In this sense, data mining has been widely used with ‘GENOME-WIDE-ASSOCIATION’ techniques in order to identify genes related to the AD [ 64 , 65 ] and prediction of AD by using data mining in ‘MRI’ Brain images [ 66 , 67 ]. The cluster ‘NF-KAPPA-B’ highlights the efforts to identify associations of NF-κB (factor nuclear kappa B) with AD by using data mining techniques which can be used to advance anti-drug developments [ 68 ].

4.2.7. Depression (g)

The cluster ‘DEPRESSION’ ( Figure 5 g) presents a common disease which affects over 260 million people. In the worst case, it can lead to suicide which is the second leading cause of death in young adults. The cluster ‘DEPRESSION’ is a highly associated cluster. Its connections mostly represent the subthemes that have been the research focus of data mining applications [ 69 ]. The connection between both the sub theme ‘SOCIAL-MEDIA’ and ‘ADOLESCENTS’, especially in times of social isolation, are extremely relevant to help identify early symptoms and tendencies among the population [ 70 ]. Furthermore, the presence of the ‘COMORBIDITY’ and ‘SYMPTONS’ is not surprising given knowledge discovery properties of the data mining field could provide significant insights into the etiology of depression [ 71 ].

4.2.8. Random Forest (h)

An ensemble learning method that is used in this study is the last cluster approach, which, among other things, is used for classification. The presence of the ‘BAYESIAN-NETWORK’ subtheme, supported by the connection between both and the ‘INFERENCE’, might represent another alternative to which the applications in data mining using random forest are benchmarked against [ 72 ]. Since the ‘RANDOM-FOREST’ ( Figure 5 h) cluster has barely passed the threshold from a basic and transversal theme to a motor theme, the works developed under this cluster are not yet as interconnected as the previous one. Thus, the theme with the most representativeness is the ‘AIR-POLLUTION’ in conjunction with ‘POLLUTION’, where studies have been performed in order to obtain ‘RISK-ASSESSMENT’ through the exploration of the knowledge hidden in large databases [ 73 ].

4.3. Thematic Evolution Structure Analysis

The Computer Science’s themes related to data mining and the medical research concepts, depicted, respectively, in the grey and blue areas of the thematic evolution diagram ( Figure 6 ), demonstrates the evolution of the research field over the different sub-periods addressed in this study. In this way, each individual theme relevance is illustrated through its cluster size as well as with its relationships throughout the different sub-periods. Thus, in this section, an analysis of the different trends on themes will be presented to give a brief insight into the factors that might have influenced its evolution. Furthermore, the proceeding analysis will be split into two thematic areas where, firstly, the grey area (practices and techniques related to data mining in healthcare) will be discussed followed by the blue one (health concepts and disease supported by data mining).

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g006.jpg

Thematic evolution structure of mining in healthcare (1995–July 2020).

4.3.1. Practices and Techniques Related to Data Mining in Healthcare

The cluster ‘KNOWLEDGE-DISCOVERY’ ( Figure 6 , 1995–2012), often known as a synonym for data mining, provides a broader view of the field differing in this way from the algorithm focused theme, that is data mining, where its appearance and, later in the third period, its fading could provide a first insight into the overall evolution of the data mining papers applied to healthcare. The occurrence of the cluster knowledge discovery in the first two periods could demonstrate the focus of the application of the data mining techniques in order to classify and predict conditions in the medical field. This gives rise to a competition with early machine learning techniques that could be potentially evidenced through the presence of the cluster ‘NEURAL-NETWORK’, which the data mining techniques could probably be benchmarked against. The introduction of the ‘FEATURE-SELECTION’, ‘ARTIFICIAL-INTELLIGENCE’, and ‘MACHINE-LEARNING’ clusters together with the fading of ‘KNOWLEDGE-DISCOVERY’ could imply the occurrence of a disruption of the field in the third sub-period that has led to a change in the perspective on the studies.

One instance that could represent such a disruption could have been a well-known paper published by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton [ 74 ], where a novel technique in neural networks was firstly applied to a major image recognition competition. A vast advantage over the other algorithms that have been used was obtained. The connection between the work previously mentioned and its impact on the data mining on healthcare research could be majorly supported by the disappearance of the cluster ‘IMAGE- MINING’ of the second sub-period which has no connections further on. Furthermore, the presence of the clusters ‘MACHINE-LEARNING’, ‘ARTIFICIAL-INTELLIGENCE’, ‘SUPPORT-VECTOR-MACHINES’, and ‘LOGISTIC-REGRESSION’ may be the evidence of a shift of focus on the data mining community for health care where, besides attempting to compete with machine learning algorithms, they are now striving to further improve the results previously obtained with machine learning through data mining. Moreover, given the presence of the colossal feature selection cluster, which circumscribes algorithms that enhance classification accuracy through a better selection of parameters, this trend could be given credence in consequence of its presence since it may be encompassing publications from the formerly stated clusters.

Although still small, the presence of the cluster ‘SECURITY’ in the last sub-period ( Figure 6 , 2013–2020) is, at the very least, relevant given the sensitive data that is handled in the medical space, such as patient’s history and diseases. Above all, the recent leaks of personal information have devised an ever-increasing attention to this topic focusing on, among other things, the de-identification of the personal information [ 75 , 76 , 77 ]. These kind of security processes allow, among others, data mining researchers to make use of the vast sensitive information that is stored in hospitals without any linkage that could associate a person to the data. For instance, the MIMIC Critical Care Database [ 78 ], an example of a de-identified database, has been allowing further research into many diseases and conditions in a secure way that would otherwise have been extremely impaired due to data limitations.

4.3.2. Health Concepts and Disease Supported by Data Mining

The cluster ‘GENE-EXPRESSION’ stands out in the first period and second period ( Figure 6 , 1995–2012) of medical research concepts and establishes strong co-occurrence with the cluster ‘CANCER’ in the third sub-period. This link can be explained by research involving the microarray technology, which makes it possible to detect deletions and duplications in the human genome by analyzing the expression of thousands of genes in different tissues. It is also possible to confirm the importance of genetic screening not only for cancer, but for several diseases, such as ‘ALZHEIMER’ and other brain disorders, thereby assisting in preventive medicine and enabling more efficient treatment plans [ 79 ]. For example, a research was carried out to analyze complex brain disorders such as schizophrenia from expression gene microarrays [ 80 ].

Sequencing technologies have undergone major improvements in recent decades to determine evolutionary changes in genetic, epigenetic mechanisms, and in the ‘MOLECULAR-CLASSIFICATION’, a topic that gained prominence as a cluster in the first period. An example of this can be found in a study published in 2010 which combined a global optimization algorithm called Dongguang Li (DGL) with cancer diagnostic methods based on gene selection and microarray analysis. It performed the molecular classification of colon cancers and leukemia and demonstrated the importance of machine learning, data mining, and good optimization algorithms for analyzing microarray data in the presence of subsets of thousands of genes [ 81 ].

The cluster ‘PROSTATE-CANCER’ in the second period ( Figure 6 , 2004–2012) presents a higher conceptual nexus to ‘MOLECULAR-CLASSIFICATION’ in the first sub-period and the same happens with clusters, such as ‘METASTASIS’, ‘BREAST-CANCER’, and ‘ALZHEIMER’, which appear more recently in the third sub-period. The significant increase in the incidence of prostate cancer in recent years results in the need for greater understanding of the disease in order to increase patient survival, since prostate cancer with metastasis was not well explored, despite having a survival rate much smaller compared to the early stages. In this sense, the understanding of age-specific survival of patients with prostate cancer in a hospital in using machine learning started to gain attention by academics and highlighted the importance of knowing survival after diagnosis for decision making and better genetic counseling [ 82 ]. In addition, the relationship between prostate cancer and Alzheimer’s disease is explained by the fact that the use of androgen deprivation therapy, used to treat prostate cancer, is associated with an increased risk of Alzheimer’s disease and dementia [ 81 ]. Therefore, the risks and benefits of long-term exposure to this therapy must be weighed. Finally, the relationship between prostate cancer and breast cancer in the thematic evolution can be explained due to the fact that studies are showing that men with a family history of breast cancer have a 21% higher risk of developing prostate cancer, including lethal disease [ 83 ].

The cluster ‘PHARMACOVIGILANCE’ appears in the second sub-period ( Figure 6 , 2004–2012) showing a strong co-occurrence with clusters of the third sub-period: ‘ADVERSE-DRUGS-REACTIONS’ and ‘ELECTRONIC-HEALTH-RECORDS’. In recent years, data-mining algorithms have stood out for their usefulness in detecting and screening patients with potential adverse drug reactions and, consequently, they have become a central component of pharmacovigilance, important for reducing the morbidity and mortality associated with the use of medications [ 48 ]. The importance of electronic medical records for pharmacovigilance is evident, which act as a health database and enable drug safety assessors to collect information. In addition, such medical records are also essential to optimize processes within health institutions, ensure more safety of patient data, integrate information, and facilitate the promotion of science and research in the health field [ 84 ]. These characteristics explain the large number of studies of ‘ELECTRONIC-HEALTH-RECORDS’ in the third sub-period and the growth of this theme in recent years, since the world has started to introduce electronic medical records, although currently there are few institutions that still use physical medical records.

The ‘DEPRESSION’ appears in the second sub-period ( Figure 6 , 2004–2012) and remains as a trend in the third sub-period with a significant increase in publications on the topic. It is known that this disease is numerous and is increasing worldwide, but that it still has many stigmas in its treatment and diagnosis. Globalization and the contemporary work environment [ 85 ] can be explanatory factors for the increase in the theme from the 2000s onwards and the COVID-19 pandemic certainly contributed to the large number of articles on mental health published in 2020. In this context, improving the detection of mental disorders is essential for global health, which can be enhanced by applying data mining to quantitative electroencephalogram signals to classify between depressed and healthy people and can act as an adjuvant clinical decision support to identify depression [ 69 ].

5. Conclusions

In this research, we have performed a BPNA to depict the strategic themes, the thematic network structure, and the thematic evolution structure of the data mining applied in healthcare. Our results highlighted several significant pieces of information that can be used by decision-makers to advance the field of data mining in healthcare systems. For instance, our results could be used by editors from scientific journals to enhance decision-making regarding special issues and manuscript review. From the same perspective, healthcare institutions could use this research in the recruiting process to better align the position needs to the candidate’s qualifications based on the expanded clusters. Furthermore, Table 2 presents a series of authors whose collaboration network may be used as a reference to identify emerging talents in a specific research field and might become persons of interest to greatly expand a healthcare institution’s research division. Additionally, Table 3 and Table 4 could also be used by researchers to enhance the alignment of their research intentions and partner institutions to, for instance, encourage the development of data mining applications in healthcare and advance the field’s knowledge.

The strategic diagram ( Figure 4 ) depicted the most important themes in terms of centrality and density. Such results could be used by researchers to provide insights for a better comprehension of how diseases like ‘CANCER’, ‘DIABETES-MELLITUS’, ‘ALZHEIMER’S-DISEASE’, ‘BREAST-CANCER’, ‘DEPRESSION’, and ‘CORONARY-ARTERY-DISEASE’ have made use of the innovations in the data mining field. Interestingly, none of the clusters have highlighted studies related to infectious diseases, and, therefore, it is reasonable to suggest the exploration of data mining techniques in this domain, especially given the global impact that the coronavirus pandemic has had on the world.

The thematic network structure ( Figure 5 ) demonstrates the co-occurrences among clusters and may be used to identify hidden patterns in the field of research to expand the knowledge and promote the development of scientific insights. Even though exhaustive research of the motor themes and their subthemes has been performed in this article, future research must be conducted in order to depict themes from the other quadrants (Q2, Q3, and Q4), especially emerging and declining themes, to bring to light relations between the rise and decay of themes that might be hidden inside the clusters.

The thematic evolution structure showed how the field is evolving over time and presented future trends of data mining in healthcare. It is reasonable to predict that clusters such as ‘NEURAL-NETWORKS’, ‘FEATURE-SELECTION’, ‘EHR’ will not decay in the near future due to their prevalence in the field and, most likely, due to the exponential increase in the amount of patient health that is being generated and stored daily in large data lakes. This unprecedented increase in data volume, which is often of dubious quality, leads to great challenges in the search for hidden information through data mining. Moreover, as a consequence of the ever-increasing data sensitivity, the cluster ‘SECURITY’, which is related to the confidentiality of the patient’s information, is likely to remain growing during the next years as government and institutions further develop structures, algorithms, and laws that aim to assure the data’s security. In this context, blockchain technologies specifically designed to ensure integrity and publicity of de-identified, similarly as it is done by the MIMIC-III (Medical Information Mart for Intensive Care III) [ 78 ], may be crucial to accelerate the advancement of the field by providing reliable information for health researchers across the world. Furthermore, future researches should be conducted in order to understand how these themes will behave and evolve during the next years, and interpret the cluster changes to properly assess the trends here presented. These results could also be used as teaching material for classes, as it provides strategic intelligence applications and the field’s historical data.

In terms of limitations, we used the WoS database since it has index journals with high JIF. Therefore, we suggest to analyze other databases, such as Scopus, PubMed, among others in future works. Besides, we used the SciMAT to perform the analysis and other bibliometric software, such as VOS viewer, Cite Space, Sci2tool, etc., could be used to explore different points of view. Such information will support this study and future works to advance the field of data mining in healthcare.

Author Contributions

Conceptualization, M.L.K., L.B.F., L.P.C.T. and N.L.B.; Data curation, L.B.F.; Formal analysis, L.B.F., B.R., and P.H.U.; Funding acquisition, N.L.B.; Investigation, M.L.K., L.B.F., L.P.C.T. and M.K.S.; Methodology, L.B.F.; Project administration, L.B.F., N.L.B. and L.P.C.T.; Resources, N.L.B.; Supervision, L.B.F., N.L.B. and L.P.C.T.; Validation, N.L.B. and L.P.C.T.; Visualization, N.L.B.; Writing—original draft, L.B.F. and N.L.B.; Writing—review & editing, N.L.B. All authors have read and agreed to the published version of the manuscript.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, and in part by the Brazilian Ministry of Health. N.L.B. is partially supported by the CIHR 2019 Novel Coronavirus (COVID-19) rapid research program.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Google Custom Search

Wir verwenden Google für unsere Suche. Mit Klick auf „Suche aktivieren“ aktivieren Sie das Suchfeld und akzeptieren die Nutzungsbedingungen.

Hinweise zum Einsatz der Google Suche

Technical University of Munich

  • Data Analytics and Machine Learning Group
  • TUM School of Computation, Information and Technology
  • Technical University of Munich

Technical University of Munich

Open Topics

We offer multiple Bachelor/Master theses, Guided Research projects and IDPs in the area of data mining/machine learning. A  non-exhaustive list of open topics is listed below.

If you are interested in a thesis or a guided research project, please send your CV and transcript of records to Prof. Stephan Günnemann via email and we will arrange a meeting to talk about the potential topics.

Robustness of Large Language Models

Type: Master's Thesis

Prerequisites:

  • Strong knowledge in machine learning
  • Very good coding skills
  • Proficiency with Python and deep learning frameworks (TensorFlow or PyTorch)
  • Knowledge about NLP and LLMs

Description:

The success of Large Language Models (LLMs) has precipitated their deployment across a diverse range of applications. With the integration of plugins enhancing their capabilities, it becomes imperative to ensure that the governing rules of these LLMs are foolproof and immune to circumvention. Recent studies have exposed significant vulnerabilities inherent to these models, underlining an urgent need for more rigorous research to fortify their resilience and reliability. A focus in this work will be the understanding of the working mechanisms of these attacks.

We are currently seeking students for the upcoming Summer Semester of 2024, so we welcome prompt applications. 

Contact: Tom Wollschläger

References:

  • Universal and Transferable Adversarial Attacks on Aligned Language Models
  • Attacking Large Language Models with Projected Gradient Descent
  • Representation Engineering: A Top-Down Approach to AI Transparency
  • Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Generative Models for Drug Discovery

Type:  Mater Thesis / Guided Research

  • Strong machine learning knowledge
  • Proficiency with Python and deep learning frameworks (PyTorch or TensorFlow)
  • Knowledge of graph neural networks (e.g. GCN, MPNN)
  • No formal education in chemistry, physics or biology needed!

Effectively designing molecular geometries is essential to advancing pharmaceutical innovations, a domain which has experienced great attention through the success of generative models. These models promise a more efficient exploration of the vast chemical space and generation of novel compounds with specific properties by leveraging their learned representations, potentially leading to the discovery of molecules with unique properties that would otherwise go undiscovered. Our topics lie at the intersection of generative models like diffusion/flow matching models and graph representation learning, e.g., graph neural networks. The focus of our projects can be model development with an emphasis on downstream tasks ( e.g., diffusion guidance at inference time ) and a better understanding of the limitations of existing models.

Contact :  Johanna Sommer , Leon Hetzel

Equivariant Diffusion for Molecule Generation in 3D

Equivariant Flow Matching with Hybrid Probability Transport for 3D Molecule Generation

Structure-based Drug Design with Equivariant Diffusion Models

Data Pruning and Active Learning

Type: Interdisciplinary Project (IDP) / Hiwi / Guided Research / Master's Thesis

Data pruning and active learning are vital techniques in scaling machine learning applications efficiently. Data pruning involves the removal of redundant or irrelevant data, which enables training models with considerably less data but the same performance. Similarly, active learning describes the process of selecting the most informative data points for labeling, thus reducing annotation costs and accelerating model training. However, current methods are often computationally expensive, which makes them difficult to apply in practice. Our objective is to scale active learning and data pruning methods to large datasets using an extrapolation-based approach.

Contact: Sebastian Schmidt , Tom Wollschläger , Leo Schwinn

  • Large-scale Dataset Pruning with Dynamic Uncertainty

Efficient Machine Learning: Pruning, Quantization, Distillation, and More - DAML x Pruna AI

Type: Master's Thesis / Guided Research / Hiwi

The efficiency of machine learning algorithms is commonly evaluated by looking at target performance, speed and memory footprint metrics. Reduce the costs associated to these metrics is of primary importance for real-world applications with limited ressources (e.g. embedded systems, real-time predictions). In this project, you will work in collaboration with the DAML research group and the Pruna AI startup on investigating solutions to improve the efficiency of machine leanring models by looking at multiple techniques like pruning, quantization, distillation, and more.

Contact: Bertrand Charpentier

  • The Efficiency Misnomer
  • A Gradient Flow Framework for Analyzing Network Pruning
  • Distilling the Knowledge in a Neural Network
  • A Survey of Quantization Methods for Efficient Neural Network Inference

Deep Generative Models

Type:  Master Thesis / Guided Research

  • Strong machine learning and probability theory knowledge
  • Knowledge of generative models and their basics (e.g., Normalizing Flows, Diffusion Models, VAE)
  • Optional: Neural ODEs/SDEs, Optimal Transport, Measure Theory

With recent advances, such as Diffusion Models, Transformers, Normalizing Flows, Flow Matching, etc., the field of generative models has gained significant attention in the machine learning and artificial intelligence research community. However, many problems and questions remain open, and the application to complex data domains such as graphs, time series, point processes, and sets is often non-trivial. We are interested in supervising motivated students to explore and extend the capabilities of state-of-the-art generative models for various data domains.

Contact : Marcel Kollovieh , David Lüdke

  • Flow Matching for Generative Modeling
  • Auto-Encoding Variational Bayes
  • Denoising Diffusion Probabilistic Models 
  • Structured Denoising Diffusion Models in Discrete State-Spaces

Graph Structure Learning

Type:  Guided Research / Hiwi

  • Optional: Knowledge of graph theory and mathematical optimization

Graph deep learning is a powerful ML concept that enables the generalisation of successful deep neural architectures to non-Euclidean structured data. Such methods have shown promising results in a vast range of applications spanning the social sciences, biomedicine, particle physics, computer vision, graphics and chemistry. One of the major limitations of most current graph neural network architectures is that they often rely on the assumption that the underlying graph is known and fixed. However, this assumption is not always true, as the graph may be noisy or partially and even completely unknown. In the case of noisy or partially available graphs, it would be useful to jointly learn an optimised graph structure and the corresponding graph representations for the downstream task. On the other hand, when the graph is completely absent, it would be useful to infer it directly from the data. This is particularly interesting in inductive settings where some of the nodes were not present at training time. Furthermore, learning a graph can become an end in itself, as the inferred structure can provide complementary insights with respect to the downstream task. In this project, we aim to investigate solutions and devise new methods to construct an optimal graph structure based on the available (unstructured) data.

Contact : Filippo Guerranti

  • A Survey on Graph Structure Learning: Progress and Opportunities
  • Differentiable Graph Module (DGM) for Graph Convolutional Networks
  • Learning Discrete Structures for Graph Neural Networks

NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification

A Machine Learning Perspective on Corner Cases in Autonomous Driving Perception  

Type: Master's Thesis 

Industrial partner: BMW 

Prerequisites: 

  • Strong knowledge in machine learning 
  • Knowledge of Semantic Segmentation  
  • Good programming skills 
  • Proficiency with Python and deep learning frameworks (TensorFlow or PyTorch) 

Description: 

In autonomous driving, state-of-the-art deep neural networks are used for perception tasks like for example semantic segmentation. While the environment in datasets is controlled in real world application novel class or unknown disturbances can occur. To provide safe autonomous driving these cased must be identified. 

The objective is to explore novel class segmentation and out of distribution approaches for semantic segmentation in the context of corner cases for autonomous driving. 

Contact: Sebastian Schmidt

References: 

  • Segmenting Known Objects and Unseen Unknowns without Prior Knowledge 
  • Efficient Uncertainty Estimation for Semantic Segmentation in Videos  
  • Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family  
  • Description of Corner Cases in Automated Driving: Goals and Challenges 

Active Learning for Multi Agent 3D Object Detection 

Type: Master's Thesis  Industrial partner: BMW 

  • Knowledge in Object Detection 
  • Excellent programming skills 

In autonomous driving, state-of-the-art deep neural networks are used for perception tasks like for example 3D object detection. To provide promising results, these networks often require a lot of complex annotation data for training. These annotations are often costly and redundant. Active learning is used to select the most informative samples for annotation and cover a dataset with as less annotated data as possible.   

The objective is to explore active learning approaches for 3D object detection using combined uncertainty and diversity based methods.  

  • Exploring Diversity-based Active Learning for 3D Object Detection in Autonomous Driving   
  • Efficient Uncertainty Estimation for Semantic Segmentation in Videos   
  • KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection
  • Towards Open World Active Learning for 3D Object Detection   

Graph Neural Networks

Type:  Master's thesis / Bachelor's thesis / guided research

  • Knowledge of graph/network theory

Graph neural networks (GNNs) have recently achieved great successes in a wide variety of applications, such as chemistry, reinforcement learning, knowledge graphs, traffic networks, or computer vision. These models leverage graph data by updating node representations based on messages passed between nodes connected by edges, or by transforming node representation using spectral graph properties. These approaches are very effective, but many theoretical aspects of these models remain unclear and there are many possible extensions to improve GNNs and go beyond the nodes' direct neighbors and simple message aggregation.

Contact: Simon Geisler

  • Semi-supervised classification with graph convolutional networks
  • Relational inductive biases, deep learning, and graph networks
  • Diffusion Improves Graph Learning
  • Weisfeiler and leman go neural: Higher-order graph neural networks
  • Reliable Graph Neural Networks via Robust Aggregation

Physics-aware Graph Neural Networks

Type:  Master's thesis / guided research

  • Proficiency with Python and deep learning frameworks (JAX or PyTorch)
  • Knowledge of graph neural networks (e.g. GCN, MPNN, SchNet)
  • Optional: Knowledge of machine learning on molecules and quantum chemistry

Deep learning models, especially graph neural networks (GNNs), have recently achieved great successes in predicting quantum mechanical properties of molecules. There is a vast amount of applications for these models, such as finding the best method of chemical synthesis or selecting candidates for drugs, construction materials, batteries, or solar cells. However, GNNs have only been proposed in recent years and there remain many open questions about how to best represent and leverage quantum mechanical properties and methods.

Contact: Nicholas Gao

  • Directional Message Passing for Molecular Graphs
  • Neural message passing for quantum chemistry
  • Learning to Simulate Complex Physics with Graph Network
  • Ab initio solution of the many-electron Schrödinger equation with deep neural networks
  • Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
  • Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds

Robustness Verification for Deep Classifiers

Type: Master's thesis / Guided research

  • Strong machine learning knowledge (at least equivalent to IN2064 plus an advanced course on deep learning)
  • Strong background in mathematical optimization (preferably combined with Machine Learning setting)
  • Proficiency with python and deep learning frameworks (Pytorch or Tensorflow)
  • (Preferred) Knowledge of training techniques to obtain classifiers that are robust against small perturbations in data

Description : Recent work shows that deep classifiers suffer under presence of adversarial examples: misclassified points that are very close to the training samples or even visually indistinguishable from them. This undesired behaviour constraints possibilities of deployment in safety critical scenarios for promising classification methods based on neural nets. Therefore, new training methods should be proposed that promote (or preferably ensure) robust behaviour of the classifier around training samples.

Contact: Aleksei Kuvshinov

References (Background):

  • Intriguing properties of neural networks
  • Explaining and harnessing adversarial examples
  • SoK: Certified Robustness for Deep Neural Networks
  • Certified Adversarial Robustness via Randomized Smoothing
  • Formal guarantees on the robustness of a classifier against adversarial manipulation
  • Towards deep learning models resistant to adversarial attacks
  • Provable defenses against adversarial examples via the convex outer adversarial polytope
  • Certified defenses against adversarial examples
  • Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks

Uncertainty Estimation in Deep Learning

Type: Master's Thesis / Guided Research

  • Strong knowledge in probability theory

Safe prediction is a key feature in many intelligent systems. Classically, Machine Learning models compute output predictions regardless of the underlying uncertainty of the encountered situations. In contrast, aleatoric and epistemic uncertainty bring knowledge about undecidable and uncommon situations. The uncertainty view can be a substantial help to detect and explain unsafe predictions, and therefore make ML systems more robust. The goal of this project is to improve the uncertainty estimation in ML models in various types of task.

Contact: Tom Wollschläger ,   Dominik Fuchsgruber ,   Bertrand Charpentier

  • Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
  • Predictive Uncertainty Estimation via Prior Networks
  • Posterior Network: Uncertainty Estimation without OOD samples via Density-based Pseudo-Counts
  • Evidential Deep Learning to Quantify Classification Uncertainty
  • Weight Uncertainty in Neural Networks

Hierarchies in Deep Learning

Type:  Master's Thesis / Guided Research

Multi-scale structures are ubiquitous in real life datasets. As an example, phylogenetic nomenclature naturally reveals a hierarchical classification of species based on their historical evolutions. Learning multi-scale structures can help to exhibit natural and meaningful organizations in the data and also to obtain compact data representation. The goal of this project is to leverage multi-scale structures to improve speed, performances and understanding of Deep Learning models.

Contact: Marcel Kollovieh , Bertrand Charpentier

  • Tree Sampling Divergence: An Information-Theoretic Metricfor Hierarchical Graph Clustering
  • Hierarchical Graph Representation Learning with Differentiable Pooling
  • Gradient-based Hierarchical Clustering
  • Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space

The Research Repository @ WVU

Home > Statler College of Engineering and Mineral Resources > MININGENG > Mining Engineering Graduate Theses and Dissertations

Mining Engineering Graduate Theses and Dissertations

Theses/dissertations from 2023 2023.

Development of A Hydrometallurgical Process for the Extraction of Cobalt, Manganese, and Nickel from Acid Mine Drainage Treatment Byproduct , Alejandro Agudelo Mira

Selective Recovery of Rare Earth Elements from Acid Mine Drainage Treatment Byproduct , Zeynep Cicek

Identification of Rockmass Deformation and Lithological Changes in Underground Mines by Using Slam-Based Lidar Technology , Francisco Eduardo Gil Hurtado

Analysis of the Brittle Failure Mechanism of Underground Stone Mine Pillars by Implementing Numerical Modeling in FLAC3D , Rosbel Jimenez

Analysis of the root causes of fatal injuries in the United States surface mines between 2008 and 2021. , Maria Fernanda Quintero

AUGMENTED REALITY AND MOBILE SYSTEMS FOR HEAVY EQUIPMENT OPERATORS IN SURFACE MINING , Juan David Valencia Quiceno

Theses/Dissertations from 2022 2022

Integrated Large Discontinuity Factor, Lamodel and Stability Mapping Approach for Stone Mine Pillar Stability , Mustafa Baris Ates

Noise Exposure Trends Among Violating Coal Mines, 2000 to 2021 , Hanna Grace Davis

Calcite depression in bastnaesite-calcite flotation system using organic acids , Emmy Muhoza

Investigation of Geomechanical Behavior of Laminated Rock Mass Through Experimental and Numerical Approach , Qingwen Shi

Static Liquefaction in Tailing Dams , Jose Raul Zela Concha

Experimental and Theoretical Investigation on the Initiation Mechanism of Low-Rank Coal's Self-Heating Process , Yinan Zhang

Development of an Entry-Scale Modeling Methodology to Provide Ground Reaction Curves for Longwall Gateroad Support Evaluation , Haochen Zhao

Size effect and anisotropy on the strength of shale under compressive stress conditions , Yun Zhao

Theses/Dissertations from 2021 2021

Evaluation of LIDAR systems for rock mass discontinuity identification in underground stone mines from 3D point cloud data , Mario Alejandro Bendezu de la Cruz

Implementing the Empirical Stone Mine Pillar Strength Equation into the Boundary Element Method Software LaModel , Samuel Escobar

Recovery of Phosphorus from Florida Phosphatic Waste Clay , Amir Eskanlou

Optimization of Operating Conditions and Design Parameters on Coal Ultra-Fine Grinding Through Kinetic Stirred Mill Tests and Numerical Modeling , Francisco Patino

The Effect of Natural Fractures on the Mechanical Behavior of Limestone Pillars: A Synthetic Rock Mass Approach Application , Mustafa Can Süner

Evaluation of Various Separation Techniques for the Removal of Actinides from A Rare Earth-Containing Solution Generated from Coarse Coal Refuse , Deniz Talan

Geology Oriented Loading Approach for Underground Coal Mines , Deniz Tuncay

Various Operational Aspects of the Extraction of Critical Minerals from Acid Mine Drainage and Its Treatment By-product , Zhongqing Xiao

Theses/Dissertations from 2020 2020

Adaptation of Coal Mine Floor Rating (CMFR) to Eastern U.S. Coal Mines , Sena Cicek

Upstream Tailings Dam - Liquefaction , Mladen Dragic

Development, Analysis and Case Studies of Impact Resistant Steel Sets for Underground Roof Fall Rehabilitation , Dakota D. Faulkner

The influence of spatial variance on rock strength and mechanism of failure , Danqing Gao

Fundamental Studies on the Recovery of Rare Earth Elements from Acid Mine Drainage , Xue Huang

Rational drilling control parameters to reduce respirable dust during roof bolting operations , Hua Jiang

Solutions to Some Mine Subsidence Research Challenges , Jian Yang

An Interactive Mobile Equipment Task-Training with Virtual Reality , Lazar Zujovic

Theses/Dissertations from 2019 2019

Fundamental Mechanism of Time Dependent Failure in Shale , Neel Gupta

A Critical Assessment on the Resources and Extraction of Rare Earth Elements from Acid Mine Drainage , Christopher R. Vass

Time-dependent deformation and associated failure of roof in underground mines , Yuting Xue

Theses/Dissertations from 2018 2018

Parametric Study of Coal Liberation Behavior Using Silica Grinding Media , Adewale Wasiu Adeniji

Three-dimensional Numerical Modeling Encompassing the Stability of a Vertical Gas Well Subjected to Longwall Mining Operation - A Case Study , Bonaventura Alves Mangu Bali

Shale Characterization and Size-effect study using Scanning Electron Microscopy and X-Ray Diffraction , Debashis Das

Behaviour Of Laminated Roof Under High Horizontal Stress , Prasoon Garg

Theses/Dissertations from 2017 2017

Optimization of Mineral Processing Circuit Design under Uncertainty , Seyed Hassan Amini

Evaluation of Ultrasonic Velocity Tests to Characterize Extraterrestrial Rock Masses , Thomas W. Edge II

A Photogrammetry Program for Physical Modeling of Subsurface Subsidence Process , Yujia Lian

An Area-Based Calculation of the Analysis of Roof Bolt Systems (ARBS) , Aanand Nandula

Developing and implementing new algorithms into the LaModel program for numerical analysis of multiple seam interactions , Mehdi Rajaeebaygi

Adapting Roof Support Methods for Anchoring Satellites on Asteroids , Grant B. Speer

Simulation of Venturi Tube Design for Column Flotation Using Computational Fluid Dynamics , Wan Wang

Theses/Dissertations from 2016 2016

Critical Analysis of Longwall Ventilation Systems and Removal of Methane , Robert B. Krog

Implementing the Local Mine Stiffness Calculation in LaModel , Kaifang Li

Development of Emission Factors (EFs) Model for Coal Train Loading Operations , Bisleshana Brahma Prakash

Nondestructive Methods to Characterize Rock Mechanical Properties at Low-Temperature: Applications for Asteroid Capture Technologies , Kara A. Savage

Mineral Asset Valuation Under Economic Uncertainty: A Complex System for Operational Flexibility , Marcell B. B. Silveira

A Feasibility Study for the Automated Monitoring and Control of Mine Water Discharges , Christopher R. Vass

Spontaneous Combustion of South American Coal , Brunno C. C. Vieira

Calibrating LaModel for Subsidence , Jian Yang

Theses/Dissertations from 2015 2015

Coal Quality Management Model for a Dome Storage (DS-CQMM) , Manuel Alejandro Badani Prado

Design Programs for Highwall Mining Operations , Ming Fan

Development of Drilling Control Technology to Reduce Drilling Noise during Roof Bolting Operations , Mingming Li

The Online LaModel User's & Training Manual Development & Testing , Christopher R. Newman

How to mitigate coal mine bumps through understanding the violent failure of coal specimens , Gamal Rashed

Theses/Dissertations from 2014 2014

Effect of biaxial and triaxial stresses on coal mine shale rocks , Shrey Arora

Stability Analysis of Bleeder Entries in Underground Coal Mines Using the Displacement-Discontinuity and Finite-Difference Programs , Xu Tang

Experimental and Theoretical Studies of Kinetics and Quality Parameters to Determine Spontaneous Combustion Propensity of U.S. Coals , Xinyang Wang

Bubble Size Effects in Coal Flotation and Phosphate Reverse Flotation using a Pico-nano Bubble Generator , Yu Xiong

Integrating the LaModel and ARMPS Programs (ARMPS-LAM) , Peng Zhang

Theses/Dissertations from 2013 2013

Column Flotation of Subbituminous Coal Using the Blend of Trimethyl Pentanediol Derivatives and Pico-Nano Bubbles , Jinxiang Chen

Applications of Surface and Subsurface Subsidence Theories to Solve Ground Control Problems , Biao Qiu

Calibrating the LaModel Program for Shallow Cover Multiple-Seam Mines , Morgan M. Sears

The Integration of a Coal Mine Emergency Communication Network into Pre-Mine Planning and Development , Mark F. Sindelar

Factors considered for increasing longwall panel width , Jack D. Trackemas

An experimental investigation of the creep behavior of an underground coalmine roof with shale formation , Priyesh Verma

Evaluation of Rope Shovel Operators in Surface Coal Mining Using a Multi-Attribute Decision-Making Model , Ivana M. Vukotic

Theses/Dissertations from 2012 2012

Calculating the Surface Seismic Signal from a Trapped Miner , Adeniyi A. Adebisi

Comprehensive and Integrated Model for Atmospheric Status in Sealed Underground Mine Areas , Jianwei Cheng

Production and Cost Assessment of a Potential Application of Surface Miners in Coal Mining in West Virginia , Timothy A. Nolan

The Integration of Geomorphic Design into West Virginia Surface Mine Reclamation , Alison E. Sears

Truck Cycle and Delay Automated Data Collection System (TCD-ADCS) for Surface Coal Mining , Patricio G. Terrazas Prado

New Abutment Angle Concept for Underground Coal Mining , Ihsan Berk Tulu

Theses/Dissertations from 2011 2011

Experimental analysis of the post-failure behavior of coal and rock under laboratory compression tests , Dachao Neil Nie

The influence of interface friction and w/h ratio on the violence of coal specimen failure , Simon H. Prassetyo

Theses/Dissertations from 2010 2010

A risk management approach to pillar extraction in the Central Appalachian coalfields , Patrick R. Bucks

The Impacts of Longwall Mining on Groundwater Systems -- A Case of Cumberland Mine Panels B5 and B6 , Xinzhi Du

Evaluation of ultrafine spiral concentrators for coal cleaning , Meng Yang

Theses/Dissertations from 2009 2009

Development of a coal reserve GIS model and estimation of the recoverability and extraction costs , Chandrakanth Reddy Apala

Application and evaluation of spiral separators for fine coal cleaning , Zhuping Che

Weak floor stability in the Illinois Basin underground coal mines , Murali M. Gadde

Design of reinforced concrete seals for underground coal mines , Rajagopala Reddy Kallu

Employing laboratory physical modeling to study the radio imaging method (RIM) , Jun Lu

Influence of cutting sequence and time effects on cutters and roof falls in underground coal mine -- numerical approach , Anil Kumar Ray

Implementing energy release rate calculations into the LaModel program , Morgan M. Sears

Modeling PDC cutter rock interaction , Ihsan Berk Tulu

Analytical determination of strain energy for the studies of coal mine bumps , Qiang Xu

Improvement of the mine fire simulation program MFIRE , Lihong Zhou

Theses/Dissertations from 2008 2008

Program-assisted analysis of the transverse pressure capacity of block stoppings for mine ventilation control , Timothy J. Batchler

Analysis of factors affecting wireless communication systems in underground coal mines , David P. McGraw

Analysis of underground coal mine refuge shelters , Mickey D. Mitchell

Theses/Dissertations from 2007 2007

Dolomite flotation of high magnesium phosphate ores using fatty acid soap collectors , Zhengxing Gu

Evaluation of longwall face support hydraulic supply systems , Ted M. Klemetti II

Experimental studies of electromagnetic signals to enhance radio imaging method (RIM) , William D. Monaghan

Analysis of water monitoring data for longwall panels , Joseph R. Zirkle

Theses/Dissertations from 2006 2006

Measurements of the electrical properties of coal measure rocks , Nikolay D. Boykov

Geomechanical and weathering properties of weak roof shales in coal mines , Hakan Gurgenli

Assessment and evaluation of noise controls on roof bolting equipment and a method for predicting sound pressure levels in underground coal mining , Rudy J. Matetic

  • Collections
  • Disciplines
  • WVU Libraries
  • WVU Research Office
  • WVU Research Commons
  • Open Access @ WVU
  • Digital Publishing Institute

Advanced Search

  • Notify me via email or RSS

Author Corner

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

Jingbo Shang

Jingbo Shang

Assistant Professor

  • CSE 4104 / SDSC 211E
  • University of California, San Diego
  • Google Scholar

2024-Spring-CSE291-DSC253-Advanced Data-Driven Text Mining

Graduate Class, CSE, UCSD , 2024

Class Time : Tuesdays and Thursdays, 9:30 to 11:50 AM. Room : CENTER 222. Piazza : piazza.com/ucsd/spring2024/cse291dsc253

Online Lecturing for First Week

To offer waitlist students opportunities to learn more about this course, in the first week, we deliver the lecture over Zoom: https://ucsd.zoom.us/j/98881116686 . The lectures will be recorded.

This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world text data. It will put emphasis on unsupervised, weakly supervised, and distantly supervised methods for text mining problems, including information retrieval, open-domain information extraction, text summarization (both extractive and generative), and knowledge graph construction. Bootstrapping, comparative analysis, learning from seed words and existing knowledge bases will be the key methodologies.

There is no textbook required, but there are recommended readings for each lecture (at the end of the slides).

  • We need your time commitment for projects
  • Feel free to audit the course with 0 unit

Prerequisites

Knowledge about Machine Learning and Data Mining; Comfortable coding using Python, C/C++, or Java; Math and Stat skills.

TA and Office Hours

  • Office Hour: Wednesdays, 10 AM to 11 AM
  • Zoom link: https://ucsd.zoom.us/my/jshang
  • Office Hour: TBD
  • Location: TBD

Note: all times are in Pacific Time .

  • Homework: 30%. There will be two homework assignments. 15% each.
  • Text Mining Challenge: 30%.
  • Project: 40%.
  • You should complete all work individually, except for the Project.
  • Late submissions are NOT accepted.

Lecture Schedule

Recording Note : Please download the recording video for the full length. Dropbox website will only show you the first one hour.

HW Note : All HWs due by the end of the day, Pacific Time.

Homework (30%)

  • Due: April 23
  • Due: May 21

Data Mining Challenge (30%)

It is a individual-based text mining competition with quantitative evaluation. The challenge runs from April 23 2023 0:00:01 AM to May 23, 2023 4:59:59 PM PT . Note that the time displayed on Kaggle is in UTC, not PT.

  • Challenge Statement, Dataset, and Details: TBD
  • Kaggle challenge link: TBD
  • Survey to map Kaggle account name to student names

Project (40%)

  • 1 to 4 members per team. More members, higher expectation.
  • 3 to 4 members are recommended, given the limited presentation slots.

Final Deliverables

  • Define your own research problem and justify its importance
  • Be ambitious! We could aim for ACL/EMNLP conference!
  • Report due on Jun 9 , End of the day, Pacific Time.
  • Write a 5 to 9 pages report (research-paper like following ACL template). The pages here do not include references.
  • Come up with your hypothesis and find some datasets for verification
  • Design your own models or try a large variety of existing models
  • Submit your codes and datasets; Github repos are welcome
  • Up to 5% bonus for working demos/apps towards the total course grades
  • The orders will be decided randomly after the teams are formed.
  • The slides must be ready 2 days before the presentation date. So other students can have the access and think about questions.
  • The presentation follows a typical conference style: 20 mins for each team including Q&A
  • Asking questions is an important part of research. You are strongly encouraged to ask questions to other teams. It will be a part of your presentation grade.
  • Handling questions is also an important skill for researchers.

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: mini-gemini: mining the potential of multi-modality vision language models.

Abstract: In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at this https URL .

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Master Thesis In Computer Science In Visual Data Mining

    data mining project thesis

  2. Research PhD Thesis on Data Mining (Thesis Writing Service)

    data mining project thesis

  3. Latest Data Mining Research and Thesis Topic Guidance For M.Tech and

    data mining project thesis

  4. MTech Thesis In Data Mining

    data mining project thesis

  5. Professional Research Guidance

    data mining project thesis

  6. Data Mining Thesis Ideas

    data mining project thesis

VIDEO

  1. Sirekap KPU

  2. IFCT MINING PROJECT REVIEW

  3. Advanced Data Mining Project Milestone 1

  4. IFCT MINING PROJECT REVIEW || X100 GEM || JOIN COMMUNITY

  5. Data mining masters thesis ! Post graduation research thesis

  6. Patrones para solucionar problemas de data races en aplicaciones multiusuario

COMMENTS

  1. (PDF) Implementation of Data Mining Techniques for ...

    Part-II of the thesis is about Implementing Data Mining Techniques in finding the trends of celebrities death causes over the past decade. The database for training is created from the public and ...

  2. PDF The application of data mining methods

    This thesis first introduces the basic concepts of data mining, such as the definition of data mining, its basic function, common methods and basic process, and two common data mining methods, classification and clustering. Then a data mining application in network is discussed in detail, followed by a brief introduction on data mining ...

  3. Data Mining

    Data Mining. Data Science; Data and Artificial Intelligence ... Fingerprint; Network; Researchers (46) Projects (2) Research output (628) Datasets (4) Prizes (17) Activities (6) Press/Media (12) Student theses (258) Student theses 1 - 50 out of 258 results ... Student thesis: Master. File. Activity Recognition Using Deep Learning in Videos ...

  4. A Study of Heart Disease Diagnosis Using Machine Learning and Data Mining

    3) Machine Learning algorithms allowed us to analyze clinical data, draw. relationships between diagnostic variables, design the predictive model, and. tests it against the new case. The predictive model achieved an accuracy of 89.4. percent using RandomForest Classifier's default setting to predict heart diseases.

  5. MASTER'S THESIS

    Data mining is an area where computer science, machine learning and statistics meet ... 1.3 Project goal The goal with this thesis can be split up as following. Concept: Understand the concept of data analysis and how these methods can be applied on small speci c problems such as structured data classi cation. Method: After the initial research ...

  6. Predicting the incidence of COVID-19 using data mining

    Data mining is capable of presenting a predictive model and extracting new knowledge from retrospective data. The way data is processed, as well as the variables selected, had a significant impact on knowledge discovery. There are various data mining techniques used to predict an outbreak. As an actual global health concern, COVID-19 had ...

  7. data mining Latest Research Papers

    The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm. Download Full-text.

  8. Analysis of agriculture data using data mining techniques: application

    In agriculture sector where farmers and agribusinesses have to make innumerable decisions every day and intricate complexities involves the various factors influencing them. An essential issue for agricultural planning intention is the accurate yield estimation for the numerous crops involved in the planning. Data mining techniques are necessary approach for accomplishing practical and ...

  9. (PDF) Determining and Managing the Scope of Data Mining Project

    Determining and Managing the Scope of Data Mining Project - a Case Study in O&G Refinery. 10.13140/RG.2.2.25490.79046. Thesis for: MBA Information Technology Management. Advisor: Dr. Colin Price ...

  10. 50 selected papers in Data Mining and Machine Learning

    Active Sampling for Feature Selection, S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003. Heterogeneous Uncertainty Sampling for Supervised Learning, D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994. Learning When Training Data are Costly: The Effect of ...

  11. PDF Data mining in medical diagnostic support system

    1 Overview of the thesis topic 1.1 Introduction to data mining Nowadays, most industrial and business fields are applying information technology to the storage and processing data, this has created a very large amount of data that stored and increased constantly. It is a good chance for mining the data in warehouse to provide use-

  12. Dissertations / Theses: 'Data mining'

    This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. ... This project is an extension of ...

  13. (PDF) Data mining techniques and methodologies

    The SFB data set is a text-based dataset and data pre-processing and cleaning is a challenging task in Text and Data Mining (TDM) and Machine Learning (ML) [1], [2]. TDM is a cycle of finding ...

  14. Latest Research and Thesis topics in Data Mining

    Topics to study in data mining. Data mining is a relatively new thing and many are not aware of this technology. This can also be a good topic for M.Tech thesis and for presentations. Following are the topics under data mining to study: Fraud Detection. Crime Rate Prediction.

  15. PDF Data mining techniques applied in educational environments:

    III. Implementing a Data Mining Project The data mining projects are implemented with the aim of discovering patterns of relevant and interesting information in large volumes. This is done with the development of four phases (Virseda Benito & Carrillo, 2008), which are usually: 1. Filtering data. 2. Selection of variables. 3. Extracting ...

  16. Adaptations of data mining methodologies: a systematic literature

    Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project (Mariscal, Marbán & Fernández, 2010). ... Solarte J. A proposed data mining methodology and its application to industrial engineering. 2002. Ph.D. thesis, University of Tennessee. Strohbach et al. ...

  17. Trending Data Mining Thesis Topics

    Integration of MapReduce, Amazon EC2, S3, Apache Spark, and Hadoop into data mining. These are the recent trends in data mining. We insist that you choose one of the topics that interest you the most. Having an appropriate content structure or template is essential while writing a thesis.

  18. Educational data mining: prediction of students' academic performance

    Educational data mining has become an effective tool for exploring the hidden relationships in educational data and predicting students' academic achievements. This study proposes a new model based on machine learning algorithms to predict the final exam grades of undergraduate students, taking their midterm exam grades as the source data. The performances of the random forests, nearest ...

  19. 82 Data Mining Essay Topic Ideas & Examples

    Commercial Uses of Data Mining. Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations. A Discussion on the Acceptability of Data Mining.

  20. Data Mining Project Ideas & Thesis Topics

    Data Mining Project Ideas. Data mining is the process of analyzing data in large size which are usually unordered and to find some of the relation between them. In order to learn more about process you can read this research paper completely which is based on Data mining. Data mining involves exploring and analyzing data's in large volume in ...

  21. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    1. Introduction. Deriving from Industry 4.0 that pursues the expansion of its autonomy and efficiency through data-driven automatization and artificial intelligence employing cyber-physical spaces, the Healthcare 4.0 portrays the overhaul of medical business models towards a data-driven management [].In akin environments, substantial amounts of information associated to organizational ...

  22. Open Theses

    Open Topics We offer multiple Bachelor/Master theses, Guided Research projects and IDPs in the area of data mining/machine learning. A non-exhaustive list of open topics is listed below.. If you are interested in a thesis or a guided research project, please send your CV and transcript of records to Prof. Stephan Günnemann via email and we will arrange a meeting to talk about the potential ...

  23. Mining Engineering Graduate Theses and Dissertations

    Truck Cycle and Delay Automated Data Collection System (TCD-ADCS) for Surface Coal Mining, Patricio G. Terrazas Prado. PDF. New Abutment Angle Concept for Underground Coal Mining, Ihsan Berk Tulu. Theses/Dissertations from 2011 PDF. Experimental analysis of the post-failure behavior of coal and rock under laboratory compression tests, Dachao ...

  24. 2024-Spring-CSE291-DSC253-Advanced Data-Driven Text Mining

    Phrase Mining Applications and Future Work. Due: May 21; Data Mining Challenge (30%) It is a individual-based text mining competition with quantitative evaluation. The challenge runs from April 23 2023 0:00:01 AM to May 23, 2023 4:59:59 PM PT. Note that the time displayed on Kaggle is in UTC, not PT. Challenge Statement, Dataset, and Details: TBD

  25. Title: Mini-Gemini: Mining the Potential of Multi-modality Vision

    In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from ...