Accessibility Links

  • Skip to content
  • Skip to search IOPscience
  • Skip to Journals list
  • Accessibility help
  • Accessibility Help

Click here to close this panel.

Purpose-led Publishing is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.

Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.

We are proudly declaring that science is our only shareholder.

Research on Application of Machine Learning in Data Mining

Xiuyi Teng 1,2 and Yuxia Gong 1,2

Published under licence by IOP Publishing Ltd IOP Conference Series: Materials Science and Engineering , Volume 392 , Issue 6 Citation Xiuyi Teng and Yuxia Gong 2018 IOP Conf. Ser.: Mater. Sci. Eng. 392 062202 DOI 10.1088/1757-899X/392/6/062202

Article metrics

3846 Total downloads

Share this article

Author affiliations.

1 Economics and Management School, Tianjin University of Science and Technology, Tianjin China, 300222

2 Financial engineering and risk management research Center, Tianjin University of Science and Technology, Tianjin China, 300222.

Buy this article in print

Data mining has been widely used in the business field, and machine learning can perform data analysis and pattern discovery, thus playing a key role in data mining application. This paper expounds the definition, model, development stage, classification and commercial application of machine learning, and emphasizes the role of machine learning in data mining. Understanding the various machine learning techniques helps to choose the right method for a specific application. Therefore, this paper summarizes and analyzes machine learning technology, and discusses their advantages and disadvantages in data mining.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence . Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Data mining articles from across Nature Portfolio

Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning, visualisation methods and statistical analyses. Data mining is used in computational biology and bioinformatics to detect trends or patterns without knowledge of the meaning of the data.

research paper on application of data mining

Discrete latent embeddings illuminate cellular diversity in single-cell epigenomics

CASTLE, a deep learning approach, extracts interpretable discrete representations from single-cell chromatin accessibility data, enabling accurate cell type identification, effective data integration, and quantitative insights into gene regulatory mechanisms.

Latest Research and Reviews

research paper on application of data mining

Expression characteristics of lipid metabolism-related genes and correlative immune infiltration landscape in acute myocardial infarction

  • Xiaorong Hu

research paper on application of data mining

Multi role ChatGPT framework for transforming medical data analysis

  • Haoran Chen
  • Shengxiao Zhang

research paper on application of data mining

A tensor decomposition reveals ageing-induced differences in muscle and grip-load force couplings during object lifting

  • Seyed Saman Saboksayr
  • Ioannis Delis

research paper on application of data mining

Research on coal mine longwall face gas state analysis and safety warning strategy based on multi-sensor forecasting models

  • Haoqian Chang
  • Xiangrui Meng

research paper on application of data mining

PDE1B, a potential biomarker associated with tumor microenvironment and clinical prognostic significance in osteosarcoma

  • Qingzhong Chen
  • Chunmiao Xing
  • Zhongwei Qian

research paper on application of data mining

A real-world pharmacovigilance study on cardiovascular adverse events of tisagenlecleucel using machine learning approach

  • Juhong Jung
  • Ju Hwan Kim
  • Ju-Young Shin

Advertisement

News and Comment

research paper on application of data mining

Discovering cryptic natural products by substrate manipulation

Cryptic halogenation reactions result in natural products with diverse structural motifs and bioactivities. However, these halogenated species are difficult to detect with current analytical methods because the final products are often not halogenated. An approach to identify products of cryptic halogenation using halide depletion has now been discovered, opening up space for more effective natural product discovery.

  • Ludek Sehnal
  • Libera Lo Presti
  • Nadine Ziemert

research paper on application of data mining

Chroma is a generative model for protein design

  • Arunima Singh

research paper on application of data mining

Efficient computation reveals rare CRISPR–Cas systems

A study published in Science develops an efficient mining algorithm to identify and then experimentally characterize many rare CRISPR systems.

research paper on application of data mining

SEVtras characterizes cell-type-specific small extracellular vesicle secretion

Although single-cell RNA-sequencing has revolutionized biomedical research, exploring cell states from an extracellular vesicle viewpoint has remained elusive. We present an algorithm, SEVtras, that accurately captures signals from small extracellular vesicles and determines source cell-type secretion activity. SEVtras unlocks an extracellular dimension for single-cell analysis with diagnostic potential.

Protein structural alignment using deep learning

Quick links.

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper on application of data mining

Review Paper on Data Mining Techniques and Applications

International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Volume-7, Issue-2, March 2019

5 Pages Posted: 2 Mar 2020

GVMGC Sonipat

Date Written: MARCH 31, 2019

Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help decision makers to make better decisions. Practically, data mining is really useful for any organization which has huge amount of data. Data mining help regular databases to perform faster. They also help to increase the profit, because of the correct decisions made with the help of data mining. This paper shows the various steps performed during the process of data mining and how it can be used by various industries to get better answers from huge amount of data.

Keywords: Data Mining, Regression, Time Series, Prediction, Association

Suggested Citation: Suggested Citation

Anshu (Contact Author)

Do you have a job opening that you would like to promote on ssrn, paper statistics, related ejournals, econometrics: econometric & statistical methods - special topics ejournal.

Subscribe to this fee journal for more curated articles on this topic

Web Technology eJournal

Decision-making & management science ejournal, data science, data analytics & informatics ejournal.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Adaptations of data mining methodologies: a systematic literature review

Associated data.

The following information was supplied regarding data availability:

SLR Protocol (also shared via online repository), corpus with definitions and mappings are provided as a Supplemental File .

The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.

Introduction

The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.

There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.

There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.

Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.

The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.

Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.

Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.

The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.

The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.

Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).

Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).

The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g001.jpg

The main steps of KDD are as follows:

  • Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
  • Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
  • Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
  • Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
  • Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
  • Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
  • Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
  • Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
  • Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.

The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g002.jpg

Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).

In 2000, as response to common issues and needs ( Marban, Mariscal & Segovia, 2009 ), an industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced as an alternative to KDD. It also consolidated original KDD model and its various extensions. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations ( Marban, Mariscal & Segovia, 2009 ). The iterative executions of CRISP-DM stand as the most distinguishing feature compared to initial KDD that assumes a sequential execution of its steps. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large datasets. However,CRISP-DM with its six main steps with a total of 24 tasks and outputs, is more refined as compared to KDD. The main steps of CRIPS-DM, as depicted in Fig. 3 below are as follows:

  • Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
  • Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
  • Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
  • Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
  • Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
  • Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g003.jpg

The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).

Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).

Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.

The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.

NameOriginBasisKey conceptYear
Human-CenteredAcademyKDDIterative process and interactivity (user’s point of view and needed decisions)1996, 2004
Cabena et al.AcademyKDDFocus on data processing and discovery tasks1997
Anand and BuchnerAcademyKDDSupplementary steps and integration of web-mining1998, 1999
Two CrowsIndustryKDDModified definitions of steps1998
SEMMAIndustryKDDTool-specific (SAS Institute), elimination of some steps2005
5 A’sIndustryIndependentSupplementary steps2003
6 SigmasIndustryIndependentSix Sigma quality improvement paradigm in conjunction with DMAIC performance improvement model2003
CRISP-DMJoint industry and academyKDDIterative execution of steps, significant refinements to tasks and outputs2000
Cios et al.AcademyCrisp-DMIntegration of data mining and knowledge discovery, feedback mechanisms, usage of received insights supported by technologies2005
RAMSYSAcademyCrisp-DMIntegration of collaborative work aspects2001–2002
DMIEAcademyCrisp-DMIntegration and adaptation to Industrial Engineering domain2001
MarbanAcademyCrisp-DMIntegration and adaptation to Software Engineering domain2007
KDD roadmapJoint industry and academyIndependentTool-specific, resourcing task2001
ASUMIndustryCrisp-DMTool-specific, combination of traditional Crisp-DM and agile implementation approach2015

Research Design

The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions

As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”

Thus, for this review, there are three research questions defined:

  • Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
  • Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
  • Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.

Data collection strategy

Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.

Primary search

The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :

(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)

The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.

The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Scope and domains inclusion

As recommended by Kitchenham, Budgen & Brereton (2015) it is necessary to initially define research scope. To clarify the scope, we defined what is not included and is out of scope of this research. The following aspects are not included in the scope of our study:

  • Context of technology and infrastructure for data mining/data analytics tasks and projects.
  • Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
  • Technological aspects in data mining for example, data engineering, dataflows and workflows.
  • Traditional statistical methods not associated with data mining directly including statistical control methods.

Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.

Screening criteria and procedures

Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.

Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .

Exclusion Criteria were initial threshold quality controls aiming at eliminating studies with limited or no scientific contribution. The exclusion criteria also address issues of understandability, accessability and availability. The Exclusion Criteria were as follows:

  • Quality 1: The publication item is not in English (understandability).
  • either the same document retrieved from two or all three databases.
  • or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
  • if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
  • Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
  • Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.

The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g004.jpg

Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.

Relevance criteriaCriteria definitionCriteria justification
Relevance 1Is the study about data mining or data analytics approach and is within designated list of domains?Exclude studies conducted outside the designated domain list. Exclude studies not directly describing and/or discussing data mining and data analytics
Relevance 2Is the study introducing/describing data mining or data analytics methodology/framework or modifying existing approaches?Exclude texts considering only specific, granular data mining and data analytics techniques, methods or traditional statistical methods. Exclude publications focusing on specific, granular data mining and data analytics process/sub-process aspects. Exclude texts where description and discussion of data mining methodologies or frameworks is manifestly missing

As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.

ScoreCriteria definition
3Data mining methodology or framework is presented in full. All steps described and explained, tests performed, results compared and evaluated. There is clear proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system, and/or prototype or full solution implementation is discussed. Success factors described and presented
2Data mining methodology or framework is presented, some process steps are missing, but they do not impact the holistic view and understanding of the performed work. Data mining process is clearly presented and described, tests performed, results compared and evaluated. There is proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system(s)
1Data mining methodology or framework is not presented in full, some key phases and process steps are missing. Publication focuses on one or some aspects (e.g., method, technique)
0Data mining methodology or framework not presented as holistic approach, but on fragmented basis, study limited to some aspects (e.g., method or technique discussion, etc.)

Data extraction and screening process

The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .

Results and quantitative analysis

In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .

As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g005.jpg

In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).

Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.

Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.

Findings and Discussion

In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.

RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?

The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.

We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.

We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g006.jpg

RQ2: How have existing data mining methodologies been adapted?

We identified that data mining methodologies have been adapted to cater to specific needs. In order to categorize adaptations scenarios, we applied a two-level dichotomy, specifically, by applying the following decision tree:

  • Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
  • Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.

Thus, when adapted three distinct types of adaptation scenarios can be distinguished:

  • Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
  • Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
  • Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.

To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g007.jpg

For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.

In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.

RQ3: For what purposes have existing data mining methodologies been adapted?

We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.

Modification

Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.

Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g009.jpg

IT, IS domain

The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .

Manufacturing and engineering

The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).

Sales and services, incl. financial industry

The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.

As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.

‘Extension’ scenario was identified in 46 peer-reviewed and 12 ‘grey’ publications. We noted that ‘Extension’ to existing data mining methodologies were executed with four major purposes:

  • Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
  • Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
  • Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
  • Purpose 4: To incorporate context-awareness aspects.

The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g010.jpg

Main adaptation purposePublications
(1) To implement fully scaled, integrated data mining solution , , , , , , , , , , , , , , ,
(2) To implement complex systems and integrated business applications with data mining model/solution as component or tool , , , , , , , , , , , , , , , , , , ,
(3) To implement data mining as part of integrated/combined specialized infrastructure,data environments and types (e.g., IoT, cloud, mobile networks) , , , , , , , , , , , , , , , , , , , ,
(4) To incorporate context-awareness aspects

In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.

This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).

In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.

For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.

A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.

Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.

One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.

The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.

Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.

Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .

Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.

Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.

Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.

Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.

Integration

‘Integration’ of data mining methodologies scenario was identified in 27 ‘peer-reviewed’ and 17 ‘grey’ studies. Our analysis revealed that this adaptation scenario at a higher abstraction level is typically executed with the five key purposes:

  • Purpose 1: to integrate/combine with various ontologies existing in organization .
  • Purpose 2: to introduce context-awareness and incorporate domain knowledge .
  • Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
  • Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
  • Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.

The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g011.jpg

Main adaptation purposePublications
(1) To integrate/combined with various ontologies existing in organization , , , , ,
(2) To introduce context-awareness and incorporate domain knowledge , , , , , ,
(3) To integrate/combine with other research/industry domains frameworks, process methodologies, and concepts , , , , , , , , , , , , ,
(4) To integrate/combine with other organizational governance frameworks, process methodologies, concepts , , , , , , , ,
(5) To accomodate or leverage upon newly available Big Data technologies, tools and methods , , , , , ,

As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.

Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.

Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.

A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).

Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).

Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).

Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.

Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.

Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.

We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.

We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.

In short, adaptations are made to:

  • improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
  • support knowledge discovery and actionability.
  • introduce context-awareness and higher degree of formalization.
  • integrate closer data mining solution with key organizational processes and frameworks.
  • significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
  • incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
  • expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.

Threats to Validity

Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).

The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.

The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.

In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.

The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.

Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.

Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.

Supplemental Information

Supplemental information 1.

Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.

Supplemental Information 2

File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare that they have no competing interests.

Veronika Plotnikova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Marlon Dumas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Fredrik Milani conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Primary Sources

Research on the Application of Data Mining in Corporate Financial Management

  • Conference paper
  • First Online: 30 January 2024
  • Cite this conference paper

research paper on application of data mining

  • Zhen Chen 39  

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1133))

Included in the following conference series:

  • International Conference on Frontier Computing

66 Accesses

The financial management of a company is a process of allocating resources to maximize the company's profits. The success of any company depends on how effectively it allocates resources. This involves deciding what and how much to produce, when to produce, and what costs should be incurred, who is responsible for paying these costs, and whether they are necessary. These decisions must be made in consideration of several factors such as production cost, sales volume and market conditions. The process of managing the company's finances includes forecasting, budgeting and control. Financial management is a key function in each business. It must be implemented in order to make the organization run smoothly and effectively. In this article, I will discuss how data mining can help in this area by predicting future trends and better estimating the budget set for each department in the company.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

research paper on application of data mining

Financial Evaluation Model Based on Data Mining Algorithm

research paper on application of data mining

Using Machine Learning and Data Mining to Evaluate Modern Financial Management Techniques

research paper on application of data mining

Data Mining Techniques Applied in the Financial Industry

Li, K., Chen, L., Accountancy, S.O.: Application of data dining technology in Management Accounting—Based on the perspective of bankruptcy risk warning. J. Shenzhen Inst. Inf. Technol. (2019)

Google Scholar  

Zhou, Q.: Innovation and application of data mining technology in enterprise financial management. In: Pioneering with Science & Technology Monthly (2018)

Zhang, Y.: Application of the data mining technology in the financial management of colleges and universities in the age of the big data. Basic Clin. Pharmacol. Toxicol. S1 , 124 (2019)

Xia, Z., Zhang, B., Gao, D.: Application of data mining in building safety construction. In: 2020 Chinese Automation Congress (CAC) (2020)

Download references

Author information

Authors and affiliations.

Software Engineering Institute of Guangzhou, Guangzhou, 510990, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Zhen Chen .

Editor information

Editors and affiliations.

Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung City, Taiwan

Jason C. Hung

School of Computer Science and Engineering, University of Aizu, Aizuwakamatsu, Japan

Jia-Wei Chang

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Chen, Z. (2024). Research on the Application of Data Mining in Corporate Financial Management. In: Hung, J.C., Yen, N., Chang, JW. (eds) Frontier Computing on Industrial Applications Volume 3. FC 2023. Lecture Notes in Electrical Engineering, vol 1133. Springer, Singapore. https://doi.org/10.1007/978-981-99-9416-8_26

Download citation

DOI : https://doi.org/10.1007/978-981-99-9416-8_26

Published : 30 January 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-9415-1

Online ISBN : 978-981-99-9416-8

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Open access
  • Published: 18 June 2024

Using GPT-4 to write a scientific review article: a pilot evaluation study

  • Zhiping Paul Wang 1 ,
  • Priyanka Bhandary 1 ,
  • Yizhou Wang 1 &
  • Jason H. Moore 1  

BioData Mining volume  17 , Article number:  16 ( 2024 ) Cite this article

187 Accesses

5 Altmetric

Metrics details

GPT-4, as the most advanced version of OpenAI’s large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4’s capabilities in generating text, tables, and diagrams for biomedical review papers. We also assessed the consistency in text generation by GPT-4, along with potential plagiarism issues when employing this model for the composition of scientific review papers. Based on the results, we suggest the development of enhanced functionalities in ChatGPT, aiming to meet the needs of the scientific community more effectively. This includes enhancements in uploaded document processing for reference materials, a deeper grasp of intricate biomedical concepts, more precise and efficient information distillation for table generation, and a further refined model specifically tailored for scientific diagram creation.

Peer Review reports

Introduction

A comprehensive review of a research field can significantly aid researchers in quickly grasping the nuances of a specific domain, leading to well-informed research strategies, efficient resource utilization, and enhanced productivity. However, the process of writing such reviews is intricate, involving multiple time-intensive steps. These include the collection of relevant papers and materials, the distillation of key points from potentially hundreds or even thousands of sources into a cohesive overview, the synthesis of this information into a meaningful and impactful knowledge framework, and the illumination of potential future research directions within the domain. Given the breadth and depth of biomedical research—one of the most expansive and dynamic fields—crafting a literature review in this area can be particularly challenging and time-consuming, often requiring months of dedicated effort from domain experts to sift through the extensive body of work and produce a valuable review paper [ 1 , 2 ].

The swift progress in Natural Language Processing (NLP) technology, particularly with the rise of Generative Pre-trained Transformers (GPT) and other Large Language Models (LLMs), has equipped researchers with a potent tool for swiftly processing extensive literature. A recent survey indicates that ChatGPT has become an asset for researchers across various fields [ 3 ]. For instance, a PubMed search for “ChatGPT” yielded over 1,400 articles with ChatGPT in their titles as of November 30th, 2023, marking a significant uptake just one year after ChatGPT’s introduction.

The exploration of NLP technology’s capability to synthesize scientific publications into comprehensive reviews is ongoing. The interest in ChatGPT’s application across scientific domains is evident. Studies have evaluated ChatGPT’s potential in clinical and academic writing [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ], and discussions are underway about its use as a scientific review article generator [ 11 , 12 , 13 ]. However, many of these studies predate the release of the more advanced GPT-4, which may render their findings outdated. In addition, there is no study specifically evaluating ChatGPT (GPT-4) for writing biomedical review papers.

As the applications of ChatGPT are explored, the scientific community is also examining the evolving role of AI in research. Unlike any tool previously utilized in the history of science, ChatGPT has been accorded a role akin to that of a scientist, even being credited as an author in scholarly articles [ 14 ]. This development has sparked ethical debates. While thorough evaluations of the quality of AI-generated scientific review articles are yet to be conducted, some AI tools, such as Scopus AI [ 15 ], are already being employed to summarize and synthesize knowledge from scientific literature databases. However, these tools often come with disclaimers cautioning users about the possibility of AI generating erroneous or offensive content. Concurrently, as ChatGPT’s potential contributions to science are probed, concerns about the possible detrimental effects of ChatGPT and other AI tools on scientific integrity have been raised [ 16 ]. These considerations highlight the necessity for more comprehensive evaluations of ChatGPT from various perspectives.

In this study, we hypothesized that ChatGPT can compose text, tables and figures for a biomedical research paper using two cancer research papers as benchmarks. To test this hypothesis, we used the first paper [ 17 ] to prompt ChatGPT to generate the main ideas and summarize text. Next, we used the second paper [ 18 ] to assess its ability to create tables and figures/graphs. We simulated the steps a scientist would take in writing a cancer research review and assessed GPT-4’s performance at each stage. Our findings are presented across four dimensions: the ability to summarize insights from reference papers on specific topics, the semantic similarity of GPT-4 generated text to benchmark texts, the projection of future research directions based on current publications, and the synthesis of context in the form of tables and graphs. We conclude with a discussion of our overall experience and the insights gained from this study.

Review text content generation by ChatGPT

The design of this study aims to replicate the process a scientist undergoes when composing a biomedical review paper. This involves the meticulous collection, examination, and organization of pertinent references, followed by the articulation of key topics of interest into a structured format of sections, subsections, and main points. The scientist then synthesizes information from the relevant references to develop a comprehensive narrative. A primary objective of this study is to assess ChatGPT’s proficiency in distilling insights from references into coherent text. To this end, a review paper on sex differences in cancer [ 17 ] was chosen as a benchmark, referred to as BRP1 (Benchmark Review Paper 1). Using BRP1 for comparison, ChatGPT’s content generation was evaluated across three dimensions: (1) summarization of main points; (2) generation of review content for each main point; and (3) synthesis of information from references to project future research directions.

Main point summarization

The effectiveness of GPT-4 in summarizing information was tested by providing it with the 113 reference articles from BRP1 to generate a list of potential sections for a review paper. The generated sections were then compared with BRP1’s actual section titles for coverage evaluation (Fig.  1 (A)). Additionally, GPT-4 was tasked with creating possible subsections using the BRP1 section titles and reference articles, which were compared with the actual subsection titles in BRP1.

Review content generation

The review content generation test involved comparing GPT-4’s ability to summarize a given point with the actual text content from BRP1 (Fig.  1 (B)). BRP1 comprises three sections with seven subsections, presenting a total of eight main points. The corresponding text content for each point was manually extracted from BRP1. Three strategies were employed for GPT-4 to generate detailed elaborations for these main points: (1) providing a point only in a prompt for baseline content generation; (2) feeding all references used by BRP1 to GPT-4 for reference-based content generation; (3) using only the references corresponding to a main point, i.e., articles being referred in a subsection of BRP1, for content generation to make a main point. The semantic similarity of the text content generated by these strategies was then compared with the manually extracted content from BRP1.

figure 1

( A ) GPT-4 summarizes sections and subsections; ( B ) GPT-4 generated review content evaluation

Projections on future research

The section on “outstanding questions” in the Concluding Remarks of BRP1 serves a dual purpose: it summarizes conclusions and sets a trajectory for future research into sex differences in cancer. This is a common feature in biomedical review papers, where a forward-looking analysis is synthesized from the main discussions within the paper. The pivotal inquiry is whether ChatGPT, without further refinement, can emulate this forward projection using all referenced articles. The relevance of such a projection is contingent upon its alignment with the main points and references of the review. Moreover, it raises the question of whether the baseline GPT-4 LLM would perform comparably.

To address these queries, all references from BRP1 were inputted into GPT-4 to generate a section akin to Concluding Remarks, encompassing a description of sex differences in cancer, future work, and potential research trajectories. Additionally, three distinct strategies were employed to assess GPT-4’s ability to formulate specific “outstanding questions,” thereby evaluating ChatGPT’s predictive capabilities for future research. These strategies involved uploading all BRP1 reference articles to GPT-4 for projection: (1) without any contextual information; (2) with the inclusion of BRP1’s main points; (3) with a brief description of broad areas of interest. The outputs from these strategies, along with the base model’s output—GPT-4 without reference articles—were juxtaposed with BRP1’s original “outstanding questions” for comparison.

Data process

Chatgpt query.

In initiating this study, we utilized the ChatGPT web application ( https://chat.openai.com/ ). However, we encountered several limitations that impeded our progress:

A cap of ten file uploads, which restricts the analysis of content synthesized from over ten articles.

A file size limit of 50 MB, hindering the consolidation of multiple articles into a single file to circumvent the upload constraint.

Inconsistencies in text file interpretation when converted from PDF format, rendering the conversion of large PDFs to smaller text files ineffective.

Anomalies in file scanning, where ChatGPT would occasionally process only one of several uploaded files, despite instructions to utilize all provided files.

Due to these constraints, we transitioned to using GPT-4 API calls for all tests involving document processing. The GPT-4 API accommodates up to twenty file uploads simultaneously, efficiently processes text files converted from PDFs, and demonstrates reliable file scanning for multiple documents. The Python code, ChatGPT prompts, and outputs pertinent to this study are available in the supplementary materials.

The web version of ChatGPT cannot read from all the PDFs uploaded and is able to process only a subset of them. However, the API version of ChatGPT was set up to be able to upload and process 20 PDFs at a time. Several validation tests were carried out to make sure that it is able to read from all of them equally well. One common validation test was to ask ChatGPT if it could reiterate the Methods section of the 18th PDF and so on. This test was carried out randomly multiple times with a different PDF each time to see if ChatGPT is truly able to upload and process the PDFs.

Text similarity comparison

To assess text content similarity, we employed a transformer network-based pre-trained model [ 19 ] to calculate the semantic similarity between the original text in BRP1 and the text generated by GPT-4. We utilized the util.pytorch_cos_sim function from the sentence_transformers package to compute the cosine similarity of semantic content. Additionally, we conducted a manual validation where one of the authors compared the two texts and then categorized the similarity between the GPT-4 generated content and the original BRP1 content into three distinct levels: semantically very similar (Y), partially similar (P), and not similar (N).

Reproducibility and plagiarism evaluation

The inherent randomness in ChatGPT’s output, attributable to the probabilistic nature of large language models (LLMs), necessitates the validation of reproducibility for results derived from ChatGPT outputs. To obtain relatively consistent responses from ChatGPT, it is advantageous to provide detailed context within the prompt, thereby guiding the model towards the desired response. Consequently, we replicated two review content generation tests, as depicted in Fig.  1 (B)—one based on point references and the other on the GPT-4 base model—one week apart using identical reference articles and prompts via API calls to GPT-4. The first test aimed to evaluate the consistency of file-based content generation by GPT-4, while the second assessed the base model. We compared the outputs from the subsequent run with those from the initial run to determine the reproducibility of the text content generated by ChatGPT.

Prior to considering the utilization of ChatGPT for generating content suitable for publication in a review paper, it is critical to address potential plagiarism concerns. The pivotal question is whether text produced by GPT-4 would be flagged as plagiarized by anti-plagiarism software. In this study, GPT-4 generated a substantial volume of text, particularly for the text content comparison test (Fig.  1 (B)). We subjected both the base model-generated review content and the reference-based GPT-4 review content to scrutiny using iThenticate to ascertain the presence of plagiarism.

Table and figure generation by ChatGPT

Review papers often distill the content from references into tables and further synthesize this information into figures. In this study, we evaluated ChatGPT’s proficiency in generating content in tabular and diagrammatic formats, using benchmark review paper 2 (BRP2) [ 18 ] as a reference, as illustrated in Fig.  2 . The authors of BRP2 developed the seminal Cancer-Immunity Cycle concept, encapsulated in a cycle diagram, which has since become a structural foundational for research in cancer immunotherapy.

Table content generation

Analogous to the file scan anomaly, ChatGPT may disproportionately prioritize one task over others when presented with multiple tasks simultaneously. To mitigate this in the table generation test, we adopted a divide-and-conquer approach, submitting separate GPT-4 prompts to generate content for each column of the table. This strategy facilitated the straightforward assembly of the individual outputs into a comprehensive table, either through GPT-4 or manual compilation.

In BRP2, eleven reference articles were utilized to construct a table (specifically, Table  1 of BRP2) that categorized positive and negative regulators at each stage of the Cancer-Immunity Cycle. These articles were compiled and inputted into ChatGPT, prompting GPT-4 to summarize information for corresponding table columns: Steps, Stimulators, Inhibitors, Other Considerations, and Example References. The content for each column was generated through separate GPT-4 API calls and subsequently compared manually with the content in the original BRP2 table. The semantic similarity and manual validations were carried out for each row of the Table  1 from BRP2. With the API version, we uploaded the references cited within the corresponding row in the table and used that to generate the contents of the row.

Diagram creation

ChatGPT is primarily designed for text handling, yet its capabilities in graph generation are increasingly being explored [ 20 ]. DALL-E, the model utilized by ChatGPT for diagram creation, has been trained on a diverse array of images, encompassing various subjects, styles, contexts, and including scientific and technical imagery. To direct ChatGPT towards producing a diagram that closely aligns with the intended visualization, a precise and succinct description of the diagram is essential. Like the approach for table generation, multiple prompts may be required to facilitate incremental revisions in the drawing process.

In this evaluation, we implemented three distinct strategies for diagram generation, as demonstrated in Fig.  2 . Initially, the 11 reference articles used for table generation were also employed by GPT-4 to generate a description for the cancer immunity cycle, followed by the creation of a diagrammatic representation of the cycle by GPT-4. This approach not only tested the information synthesis capability of GPT-4 but also its diagram drawing proficiency. Secondly, we extracted the paragraph under the section titled ‘The Cancer-Immunity Cycle’ from BRP2 to serve as the diagram description. Terms indicative of a cyclical structure, such as ‘cycle’ and ‘step 1 again,’ were omitted from the description prior to its use as a prompt for diagram drawing. This tested GPT-4’s ability to synthesize the provided information into an innovative cyclical structure for cancer immunotherapy. Lastly, the GPT-4 base model was tasked with generating a cancer immunity mechanism and its diagrammatic representation without any given context. The diagrams produced through these three strategies were scrutinized and compared with the original cancer immunity cycle figure in BRP2 to assess the scientific diagram drawing capabilities of GPT-4.

figure 2

GPT-4 table generation and figure creation

Results and discussions

Main point summary.

As depicted in Fig.  1 A, GPT-4 generated nine potential sections for a proposed paper entitled ‘The Spectrum of Sex Differences in Cancer,’ utilizing the 113 reference articles uploaded, which encompassed all three sections in BRP1. Upon request to generate possible subsections using BRP1 section titles and references, GPT-4 produced four subsections for each section, totaling twelve subsections that encompassed all seven subsections in BRP1. Detailed information regarding GPT-4 prompts, outputs, and comparisons with BRP1 section and subsection titles is provided in the supplementary materials.

The results suggest that ChatGPT can effectively summarize the key points from a comprehensive list of documents, which is particularly beneficial when composing a review paper that references hundreds of articles. With ChatGPT’s assistance, authors can swiftly summarize a list of main topics for further refinement, organization, and editing. Once the topics are finalized, GPT-4 can easily summarize different aspects for each topic, aiding authors in organizing the subsections. This indicates a novel approach to review paper composition that could be more efficient and productive than traditional methods. It represents a collaborative effort between ChatGPT and the review writer, with ChatGPT sorting and summarizing articles, and the author conducting high-level and creative analysis and editing.

During this evaluation, one limitation of GPT-4 was identified: its inability to provide an accurate list of articles referenced for point generation. This presents a challenge in developing an automated pipeline that enables both information summarization and file classification.

Figure  3 illustrates a sample of the text content generation, including the original BRP1 text, the prompt, and ChatGPT’s output. The evaluation results for GPT-4’s review content generation are presented in Table  1 (refer to Fig.  1 B). When generating review content using corresponding references as in BRP1, GPT-4 achieved an average similarity score of 0.748 with the original content in BRP1 across all main points. Manual similarity validation confirmed that GPT-4 generated content that was semantically similar for all 8 points, with 6 points matching very well (Y) and 2 points matching partially (P). When utilizing all reference articles for GPT-4 to generate review content for a point, the mean similarity score was slightly lower at 0.699, with a manual validation result of 5Y3P. The results from the GPT-4 based model were comparable to the corresponding reference-based results, with a mean similarity score of 0.755 and a 6Y2P manual validation outcome.

figure 3

Text generation using GPT4 with specific references ( A ) Original section in BRP1 ( B ) Prompt for same section ( C ) Response from GPT4

As the GPT-4 base model has been trained on an extensive corpus of scientific literature, including journals and articles that explore sex differences in cancer, it is plausible for it to generate text content similar to the original review paper, even for a defined point without any contextual input. The performance when using corresponding references is notably better than when using all references, suggesting that GPT-4 processes information more effectively with relevant and less noisy input.

The similarity score represents only the level of semantic similarity between the GPT-4 output and the original review paper text. It should not be construed as a measure of the quality of the text content generated by GPT-4. While it is relatively straightforward to assess the relevance of content for a point, gauging comprehensiveness is nearly impossible without a gold standard. However, scientific review papers are often required in research areas where such standards do not yet exist. Consequently, this review content similarity test merely indicates whether GPT-4 can produce text content that is semantically akin to that of a human scholar. Based on the results presented in Table  1 , GPT-4 has demonstrated adequate capability in this regard.

In this evaluation, GPT-4 initially synthesized content analogous to the Concluding Remarks section of BRP1 by utilizing all reference articles, further assessing its capability to integrate information into coherent conclusions. Subsequently, GPT-4 projected future research directions using three distinct methodologies. The findings, as detailed in Table  2 , reveal that GPT-4’s content generation performance significantly increased from 0.45 to 0.71 upon the integration of all pertinent references, indicating that the provision of relevant information markedly enhances the model’s guidance. Consequently, although GPT-4 may face challenges in precisely replicating future research due to thematic discrepancies, equipping it with a distinct theme can empower it to produce content that more accurately represents the intended research trajectory. In contrast, the performance of the GPT-4 base model remained comparably stable, regardless of additional contextual cues. Manual verification confirmed GPT-4’s ability to synthesize information from the provided documents and to make reasonably accurate predictions about future research trajectories.

Reproducibility

The comparative analysis of GPT-4 outputs from different runs is presented in Table  3 . Based on previous similarity assessments, a similarity score of 0.7 is generally indicative of a strong semantic correlation in the context of this review paper. In this instance, GPT-4 outputs using corresponding references exhibited an average similarity score of 0.8 between two runs, while the base model scored 0.9. A manual review confirmed that both outputs expressed the same semantic meaning at different times. Consequently, it can be concluded that GPT-4 consistently generates uniform text responses when provided with identical prompts and reference materials.

An intriguing observation is that the GPT-4 base model appears to be more stable than when utilizing uploaded documents. This may suggest limitations in GPT-4’s ability to process external documents, particularly those that are unstructured or highly specialized in scientific content. This limitation aligns with our previous observation regarding GPT-4’s deficiency in cataloging citations within its content summaries.

Plagiarism check

The plagiarism assessment conducted via iThenticate ( https://www.ithenticate.com/ ) yielded a percentage score of 34% for reference-based GPT-4 content generation and 10% for the base model. Of these percentages, only 2% and 3%, respectively, were attributed to matches with the original review paper (BRP1), predominantly due to title similarities, as we maintained the same section and subsection titles. A score of 34% is typically indicative of significant plagiarism concerns, whereas 10% is considered minimal. These results demonstrate the GPT-4 base model’s capacity to expound upon designated points in a novel manner, minimally influenced by the original paper. However, the reference-based content generation raised concerns due to a couple of instances of ‘copy-paste’ style matches from two paragraphs in BRP1 references [ 21 , 22 ], which contributed to the elevated 34% score. In summary, while the overall content generated by ChatGPT appears to be novel, the occurrence of sporadic close matches warrants scrutiny.

This finding aligns with the theoretical low risk of direct plagiarism by ChatGPT, as AI-generated text responses are based on learned patterns and information, rather than direct ‘copy-paste’ from specific sources. Nonetheless, the potential for plagiarism and related academic integrity issues are of serious concern in academia. Researchers have been exploring appropriate methods to disclose ChatGPT’s contributions in publications and strategies to detect AI-generated content [ 23 , 24 , 25 ].

Table construction in scientific publications often necessitates a more succinct representation of relationships and key terms compared to text content summarization and synthesis. This requires ChatGPT to extract information with greater precision. For the five columns of information compiled by GPT-4 for Table  1 in BRP2, the Steps column is akin to summarizing section and subsection titles in BRP1. ‘ Stimulators’ and ‘ Inhibitors’ involve listing immune regulation factors, demanding more concise and precise information extraction. ‘ Other Considerations ’ encompasses additional relevant information, while ‘ Example References ’ lists citations.

For the Steps column, GPT-4 partially succeeded but struggled to accurately summarize information into numbered steps. For the remaining columns, GPT-4 was unable to extract the corresponding information accurately. Extracting concise and precise information from uploaded documents for specific scientific categories remains a significant challenge for GPT-4, which also lacks the ability to provide reference citations, as observed in previous tests. All results, including GPT prompts, outputs, and evaluations, are detailed in the supplementary materials.

In summary, GPT-4 has not yet achieved the capability to generate table content with the necessary conciseness and accuracy for information summary and synthesis.

Figure creation

In the diagram drawing test, we removed all terms indicative of a cyclical graph from the diagram description in the prompt to evaluate whether GPT-4 could independently recreate the original, pioneering depiction of the cancer immune system cycle. We employed three strategies for diagram generation, as depicted in Fig.  2 , which included: (1) using a diagram description generated from references and incorporated into the drawing prompt; (2) using the description from BRP2; (3) relying on the GPT-4 base model. The resulting diagrams produced by GPT-4 are presented in Fig.  4 , with detailed information provided in the supplementary materials.

figure 4

( A ) Original figure ( B ) reference description ( C ) BRP2 description ( D ) base model

These diagrams highlight common inaccuracies in GPT-4’s drawings, such as misspelled words, omitted numbers, and a lack of visual clarity due to superfluous icons and cluttered labeling. Despite these issues, GPT-4 demonstrated remarkable proficiency in constructing an accurate cycle architecture, even without explicit instructions to do so.

In conclusion, while GPT-4 can serve as a valuable tool for conceptualizing diagrams for various biomedical reactions, mechanisms, or systems, professional graph drawing tools are essential for the actual creation of diagrams.

Conclusions

In this study, we evaluated the capabilities of the language model GPT-4 within ChatGPT for composing a biomedical review article. We focused on four key areas: (1) summarizing insights from reference papers; (2) generating text content based on these insights; (3) suggesting avenues for future research; and (4) creating tables and graphs. GPT-4 exhibited commendable performance in the first three tasks but was unable to fulfill the fourth.

ChatGPT’s design is centered around text generation, with its language model finely tuned for this purpose through extensive training on a wide array of sources, including scientific literature. Consequently, GPT-4’s proficiency in text summarization and synthesis is anticipated. When specifically comparing the API GPT model performance on a section providing specific references (references only limited to that section) and all references from the entire paper, the model does better when it is given specific references because providing all references could bring in a lot of noise. One more thing to note is that the prompt specifically mentions not to use external knowledge and hence it must process over a hundred publications and discover relevant information for the section and then compose a reply. This could explain why giving specific references improves performance over giving all references. Remarkably, the GPT-4 base model’s performance is on par with, or in some cases, slightly surpasses that of reference-based text content generation, owing to its training on a diverse collection of research articles and web text. Hence, when given a prompt and some basic points, it performs well since it already possesses all the information needed to generate an appropriate response. Furthermore, reproducibility tests have demonstrated GPT-4’s ability to generate consistent text content, whether utilizing references or solely relying on its base model.

In addition, we assessed GPT-4’s proficiency in extracting precise and pertinent information for the construction of research-related tables. GPT-4 encountered difficulties with this task, indicating that ChatGPT’s language model requires additional training to enhance its ability to discern and comprehend specialized scientific terminology from literature. This improvement necessitates addressing complex scientific concepts and integrating knowledge across various disciplines.

Moreover, GPT-4’s capability to produce scientific diagrams does not meet the standards required for publication. This shortfall may stem from its associated image generation module, DALL-E, being trained on a broad spectrum of images that encompass both scientific and general content. However, with ongoing updates and targeted retraining to include a greater volume of scientific imagery, the prospect of a more sophisticated language model with improved diagrammatic capabilities could be a foreseeable advancement.

To advance the assessment of ChatGPT’s utility in publishing biomedical review articles, we executed a plagiarism analysis on the text generated by GPT-4. This analysis revealed potential issues when references were employed, with GPT-4 occasionally producing outputs that closely resemble content from reference articles. Although GPT-4 predominantly generates original text, we advise conducting a plagiarism check on ChatGPT’s output before any formal dissemination. Moreover, despite the possibility that the original review paper BRP1 was part of GPT-4’s training dataset, the plagiarism evaluation suggests that the output does not unduly prioritize it, considering the extensive data corpus used for training the language model.

Our study also highlights the robust performance of the GPT-4 base model, which shows adeptness even without specific reference articles. This observation leads to the conjecture that incorporating the entirety of scientific literature into the training of a future ChatGPT language model could facilitate the on-demand extraction of review materials. Thus, it posits the potential for ChatGPT to eventually author comprehensive summary and synthesis-based scientific review articles. ChatGPT did not offer any citations for the PDFs that were provided to it at the time this work was written. Therefore, it is advised in such a situation to go section by section, supply a single paper, and obtain a summary of that publication alone so that the user can write a few phrases for that portion and properly credit the paper. On the other hand, the user can supply all articles for commonly recognized knowledge to produce a well-rounded set of statements that require a set of citations.

ChatGPT’s power and versatility warrant additional exploration of various facets. While these are beyond the scope of the current paper, we will highlight selected topics that are instrumental in fostering a more science oriented ChatGPT environment. Holistically speaking, to thoroughly assess ChatGPT’s proficiency in generating biomedical review papers, it is imperative to include a diverse range of review paper types in the evaluation process. For instance, ChatGPT is already equipped to devise data analysis strategies and perform data science tasks in real-time. This capability suggests potential for generating review papers that include performance comparisons and benchmarks of computational tools. However, this extends beyond the scope of our pilot study, which serves as a foundational step toward more extensive research endeavors.

Ideally, ChatGPT would conduct essential statistical analyses of uploaded documents, such as ranking insights, categorizing documents per insight, and assigning relevance weights to each document. This functionality would enable scientists to quickly synthesize the progression and extensively studied areas within a field. When it comes to mitigating hallucination, employing uploaded documents as reference material can reduce the occurrence of generating inaccurate or ‘hallucinated’ content. However, when queries exceed the scope of these documents, ChatGPT may still integrate its intrinsic knowledge base. In such cases, verifying ChatGPT’s responses against the documents’ content is vital. A feasible method is to cross-reference responses with the documents, although this may require significant manual effort. Alternatively, requesting ChatGPT to annotate its output with corresponding references from the documents could be explored, despite being a current limitation of GPT-4.

To address academic integrity concerns, as the development of LLMs progresses towards features that could potentially expedite or even automate the creation of scientific review papers, the establishment of a widely accepted ethical practice guide becomes paramount. Until such guidelines are in place, it remains essential to conduct plagiarism checks on AI-generated content and transparently disclose the extent of AI’s contribution to the published work. The advent of large language models like Google’s Gemini AI [ 26 ] and Perplexity.ai has showcased NLP capabilities comparable to those of GPT-4. This, coupled with the emergence of specialized models such as BioBert [ 27 ], BioBART [ 28 ], and BioGPT [ 29 ] for biomedical applications, highlights the imperative for in-depth comparative studies. These assessments are vital for identifying the optimal AI tool for particular tasks, taking into account aspects such as multimodal functionalities, domain-specific precision, and ethical considerations. Conducting such comparative analyses will not only aid users in making informed choices but also promote the ethical and efficacious application of these sophisticated AI technologies across diverse sectors, including healthcare and education.

Data availability

All source codes, GPT4 generated contents and results are available at the GitHub repository.

Dhillon P. How to write a good scientific review article. FEBS J. 2022;289:3592–602.

Article   CAS   PubMed   Google Scholar  

Health sciences added to the Nature Index. | News | Nature Index. https://www.nature.com/nature-index/news/health-sciences-added-to-nature-index .

Van Noorden R, Perkel JM. AI and science: what 1,600 researchers think. Nature. 2023;621:672–5.

Article   PubMed   Google Scholar  

Ariyaratne S, Iyengar KP, Nischal N, Chitti Babu N, Botchu R. A comparison of ChatGPT-generated articles with human-written articles. Skeletal Radiol. 2023;52:1755–8.

Kumar AH. Analysis of ChatGPT Tool to assess the potential of its utility for academic writing in Biomedical Domain. Biology Eng Med Sci Rep. 2023;9:24–30.

Article   Google Scholar  

Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: implications in Scientific writing. Cureus 15, e35179.

Meyer JG, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16:20.

Article   PubMed   PubMed Central   Google Scholar  

Mondal H, Mondal S. ChatGPT in academic writing: maximizing its benefits and minimizing the risks. Indian J Ophthalmol. 2023;71:3600.

Misra DP, Chandwar K. ChatGPT, artificial intelligence and scientific writing: what authors, peer reviewers and editors should know. J R Coll Physicians Edinb. 2023;53:90–3.

Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5:e179–81.

Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the power of ChatGPT for automating systematic review process: Methodology, Case Study, limitations, and future directions. Systems. 2023;11:351.

Huang J, Tan M. The role of ChatGPT in scientific communication: writing better scientific review articles. Am J Cancer Res. 2023;13:1148–54.

PubMed   PubMed Central   Google Scholar  

Haman M, Školník M. Using ChatGPT to conduct a literature review. Account Res. 2023;0:1–3.

ChatGPT listed as author on research papers. many scientists disapprove. https://www.nature.com/articles/d41586-023-00107-z .

Scopus AI. Trusted content. Powered by responsible AI. www.elsevier.com https://www.elsevier.com/products/scopus/scopus-ai .

Conroy G. How ChatGPT and other AI tools could disrupt scientific publishing. Nature. 2023;622:234–6.

Rubin JB. The spectrum of sex differences in cancer. Trends Cancer. 2022;8:303–15.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Chen DS, Mellman I. Oncology meets immunology: the Cancer-Immunity cycle. Immunity. 2013;39:1–10.

Reimers N, Gurevych I, Sentence-BERT. Sentence Embeddings using Siamese BERT-Networks. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 3982–3992Association for Computational Linguistics, Hong Kong, China, (2019). https://doi.org/10.18653/v1/D19-1410 .

Jin B et al. Large Language Models on Graphs: A Comprehensive Survey. Preprint at https://doi.org/10.48550/arXiv.2312.02783 (2024).

Polkinghorn WR, et al. Androgen receptor signaling regulates DNA repair in prostate cancers. Cancer Discov. 2013;3:1245–53.

Broestl L, Rubin JB. Sexual differentiation specifies Cellular responses to DNA damage. Endocrinology. 2021;162:bqab192.

ChatGPT and Academic Integrity Concerns. Detecting Artificial Intelligence Generated Content | Language Education and Technology. https://www.langedutech.com/letjournal/index.php/let/article/view/49 .

Gao CA et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. 2022.12.23.521610 Preprint at https://doi.org/10.1101/2022.12.23.521610 (2022).

Homolak J. In reply: we do not stand a ghost of a chance of detecting plagiarism with ChatGPT employed as a ghost author. Croat Med J. 2023;64:293–4.

Article   PubMed Central   Google Scholar  

Introducing Gemini. Google’s most capable AI model yet. https://blog.google/technology/ai/google-gemini-ai/#sundar-note .

Lee J, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–40.

Yuan H et al. Association for Computational Linguistics, Dublin, Ireland,. BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model. in Proceedings of the 21st Workshop on Biomedical Language Processing (eds. Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 97–109 (2022). https://doi.org/10.18653/v1/2022.bionlp-1.9 .

Luo R, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409.

Download references

This research received no specific grant from any funding agency.

Author information

Authors and affiliations.

Department of Computational Biomedicine, Cedars Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center, Suite G-541, West Hollywood, CA, 90069, USA

Zhiping Paul Wang, Priyanka Bhandary, Yizhou Wang & Jason H. Moore

You can also search for this author in PubMed   Google Scholar

Contributions

Z. Wang authored the main manuscript text, while P. Bhandary conducted most of the tests and collected the results. Y. Wang and J. Moore contributed scientific suggestions and collaborated on the project. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jason H. Moore .

Ethics declarations

Supplements.

All supplementary materials are available at https://github.com/EpistasisLab/GPT4_and_Review .

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wang, Z.P., Bhandary, P., Wang, Y. et al. Using GPT-4 to write a scientific review article: a pilot evaluation study. BioData Mining 17 , 16 (2024). https://doi.org/10.1186/s13040-024-00371-3

Download citation

Received : 26 April 2024

Accepted : 11 June 2024

Published : 18 June 2024

DOI : https://doi.org/10.1186/s13040-024-00371-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

BioData Mining

ISSN: 1756-0381

research paper on application of data mining

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

water-logo

Article Menu

research paper on application of data mining

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Automatic rainwater quality monitoring system using low-cost technology, 1. introduction, 2. case study, 3. methodology, 3.1. selection of monitored water quality parameters, 3.2. analysis of requirements.

  • Power supply: The prototype will be connected to the SAIH stations and powered through the electrical network.
  • Connectivity: SAIH stations and the developed prototype connect to the internet wirelessly via Wi-Fi.
  • Visualization: SAIH stations display their information in real-time on a web page using ThingSpeak platform widgets. However, the free version only stores data for a month before deleting it. To overcome this limitation, a database was developed for this project on web hosting.
  • Operating conditions: The water quality sensors consist of a circuit and a probe. The probes are waterproof, but the circuits are not. To protect the circuits from the elements, a housing was designed to store them, while the probes were left exposed to collect and renew rainwater. The maintenance of probes due to corrosion or sensor replacement, estimating that this could potentially increase the total cost by 5%.

3.3. Design and Development

3.3.1. calibration.

  • Clean the sensor probe with distilled water and dry it with a disposable tissue.
  • Correctly place the sensor inside the calibration standard solution.
  • Carry out the measurement.
  • Record the parameter value and the voltage read by the device.
  • Remove the probe from the calibration standard solution.
  • Repeat steps 1 to 5 10 times.
  • Calculate the relative error percentage using Equation (1). If it is greater than that guaranteed by the manufacturer, the sensor is calibrated.
  • Repeat steps 1 through 7 for each sensor.

3.3.2. Validation

3.3.3. data visualization.

  • Analysis of requirements: The monitoring stations were displayed on a map with markers to visualize them and show the latest information upon interaction.
  • Architecture and technology: A hosting provider was hired for platform development. A dynamic website was developed using web technologies such as HTML, PHP, JavaScript, and CSS. MySQL 8.0.17 was used as the database manager. A free Bootstrap web template (SB Admin 2) was used as a base [ 25 ] to develop the web application and mobile-first sites with a layout that adapts to the user’s screen [ 26 ].
  • Design of the logical and physical structure of the site: The main page contains the map with the monitoring stations. The site contains several sections with different functions, such as selecting consulted parameters in real-time, downloading data for a given period, and consulting data recorded on a specific time and date. Additionally, complementary pages such as Team and Contact were added.
  • Content creation: The platform’s content is primarily graphic, with more extensive text found in the Team and Contact tabs. The rest of the site contains short indications for the user or information on the indicators.
  • Graphic design: The interface features various shades of white, blue, gray, black, and green. White is used for the navigation bar, page background, and pop-up windows. Blue is used for some text, radio buttons, weather indicators, and real-time graphs. Gray is used for some text, station markers, indicator icons, and back buttons. Black is used for most text, and green is used for water quality indicators and some text. The default typography was retained from the Bootstrap template, and the sizes were adapted according to the device accessing the site. The indicator icons were obtained from the Font Awesome platform [ 27 ], which offers free icons that can be added to the website.
  • Creation of the static pages: The static pages include the Team and Contact pages, which will not change according to the database.
  • Creation of the dynamic pages: The dynamic page is the home page, where the station markers are displayed on the map. These markers change color based on the intensity of precipitation, and the magnitude of the indicators is updated in real-time.
  • Verification of the site’s operation: The page’s connection with the database was verified, ensuring that the most recent data was updated. The links within the site were also confirmed to redirect to the correct site. The site was tested on different browsers and devices, and the content was adjusted to fit the screen size. Finally, the site’s loading time was tested.
  • Start-up: After verifying the site’s operation locally, it was published on the web domain, making it accessible to the public.

3.3.4. Materials

4. discussion and results, 4.1. selection of monitored water quality parameters, 4.2. design and development, 4.3. calibration, 4.4. validation, 4.5. implementation of the interface to visualize the data, 5. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Krishna, H. The Texas Manual on Rainwater Harvestingo , 3rd ed.; Texas Water Development Board: Austin, TX, USA, 2005. [ Google Scholar ]
  • Cambio de Michoacán Suministro Irregular de Agua Desde Este Jueves En 115 Colonias de Morelia. Available online: https://cambiodemichoacan.com.mx/2022/03/30/suministro-irregular-de-agua-desde-este-jueves-en-115-colonias-de-morelia/ (accessed on 30 March 2022).
  • Morales Magaña, M. Flujos de Agua y Poder: La Gestión Del Agua Urbanizada En La Ciudad de Morelia, Michoacán. Ph.D. Thesis, Colegio de Michoacán, Zamora, Mexico, 2015. [ Google Scholar ]
  • García-Estrada, L.; Hernández-Guerrero, J. Ciclo Hidrosocial y Acceso Al Agua En La Periferia de La Ciudad de Morelia, México: Estudio de Caso En La Aldea. Rev. Geográfica América Cent. 2020 , 64 , 245–273. [ Google Scholar ] [ CrossRef ]
  • Garduño Monroy, V.H.; Giordano, N.; Ávila Olivera, J.A.; Hernández Madrigal, V.M.; Sámano Nateras, A.; Díaz Salmerón, J.E. Estudio Hidrogeológico Del Sistema Acuífero de Morelia, Michoacán, Para Una Correcta Planificación Del Territorio ; Urbanización, Sociedad y Ambiente. Michoacán, México. UNAM Centro de Investigaciones en Geografía Ambiental: Morelia, Mexico, 2014; pp. 197–222. [ Google Scholar ]
  • Ahmed, W.; Gardner, T.; Toze, S. Microbiological Quality of Roof-Harvested Rainwater and Health Risks: A Review. J. Environ. Qual. 2011 , 40 , 13–21. [ Google Scholar ] [ CrossRef ]
  • Gillette, D.A.; Sinclair, P.C. Estimation of Suspension of Alkaline Material by Dust Devils in the United States. Atmos. Environ. 1990 , 24 , 1135–1142. [ Google Scholar ] [ CrossRef ]
  • Rastogi, N.; Sarin, M.M. Chemical Characteristics of Individual Rain Events from a Semi-Arid Region in India: Three-Year Study. Atmos. Environ. 2005 , 39 , 3313–3323. [ Google Scholar ] [ CrossRef ]
  • Wu, Y.; Xu, Z.; Liu, W.; Zhao, T.; Zhang, X.; Jiang, H.; Yu, C.; Zhou, L.; Zhou, X. Chemical Compositions of Precipitation at Three Non-Urban Sites of Hebei Province, North China: Influence of Terrestrial Sources on Ionic Composition. Atmos. Res. 2016 , 181 , 115–123. [ Google Scholar ] [ CrossRef ]
  • Migliavacca, D.; Teixeira, E.C.; Wiegand, F.; Machado, A.C.M.; Sanchez, J. Atmospheric Precipitation and Chemical Composition of an Urban Site, Guaíba Hydrographic Basin, Brazil. Atmos. Environ. 2005 , 39 , 1829–1844. [ Google Scholar ] [ CrossRef ]
  • Xu, Z.; Han, G. Chemical and Strontium Isotope Characterization of Rainwater in Beijing, China. Atmos. Environ. 2005 , 43 , 1954–1961. [ Google Scholar ] [ CrossRef ]
  • Huang, D.Y.; Xu, Y.G.; Peng, P.; Zhang, H.H.; Lan, J.B. Chemical Composition and Seasonal Variation of Acid Deposition in Guangzhou, South China: Comparison with Precipitation in Other Major Chinese Cities. Environ. Pollut. 2009 , 157 , 35–41. [ Google Scholar ] [ CrossRef ]
  • Larssen, T.; Semb, A.; Mulder, J.; Muniz, I.; Vogt, R.; Lydersen, E.; Angell, V.; Dagang, T.; Eilester, O.; Seip, H.M. Acid Deposition and Its Effects in China: An Overview. Environ. Sci. Policy 1999 , 2 , 9–24. [ Google Scholar ] [ CrossRef ]
  • Bolaños, K.; Sibara, J.; Mora, J.; Umaña, D.; Cambronero, M.; Sandolval, L.; Martínez, M. Estudio Preliminar Sobre La Composición Atmosférica Del Agua de Lluvia En y Los Alrededores Del Parque Nacional Del Volcán Poás. In Memorias del I Congreso Internacional de Ciencias Exactas y Naturales de la Universidad Nacional, Costa Rica ; Universidad Nacional: Heredia, Costa Rica, 2019; pp. 1–11. [ Google Scholar ]
  • Cousins, I.T.; Johansson, J.H.; Salter, M.E.; Sha, B.; Scheringer, M. Outside the Safe Operating Space of a New Planetary Boundary for Per- and Polyfluoroalkyl Substances (PFAS). Environ. Sci. Technol. 2022 , 56 , 11172–11179. [ Google Scholar ] [ CrossRef ]
  • Rao, A.S.; Marshall, S.; Gubbi, J.; Palaniswami, M.; Sinnott, R.; Pettigrove, V. Design of Low-Cost Autonomous Water Quality Monitoring System. In Proceedings of the 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 2013, Mysore, India, 22–25 August 2013; pp. 14–19. [ Google Scholar ]
  • Cloete, N.A.; Malekian, R.; Nair, L. Design of Smart Sensors for Real-Time Water Quality Monitoring. IEEE Access 2016 , 4 , 3975–3990. [ Google Scholar ] [ CrossRef ]
  • Oelen, A.; Van Aart, C.; De Boer, V. Measuring Surface Water Quality Using a Low-Cost Sensor Kit within the Context of Rural Africa. In Proceedings of the CEUR Workshop Proceedings, Amsterdam, The Netherlands, 27 May 2018; CEUR-WS: Aachen, Germany, 2018; Volume 2120. [ Google Scholar ]
  • Malhotra, R.; Devaraj, H.; Baldwin, B.; Kolli, V.; Lehman, K.; Li, A.; Lin, C. Integrating Electronics with Solid Structures Using 3D Circuits. In Proceedings of the 2019 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 11–13 October 2019; pp. 1–11. [ Google Scholar ]
  • Hong, W.J.; Shamsuddin, N.; Abas, E.; Apong, R.A.; Masri, Z.; Suhaimi, H.; Gödeke, S.H.; Noh, M.N.A. Water Quality Monitoring with Arduino Based Sensors. Environments 2021 , 8 , 6. [ Google Scholar ] [ CrossRef ]
  • Rodríguez Licea, D.; Sánchez Quispe, S.T.; Domínguez Mota, F.J. Sistema Automático de Información Hidrológica de Morelia ; Michoacan University of Saint Nicholas of Hidalgo: Morelia, México, 2020. [ Google Scholar ]
  • Lambrou, T.P.; Anastasiou, C.; Panayiotou, C.; Polycarpou, M. A Low-Cost Sensor Network for Real-Time Monitoring and Contamination Detection in Drinking Water Distribution Systems. IEEE Sens. J. 2014 , 14 , 2765–2772. [ Google Scholar ] [ CrossRef ]
  • Pitula, K.; Dysart-Gale, D.; Radhakrishnan, T. Expanding the Boundaries of HCI: A Case Study in Requirements Engineering for ICT4D. Inf. Technol. Int. Dev. 2010 , 6 , 78–93. [ Google Scholar ]
  • Luján Mora, S. Programación de Aplicaciones Web: Historia, Principios Básicos y Clientes Web ; Editorial Club Universitario: Alicante, Spain, 2002. [ Google Scholar ]
  • Start Bootstrap SB Admin 2—Free Bootstrap Admin Theme—Start Bootstrap. Available online: https://startbootstrap.com/theme/sb-admin-2 (accessed on 1 January 2024).
  • Rockcontent Bootstrap: ¿qué Es, Para Qué Sirve y Cómo Instalarlo? Available online: https://rockcontent.com/es/blog/bootstrap/ (accessed on 15 January 2024).
  • Zlatanov, N. Arduino and Open Source Computer Hardware and Software. IEEE Sens. J. 2015 , 11 , 10. [ Google Scholar ] [ CrossRef ]
  • Arduino What Is Arduino? Available online: https://www.arduino.cc/en/Guide/Introduction (accessed on 1 January 2024).
  • Negara, R.M.; Tulloh, R.; Hadiansyah, P.N.; Zahra, R.T. My Locker: Loaning Locker System Based on QR Code. Int. J. Eng. Adv. Technol. 2019 , 9 , 12–19. [ Google Scholar ] [ CrossRef ]
  • Matulka, R.; Greene, M. How 3D Printers Work. Available online: https://www.energy.gov/articles/how-3d-printers-work (accessed on 1 January 2024).
  • Balluff Cómo Funciona Un Sistema de Sensores. Available online: https://www.balluff.com/es-mx/mx/service/basics-of-automation/fundamentals-of-automation/basic-of-sensing/ (accessed on 1 February 2024).
  • USEPA Conductivity. Available online: https://archive.epa.gov/water/archive/web/html/vms59.html (accessed on 1 February 2024).
  • West, J.; Charlton, C.; Kaplan, K. Conductivity Meters. Available online: https://encyclopedia.che.engin.umich.edu (accessed on 1 February 2024).
  • DFRobot Gravity: Analog TDS Sensor Meter for Arduino SKU SEN0244. Available online: https://www.dfrobot.com/product-1662.html (accessed on 1 March 2024).
  • Omega Engineering Conductivity Meter. Available online: https://www.omega.co.uk/prodinfo/conductivity-meter.html (accessed on 1 February 2024).
  • Aqion Temperature Compensation for Conductivity. Available online: https://www.aqion.de/site/112 (accessed on 1 March 2024).
  • DFRobot Turbidity Sensor SKU SEN0189. Available online: https://wiki.dfrobot.com/Turbidity_sensor_SKU__SEN0189 (accessed on 1 May 2024).
  • SINEC NMX-AA-008-SCFI-2016 ; Medición Del PH En Agua Naturales, Residuales y Residuales Tratadas. Secretaria de Economía: Gobierno de México, Mexico, 2016.
  • SINEC NMX-AA-093-SCFI-2000 ; Determinación de La Contuctividad Electrolítica. Secretaria de Economía: Gobierno de México, Mexico, 2000.
  • SINEC NMX-AA-038-SCFI-2001 ; Determinación de Turbiedad En Agua Naturales, Residuales y Residuales Tratadas. Secretaria de Economía: Gobierno de México, Mexico, 2001.

Click here to enlarge figure

SensorModelInput Voltage (V)Measuring RangeMeasuring AccuracyOperating Temperature (°C)Price (USD)
Analog TDS and EC sensorTDS Meter v13.3~5.50~1000 mg/L±10% (25 °C)0~55$32.50
Analog turbidity sensorSEN018950~1000 NTU±10%5~90$34.50
Analog pH sensorPH-4502C50~14±10%0~80$30.65
Digital water temperature sensorDS18B203.0~5.5−10~+85 °C±0.5 °C−55~+125$2.75
pHVoltage (V)Average Voltage (V)
43.0365, 3.0356, 3.0358, 3.0351, 3.0361, 3.0359, 3.0355, 3.0352, 3.0352, 3.03573.0357
72.5299, 2.5296, 2.5295, 2.5303, 2.5297, 2.5287, 2.5291, 2.5296, 2.5296, 2.52882.5295
102.0352, 2.0361, 2.0359, 2.0355, 2.0352, 2.0353, 2.0348, 2.0351, 2.0357, 2.03512.0354
Temperature (°C)Voltage (V)EC
(µS/cm)
23.941.74271423.2627
23.941.73651415.9561
23.931.73691416.3735
23.931.74101421.2037
23.951.73811417.8912
23.941.73771417.3677
23.961.73941419.3567
23.921.73611415.4971
23.941.73451413.6064
23.951.73601415.4216
Voltage Calibration CoefficientTemperature (°C)Read Voltage (V)Affected Voltage (V)EC (µS/cm)
0.995023.941.74271.73401413.0001
0.998623.941.73651.73401413.0001
0.998423.931.73691.73401413.0001
0.996023.931.74101.73401413.0001
0.997623.951.73811.73401413.0001
0.997923.941.73771.73401413.0001
0.996923.961.73941.73401413.0001
0.998823.921.73611.73401413.0001
0.999723.941.73451.73401413.0000
0.998823.951.73601.73401413.0001
0 NTU5 NTU10 NTU20 NTU50 NTU
Voltage (V)Turbidity (NTU)Voltage (V)Turbidity (NTU)Voltage (V)Turbidity (NTU)Voltage (V)Turbidity (NTU)Voltage (V)Turbidity (NTU)
4.2666−248.42504.2567−210.74114.2464−171.68754.2432−159.67084.2340−125.2497
4.2656−244.60784.2557−206.95494.2454−167.93354.2422−155.92674.2330−121.5342
4.2661−246.51614.2562−208.84774.2459−169.81024.2427−157.79794.2335−123.3911
4.2642−239.26754.2543−201.65804.2440−162.68164.2408−150.68854.2316−116.3360
4.2651−242.70014.2552−205.06264.2449−166.05734.2417−154.05514.2325−119.6769
4.2656−244.60784.2557−206.95494.2454−167.93354.2422−155.92624.2330−121.5337
4.2646−240.79294.2547−203.17094.2444−164.18174.2412−152.18464.2320−117.8206
4.2637−237.36144.2538−199.76734.2435−160.80704.2403−148.81894.2312−114.4807
4.2666−248.42504.2567−210.74114.2464−171.68754.2432−159.67014.2340−125.2489
4.2656−244.60784.2557−206.95494.2454−167.93354.2422−155.92624.2330−121.5337
Voltage Calibration CoefficientRead Voltage (V)Affected Voltage (V)Turbidity Standard (NTU)Arduino Turbidity (NTU)
0.98474.26544.200200.0000
0.98674.25554.198955.0000
0.98884.24524.1975109.9996
0.98894.24204.19482019.9996
0.98914.23284.18665049.9998
Aggregate Volume (mL)Concentration (M)pH ThermopH ArduinoDifference% Error
007.56807.60600.03800.5020
0.054.998 × 10 5.87606.06800.19203.2680
0.109.990 × 10 4.38604.56600.18004.1040
0.151.498 × 10 4.06604.25800.19204.7220
0.252.494 × 10 3.87604.06000.18404.7470
Aggregate Volume (mL)ConcentrationEC Thermo (µS/cm)CE Arduino (µS/cm)Difference% Error
005.03765.24280.20524.0730
0.10.000116.701017.15550.45452.7210
10.0010147.2650148.67721.41220.9590
100.01001417.33221411.60595.72630.4040
Turbidity Standard (NTU)Arduino Turbidity (NTU)Difference% Error
0−0.16450.1645-
54.79870.20134.0252
109.85900.14101.4096
2019.80800.19200.9600
5049.62120.37880.7575
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Mejía-Ferreyra, L.D.; García-Romero, L.; Sánchez-Quispe, S.T.; Apolinar-Cortés, J.; Orantes-Avalos, J.C. Automatic Rainwater Quality Monitoring System Using Low-Cost Technology. Water 2024 , 16 , 1735. https://doi.org/10.3390/w16121735

Mejía-Ferreyra LD, García-Romero L, Sánchez-Quispe ST, Apolinar-Cortés J, Orantes-Avalos JC. Automatic Rainwater Quality Monitoring System Using Low-Cost Technology. Water . 2024; 16(12):1735. https://doi.org/10.3390/w16121735

Mejía-Ferreyra, Luis Daniel, Liliana García-Romero, Sonia Tatiana Sánchez-Quispe, José Apolinar-Cortés, and Julio César Orantes-Avalos. 2024. "Automatic Rainwater Quality Monitoring System Using Low-Cost Technology" Water 16, no. 12: 1735. https://doi.org/10.3390/w16121735

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Order Data Mining Research Paper Writing

    research paper on application of data mining

  2. (PDF) Optimized Summarization of Research Papers as an Aid for Research Scholars using Data

    research paper on application of data mining

  3. data mining

    research paper on application of data mining

  4. Data Mining Dissertation Help with Upto 50% Off by Ph.D. Experts

    research paper on application of data mining

  5. Chapter 1 Introduction to Data Mining

    research paper on application of data mining

  6. 😍 Data mining research paper. What are some good research topics in data mining?. 2019-03-04

    research paper on application of data mining

VIDEO

  1. Data mining and warehouse Paper Questions Rgpv Exam

  2. Application of Data Mining, Tagalog Version

  3. Lecture 16: Data Mining CSE 2020 Fall

  4. Lecture 15: Data Mining CSE 2020 Fall

  5. HPPSC Mining Inspector Solved Paper 1

  6. What is Data Mining?

COMMENTS

  1. (PDF) Data mining techniques and applications

    Data mining is a process which finds useful patterns from large amount of data. The paper discusses few of the data mining techniques, algorithms and some of the organizations which have adapted ...

  2. Data mining techniques and applications

    This paper reviews data mining techniques and its applications such as educational data mining (EDM), finance, commerce, life sciences and medical etc. We group existing approaches to determine how the data mining can be used in different fields. Our categorization specifically focuses on the research that has been published over the period ...

  3. Research on Application of Machine Learning in Data Mining

    This paper expounds the definition, model, development stage, classification and commercial application of machine learning, and emphasizes the role of machine learning in data mining. Understanding the various machine learning techniques helps to choose the right method for a specific application. Therefore, this paper summarizes and analyzes ...

  4. Data mining techniques and applications

    DMT. Data mining techniques are applied with respect to different. aspects of data mining as data obtained from diff erent sources. can be different and asyn chronous. Data mining is a v ast field ...

  5. Statistical Methods with Applications in Data Mining: A Review of the

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

  6. (PDF) Trends in data mining research: A two-decade review using topic

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  7. Data Mining for the Internet of Things: Literature Review and

    A variety of researches focusing on knowledge view, technique view, and application view can be found in the literature. However, no previous effort has been made to review the different views of data mining in a systematic way, especially in nowadays big data [5-7]; mobile internet and Internet of Things [8-10] grow rapidly and some data mining researchers shift their attention from data ...

  8. Knowledge Discovery: Methods from data mining and machine learning

    Abstract. The interdisciplinary field of knowledge discovery and data mining emerged from a necessity of big data requiring new analytical methods beyond the traditional statistical approaches to discover new knowledge from the data mine. This emergent approach is a dialectic research process that is both deductive and inductive.

  9. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  10. Data mining

    Data mining is the process of extracting potentially useful information from data sets. It uses a suite of methods to organise, examine and combine large data sets, including machine learning ...

  11. Paper Review On Data Mining, components, And Big Data

    Computer-intensive activities are the collection, query and removal of these data sets. Mining data sources include the recovery of knowledge systems incorporated into templates and trends on streams of non-stop information. Because of the importance of its applications and the increasing generation of data stream research, information ...

  12. Review Paper on Data Mining Techniques and Applications

    Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...

  13. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  14. Data Mining in Health Care: Application Perspective

    In this paper the researcher used data mining algorithms to present a summary on healthcare analytics. In the current scenario of application of data mining in health care, create and gather high volumes of information that assist to give some interesting patterns from electronic systems that will protect medical records and enable quick ...

  15. A sample study on applying data mining research techniques in

    The purpose of this research is to present a sample study analyzing data gathered from an educational study using data mining techniques appropriate for processing these data. In order to achieve this aim, a "Computer Self-efficiency Scale" used in educational sciences was selected and this scale was applied in a study group.

  16. Financial Data Analysis and Application Based on Big Data Mining

    In terms of data warehouse applications, Liu et al. researched and implemented a management system applicable to customer data analysis based on data warehousing and data mining techniques, and the combination of the two techniques reflects the advantages of analyzing historical data and is widely used in the mobile communication industry.

  17. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  18. Research on Big Data Mining Application of Internet of Things Based on

    In order to scientifically deal with the problem of network extension, optimize the application performance of the Internet of Things technology and equipment, and truly meet the requirements of big data application of the Internet of Things, in the development of modern economic construction, scholars from various countries in the integration of their own research experience on the basis of ...

  19. Research on the Application of Data Mining in Corporate Financial

    3 Research on the Application of Data Mining in Corporate Financial Management The process attribute emphasizes extending control to the whole process of the company's business activities, that is, under the network environment, using real-time information to control the whole process of the company's business activities such as research and ...

  20. Systematic Review of Machine Learning Applications in Mining ...

    Recent developments in smart mining technology have enabled the production, collection, and sharing of a large amount of data in real time. Therefore, research employing machine learning (ML) that utilizes these data is being actively conducted in the mining industry. In this study, we reviewed 109 research papers, published over the past decade, that discuss ML techniques for mineral ...

  21. (PDF) A Study On Applications Of Data Mining

    This survey paper supplies the overview of data mining, the processes involved, the scope it can offer, its different techniques and multiple applications. Data mining is a great model of using ...

  22. Minerals

    In the past two decades, the mining sector has increasingly embraced simulation and modelling techniques for decision-making processes. This adoption has facilitated enhanced process control and optimisation, enabling access to valuable data such as precise granulometry measurements, improved recovery rates, and the ability to forecast outcomes. Soft computing techniques, such as artificial ...

  23. Using GPT-4 to write a scientific review article: a pilot evaluation

    GPT-4, as the most advanced version of OpenAI's large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4's capabilities in generating text, tables, and diagrams for biomedical review papers. We also assessed the ...

  24. Application research of data mining technology in customer relationship

    The data mining makes the role of CRM (Customer relationship management) system get a good play, so that customer information management in the global has been fully developed. This paper mainly discusses the application and research of data mining technology in CRM ( Customer relationship management).

  25. (PDF) DATA MINING IN CLOUD COMPUTING: A REVIEW

    This paper introduces the basic concept of cloud computing and data mining firstly, and sketches out how data mining is used in cloud computing; Then summarizes the research of parallel ...

  26. JMSE

    This paper evaluates the carbon dioxide sequestration potential in the saline aquifers of the South Qiongdongnan-Yinggehai Basin. By using a hierarchical evaluation method, the assessment is divided into five stages: the basin level, the zone level, the target level, the site level, and the injection level. The study primarily focuses on evaluating the sequestration potential of and ...

  27. (PDF) DATA MINING IN HEALTHCARE

    Data mining is a powerful new tec hnology with gr eat potential t o help c ompanies. focus on the m ost important information in the data they have collected about the behavior. of their customers ...

  28. Water

    A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...

  29. Mining Domain-Based Policies

    An important contribution of this paper is also to show the relation of the role mining problem to several prob- lems already identified in the data mining and data analysis literature.