U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. ​ Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. ​ Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

  • To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
  • To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
  • To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
  • To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure ​ Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

  • Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
  • Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
  • Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
  • Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
  • Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

  • Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
  • Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
  • Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
  • Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table ​ Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Analytical methodsData-driven model buildingExamples
Descriptive analyticsAnswer the question, “what happened in the past”?Summarising past events, e.g., sales, business data, social media usage, reporting general trends, etc.
Diagnostic analyticsAnswer the question, “why did it happen?”Identify anomalies and determine casual relationships, to find out business loss, identifying the influence of medications, etc.
Predictive analyticsAnswer the question, “what will happen in the future?”Predicting customer preferences, recommending products, identifying possible security breaches, predicting staff and resource needs, etc.
Prescriptive analyticsAnswer the question, “what action should be taken?” Improving business management, maintenance, improving patient care and healthcare administration, determining optimal marketing strategies, etc.

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure ​ Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure ​ Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure ​ Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure ​ Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

  • Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
  • Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
  • Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
  • IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
  • Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
  • Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
  • Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
  • Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
  • Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
  • Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

  • Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
  • The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
  • Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
  • The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
  • The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
  • In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Survey paper
  • Open access
  • Published: 01 October 2015

Big data analytics: a survey

  • Chun-Wei Tsai 1 ,
  • Chin-Feng Lai 2 ,
  • Han-Chieh Chao 1 , 3 , 4 &
  • Athanasios V. Vasilakos 5  

Journal of Big Data volume  2 , Article number:  21 ( 2015 ) Cite this article

147k Accesses

481 Citations

130 Altmetric

Metrics details

The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.

Introduction

As the information technology spreads fast, most of the data were born digital as well as exchanged on internet today. According to the estimation of Lyman and Varian [ 1 ], the new data stored in digital media devices have already been more than 92 % in 2002, while the size of these new data was also more than five exabytes. In fact, the problems of analyzing the large scale data were not suddenly occurred but have been there for several years because the creation of data is usually much easier than finding useful things from the data. Even though computer systems today are much faster than those in the 1930s, the large scale data is a strain to analyze by the computers we have today.

In response to the problems of analyzing large-scale data , quite a few efficient methods [ 2 ], such as sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been presented. Of course, these methods are constantly used to improve the performance of the operators of data analytics process. Footnote 1 The results of these methods illustrate that with the efficient methods at hand, we may be able to analyze the large-scale data in a reasonable time. The dimensional reduction method (e.g., principal components analysis; PCA [ 3 ]) is a typical example that is aimed at reducing the input data volume to accelerate the process of data analytics. Another reduction method that reduces the data computations of data clustering is sampling [ 4 ], which can also be used to speed up the computation time of data analytics.

Although the advances of computer systems and internet technologies have witnessed the development of computing hardware following the Moore’s law for several decades, the problems of handling the large-scale data still exist when we are entering the age of big data . That is why Fisher et al. [ 5 ] pointed out that big data means that the data is unable to be handled and processed by most current information systems or methods because data in the big data era will not only become too big to be loaded into a single machine, it also implies that most traditional data mining methods or data analytics developed for a centralized data analysis process may not be able to be applied directly to big data. In addition to the issues of data size, Laney [ 6 ] presented a well-known definition (also called 3Vs) to explain what is the “big” data: volume, velocity, and variety. The definition of 3Vs implies that the data size is large, the data will be created rapidly, and the data will be existed in multiple types and captured from different sources, respectively. Later studies [ 7 , 8 ] pointed out that the definition of 3Vs is insufficient to explain the big data we face now. Thus, veracity, validity, value, variability, venue, vocabulary, and vagueness were added to make some complement explanation of big data [ 8 ].

Expected trend of the marketing of big data between 2012 and 2018. Note that yellow , red , and blue of different colored box represent the order of appearance of reference in this paper for particular year

The report of IDC [ 9 ] indicates that the marketing of big data is about $16.1 billion in 2014. Another report of IDC [ 10 ] forecasts that it will grow up to $32.4 billion by 2017. The reports of [ 11 ] and [ 12 ] further pointed out that the marketing of big data will be $46.34 billion and $114 billion by 2018, respectively. As shown in Fig. 1 , even though the marketing values of big data in these researches and technology reports [ 9 – 15 ] are different, these forecasts usually indicate that the scope of big data will be grown rapidly in the forthcoming future.

In addition to marketing, from the results of disease control and prevention [ 16 ], business intelligence [ 17 ], and smart city [ 18 ], we can easily understand that big data is of vital importance everywhere. A numerous researches are therefore focusing on developing effective technologies to analyze the big data. To discuss in deep the big data analytics, this paper gives not only a systematic description of traditional large-scale data analytics but also a detailed discussion about the differences between data and big data analytics framework for the data scientists or researchers to focus on the big data analytics.

Moreover, although several data analytics and frameworks have been presented in recent years, with their pros and cons being discussed in different studies, a complete discussion from the perspective of data mining and knowledge discovery in databases still is needed. As a result, this paper is aimed at providing a brief review for the researchers on the data mining and distributed computing domains to have a basic idea to use or develop data analytics for big data.

Roadmap of this paper

Figure 2 shows the roadmap of this paper, and the remainder of the paper is organized as follows. “ Data analytics ” begins with a brief introduction to the data analytics, and then “ Big data analytics ” will turn to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “ The open issues ” while the conclusions and future trends are drawn in “ Conclusions ”.

Data analytics

To make the whole process of knowledge discovery in databases (KDD) more clear, Fayyad and his colleagues summarized the KDD process by a few operations in [ 19 ], which are selection, preprocessing, transformation, data mining, and interpretation/evaluation. As shown in Fig. 3 , with these operators at hand we will be able to build a complete data analytics system to gather data first and then find information from the data and display the knowledge to the user. According to our observation, the number of research articles and technical reports that focus on data mining is typically more than the number focusing on other operators, but it does not mean that the other operators of KDD are unimportant. The other operators also play the vital roles in KDD process because they will strongly impact the final result of KDD. To make the discussions on the main operators of KDD process more concise, the following sections will focus on those depicted in Fig. 3 , which were simplified to three parts (input, data analytics, and output) and seven operators (gathering, selection, preprocessing, transformation, data mining, evaluation, and interpretation).

The process of knowledge discovery in databases

As shown in Fig. 3 , the gathering, selection, preprocessing, and transformation operators are in the input part. The selection operator usually plays the role of knowing which kind of data was required for data analysis and select the relevant information from the gathered data or databases; thus, these gathered data from different data resources will need to be integrated to the target data. The preprocessing operator plays a different role in dealing with the input data which is aimed at detecting, cleaning, and filtering the unnecessary, inconsistent, and incomplete data to make them the useful data. After the selection and preprocessing operators, the characteristics of the secondary data still may be in a number of different data formats; therefore, the KDD process needs to transform them into a data-mining-capable format which is performed by the transformation operator. The methods for reducing the complexity and downsizing the data scale to make the data useful for data analysis part are usually employed in the transformation, such as dimensional reduction, sampling, coding, or transformation.

The data extraction, data cleaning, data integration, data transformation, and data reduction operators can be regarded as the preprocessing processes of data analysis [ 20 ] which attempts to extract useful data from the raw data (also called the primary data) and refine them so that they can be used by the following data analyses. If the data are a duplicate copy, incomplete, inconsistent, noisy, or outliers, then these operators have to clean them up. If the data are too complex or too large to be handled, these operators will also try to reduce them. If the raw data have errors or omissions, the roles of these operators are to identify them and make them consistent. It can be expected that these operators may affect the analytics result of KDD, be it positive or negative. In summary, the systematic solutions are usually to reduce the complexity of data to accelerate the computation time of KDD and to improve the accuracy of the analytics result.

Data analysis

Since the data analysis (as shown in Fig. 3 ) in KDD is responsible for finding the hidden patterns/rules/information from the data, most researchers in this field use the term data mining to describe how they refine the “ground” (i.e, raw data) into “gold nugget” (i.e., information or knowledge). The data mining methods [ 20 ] are not limited to data problem specific methods. In fact, other technologies (e.g., statistical or machine learning technologies) have also been used to analyze the data for many years. In the early stages of data analysis, the statistical methods were used for analyzing the data to help us understand the situation we are facing, such as public opinion poll or TV programme rating. Like the statistical analysis, the problem specific methods for data mining also attempted to understand the meaning from the collected data.

After the data mining problem was presented, some of the domain specific algorithms are also developed. An example is the apriori algorithm [ 21 ] which is one of the useful algorithms designed for the association rules problem. Although most definitions of data mining problems are simple, the computation costs are quite high. To speed up the response time of a data mining operator, machine learning [ 22 ], metaheuristic algorithms [ 23 ], and distributed computing [ 24 ] were used alone or combined with the traditional data mining algorithms to provide more efficient ways for solving the data mining problem. One of the well-known combinations can be found in [ 25 ], Krishna and Murty attempted to combine genetic algorithm and k -means to get better clustering result than k -means alone does.

Data mining algorithm

As Fig. 4 shows, most data mining algorithms contain the initialization, data input and output, data scan, rules construction, and rules update operators [ 26 ]. In Fig. 4 , D represents the raw data, d the data from the scan operator, r the rules, o the predefined measurement, and v the candidate rules. The scan, construct, and update operators will be performed repeatedly until the termination criterion is met. The timing to employ the scan operator depends on the design of the data mining algorithm; thus, it can be considered as an optional operator. Most of the data algorithms can be described by Fig. 4 in which it also shows that the representative algorithms— clustering , classification , association rules , and sequential patterns —will apply these operators to find the hidden information from the raw data. Thus, modifying these operators will be one of the possible ways for enhancing the performance of the data analysis.

Clustering is one of the well-known data mining problems because it can be used to understand the “new” input data. The basic idea of this problem [ 27 ] is to separate a set of unlabeled input data Footnote 2 to k different groups, e.g., such as k -means [ 28 ]. Classification [ 20 ] is the opposite of clustering because it relies on a set of labeled input data to construct a set of classifiers (i.e., groups) which will then be used to classify the unlabeled input data to the groups to which they belong. To solve the classification problem, the decision tree-based algorithm [ 29 ], naïve Bayesian classification [ 30 ], and support vector machine (SVM) [ 31 ] are widely used in recent years.

Unlike clustering and classification that attempt to classify the input data to k groups, association rules and sequential patterns are focused on finding out the “relationships” between the input data. The basic idea of association rules [ 21 ] is find all the co-occurrence relationships between the input data. For the association rules problem, the apriori algorithm [ 21 ] is one of the most popular methods. Nevertheless, because it is computationally very expensive, later studies [ 32 ] have attempted to use different approaches to reducing the cost of the apriori algorithm, such as applying the genetic algorithm to this problem [ 33 ]. In addition to considering the relationships between the input data, if we also consider the sequence or time series of the input data, then it will be referred to as the sequential pattern mining problem [ 34 ]. Several apriori-like algorithms were presented for solving it, such as generalized sequential pattern [ 34 ] and sequential pattern discovery using equivalence classes [ 35 ].

Output the result

Evaluation and interpretation are two vital operators of the output. Evaluation typically plays the role of measuring the results. It can also be one of the operators for the data mining algorithm, such as the sum of squared errors which was used by the selection operator of the genetic algorithm for the clustering problem [ 25 ].

To solve the data mining problems that attempt to classify the input data, two of the major goals are: (1) cohesion—the distance between each data and the centroid (mean) of its cluster should be as small as possible, and (2) coupling—the distance between data which belong to different clusters should be as large as possible. In most studies of data clustering or classification problems, the sum of squared errors (SSE), which was used to measure the cohesion of the data mining results, can be defined as

where k is the number of clusters which is typically given by the user; \(n_i\) the number of data in the i th cluster; \(x_{ij}\) the j th datum in the i th cluster; \(c_i\) is the mean of the i th cluster; and \(n= \sum ^k_{i=1} n_i\) is the number of data. The most commonly used distance measure for the data mining problem is the Euclidean distance, which is defined as

where \(p_i\) and \(p_j\) are the positions of two different data. For solving different data mining problems, the distance measurement \(D(p_i, p_j)\) can be the Manhattan distance, the Minkowski distance, or even the cosine similarity [ 36 ] between two different documents.

Accuracy (ACC) is another well-known measurement [ 37 ] which is defined as

To evaluate the classification results, precision ( p ), recall ( r ), and F -measure can be used to measure how many data that do not belong to group A are incorrectly classified into group A ; and how many data that belong to group A are not classified into group A . A simple confusion matrix of a classifier [ 37 ] as given in Table 1 can be used to cover all the situations of the classification results.

In Table 1 , TP and TN indicate the numbers of positive examples and negative examples that are correctly classified, respectively; FN and FP indicate the numbers of positive examples and negative examples that are incorrectly classified, respectively. With the confusion matrix at hand, it is much easier to describe the meaning of precision ( p ), which is defined as

and the meaning of recall ( r ), which is defined as

The F -measure can then be computed as

In addition to the above-mentioned measurements for evaluating the data mining results, the computation cost and response time are another two well-known measurements. When two different mining algorithms can find the same or similar results, of course, how fast they can get the final mining results will become the most important research topic.

After something (e.g., classification rules) is found by data mining methods, the two essential research topics are: (1) the work to navigate and explore the meaning of the results from the data analysis to further support the user to do the applicable decision can be regarded as the interpretation operator [ 38 ], which in most cases, gives useful interface to display the information [ 39 ] and (2) a meaningful summarization of the mining results [ 40 ] can be made to make it easier for the user to understand the information from the data analysis. The data summarization is generally expected to be one of the simple ways to provide a concise piece of information to the user because human has trouble of understanding vast amounts of complicated information. A simple data summarization can be found in the clustering search engine, when a query “oasis” is sent to Carrot2 ( http://search.carrot2.org/stable/search ), it will return some keywords to represent each group of the clustering results for web links to help us recognize which category needed by the user, as shown in the left side of Fig. 5 .

Screenshot of the results of clustering search engine

A useful graphical user interface is another way to provide the meaningful information to an user. As explained by Shneiderman in [ 39 ], we need “overview first, zoom and filter, then retrieve the details on demand”. The useful graphical user interface [ 38 , 41 ] also makes it easier for the user to comprehend the meaning of the results when the number of dimensions is higher than three. How to display the results of data mining will affect the user’s perspective to make the decision. For instance, data mining can help us find “type A influenza” at a particular region, but without the time series and flu virus infected information of patients, the government could not recognize what situation (pandemic or controlled) we are facing now so as to make appropriate responses to that. For this reason, a better solution to merge the information from different sources and mining algorithm results will be useful to let the user make the right decision.

Since the problems of handling and analyzing large-scale and complex input data always exist in data analytics, several efficient analysis methods were presented to accelerate the computation time or to reduce the memory cost for the KDD process, as shown in Table 2 . The study of [ 42 ] shows that the basic mathematical concepts (i.e., triangle inequality) can be used to reduce the computation cost of a clustering algorithm. Another study [ 43 ] shows that the new technologies (i.e., distributed computing by GPU) can also be used to reduce the computation time of data analysis method. In addition to the well-known improved methods for these analysis methods (e.g., triangle inequality or distributed computing), a large proportion of studies designed their efficient methods based on the characteristics of mining algorithms or problem itself, which can be found in [ 32 , 44 , 45 ], and so forth. This kind of improved methods typically was designed for solving the drawback of the mining algorithms or using different ways to solve the mining problem. These situations can be found in most association rules and sequential patterns problems because the original assumption of these problems is for the analysis of large-scale dataset. Since the earlier frequent pattern algorithm (e.g., apriori algorithm) needs to scan the whole dataset many times which is computationally very expensive. How to reduce the number of times the whole dataset is scanned so as to save the computation cost is one of the most important things in all the frequent pattern studies. The similar situation also exists in data clustering and classification studies because the design concept of earlier algorithms, such as mining the patterns on-the-fly [ 46 ], mining partial patterns at different stages [ 47 ], and reducing the number of times the whole dataset is scanned [ 32 ], are therefore presented to enhance the performance of these mining algorithms. Since some of the data mining problems are NP-hard [ 48 ] or the solution space is very large, several recent studies [ 23 , 49 ] have attempted to use metaheuristic algorithm as the mining algorithm to get the approximate solution within a reasonable time.

Abundant research results of data analysis [ 20 , 27 , 63 ] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature [ 2 , 64 ] usually can help us easily find the possible solutions. For instance, the clustering result is extremely sensitive to the initial means, which can be mitigated by using multiple sets of initial means [ 65 ]. According to our observation, most data analysis methods have limitations for big data, that can be described as follows:

Unscalability and centralization Most data analysis methods are not for large-scale and complex dataset. The traditional data analysis methods cannot be scaled up because their design does not take into account large or complex datasets. The design of traditional data analysis methods typically assumed they will be performed in a single machine, with all the data in memory for the data analysis process. For this reason, the performance of traditional data analytics will be limited in solving the volume problem of big data.

Non-dynamic Most traditional data analysis methods cannot be dynamically adjusted for different situations, meaning that they do not analyze the input data on-the-fly. For example, the classifiers are usually fixed which cannot be automatically changed. The incremental learning [ 66 ] is a promising research trend because it can dynamically adjust the the classifiers on the training process with limited resources. As a result, the performance of traditional data analytics may not be useful to the problem of velocity problem of big data.

Uniform data structure Most of the data mining problems assume that the format of the input data will be the same. Therefore, the traditional data mining algorithms may not be able to deal with the problem that the formats of different input data may be different and some of the data may be incomplete. How to make the input data from different sources the same format will be a possible solution to the variety problem of big data.

Because the traditional data analysis methods are not designed for large-scale and complex data, they are almost impossible to be capable of analyzing the big data. Redesigning and changing the way the data analysis methods are designed are two critical trends for big data analysis. Several important concepts in the design of the big data analysis method will be given in the following sections.

Big data analytics

Nowadays, the data that need to be analyzed are not just large, but they are composed of various data types, and even including streaming data [ 67 ]. Since big data has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous,” which may change the statistical and data analysis approaches [ 68 ]. Although it seems that big data makes it possible for us to collect more data to find more useful information, the truth is that more data do not necessarily mean more useful information. It may contain more ambiguous or abnormal data. For instance, a user may have multiple accounts, or an account may be used by multiple users, which may degrade the accuracy of the mining results [ 69 ]. Therefore, several new issues for data analytics come up, such as privacy, security, storage, fault tolerance, and quality of data [ 70 ].

The comparison between traditional data analysis and big data analysis on wireless sensor network

The big data may be created by handheld device, social network, internet of things, multimedia, and many other new applications that all have the characteristics of volume, velocity, and variety. As a result, the whole data analytics has to be re-examined from the following perspectives:

From the volume perspective, the deluge of input data is the very first thing that we need to face because it may paralyze the data analytics. Different from traditional data analytics, for the wireless sensor network data analysis, Baraniuk [ 71 ] pointed out that the bottleneck of big data analytics will be shifted from sensor to processing, communications, storage of sensing data, as shown in Fig. 6 . This is because sensors can gather much more data, but when uploading such large data to upper layer system, it may create bottlenecks everywhere.

In addition, from the velocity perspective, real-time or streaming data bring up the problem of large quantity of data coming into the data analytics within a short duration but the device and system may not be able to handle these input data. This situation is similar to that of the network flow analysis for which we typically cannot mirror and analyze everything we can gather.

From the variety perspective, because the incoming data may use different types or have incomplete data, how to handle them also bring up another issue for the input operators of data analytics.

In this section, we will turn the discussion to the big data analytics process.

Big data input

The problem of handling a vast quantity of data that the system is unable to process is not a brand-new research issue; in fact, it appeared in several early approaches [ 2 , 21 , 72 ], e.g., marketing analysis, network flow monitor, gene expression analysis, weather forecast, and even astronomy analysis. This problem still exists in big data analytics today; thus, preprocessing is an important task to make the computer, platform, and analysis algorithm be able to handle the input data. The traditional data preprocessing methods [ 73 ] (e.g., compression, sampling, feature selection, and so on) are expected to be able to operate effectively in the big data age. However, a portion of the studies still focus on how to reduce the complexity of the input data because even the most advanced computer technology cannot efficiently process the whole input data by using a single machine in most cases. By using domain knowledge to design the preprocessing operator is a possible solution for the big data. In [ 74 ], Ham and Lee used the domain knowledge, B -tree, divide-and-conquer to filter the unrelated log information for the mobile web log analysis. A later study [ 75 ] considered that the computation cost of preprocessing will be quite high for massive logs, sensor, or marketing data analysis. Thus, Dawelbeit and McCrindle employed the bin packing partitioning method to divide the input data between the computing processors to handle this high computations of preprocessing on cloud system. The cloud system is employed to preprocess the raw data and then output the refined data (e.g., data with uniform format) to make it easier for the data analysis method or system to preform the further analysis work.

Sampling and compression are two representative data reduction methods for big data analytics because reducing the size of data makes the data analytics computationally less expensive, thus faster, especially for the data coming to the system rapidly. In addition to making the sampling data represent the original data effectively [ 76 ], how many instances need to be selected for data mining method is another research issue [ 77 ] because it will affect the performance of the sampling method in most cases.

To avoid the application-level slow-down caused by the compression process, in [ 78 ], Jun et al. attempted to use the FPGA to accelerate the compression process. The I/O performance optimization is another issue for the compression method. For this reason, Zou et al. [ 79 ] employed the tentative selection and predictive dynamic selection and switched the appropriate compression method from two different strategies to improve the performance of the compression process. To make it possible for the compression method to efficiently compress the data, a promising solution is to apply the clustering method to the input data to divide them into several different groups and then compress these input data according to the clustering information. The compression method described in [ 80 ] is one of this kind of solutions, it first clusters the input data and then compresses these input data via the clustering results while the study [ 81 ] also used clustering method to improve the performance of the compression process.

In summary, in addition to handling the large and fast data input, the research issues of heterogeneous data sources, incomplete data, and noisy data may also affect the performance of the data analysis. The input operators will have a stronger impact on the data analytics at the big data age than it has in the past. As a result, the design of big data analytics needs to consider how to make these tasks (e.g., data clean, data sampling, data compression) work well.

Big data analysis frameworks and platforms

Various solutions have been presented for the big data analytics which can be divided [ 82 ] into (1) Processing/Compute: Hadoop [ 83 ], Nvidia CUDA [ 84 ], or Twitter Storm [ 85 ], (2) Storage: Titan or HDFS, and (3) Analytics: MLPACK [ 86 ] or Mahout [ 87 ]. Although there exist commercial products for data analysis [ 83 – 86 ], most of the studies on the traditional data analysis are focused on the design and development of efficient and/or effective “ways” to find the useful things from the data. But when we enter the age of big data, most of the current computer systems will not be able to handle the whole dataset all at once; thus, how to design a good data analytics framework or platform Footnote 3 and how to design analysis methods are both important things for the data analysis process. In this section, we will start with a brief introduction to data analysis frameworks and platforms, followed by a comparison of them.

The basic idea of big data analytics on cloud system

Researches in frameworks and platforms

To date, we can easily find tools and platforms presented by well-known organizations. The cloud computing technologies are widely used on these platforms and frameworks to satisfy the large demands of computing power and storage. As shown in Fig. 7 , most of the works on KDD for big data can be moved to cloud system to speed up the response time or to increase the memory space. With the advance of these works, handling and analyzing big data within a reasonable time has become not so far away. Since the foundation functions to handle and manage the big data were developed gradually; thus, the data scientists nowadays do not have to take care of everything, from the raw data gathering to data analysis, by themselves if they use the existing platforms or technologies to handle and manage the data. The data scientists nowadays can pay more attention to finding out the useful information from the data even thought this task is typically like looking for a needle in a haystack. That is why several recent studies tried to present efficient and effective framework to analyze the big data, especially on find out the useful things.

Performance-oriented From the perspective of platform performance, Huai [ 88 ] pointed out that most of the traditional parallel processing models improve the performance of the system by using a new larger computer system to replace the old computer system, which is usually referred to as “scale up”, as shown in Fig. 8 a. But for the big data analytics, most researches improve the performance of the system by adding more similar computer systems to make it possible for a system to handle all the tasks that cannot be loaded or computed in a single computer system (called “scale out”), as shown in Fig. 8 b where M1, M2, and M3 represent computer systems that have different computing power, respectively. For the scale up based solution, the computing power of the three systems is in the order of \(\text {M3}>\text {M2}>\text {M1}\) ; but for the scale out based system, all we have to do is to keep adding more similar computer systems to to a system to increase its ability. To build a scalable and fault-tolerant manager for big data analysis, Huai et al. [ 88 ] presented a matrix model which consists of three matrices for data set (D), concurrent data processing operations (O), and data transformations (T), called DOT. The big data is divided into n subsets each of which is processed by a computer node (worker) in such a way that all the subsets are processed concurrently, and then the results from these n computer nodes are collected and transformed to a computer node. By using this framework, the whole data analysis framework is composed of several DOT blocks. The system performance can be easily enhanced by adding more DOT blocks to the system.

The comparisons between scale up and scale out

Another efficient big data analytics was presented in [ 89 ], called generalized linear aggregates distributed engine (GLADE). The GLADE is a multi-level tree-based data analytics system which consists of two types of computer nodes that are a coordinator and workers. The simulation results [ 90 ] show that the GLADE can provide a better performance than Hadoop in terms of the execution time. Because Hadoop requires large memory and storage for data replication and it is a single master, Footnote 4 Essa et al. [ 91 ] presented a mobile agent based framework to solve these two problems, called the map reduce agent mobility (MRAM). The main reason is that each mobile agent can send its code and data to any other machine; therefore, the whole system will not be down if the master failed. Compared to Hadoop, the architecture of MRAM was changed from client/server to a distributed agent. The load time for MRAM is less than Hadoop even though both of them use the map-reduce solution and Java language. In [ 92 ], Herodotou et al. considered issues of the user needs and system workloads. They presented a self-tuning analytics system built on Hadoop for big data analysis. Since one of the major goals of their system is to adjust the system based on the user needs and system workloads to provide good performance automatically, the user usually does not need to understand and manipulate the Hadoop system. The study [ 93 ] was from the perspectives of data centric architecture and operational models to presented a big data architecture framework (BDAF) which includes: big data infrastructure, big data analytics, data structures and models, big data lifecycle management, and big data security. According to the observations of Demchenko et al. [ 93 ], cluster services, Hadoop related services, data analytics tools, databases, servers, and massively parallel processing databases are typically the required applications and services in big data analytics infrastructure.

Result-oriented Fisher et al. [ 5 ] presented a big data pipeline to show the workflow of big data analytics to extract the valuable knowledge from big data, which consists of the acquired data, choosing architecture, shaping data into architecture, coding/debugging, and reflecting works. From the perspectives of statistical computation and data mining, Ye et al. [ 94 ] presented an architecture of the services platform which integrates R to provide better data analysis services, called cloud-based big data mining and analyzing services platform (CBDMASP). The design of this platform is composed of four layers: the infrastructure services layer, the virtualization layer, the dataset processing layer, and the services layer. Several large-scale clustering problems (the datasets are of size from 0.1 G up to 25.6 G) were also used to evaluate the performance of the CBDMASP. The simulation results show that using map-reduce is much faster than using a single machine when the input data become too large. Although the size of the test dataset cannot be regarded as a big dataset, the performance of the big data analytics using map-reduce can be sped up via this kind of testings. In this study, map-reduce is a better solution when the dataset is of size more than 0.2 G, and a single machine is unable to handle a dataset that is of size more than 1.6 G.

Another study [ 95 ] presented a theorem to explain the big data characteristics, called HACE: the characteristics of big data usually are large-volume, Heterogeneous, Autonomous sources with distributed and decentralized control, and we usually try to find out some useful and interesting things from complex and evolving relationships of data. Based on these concerns and data mining issues, Wu and his colleagues [ 95 ] also presented a big data processing framework which includes data accessing and computing tier, data privacy and domain knowledge tier, and big data mining algorithm tier. This work explains that the data mining algorithm will become much more important and much more difficult; thus, challenges will also occur on the design and implementation of big data analytics platform. In addition to the platform performance and data mining issues, the privacy issue for big data analytics was a promising research in recent years. In [ 96 ], Laurila et al. explained that the privacy is an essential problem when we try to find something from the data that are gathered from mobile devices; thus, data security and data anonymization should also be considered in analyzing this kind of data. Demirkan and Delen [ 97 ] presented a service-oriented decision support system (SODSS) for big data analytics which includes information source, data management, information management, and operations management.

Comparison between the frameworks/platforms of big data

In [ 98 ], Talia pointed out that cloud-based data analytics services can be divided into data analytics software as a service, data analytics platform as a service, and data analytics infrastructure as a service. A later study [ 99 ] presented a general architecture of big data analytics which contains multi-source big data collecting, distributed big data storing, and intra/inter big data processing. Since many kinds of data analytics frameworks and platforms have been presented, some of the studies attempted to compare them to give a guidance to choose the applicable frameworks or platforms for relevant works. To give a brief introduction to big data analytics, especially the platforms and frameworks, in [ 100 ], Cuzzocrea et al. first discuss how recent studies responded the “computational emergency” issue of big data analytics. Some open issues, such as data source heterogeneity and uncorrelated data filtering, and possible research directions are also given in the same study. In [ 101 ], Zhang and Huang used the 5Ws model to explain what kind of framework and method we need for different big data approaches. Zhang and Huang further explained that the 5Ws model represents what kind of data, why we have these data, where the data come from, when the data occur, who receive the data, and how the data are transferred. A later study [ 102 ] used the features (i.e., owner, workload, source code, low latency, and complexity) to compare the frameworks of Hadoop [ 83 ], Storm [ 85 ] and Drill [ 103 ]. Thus, it can be easily seen that the framework of Apache Hadoop has high latency compared with the other two frameworks. To better understand the strong and weak points of solutions of big data, Chalmers et al. [ 82 ] then employed the volume, variety, variability, velocity, user skill/experience, and infrastructure to evaluate eight solutions of big data analytics.

In [ 104 ], in addition to defining that a big data system should include data generation, data acquisition, data storage, and data analytics modules, Hu et al. also mentioned that a big data system can be decomposed into infrastructure, computing, and application layers. Moreover, a promising research for NoSQL storage systems was also discussed in this study which can be divided into key-value , column , document , and row databases. Since big data analysis is generally regarded as a high computation cost work, the high performance computing cluster system (HPCC) is also a possible solution in early stage of big data analytics. Sagiroglu and Sinanc [ 105 ] therefore compare the characteristics between HPCC and Hadoop. They then emphasized that HPCC system uses the multikey and multivariate indexes on distributed file system while Hadoop uses the column-oriented database. In [ 17 ], Chen et al. give a brief introduction to the big data analytics of business intelligence (BI) from the perspective of evolution, applications, and emerging research topics. In their survey, Chen et al. explained that the revolution of business intelligence and analytics (BI&I) was from BI&I 1.0, BI&I 2.0, to BI&I 3.0 which are DBMS-based and structured content, web-based and unstructured content, and mobile and sensor based content, respectively.

Big data analysis algorithms

Mining algorithms for specific problem.

Because the big data issues have appeared for nearly ten years, in [ 106 ], Fan and Bifet pointed out that the terms “big data” [ 107 ] and “big data mining” [ 108 ] were first presented in 1998, respectively. The big data and big data mining almost appearing at the same time explained that finding something from big data will be one of the major tasks in this research domain. Data mining algorithms for data analysis also play the vital role in the big data analysis, in terms of the computation cost, memory requirement, and accuracy of the end results. In this section, we will give a brief discussion from the perspective of analysis and search algorithms to explain its importance for big data analytics.

Clustering algorithms In the big data age, traditional clustering algorithms will become even more limited than before because they typically require that all the data be in the same format and be loaded into the same machine so as to find some useful things from the whole data. Although the problem [ 64 ] of analyzing large-scale and high-dimensional dataset has attracted many researchers from various disciplines in the last century, and several solutions [ 2 , 109 ] have been presented presented in recent years, the characteristics of big data still brought up several new challenges for the data clustering issues. Among them, how to reduce the data complexity is one of the important issues for big data clustering. In [ 110 ], Shirkhorshidi et al. divided the big data clustering into two categories: single-machine clustering (i.e., sampling and dimension reduction solutions), and multiple-machine clustering (parallel and MapReduce solutions). This means that traditional reduction solutions can also be used in the big data age because the complexity and memory space needed for the process of data analysis will be decreased by using sampling and dimension reduction methods. More precisely, sampling can be regarded as reducing the “amount of data” entered into a data analyzing process while dimension reduction can be regarded as “downsizing the whole dataset” because irrelevant dimensions will be discarded before the data analyzing process is carried out.

CloudVista [ 111 ] is a representative solution for clustering big data which used cloud computing to perform the clustering process in parallel. BIRCH [ 44 ] and sampling method were used in CloudVista to show that it is able to handle large-scale data, e.g., 25 million census records. Using GPU to enhance the performance of a clustering algorithm is another promising solution for big data mining. The multiple species flocking (MSF) [ 112 ] was applied to the CUDA platform from NVIDIA to reduce the computation time of clustering algorithm in [ 113 ]. The simulation results show that the speedup factor can be increased from 30 up to 60 by using GPU for data clustering. Since most traditional clustering algorithms (e.g, k -means) require a computation that is centralized, how to make them capable of handling big data clustering problems is the major concern of Feldman et al. [ 114 ] who use a tree construction for generating the coresets in parallel which is called the “merge-and-reduce” approach. Moreover, Feldman et al. pointed out that by using this solution for clustering, the update time per datum and memory of the traditional clustering algorithms can be significantly reduced.

Classification algorithms Similar to the clustering algorithm for big data mining, several studies also attempted to modify the traditional classification algorithms to make them work on a parallel computing environment or to develop new classification algorithms which work naturally on a parallel computing environment. In [ 115 ], the design of classification algorithm took into account the input data that are gathered by distributed data sources and they will be processed by a heterogeneous set of learners. Footnote 5 In this study, Tekin et al. presented a novel classification algorithm called “classify or send for classification” (CoS). They assumed that each learner can be used to process the input data in two different ways in a distributed data classification system. One is to perform a classification function by itself while the other is to forward the input data to another learner to have them labeled. The information will be exchanged between different learners. In brief, this kind of solutions can be regarded as a cooperative learning to improve the accuracy in solving the big data classification problem. An interesting solution uses the quantum computing to reduce the memory space and computing cost of a classification algorithm. For example, in [ 116 ], Rebentrost et al. presented a quantum-based support vector machine for big data classification and argued that the classification algorithm they proposed can be implemented with a time complexity \(O(\log NM)\) where N is the number of dimensions and M is the number of training data. There are bright prospects for big data mining by using quantum-based search algorithm when the hardware of quantum computing has become mature.

Frequent pattern mining algorithms Most of the researches on frequent pattern mining (i.e., association rules and sequential pattern mining) were focused on handling large-scale dataset at the very beginning because some early approaches of them were attempted to analyze the data from the transaction data of large shopping mall. Because the number of transactions usually is more than “tens of thousands”, the issues about how to handle the large scale data were studied for several years, such as FP-tree [ 32 ] using the tree structure to include the frequent patterns to further reduce the computation time of association rule mining. In addition to the traditional frequent pattern mining algorithms, of course, parallel computing and cloud computing technologies have also attracted researchers in this research domain. Among them, the map-reduce solution was used for the studies [ 117 – 119 ] to enhance the performance of the frequent pattern mining algorithm. By using the map-reduce model for frequent pattern mining algorithm, it can be easily expected that its application to “cloud platform” [ 120 , 121 ] will definitely become a popular trend in the forthcoming future. The study of [ 119 ] no only used the map-reduce model, it also allowed users to express their specific interest constraints in the process of frequent pattern mining. The performance of these methods by using map-reduce model for big data analysis is, no doubt, better than the traditional frequent pattern mining algorithms running on a single machine.

Machine learning for big data mining

The potential of machine learning for data analytics can be easily found in the early literature [ 22 , 49 ]. Different from the data mining algorithm design for specific problems, machine learning algorithms can be used for different mining and analysis problems because they are typically employed as the “search” algorithm of the required solution. Since most machine learning algorithms can be used to find an approximate solution for the optimization problem, they can be employed for most data analysis problems if the data analysis problems can be formulated as an optimization problem. For example, genetic algorithm, one of the machine learning algorithms, can not only be used to solve the clustering problem [ 25 ], it can also be used to solve the frequent pattern mining problem [ 33 ]. The potential of machine learning is not merely for solving different mining problems in data analysis operator of KDD; it also has the potential of enhancing the performance of the other parts of KDD, such as feature reduction for the input operators [ 72 ].

A recent study [ 68 ] shows that some traditional mining algorithms, statistical methods, preprocessing solutions, and even the GUI’s have been applied to several representative tools and platforms for big data analytics. The results show clearly that machine learning algorithms will be one of the essential parts of big data analytics. One of the problems in using current machine learning methods for big data analytics is similar to those of most traditional data mining algorithms which are designed for sequential or centralized computing. However, one of the most possible solutions is to make them work for parallel computing. Fortunately, some of the machine learning algorithms (e.g., population-based algorithms) can essentially be used for parallel computing, which have been demonstrated for several years, such as parallel computing version of genetic algorithm [ 122 ]. Different from the traditional GA, as shown in Fig. 9 a, the population of island model genetic algorithm, one of the parallel GA’s, can be divided into several sub-populations, as shown in Fig. 9 b. This means that the sub-populations can be assigned to different threads or computer nodes for parallel computing, by a simple modification of the GA.

The comparison between basic idea of traditional GA (TGA) and parallel genetic algorithm (PGA)

For this reason, in [ 123 ], Kiran and Babu explained that the framework for distributed data mining algorithm still needs to aggregate the information from different computer nodes. As shown in Fig. 10 , the common design of distributed data mining algorithm is as follows: each mining algorithm will be performed on a computer node (worker) which has its locally coherent data, but not the whole data. To construct a globally meaningful knowledge after each mining algorithm finds its local model, the local model from each computer node has to be aggregated and integrated into a final model to represent the complete knowledge. Kiran and Babu [ 123 ] also pointed out that the communication will be the bottleneck when using this kind of distributed computing framework.

A simple example of distributed data mining framework [ 86 ]

Bu et al. [ 124 ] found some research issues when trying to apply machine learning algorithms to parallel computing platforms. For instance, the early version of map-reduce framework does not support “iteration” (i.e., recursion). But the good news is that some recent works [ 87 , 125 ] have paid close attention to this problem and tried to fix it. Similar to the solutions for enhancing the performance of the traditional data mining algorithms, one of the possible solutions to enhancing the performance of a machine learning algorithm is to use CUDA, i.e., a GPU, to reduce the computing time of data analysis. Hasan et al. [ 126 ] used CUDA to implement the self-organizing map (SOM) and multiple back-propagation (MBP) for the classification problem. The simulation results show that using GPU is faster than using CPU. More precisely, SOM running on a GPU is three times faster than SOM running on a CPU, and MPB running on a GPU is twenty-seven times faster than MPB running on a. Another study [ 127 ] attempted to apply the ant-based algorithm to grid computing platform. Since the proposed mining algorithm is extended by the ant clustering algorithm of Deneubourg et al. [ 128 ], Footnote 6 Ku-Mahamud modified the ant behavior of this ant clustering algorithm for big data clustering. That is, each ant will be randomly placed on the grid. This means that the ant clustering algorithm then can be used on a parallel computing environment.

The trends of machine learning studies for big data analytics can be divided into twofold: one attempts to make machine learning algorithms run on parallel platforms, such as Radoop [ 129 ], Mahout [ 87 ], and PIMRU [ 124 ]; the other is to redesign the machine learning algorithms to make them suitable for parallel computing or to parallel computing environment, such as neural network algorithms for GPU [ 126 ] and ant-based algorithm for grid [ 127 ]. In summary, both of them make it possible to apply the machine learning algorithms to big data analytics although still many research issues need to be solved, such as the communication cost for different computer nodes [ 86 ] and the large computation cost most machine learning algorithms require [ 126 ].

Output the result of big data analysis

The benchmarks of PigMix [ 130 ], GridMix [ 131 ], TeraSort and GraySort [ 132 ], TPC-C, TPC-H, TPC-DS [ 133 ], and yahoo cloud serving benchmark (YCSB) [ 134 ] have been presented for evaluating the performance of the cloud computing and big data analytics systems. Ghazal et al. [ 135 ] presented another benchmark (called BigBench) to be used as an end-to-end big data benchmark which covers the characteristics of 3V of big data and uses the loading time, time for queries, time for procedural processing queries, and time for the remaining queries as the metrics. By using these benchmarks, the computation time is one of the intuitive metrics for evaluating the performance of different big data analytics platforms or algorithms. That is why Cheptsov [ 136 ] compered the high performance computing (HPC) and cloud system by using the measurement of computation time to understand their scalability for text file analysis. In addition to the computation time, the throughput (e.g., the number of operations per second) and read/write latency of operations are the other measurements of big data analytics [ 137 ]. In the study of [ 138 ], Zhao et al. believe that the maximum size of data and the maximum number of jobs are the two important metrics to understand the performance of the big data analytics platform. Another study described in [ 139 ] presented a systematic evaluation method which contains the data throughput, concurrency during map and reduce phases, response times, and the execution time of map and reduce. Moreover, most benchmarks for evaluating the performance of big data analytics typically can only provide the response time or the computation cost; however, the fact is that several factors need to be taken into account at the same time when building a big data analytics system. The hardware, bandwidth for data transmission, fault tolerance, cost, power consumption of these systems are all issues [ 70 , 104 ] to be taken into account at the same time when building a big data analytics system. Several solutions available today are to install the big data analytics on a cloud computing system or a cluster system. Therefore, the measurements of fault tolerance, task execution, and cost of cloud computing systems can then be used to evaluate the performance of the corresponding factors of big data analytics.

How to present the analysis results to a user is another important work in the output part of big data analytics because if the user cannot easily understand the meaning of the results, the results will be entirely useless. Business intelligent and network monitoring are the two common approaches because their user interface plays the vital role of making them workable. Zhang et al. [ 140 ] pointed out that the tasks of the visual analytics for commercial systems can be divided into four categories which are exploration, dashboards, reporting, and alerting. The study [ 141 ] showed that the interface for electroencephalography (EEG) interpretation is another noticeable research issue in big data analytics. The user interface for cloud system [ 142 , 143 ] is the recent trend for big data analytics. This usually plays vital roles in big data analytics system, one of which is to simplify the explanation of the needed knowledge to the users while the other is to make it easier for the users to handle the data analytics system to work with their opinions. According to our observations, a flexible user interface is needed because although the big data analytics can help us to find some hidden information, the information found usually is not knowledge. This situation is just like the example we mentioned in “ Output the result ”. The mining or statistical techniques can be employed to know the flu situation of each region, but data scientists sometimes need additional ways to display the information to find out the knowledge they need or to prove their assumption. Thus, the user interface can be adjusted by the user to display the knowledge that is needed urgently for big data analytics.

Summary of process of big data analytics

This discussion of big data analytics in this section was divided into input, analysis, and output for mapping the data analysis process of KDD. For the input (see also in “ Big data input ”) and output (see also “ Output the result of big data analysis ”) of big data, several methods and solutions proposed before the big data age (see also “ Data input ”) can also be employed for big data analytics in most cases.

However, there still exist some new issues of the input and output that the data scientists need to confront. A representative example we mentioned in “ Big data input ” is that the bottleneck will not only on the sensor or input devices, it may also appear in other places of data analytics [ 71 ]. Although we can employ traditional compression and sampling technologies to deal with this problem, they can only mitigate the problems instead of solving the problems completely. Similar situations also exist in the output part. Although several measurements can be used to evaluate the performance of the frameworks, platforms, and even data mining algorithms, there still exist several new issues in the big data age, such as information fusion from different information sources or information accumulation from different times.

Several studies attempted to present an efficient or effective solution from the perspective of system (e.g., framework and platform) or algorithm level. A simple comparison of these big data analysis technologies from different perspectives is described in Table 3 , to give a brief introduction to the current studies and trends of data analysis technologies for the big data. The “Perspective” column of this table explains that the study is focused on the framework or algorithm level; the “Description” column gives the further goal of the study; and the “Name” column is an abbreviated names of the methods or platform/framework. From the analysis framework perspective, this table shows that big data framework , platform , and machine learning are the current research trends in big data analytics system. For the mining algorithm perspective, the clustering , classification , and frequent pattern mining issues play the vital role of these researches because several data analysis problems can be mapped to these essential issues.

A promising trend that can be easily found from these successful examples is to use machine learning as the search algorithm (i.e., mining algorithm) for the data mining problems of big data analytics system. The machine learning-based methods are able to make the mining algorithms and relevant platforms smarter or reduce the redundant computation costs. That parallel computing and cloud computing technologies have a strong impact on the big data analytics can also be recognized as follows: (1) most of the big data analytics frameworks and platforms are using Hadoop and Hadoop relevant technologies to design their solutions; and (2) most of the mining algorithms for big data analysis have been designed for parallel computing via software or hardware or designed for Map-Reduce-based platform.

From the results of recent studies of big data analytics, it is still at the early stage of Nolan’s stages of growth model [ 146 ] which is similar to the situations for the research topics of cloud computing, internet of things, and smart grid. This is because several studies just attempted to apply the traditional solutions to the new problems/platforms/environments. For example, several studies [ 114 , 145 ] used k -means as an example to analyze the big data, but not many studies applied the state-of-the-art data mining algorithms and machine learning algorithms to the analysis the big data. This explains that the performance of the big data analytics can be improved by data mining algorithms and metaheuristic algorithms presented in recent years [ 147 ]. The relevant technologies for compression, sampling, or even the platform presented in recent years may also be used to enhance the performance of the big data analytics system. As a result, although these research topics still have several open issues that need to be solved, these situations, on the contrary, also illustrate that everything is possible in these studies.

The open issues

Although the data analytics today may be inefficient for big data caused by the environment, devices, systems, and even problems that are quite different from traditional mining problems, because several characteristics of big data also exist in the traditional data analytics. Several open issues caused by the big data will be addressed as the platform/framework and data mining perspectives in this section to explain what dilemmas we may confront because of big data. Here are some of the open issues:

Platform and framework perspective

Input and output ratio of platform.

A large number of reports and researches mentioned that we will enter the big data age in the near future. Some of them insinuated to us that these fruitful results of big data will lead us to a whole new world where “everything” is possible; therefore, the big data analytics will be an omniscient and omnipotent system. From the pragmatic perspective, the big data analytics is indeed useful and has many possibilities which can help us more accurately understand the so-called “things.” However, the situation in most studies of big data analytics is that they argued that the results of big data are valuable, but the business models of most big data analytics are not clear. The fact is that assuming we have infinite computing resources for big data analytics is a thoroughly impracticable plan, the input and output ratio (e.g., return on investment) will need to be taken into account before an organization constructs the big data analytics center.

Communication between systems

Since most big data analytics systems will be designed for parallel computing, and they typically will work on other systems (e.g., cloud platform) or work with other systems (e.g., search engine or knowledge base), the communication between the big data analytics and other systems will strongly impact the performance of the whole process of KDD. The first research issue for the communication is that the communication cost will incur between systems of data analytics. How to reduce the communication cost will be the very first thing that the data scientists need to care. Another research issue for the communication is how the big data analytics communicates with other systems. The consistency of data between different systems, modules, and operators is also an important open issue on the communication between systems. Because the communication will appear more frequently between systems of big data analytics, how to reduce the cost of communication and how to make the communication between these systems as reliable as possible will be the two important open issues for big data analytics.

Bottlenecks on data analytics system

The bottlenecks will be appeared in different places of the data analytics for big data because the environments, systems, and input data have changed which are different from the traditional data analytics. The data deluge of big data will fill up the “input” system of data analytics, and it will also increase the computation load of the data “analysis” system. This situation is just like the torrent of water (i.e., data deluge) rushed down the mountain (i.e., data analytics), how to split it and how to avoid it flowing into a narrow place (e.g., the operator is not able to handle the input data) will be the most important things to avoid the bottlenecks in data analytics system. One of the current solutions to the avoidance of bottlenecks on a data analytics system is to add more computation resources while the other is to split the analysis works to different computation nodes. A complete consideration for the whole data analytics to avoid the bottlenecks of that kind of analytics system is still needed for big data.

Security issues

Since much more environment data and human behavior will be gathered to the big data analytics, how to protect them will also be an open issue because without a security way to handle the collected data, the big data analytics cannot be a reliable system. In spite of the security that we have to tighten for big data analytics before it can gather more data from everywhere, the fact is that until now, there are still not many studies focusing on the security issues of the big data analytics. According to our observation, the security issues of big data analytics can be divided into fourfold: input, data analysis, output, and communication with other systems. For the input, it can be regarded as the data gathering which is relevant to the sensor, the handheld devices, and even the devices of internet of things. One of the important security issues on the input part of big data analytics is to make sure that the sensors will not be compromised by the attacks. For the analysis and input, it can be regarded as the security problem of such a system. For communication with other system, the security problem is on the communications between big data analytics and other external systems. Because of these latent problems, security has become one of the open issues of big data analytics.

Data mining perspective

Data mining algorithm for map-reduce solution.

As we mentioned in the previous sections, most of the traditional data mining algorithms are not designed for parallel computing; therefore, they are not particularly useful for the big data mining. Several recent studies have attempted to modify the traditional data mining algorithms to make them applicable to Hadoop-based platforms. As long as porting the data mining algorithms to Hadoop is inevitable, making the data mining algorithms work on a map-reduce architecture is the first very thing to do to apply traditional data mining methods to big data analytics. Unfortunately, not many studies attempted to make the data mining and soft computing algorithms work on Hadoop because several different backgrounds are needed to develop and design such algorithms. For instance, the researcher and his or her research group need to have the background in data mining and Hadoop so as to develop and design such algorithms. Another open issue is that most data mining algorithms are designed for centralized computing; that is, they can only work on all the data at the same time. Thus, how to make them work on a parallel computing system is also a difficult work. The good news is that some studies [ 145 ] have successfully applied the traditional data mining algorithms to the map-reduce architecture. These results imply that it is possible to do so. According to our observation, although the traditional mining or soft computing algorithms can be used to help us analyze the data in big data analytics, unfortunately, until now, not many studies are focused on it. As a consequence, it is an important open issue in big data analytics.

Noise, outliers, incomplete and inconsistent data

Although big data analytics is a new age for data analysis, because several solutions adopt classical ways to analyze the data on big data analytics, the open issues of traditional data mining algorithms also exist in these new systems. The open issues of noise, outliers, incomplete, and inconsistent data in traditional data mining algorithms will also appear in big data mining algorithms. More incomplete and inconsistent data will easily appear because the data are captured by or generated from different sensors and systems. The impact of noise, outliers, incomplete and inconsistent data will be enlarged for big data analytics. Therefore, how to mitigate the impact will be the open issues for big data analytics.

Bottlenecks on data mining algorithm

Most of the data mining algorithms in big data analytics will be designed for parallel computing. However, once data mining algorithms are designed or modified for parallel computing, it is the information exchange between different data mining procedures that may incur bottlenecks. One of them is the synchronization issue because different mining procedures will finish their jobs at different times even though they use the same mining algorithm to work on the same amount of data. Thus, some of the mining procedures will have to wait until the others finished their jobs. This situation may occur because the loading of different computer nodes may be different during the data mining process, or it may occur because the convergence speeds are different for the same data mining algorithm. The bottlenecks of data mining algorithms will become an open issue for the big data analytics which explains that we need to take into account this issue when we develop and design a new data mining algorithm for big data analytics.

Privacy issues

The privacy concern typically will make most people uncomfortable, especially if systems cannot guarantee that their personal information will not be accessed by the other people and organizations. Different from the concern of the security, the privacy issue is about if it is possible for the system to restore or infer personal information from the results of big data analytics, even though the input data are anonymous. The privacy issue has become a very important issue because the data mining and other analysis technologies will be widely used in big data analytics, the private information may be exposed to the other people after the analysis process. For example, although all the gathered data for shop behavior are anonymous (e.g., buying a pistol), because the data can be easily collected by different devices and systems (e.g., location of the shop and age of the buyer), a data mining algorithm can easily infer who bought this pistol. More precisely, the data analytics is able to reduce the scope of the database because location of the shop and age of the buyer provide the information to help the system find out possible persons. For this reason, any sensitive information needs to be carefully protected and used. The anonymous, temporary identification, and encryption are the representative technologies for privacy of data analytics, but the critical factor is how to use, what to use, and why to use the collected data on big data analytics.

Conclusions

In this paper, we reviewed studies on the data analytics from the traditional data analysis to the recent big data analysis. From the system perspective, the KDD process is used as the framework for these studies and is summarized into three parts: input, analysis, and output. From the perspective of big data analytics framework and platform, the discussions are focused on the performance-oriented and results-oriented issues. From the perspective of data mining problem, this paper gives a brief introduction to the data and big data mining algorithms which consist of clustering, classification, and frequent patterns mining technologies. To better understand the changes brought about by the big data, this paper is focused on the data analysis of KDD from the platform/framework to data mining. The open issues on computation, quality of end result, security, and privacy are then discussed to explain which open issues we may face. Last but not least, to help the audience of the paper find solutions to welcome the new age of big data, the possible high impact research trends are given below:

For the computation time, there is no doubt at all that parallel computing is one of the important future trends to make the data analytics work for big data, and consequently the technologies of cloud computing, Hadoop, and map-reduce will play the important roles for the big data analytics. To handle the computation resources of the cloud-based platform and to finish the task of data analysis as fast as possible, the scheduling method is another future trend.

Using efficient methods to reduce the computation time of input, comparison, sampling, and a variety of reduction methods will play an important role in big data analytics. Because these methods typically do not consider parallel computing environment, how to make them work on parallel computing environment will be a future research trend. Similar to the input, the data mining algorithms also face the same situation that we mentioned in the previous section , how to make them work on parallel computing environment will be a very important research trend because there are abundant research results on traditional data mining algorithms.

How to model the mining problem to find something from big data and how to display the knowledge we got from big data analytics will also be another two vital future trends because the results of these two researches will decide if the data analytics can practically work for real world approaches, not just a theoretical stuff.

The methods of extracting information from external and relative knowledge resources to further reinforce the big data analytics, until now, are not very popular in big data analytics. But combining information from different resources to add the value of output knowledge is a common solution in the area of information retrieval, such as clustering search engine or document summarization. For this reason, information fusion will also be a future trend for improving the end results of big data analytics.

Because the metaheuristic algorithms are capable of finding an approximate solution within a reasonable time, they have been widely used in solving the data mining problem in recent years. Until now, many state-of-the-art metaheuristic algorithms still have not been applied to big data analytics. In addition, compared to some early data mining algorithms, the performance of metaheuristic is no doubt superior in terms of the computation time and the quality of end result. From these observations, the application of metaheuristic algorithms to big data analytics will also be an important research topic.

Because social network is part of the daily life of most people and because its data is also a kind of big data, how to analyze the data of a social network has become a promising research issue. Obviously, it can be used to predict the behavior of a user. After that, we can make applicable strategies for the user. For instance, a business intelligence system can use the analysis results to encourage particular customers to buy the goods they are interested.

The security and privacy issues that accompany the work of data analysis are intuitive research topics which contain how to safely store the data, how to make sure the data communication is protected, and how to prevent someone from finding out the information about us. Many problems of data security and privacy are essentially the same as those of the traditional data analysis even if we are entering the big data age. Thus, how to protect the data will also appear in the research of big data analytics.

In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining.

In this paper, by an unlabeled input data, we mean that it is unknown to which group the input data belongs. If all the input data are unlabeled, it means that the distribution of the input data is unknown.

In this paper, the analysis framework refers to the whole system, from raw data gathering, data reformat, data analysis, all the way to knowledge representation.

The whole system may be down when the master machine crashed for a system that has only one master.

The learner typically represented the classification function which will create the classifier to help us classify the unknown input data.

The basic idea of [ 128 ] is that each ant will pick up and drop data items in terms of the similarity of its local neighbors.

Abbreviations

principal components analysis

volume, velocity, and variety

International Data Corporation

knowledge discovery in databases

support vector machine

sum of squared errors

generalized linear aggregates distributed engine

big data architecture framework

cloud-based big data mining & analyzing services platform

service-oriented decision support system

high performance computing cluster system

business intelligence and analytics

database management system

multiple species flocking

genetic algorithm

self-organizing map

multiple back-propagation

yahoo cloud serving benchmark

high performance computing

electroencephalography

Lyman P, Varian H. How much information 2003? Tech. Rep, 2004. [Online]. Available: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf .

Xu R, Wunsch D. Clustering. Hoboken: Wiley-IEEE Press; 2009.

Google Scholar  

Ding C, He X. K-means clustering via principal component analysis. In: Proceedings of the Twenty-first International Conference on Machine Learning, 2004, pp 1–9.

Kollios G, Gunopulos D, Koudas N, Berchtold S. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng. 2003;15(5):1170–87.

Article   Google Scholar  

Fisher D, DeLine R, Czerwinski M, Drucker S. Interactions with big data analytics. Interactions. 2012;19(3):50–9.

Laney D. 3D data management: controlling data volume, velocity, and variety, META Group, Tech. Rep. 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

van Rijmenam M. Why the 3v’s are not sufficient to describe big data, BigData Startups, Tech. Rep. 2013. [Online]. Available: http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/ .

Borne K. Top 10 big data challenges a serious look at 10 big data v’s, Tech. Rep. 2014. [Online]. Available: https://www.mapr.com/blog/top-10-big-data-challenges-look-10-big-data-v .

Press G. $16.1 billion big data market: 2014 predictions from IDC and IIA, Forbes, Tech. Rep. 2013. [Online]. Available: http://www.forbes.com/sites/gilpress/2013/12/12/16-1-billion-big-data-market-2014-predictions-from-idc-and-iia/ .

Big data and analytics—an IDC four pillar research area, IDC, Tech. Rep. 2013. [Online]. Available: http://www.idc.com/prodserv/FourPillars/bigData/index.jsp .

Taft DK. Big data market to reach $46.34 billion by 2018, EWEEK, Tech. Rep. 2013. [Online]. Available: http://www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html .

Research A. Big data spending to reach $114 billion in 2018; look for machine learning to drive analytics, ABI Research, Tech. Rep. 2013. [Online]. Available: https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo .

Furrier J. Big data market $50 billion by 2017—HP vertica comes out #1—according to wikibon research, SiliconANGLE, Tech. Rep. 2012. [Online]. Available: http://siliconangle.com/blog/2012/02/15/big-data-market-15-billion-by-2017-hp-vertica-comes-out-1-according-to-wikibon-research/ .

Kelly J, Vellante D, Floyer D. Big data market size and vendor revenues, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues .

Kelly J, Floyer D, Vellante D, Miniman S. Big data vendor revenue and market forecast 2012-2017, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 .

Mayer-Schonberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt; 2013.

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Quart. 2012;36(4):1165–88.

Kitchin R. The real-time city? big data and smart urbanism. Geo J. 2014;79(1):1–14.

Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.

Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. Proc ACM SIGMOD Int Conf Manag Data. 1993;22(2):207–16.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Abbass H, Newton C, Sarker R. Data mining: a heuristic approach. Hershey: IGI Global; 2002.

Book   Google Scholar  

Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P. Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cyber Part B Cyber. 2004;34(6):2451–65.

Krishna K, Murty MN. Genetic \(k\) -means algorithm. IEEE Trans Syst Man Cyber Part B Cyber. 1999;29(3):433–9.

Tsai C-W, Lai C-F, Chiang M-C, Yang L. Data mining for internet of things: a survey. IEEE Commun Surveys Tutor. 2014;16(1):77–97.

Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comp Surveys. 1999;31(3):264–323.

McQueen JB. Some methods of classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 1967. pp 281–297.

Safavian S, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cyber. 1991;21(3):660–74.

Article   MathSciNet   Google Scholar  

McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: Proceedings of the National Conference on Artificial Intelligence, 1998. pp. 41–48.

Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the annual workshop on Computational learning theory, 1992. pp. 144–152.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In : Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. pp. 1–12.

Kaya M, Alhajj R. Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets Syst. 2005;152(3):587–601.

Article   MATH   MathSciNet   Google Scholar  

Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, 1996. pp 3–17.

Zaki MJ. Spade: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Article   MATH   Google Scholar  

Baeza-Yates RA, Ribeiro-Neto B. Modern Information Retrieval. Boston: Addison-Wesley Longman Publishing Co., Inc; 1999.

Liu B. Web data mining: exploring hyperlinks, contents, and usage data. Berlin, Heidelberg: Springer-Verlag; 2007.

d’Aquin M, Jay N. Interpreting data mining results with linked data for learning analytics: motivation, case study and directions. In: Proceedings of the International Conference on Learning Analytics and Knowledge, pp 155–164.

Shneiderman B. The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, 1996, pp 336–343.

Mani I, Bloedorn E. Multi-document summarization by graph search and matching. In: Proceedings of the National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, 1997, pp 622–628.

Kopanakis I, Pelekis N, Karanikas H, Mavroudkis T. Visual techniques for the interpretation of data mining outcomes. In: Proceedings of the Panhellenic Conference on Advances in Informatics, 2005. pp 25–35.

Elkan C. Using the triangle inequality to accelerate k-means. In: Proceedings of the International Conference on Machine Learning, 2003, pp 147–153.

Catanzaro B, Sundaram N, Keutzer K. Fast support vector machine training and classification on graphics processors. In: Proceedings of the International Conference on Machine Learning, 2008. pp 104–111.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996. pp 103–114.

Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. pp 226–231.

Ester M, Kriegel HP, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the International Conference on Very Large Data Bases, 1998. pp 323–333.

Ordonez C, Omiecinski E. Efficient disk-based k-means clustering for relational databases. IEEE Trans Knowl Data Eng. 2004;16(8):909–21.

Kogan J. Introduction to clustering large and high-dimensional data. Cambridge: Cambridge Univ Press; 2007.

MATH   Google Scholar  

Mitra S, Pal S, Mitra P. Data mining in soft computing framework: a survey. IEEE Trans Neural Netw. 2002;13(1):3–14.

Mehta M, Agrawal R, Rissanen J. SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. 1996. pp 18–32.

Micó L, Oncina J, Carrasco RC. A fast branch and bound nearest neighbour classifier in metric spaces. Pattern Recogn Lett. 1996;17(7):731–9.

Djouadi A, Bouktache E. A fast algorithm for the nearest-neighbor classifier. IEEE Trans Pattern Anal Mach Intel. 1997;19(3):277–82.

Ververidis D, Kotropoulos C. Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Process. 2008;88(12):2956–70.

Pei J, Han J, Mao R. CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000. pp 21–30.

Zaki MJ, Hsiao C-J. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng. 2005;17(4):462–78.

Burdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the International Conference on Data Engineering, 2001. pp 443–452.

Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 462–468.

Zaki MJ. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Yan X, Han J, Afshar R. CloSpan: mining closed sequential patterns in large datasets. In: Proceedings of the SIAM International Conference on Data Mining, 2003. pp 166–177.

Pei J, Han J, Asl MB, Pinto H, Chen Q, Dayal U, Hsu MC. PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, 2001. pp 215–226.

Ayres J, Flannick J, Gehrke J, Yiu T. Sequential PAttern Mining using a bitmap representation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 429–435.

Masseglia F, Poncelet P, Teisseire M. Incremental mining of sequential patterns in large databases. Data Knowl Eng. 2003;46(1):97–121.

Xu R, Wunsch-II DC. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.

Chiang M-C, Tsai C-W, Yang C-S. A time-efficient pattern reduction algorithm for k-means clustering. Inform Sci. 2011;181(4):716–31.

Bradley PS, Fayyad UM. Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning, 1998. pp 91–99.

Laskov P, Gehl C, Krüger S, Müller K-R. Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res. 2006;7:1909–36.

MATH   MathSciNet   Google Scholar  

Russom P. Big data analytics. TDWI: Tech. Rep ; 2011.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

Boyd D, Crawford K. Critical questions for big data. Inform Commun Soc. 2012;15(5):662–79.

Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: Proceedings of the International Conference on Contemporary Computing, 2013. pp 404–409.

Baraniuk RG. More is less: signal processing and the data deluge. Science. 2011;331(6018):717–9.

Lee J, Hong S, Lee JH. An efficient prediction for heavy rain from big weather data using genetic algorithm. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2014. pp 25:1–25:7.

Famili A, Shen W-M, Weber R, Simoudis E. Data preprocessing and intelligent data analysis. Intel Data Anal. 1997;1(1–4):3–23.

Zhang H. A novel data preprocessing solution for large scale digital forensics investigation on big data, Master’s thesis, Norway, 2013.

Ham YJ, Lee H-W. International journal of advances in soft computing and its applications. Calc Paralleles Reseaux et Syst Repar. 2014;6(1):1–18.

Cormode G, Duffield N. Sampling for big data: a tutorial. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. pp 1975–1975.

Satyanarayana A. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, 2014. pp 1–6.

Jun SW, Fleming K, Adler M, Emer JS. Zip-io: architecture for application-specific compression of big data. In: Proceedings of the International Conference on Field-Programmable Technology, 2012, pp 343–351.

Zou H, Yu Y, Tang W, Chen HM. Improving I/O performance with adaptive data compression for big data applications. In: Proceedings of the International Parallel and Distributed Processing Symposium Workshops, 2014. pp 1228–1237.

Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J. A spatiotemporal compression based approach for efficient big data processing on cloud. J Comp Syst Sci. 2014;80(8):1563–83.

Xue Z, Shen G, Li J, Xu Q, Zhang Y, Shao J. Compression-aware I/O performance analysis for big data clustering. In: Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2012. pp 45–52.

Pospiech M, Felden C. Big data—a state-of-the-art. In: Proceedings of the Americas Conference on Information Systems, 2012, pp 1–23. [Online]. Available: http://aisel.aisnet.org/amcis2012/proceedings/DecisionSupport/22 .

Apache Hadoop, February 2, 2015. [Online]. Available: http://hadoop.apache.org .

Cuda, February 2, 2015. [Online]. Available: URL: http://www.nvidia.com/object/cuda_home_new.html .

Apache Storm, February 2, 2015. [Online]. Available: URL: http://storm.apache.org/ .

Curtin RR, Cline JR, Slagle NP, March WB, Ram P, Mehta NA, Gray AG. MLPACK: a scalable C++ machine learning library. J Mach Learn Res. 2013;14:801–5.

Apache Mahout, February 2, 2015. [Online]. Available: http://mahout.apache.org/ .

Huai Y, Lee R, Zhang S, Xia CH, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the ACM Symposium on Cloud Computing, 2011. pp 4:1–4:14.

Rusu F, Dobra A. GLADE: a scalable framework for efficient analytics. In: Proceedings of LADIS Workshop held in conjunction with VLDB, 2012. pp 1–6.

Cheng Y, Qin C, Rusu F. GLADE: big data analytics made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012. pp 697–700.

Essa YM, Attiya G, El-Sayed A. Mobile agent based new framework for improving big data analysis. In: Proceedings of the International Conference on Cloud Computing and Big Data. 2013, pp 381–386.

Wonner J, Grosjean J, Capobianco A, Bechmann D Starfish: a selection technique for dense virtual environments. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2012. pp 101–104.

Demchenko Y, de Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2014. pp 104–112.

Ye F, Wang ZJ, Zhou FC, Wang YP, Zhou YC. Cloud-based big data mining and analyzing services platform integrating r. In: Proceedings of the International Conference on Advanced Cloud and Big Data, 2013. pp 147–151.

Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T, Dousse O, Eberle J, Miettinen M. The mobile data challenge: big data for mobile computing research. In: Proceedings of the Mobile Data Challenge by Nokia Workshop, 2012. pp 1–8.

Demirkan H, Delen D. Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decision Support Syst. 2013;55(1):412–21.

Talia D. Clouds for scalable big data analytics. Computer. 2013;46(5):98–101.

Lu R, Zhu H, Liu X, Liu JK, Shao J. Toward efficient and privacy-preserving computing in big data era. IEEE Netw. 2014;28(4):46–50.

Cuzzocrea A, Song IY, Davis KC. Analytics over large-scale multidimensional data: The big data revolution!. In: Proceedings of the ACM International Workshop on Data Warehousing and OLAP, 2011. pp 101–104.

Zhang J, Huang ML. 5Ws model for big data analysis and visualization. In: Proceedings of the International Conference on Computational Science and Engineering, 2013. pp 1021–1028.

Chandarana P, Vijayalakshmi M. Big data analytics frameworks. In: Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications, 2014. pp 430–434.

Apache Drill February 2, 2015. [Online]. Available: URL: http://drill.apache.org/ .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Sagiroglu S, Sinanc D, Big data: a review. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2013. pp 42–47.

Fan W, Bifet A. Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett. 2013;14(2):1–5.

Diebold FX. On the origin(s) and development of the term “big data”, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, Tech. Rep. 2012. [Online]. Available: http://economics.sas.upenn.edu/sites/economics.sas.upenn.edu/files/12-037.pdf .

Weiss SM, Indurkhya N. Predictive data mining: a practical guide. San Francisco: Morgan Kaufmann Publishers Inc.; 1998.

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comp. 2014;2(3):267–79.

Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. Big data clustering: a review. In: Proceedings of the International Conference on Computational Science and Its Applications, 2014. pp 707–720.

Xu H, Li Z, Guo S, Chen K. Cloudvista: interactive and economical visual cluster analysis for big data in the cloud. Proc VLDB Endowment. 2012;5(12):1886–9.

Cui X, Gao J, Potok TE. A flocking based algorithm for document clustering analysis. J Syst Archit. 2006;52(89):505–15.

Cui X, Charles JS, Potok T. GPU enhanced parallel computing for large scale data clustering. Future Gener Comp Syst. 2013;29(7):1736–41.

Feldman D, Schmidt M, Sohler C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2013. pp 1434–1453.

Tekin C, van der Schaar M. Distributed online big data classification using context information. In: Proceedings of the Allerton Conference on Communication, Control, and Computing, 2013. pp 1435–1442.

Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big feature and big data classification. CoRR , vol. abs/1307.0471, 2014. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1307.html#RebentrostML13 .

Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on mapreduce. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2012. pp 76:1–76:8.

Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: Proceedings of the ACM International Conference on Information and Knowledge Management, 2012. pp 85–94.

Leung CS, MacKinnon R, Jiang F. Reducing the search space for big data mining for interesting patterns from uncertain data. In: Proceedings of the International Congress on Big Data, 2014. pp 315–322.

Yang L, Shi Z, Xu L, Liang F, Kirsh I. DH-TRIE frequent pattern mining on hadoop using JPA. In: Proceedings of the International Conference on Granular Computing, 2011. pp 875–878.

Huang JW, Lin SC, Chen MS. DPSP: Distributed progressive sequential pattern mining on the cloud. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, vol. 6119, 2010, pp 27–34.

Paz CE. A survey of parallel genetic algorithms. Calc Paralleles Reseaux et Syst Repar. 1998;10(2):141–71.

kranthi Kiran B, Babu AV. A comparative study of issues in big data clustering algorithm with constraint based genetic algorithm for associative clustering. Int J Innov Res Comp Commun Eng 2014; 2(8): 5423–5432.

Bu Y, Borkar VR, Carey MJ, Rosen J, Polyzotis N, Condie T, Weimer M, Ramakrishnan R. Scaling datalog for machine learning on big data, CoRR , vol. abs/1203.0160, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1203.html#abs-1203-0160 .

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010. pp 135–146.

Hasan S, Shamsuddin S,  Lopes N. Soft computing methods for big data problems. In: Proceedings of the Symposium on GPU Computing and Applications, 2013. pp 235–247.

Ku-Mahamud KR. Big data clustering using grid computing and ant-based algorithm. In: Proceedings of the International Conference on Computing and Informatics, 2013. pp 6–14.

Deneubourg JL, Goss S, Franks N, Sendova-Franks A, Detrain C, Chrétien L. The dynamics of collective sorting robot-like ants and ant-like robots. In: Proceedings of the International Conference on Simulation of Adaptive Behavior on From Animals to Animats, 1990. pp 356–363.

Radoop [Online]. https://rapidminer.com/products/radoop/ . Accessed 2 Feb 2015.

PigMix [Online]. https://cwiki.apache.org/confluence/display/PIG/PigMix . Accessed 2 Feb 2015.

GridMix [Online]. http://hadoop.apache.org/docs/r1.2.1/gridmix.html . Accessed 2 Feb 2015.

TeraSoft [Online]. http://sortbenchmark.org/ . Accessed 2 Feb 2015.

TPC, transaction processing performance council [Online]. http://www.tpc.org/ . Accessed 2 Feb 2015.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with ycsb. In: Proceedings of the ACM Symposium on Cloud Computing, 2010. pp 143–154.

Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen HA. BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013. pp 1197–1208.

Cheptsov A. Hpc in big data age: An evaluation report for java-based data-intensive applications implemented with hadoop and openmpi. In: Proceedings of the European MPI Users’ Group Meeting, 2014. pp 175:175–175:180.

Yuan LY, Wu L, You JH, Chi Y. Rubato db: A highly scalable staged grid database system for oltp and big data applications. In: Proceedings of the ACM International Conference on Conference on Information and Knowledge Management, 2014. pp 1–10.

Zhao JM, Wang WS, Liu X, Chen YF. Big data benchmark - big DS. In: Proceedings of the Advancing Big Data Benchmarks, 2014, pp. 49–57.

 Saletore V, Krishnan K, Viswanathan V, Tolentino M. HcBench: Methodology, development, and full-system characterization of a customer usage representative big data/hadoop benchmark. In: Advancing Big Data Benchmarks, 2014. pp 73–93.

Zhang L, Stoffel A, Behrisch M,  Mittelstadt S, Schreck T, Pompl R, Weber S, Last H, Keim D. Visual analytics for the big data era—a comparative review of state-of-the-art commercial systems. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, 2012. pp 173–182.

Harati A, Lopez S, Obeid I, Picone J, Jacobson M, Tobochnik S. The TUH EEG CORPUS: A big data resource for automated eeg interpretation. In: Proceeding of the IEEE Signal Processing in Medicine and Biology Symposium, 2014. pp 1–5.

Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment. 2009;2(2):1626–9.

Beckmann M, Ebecken NFF, de Lima BSLP, Costa MA. A user interface for big data with rapidminer. RapidMiner World, Boston, MA, Tech. Rep., 2014. [Online]. Available: http://www.slideshare.net/RapidMiner/a-user-interface-for-big-data-with-rapidminer-marcelo-beckmann .

Januzaj E, Kriegel HP, Pfeifle M. DBDC: Density based distributed clustering. In: Proceedings of the Advances in Database Technology, 2004; vol. 2992, 2004, pp 88–105.

Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. Proceedings Cloud Comp. 2009;5931:674–9.

Nolan RL. Managing the crises in data processing. Harvard Bus Rev. 1979;57(1):115–26.

Tsai CW, Huang WC, Chiang MC. Recent development of metaheuristics for clustering. In: Proceedings of the Mobile, Ubiquitous, and Intelligent Computing, 2014; vol. 274, pp. 629–636.

Download references

Authors’ contributions

CWT contributed to the paper review and drafted the first version of the manuscript. CFL contributed to the paper collection and manuscript organization. HCC and AVV double checked the manuscript and provided several advanced ideas for this manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions on the paper. This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under Contracts MOST103-2221-E-197-034, MOST104-2221-E-197-005, and MOST104-2221-E-197-014.

Compliance with ethical guidelines

Competing interests The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan

Chun-Wei Tsai & Han-Chieh Chao

Institute of Computer Science and Information Engineering, National Chung Cheng University, Chia-Yi, Taiwan

Chin-Feng Lai

Information Engineering College, Yangzhou University, Yangzhou, Jiangsu, China

Han-Chieh Chao

School of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, SE-931 87, Skellefteå, Sweden

Athanasios V. Vasilakos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Athanasios V. Vasilakos .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Tsai, CW., Lai, CF., Chao, HC. et al. Big data analytics: a survey. Journal of Big Data 2 , 21 (2015). https://doi.org/10.1186/s40537-015-0030-3

Download citation

Received : 14 May 2015

Accepted : 02 September 2015

Published : 01 October 2015

DOI : https://doi.org/10.1186/s40537-015-0030-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • data analytics
  • data mining

research papers in data analytics

Grad Coach

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Research Topic Kickstarter service, which is the perfect starting point for developing a unique, well-justified research topic.

Research Topic Kickstarter - Need Help Finding A Research Topic?

You Might Also Like:

IT & Computer Science Research Topics

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects

International Journal of Academic Research in Management, 9(1):1-9, 2022 http://elvedit.com/journals/IJARM/wp-content/uploads/Different-Types-of-Data-Analysis-Data-Analysis-Methods-and-Tec

9 Pages Posted: 18 Aug 2022

Hamed Taherdoost

Hamta Group

Date Written: August 1, 2022

This article is concentrated to define data analysis and the concept of data preparation. Then, the data analysis methods will be discussed. For doing so, the first six main categories are described briefly. Then, the statistical tools of the most commonly used methods including descriptive, explanatory, and inferential analyses are investigated in detail. Finally, we focus more on qualitative data analysis to get familiar with the data preparation and strategies in this concept.

Keywords: Data Analysis, Data Preparation, Data Analysis Methods, Data Analysis Types, Descriptive Analysis, Explanatory Analysis, Inferential Analysis, Predictive Analysis, Explanatory Analysis, Causal Analysis and Mechanistic Analysis, Statistical Analysis.

Suggested Citation: Suggested Citation

Hamed Taherdoost (Contact Author)

Hamta group ( email ).

Vancouver Canada

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, data science, data analytics & informatics ejournal.

Subscribe to this fee journal for more curated articles on this topic

  • DOI: 10.1080/2573234x.2024.2365917
  • Corpus ID: 270622352

The role of data science and data analytics for innovation: a literature review

  • Pedro Natividade Joergensen , Michael Zaggl
  • Published in The Journal of Business… 19 June 2024
  • Business, Computer Science

73 References

Governing crowdsourcing for unconstrained innovation problems, data science and its relationship to big data and data-driven decision making, facilitators and inhibitors for integrating expertise diversity in innovation teams: the case of plasmid exchange in molecular biology, explicating the role of emerging technologies and firm capabilities towards attainment of competitive advantage in health insurance service firms, technological innovation and circular economy practices: business strategies to mitigate the effects of covid-19, exploration of talent mining based on machine learning and the influence of knowledge acquisition, innovation resistance and resource allocation strategy of medical information digitalization, exploring big data-driven innovation in the manufacturing sector: evidence from uk firms, the value of surgical data—impact on the future of the surgical field, title a knowledge graph to understand nursing big data: case example for guidance., related papers.

Showing 1 through 3 of 0 Related Papers

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sensors-logo

Article Menu

research papers in data analytics

  • Subscribe SciFeed
  • Recommended Articles
  • Author Biographies
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A review of predictive analytics models in the oil and gas industries.

research papers in data analytics

1. Introduction

2. predicted analytics models for o&g, 2.1. application of artificial neural network models, 2.2. application of deep learning models, 2.3. application of fuzzy logic and neuro-fuzzy models, 2.4. application of decision tree, random forest, and hybrid models, 2.5. application of interrelated ai models, 2.6. application of statistical models, 2.7. alternative ml models utilized for predictive analytics in the o&g, 3. literature review assessment.

  • Table 1 , Table 2 , Table 3 , Table 4 , Table 5 , Table 6 and Table 7 provide a comprehensive overview of the reviewed papers, presenting essential details such as the author names, applied AI model types, temporality of the dataset, domain of the O&G model in the study, dataset sources, number of data samples, parameters for input and output, measures for the performance employed, best models found, and advantages or drawbacks of the performing models. The researchers consistently focused on carefully selecting input combinations for O&G predictive analytics modeling.
  • ANN models can be expanded from binary to multiclass cases. Furthermore, the complexity of ANN models may be easily changed by modifying model structure and learning methods and assigning transfer functions using empirical evidence or correlation analysis. The findings revealed that ANNs could effectively predict, classify, or cluster O&G cases, including crater width in buried gas pipelines, corrosion defect depth, flowing bottom-hole pressure in vertical oil wells, concentrations of gas-phase pollutants for contamination removal, drilling-related occurrences based on epochs, age, formation, lithology, and fields, as well as predicting gas routes and chimneys in drilling activities and DGA datasets. ANNs may be compared to various models, like the SARIMA and QDA.
  • Reviewed articles from 2021 to 2023: RF has become much more popular in the predictive analytics for O&G than other modeling techniques, like the MLP, DT, and LSTM, because it prevents overfitting and is more accurate in prediction. In the O&G sector, RF appears to be a typical, flexible, and effective ML framework because of its capacity to handle complicated O&G datasets that may be fragmented. The O&G industry has become another field with data scarcity for modeling. In pipeline failure risk prediction and transformer fault classification, RF is included in model ensembles to help achieve good results. Its use in drilling, well data analysis, lithology identification, crude oil data analysis, and burst pressure prediction demonstrates RF’s robust application performance. RF stands out for its dependability, obtaining excellent accuracy, precision, and recall values in many applications within the O&G area, emphasizing its applicability for multiple data formats such as binary or multiclass cases.
  • The O&G industry has seen a rise in the use of DL, an effective subset of ML, especially for predicting the lifespan of equipment and modeling groundwater levels. DL frameworks, especially the CNN and LSTM, outperform other models in prediction accuracy. Industry uses of DL include assessing algorithm performance, integrating data into DL algorithms, and developing simulation frameworks. Significant studies demonstrate DL’s efficacy in estimating oil output and pressure in wells, identifying pipeline fractures, and producing hydrocarbons in the gas sector. The evaluations of hybrid models, such as DCNN+LSTM and LSTM+Seq2Seq, show outstanding accuracy, indicating DL’s potential for optimizing operations and decision-making processes in the O&G field. The hybrid model is more efficient due to feature extraction and the capacity to learn patterns in extended data sequences.
  • AI models are widely employed in the O&G sector to deliver predictive analytics. In non-linear modeling, SVR is a kernel-based ML method often used to translate data to a higher-dimensional space. This makes it an effective tool for regression problems with complicated input and interaction of target variables. MLR is still an excellent approach for examining dependencies since it is a powerful tool for analyzing the connection between dependent and several independent variables. Non-temporal gas well data are analyzed using MLR, SVR, and GPR models because they provide a good blend of interpretability, simplicity, performance, and adaptability. However, the decision between these models is ultimately determined by the dataset’s particular properties and the problem’s needs. The other research focused on the temporal prediction of corrosion in pipes using several AI models, with the RNN showing promising results. Non-temporal O&G production categorization, reservoir data analysis, and transformer fault prediction were all explored using various AI models, demonstrating industry flexibility.
  • The O&G sector replicates real-world system behavior with mathematical models, namely regression and time series analysis. Statistical models such as the SARIMA, AR, and ARIMA are more accurate since they account for temporal relationships. Research has validated the efficacy of the SARIMA in forecasting DGA gas concentrations in transformers, highlighting its ability to capture seasonal fluctuations based on each temporal data point. These techniques forecast shale gas output, producing a satisfactory mean outcome. It has been proven that statistical approaches are adaptable to dealing with temporal dependencies and forecasting concerns in the O&G area.
  • The limited sample size of the dataset utilized in earlier research on predictive analytics in O&G industries is a key limitation that can have a major impact on the results’ generalizability and dependability. It is challenging to obtain reliable results from small sample numbers since they frequently result in more variability and fewer accurate estimations. This limitation may also lead to a loss of statistical power, which lowers the capacity to identify important variations or connections in the data. Additionally, there is a higher chance that a smaller sample size of data may not accurately reflect the larger population, which could introduce bias and restrict the findings’ application to other groups. Therefore, to maintain robustness and accuracy, researchers need to take precautions when interpreting studies based on limited datasets and think about confirming their findings using larger and more varied sample sizes.
  • A few input parameters were used to detect defects in wells utilizing various sensors in predictive analytics including classified, clustered, and forecasted. Because of the data’s accessibility and availability, researchers regularly employ P-PDG, P-PDG, P-TPT, T-TPT, and P-MON-CKP (five parameters) as input parameters. Data limitations are widespread due to the difficulty of digging wells in severe environments such as the deep sea. However, there are two types of models implemented RF model in the previous study. Between RF model used 15 input parameters and the RF model used five parameters then the performance results of those two models are compared. The outcomes of employing the 15 input parameters with the DT model were superior to the five input parameter models. Table 8 outlines the input parameters utilized by the researchers in their research papers.
  • Detecting internal transformer failures is another O&G-related topic that has been the subject of several previous studies. Specifically, a few gas compositions were used as input variables, including acetylene (C 2 H 2 ), ethylene (C 2 H 4 ), ethane (C 2 H 6 ), methane (CH 4 ), and hydrogen (H 2 ), which were mainly applied across the studies because of the high correlation between the input variables and the target variables in detecting the fault in the transformer. However, the detection of other parameters such as total hydrocarbon (TH), carbon monoxide (CO), carbon dioxide (CO 2 ), ammonia (NH 3 ), acetaldehyde (CH 3 CHO), acetone (CH 32 CO), toluene (C 6 H 5 CH 3 ), oxygen (O 2 ), nitrogen (N 2 ), and ethanol (CH 3 CH 2 OH) varied between studies. These parameters were chosen because of the weak correlation ranking between the input and target variables; so, not all the studies implemented the gas compositions mentioned earlier. A few input variables, including C 2 H 2 , C 2 H 4 , C 2 H 6 , CH 4 , and H 2 (five variables), were included in the study article’s model comparison. The results showed that models like KNN, QDA, and LGBM had accuracies of 88%, 99.29%, and 87.06%, respectively. In contrast, the accuracies of the MTGNN, KNN+SMOTE, and RF, with accuracies of 92%, 98%, and 96.2%, respectively, were obtained when the models employed C 2 H 2 , C 2 H 4 , C 2 H 6 , CH 4 , H 2 , TH, CO, CO 2 , NH 3 , CH 3 CHO, CH 32 CO, C 6 H 5 CH 3 , O 2 , N 2 , and CH 3 CH 2 OH (15 variables) in their research. As can be observed from the average accuracies, the use of 15 variables produces superior outcomes than the five variable models. Previous research publications may be found in Table 9 .
  • Table 10 summarizes the input parameters for a well logging predictive analytics model. The researchers commonly used 14 parameters for well logging, including gamma ray (GR), sonic (Vp), deep and shallow resistivities (LLD and LLS), neuro-porosity (NPHI), density (RHOB), caliper (CALI), neutron (NEU), sonic transit time (DT), bulk density (DEN), deep resistivity (RD), true resistivity (RT), shallow resistivity (RES SLW), total porosity (PHIT), and water saturation (SW). The correlation coefficient between the input parameters and the target variables is essential to determine which parameters are appropriate for predictive analytics and the data type, which may be numerical or categorical. Thus, a few important variables can be chosen to construct the best model for increased accuracy. However, the model using 14 variables produced a substantial result of 97% by including XGBoost in their research, but the study that only utilized GR, Vp, LLD and LLS, NPHI, and RHOB and used the LSTM model achieved a slightly lower result of 94%. These three well-known datasets, which have been utilized in recent research in the O&G sector, demonstrate the importance of determining the correlation between target and input parameters to compare which variables are appropriate for models to provide significant outcomes in the research.
  • The assessment of O&G research revealed an increase in published papers over time. As seen in Figure 2 , the rise in O&G discoveries due to the dependence of technological advancements on the usage of gas and petroleum, as well as the annual progress of ML and AI tools, has resulted in more studies in this field utilizing AI-based models. As shown in Figure 2 , there was an increase in growth throughout 2021, with 32 research publications published in this field. However, the number of articles released in 2022 decreased by seven, with just 25 published research papers. This reduction can be attributed to the continued development of AI and the gradual progression of interest in O&G research. It exhibits a positive trend, with 34 articles published in this field by 2023. This increase may be impacted by recognizing the necessity for improvement in the AI-based model in the O&G area. Many O&G companies have followed the IR4.0 road to integrate AI in their organization and reduce the likelihood of future expense utilization by forecasting future events.
  • Throughout the research period, developments in AI models resulted in more complicated and interconnected models, giving researchers tools to construct more exact and resilient models. A similar finding was reached while investigating the use of various models in predictive analytics in the O&G industry during the last three years. Figure 4 a depicts a thorough breakdown of the most common model types used for predictive analytics in the O&G industry, illustrated by a pie chart. The chart shows that the most widely used models, there is 37% out of all models are classified as “others”, which primarily include foundational models such as SVR, GRU, MLP, and boosting-based models (shown in Figure 4 b). Due to their improved efficiency, accuracy, and capacity to handle non-linear datasets, these models have become quite popular. This selection of models shows that there is still a lot of remaining potential in this field.
  • The analysis of predictive analytics research publications from 2021 to 2023 focuses heavily on several areas of the O&G sector. Crude oils (7), oil (5), reservoirs (16), pipelines (16), drilling (5), wells (20), transformers (10), gas (10), and lithology (2) all appear as similar subjects in different research. The frequency of these terms demonstrates the industry’s strong interest in using predictive analytics to optimize operations and decision-making in various sectors, including reservoir management, drilling procedures, pipeline integrity, and transformer health. This trend represents a deliberate effort in the O&G industry to use sophisticated analytics for greater efficiency, risk management, and overall operational excellence. Figure 5 is the graphical summary of the types of O&G sectors in research articles.
  • Several performance measures have been utilized in O&G research, demonstrating diverse assessment criteria for predictive analytics models (see Figure 6 ). The performance metrics help understand the models’ performance since they might show many model characteristics. Figure 6 a, which shows the various performance measures used in the research, demonstrates that accuracy (49) was the most preferred for calculating the correctly predicted value versus the actual one. This performance measure is appropriate for categorical data types and classification predictive analysis because it is simple to grasp and indicates whether all the classes are balanced. However, utilizing accuracy for unbalanced classes has limitations since it can be deceptive; alternative measures like precision, recall, F1 score, or AUC may be more helpful. Aside from that, the researchers’ second chosen performance indicator in their research is R 2 (41). This performance indicator is commonly employed in regression analysis and numerical data since it measures the relationship between the independent and dependent variables.
  • Furthermore, R 2 is simple to read because it ranges from 0 to 1, with closer results to 1 indicating perfect variability between independent and dependent variables. However, there is a disadvantage to using only R 2 to demonstrate how effectively the model reacts. One of the disadvantages is that it is vulnerable to outliers; even a single outlier might alter the results. Figure 6 b is an expansion of the “others” section that depicts the additional performance indicators used in the previous studies.
  • Based on the data presented in Table 11 , a thorough analysis of model performance for diverse applications identifies numerous key performers across multiple categories. In the field of ANNs, significant high performers include ANN models with accuracies of 99.6% and ANNs integrated with PSO (ANN+PSO) with 99% accuracy. This suggests that adding optimization techniques such as PSO can considerably improve ANN performance. DL models also perform well, with DCNN+LSTM obtaining 99.37% accuracy and GRU models reaching 99% accuracy. These studies demonstrate the effectiveness of DL systems, particularly in managing complicated data patterns.
  • Within the class of Fuzzy Logic and Neuro-fuzzy models, every variation—LSSVM+CSA, ANFIS+PCA, and Control Chart+RF—achieves 99% accuracy on average. This consistency emphasizes the dependability of Fuzzy Logic systems in certain applications. DT, RF, and hybrid models exhibit considerable variability, with top performers such as DT and CATBOOST reaching 99.9% accuracy. However, the high number of models with much lower accuracies indicates a considerable sensitivity to certain data properties and model settings.
  • Interrelated AI models, particularly the SVR combined with the Genetic Algorithm and Particle Swarm Optimization (SVR+GA+PSO), outperform others with 99% accuracy, demonstrating the potential of hybrid approaches to increase prediction accuracy. The ARIMA is the most accurate statistical models in the research, with a performance of 63%. However, it has limitations when dealing with complex datasets compared to advanced AI models.
  • Finally, in predictive analytics for the O&G domain, the Hybrid-Physics Guided-Variational Bayesian Spatial-Temporal Neural Network and GRU models approach 99% accuracy, demonstrating the usefulness of merging domain-specific knowledge with sophisticated neural network designs. ANN and DL models perform well in a variety of situations, but using hybrid approaches and optimization techniques can improve their accuracy even more. However, the difference in performance across DT and RF models indicates that careful model selection and tuning are necessary to achieve optimal outcomes.
  • The study indicates various patterns in model performance. ANNs have few outliers of the model’s performance but show excellent accuracy for the MLP, for example, has 10% accuracy. While there is significant volatility in the model’s performance, DL models consistently perform well, as seen by Faster R-CNN+ClusterRPN’s 71% accuracy. Fuzzy Logic models provide particularly consistent high performance. DT and RF models are very variable, with some obtaining outstanding accuracy and others doing poorly. Interrelated AI models have consistently obtained excellent accuracy. Statistical models, such as the ARIMA, perform poorly compared to other categories, showing their limits with complicated datasets. Predictive analytics models normally perform well. Yet, there is a significant outlier in predictive analytics modeling. For example, K+MC with 18% accuracy.
  • Performance levels differ among model categories, as shown in Figure 7 . ANN models perform well on average, with an accuracy of 89.23%, but performance can vary greatly depending on specific variations and modifications, as shown by several outliers. DL models perform well, with an average accuracy of 93.73%, demonstrating less variability and solid outcomes across diverse versions. Fuzzy Logic and Neuro-fuzzy models stand out for their excellent and constant performance, with an average accuracy of 99%, making them extremely trustworthy for their applications. DT, RF, and hybrid models exhibit great variability; although models like CATBOOST and DT attain excellent accuracy, others, such as RF+Analog-to-digital converters, perform poorly. Interrelated AI models perform consistently well, with an average accuracy of 97.67%. In comparison, the ARIMA model from the statistical model category performs inadequately, with 63% accuracy, demonstrating limits in dealing with complex information. Models used for predictive analytics in the O&G field typically perform well, although there are a few distinct instances. Overall, while the most advanced AI models perform well, the diversity in particular categories emphasize the significance of model selection and modification for the best outcomes.

4. Future Research Directions

5. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest, abbreviations.

AbbreviationDefinitionAbbreviationDefinition
RFRandom ForestDNNDeep Neural Network
GAMGeneralized Additive ModelMELMMultivariate Empirical Mode Decomposition
NNNeural NetworkANFISAdaptive Neuro-Fuzzy Inference System
SVR-GASupport Vector Regression with Genetic AlgorithmSOMSelf-Organizing Map
SVR-PSOSupport Vector Regression with Particle Swarm OptimizationANNArtificial Neural Network
SVR-FFASupport Vector Regression with Firefly AlgorithmMRGCMaximum Relevant Gain Clustering
GBGradient BoostingCatBoostCategorical Boosting
LSSVM-CSALeast Squares Support Vector Machine with Cuckoo Search AlgorithmMLRMultiple Linear Regression
AHCAgglomerative Hierarchical ClusteringSVMSupport Vector Machine
XGBoostExtreme Gradient BoostingFNFuzzy Network
GPRGaussian Process RegressionLDALinear Discriminant Analysis
LWQPSO-ANNLinearly Weighted Quantum Particle Swarm Optimization with Artificial Neural NetworkLSSVMLeast Squares Support Vector Machine
PCAPrincipal Component AnalysisDLDeep Learning
MLP-ANNMultilayer Perceptron with Artificial Neural NetworkMLSTMMultilayer Long Short-Term Memory
MLP-PSOMultilayer Perceptron with Particle Swarm OptimizationGRUGated Recurrent Unit
DTDecision TreeAdaBoostAdaptive Boosting
LSTMLong Short-Term MemoryLSTM-AE-IFLong Short-Term Memory Autoencoder with Isolation Forest
KNNk-Nearest NeighborsDNNDeep Neural Network
NBNaive BayesCNNConvolutional Neural Network
GPGenetic ProgrammingO&GOil and Gas
ELMExtreme Learning MachineAIArtificial Intelligence
DFDeep ForestMSEMean Squared Error
QDAQuadratic Discriminant AnalysisMAPEMean Absolute Percentage Error
MLMachine LearningAAPEArithmetic Average Percentage Error
DGADissolved Gas AnalysisSMAPESymmetric Mean Absolute Percentage Error
RMSERoot Mean Squared ErrorRSERelative Squared Error
MAEMean Absolute ErrorRFRRandom Forest Regression
AUCArea Under the CurveFNACCFaulty-Normal Accuracy
AREAbsolute Relative ErrorTPCTotal Percent Correct
EVSExplained Variance ScoreVAFVariance Accounted For
DTRDecision Tree RegressionWIWeighted Index
PLRPolynomial Linear RegressionLMILinear Mean Index
SNRSignal-to-Noise RatioAPAverage Precision
RFNACCReal Faulty-Normal AccuracyMAPMean Average Percentage
RMSPERoot Mean Square Percentage ErrorARDAbsolute Relative Difference
MAREMean Absolute Relative ErrorMpaMegapascal
SISeverity IndexP-JUS-CKGLPressure Downstream of Gas Lift Choke
ENSEnergy Normalized ScoreP-CKGLPressure Downstream of Gas Lift Choke (CKGL)
MPEMean Percentage ErrorQGLGas Lift Flow Rate
RCorrelation of CoefficientT-PDGTemperature at the Permanent Downhole Gauge Sensor
AARDAverage Absolute Relative DeviationT-PCKTemperature Downstream of the Production Choke
P-PDGPressure at Permanent Downhole Gauge (PDG)LSBLeast Square Boosting
P-TPTPressure at Temperature/Pressure Transducer (TPT)PLSPartial Least Squares
T-TPTTemperature at TPTFPMFeature Projection Model
P-MON-CKPPressure Upstream of Production Choke (CKP)FP-DNNFeature Projection-Deep Neural Network
T-JUS-CKPPressure Downstream of CKPGNNGraph Neural Network
T-JUS-CKGLTemperature Downstream of CKGLMLPMultilayer Perceptron
FP-PLSFeature Projection-PLSBi-LSTMBidirectional Long Short-Term
MGGPMulti-Gene Genetic ProgrammingSHAPShapley Additive Explanation
xNESExponential Natural Evolution StrategiesLRLogistic Regression
RNNRecurrent Neural NetworkLOFLocal Outlier Factor
LGBMLight Gradient Boosting MachineICAImperialist Competitive Algorithm
SMOTESynthetic Minority Oversampling TechniqueSFLAShuffled Frog-Leaping Algorithm
LIMELocal Interpretable Model-Agnostic ExplanationsSASimulated Annealing
XAIExplainable Artificial IntelligencePBBLRPhysics-Based Bayesian Linear Regression
GSKGaining-Sharing Knowledge-Based AlgorithmARIMAAutoregressive Integrated Moving Average
BayesOpt-XGBoostBayesian oOptimization XGBoostGMGeneralized Method of Moments
FAFirefly AlgorithmPSO-FDGGMPSO-Based Data Grouping Grey Model with a Fractional Order ccumulation
COACuckoo Optimization AlgorithmPSOGMPSO for Grey Model
GWOGrey Wolf OptimizerLSSVMLeast Square Support Vector Machine
HASHarmony SearchGAGenetic Algorithm
BLRBayesian Linear RegressionOCSVMOne-Class Support Vector Machine
SARIMASeasonal Autoregressive Integrated Moving AverageBAEBasic Autoencoder
GMGrey ModelCAEConvolutional Autoencoder
FGMFractional Grey ModelAEAutoencoder
DGGMData Grouping-Based Grey Modeling MethodVAEVariational Autoencoder
GPRGaussian Process RegressionMARSMultivariate Adaptive Regression Spline
  • Liang, J.; Li, C.; Sun, K.; Zhang, S.; Wang, S.; Xiang, J.; Hu, S.; Wang, Y.; Hu, X. Activation of mixed sawdust and spirulina with or without a pre-carbonization step: Probing roles of volatile-char interaction on evolution of pyrolytic products. Fuel Process. Technol. 2023 , 250 , 107926. [ Google Scholar ] [ CrossRef ]
  • Xu, L.; Wang, Y.; Mo, L.; Tang, Y.; Wang, F.; Li, C. The research progress and prospect of data mining methods on corrosion prediction of oil and gas pipelines. Eng. Fail. Anal. 2023 , 144 , 106951. [ Google Scholar ] [ CrossRef ]
  • Yusoff, M.; Ehsan, D.; Sharif, M.Y.; Sallehud-Din, M.T.M. Topology Approach for Crude Oil Price Forecasting of Particle Swarm Optimization and Long Short-Term Memory. Int. J. Adv. Comput. Sci. Appl. 2024 , 15 , 524–532. [ Google Scholar ] [ CrossRef ]
  • Yusoff, M.; Sharif, M.Y.; Sallehud-Din, M.T.M. Long Term Short Memory with Particle Swarm Optimization for Crude Oil Price Prediction. In Proceedings of the 2023 7th International Symposium on Innovative Approaches in Smart Technologies (ISAS), Istanbul, Turkiye, 23–25 November 2023; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Sharma, R.; Villányi, B. Evaluation of corporate requirements for smart manufacturing systems using predictive analytics. Internet Things 2022 , 19 , 100554. [ Google Scholar ] [ CrossRef ]
  • Mahfuz, N.M.; Yusoff, M.; Ahmad, Z. Review of single clustering methods. IAES Int. J. Artif. Intell. 2019 , 8 , 221–227. [ Google Scholar ] [ CrossRef ]
  • Henrys, K. Role of Predictive Analytics in Business. SSRN Electron. J. 2021 . [ Google Scholar ] [ CrossRef ]
  • Tewari, S.; Dwivedi, U.D.; Biswas, S. A novel application of ensemble methods with data resampling techniques for drill bit selection in the oil and gas industry. Energies 2021 , 14 , 432. [ Google Scholar ] [ CrossRef ]
  • Allouche, I.; Zheng, Q.; Yoosef-Ghodsi, N.; Fowler, M.; Li, Y.; Adeeb, S. Enhanced predictive method for pipeline strain demand subject to permanent ground displacements with internal pressure & temperature: A finite difference approach. J. Infrastruct. Intell. Resil. 2023 , 2 , 100030. [ Google Scholar ] [ CrossRef ]
  • Carvalho, B.G.; Vargas, R.E.V.; Salgado, R.M.; Munaro, C.J.; Varejao, F.M. Flow Instability Detection in Offshore Oil Wells with Multivariate Time Series Machine Learning Classifiers. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Ohalete, N.C.; Aderibigbe, A.O.; Ani, E.C.; Ohenhen, P.E.; Akinoso, A. Advancements in predictive maintenance in the oil and gas industry: A review of AI and data science applications. World J. Adv. Res. Rev. 2023 , 20 , 167–181. [ Google Scholar ] [ CrossRef ]
  • Tariq, Z.; Aljawad, M.S.; Hasan, A.; Murtaza, M.; Mohammed, E.; El-Husseiny, A.; Alarifi, S.A.; Mahmoud, M.; Abdulraheem, A. A Systematic Review of Data Science and Machine Learning Applications to the Oil and Gas Industry. J. Pet. Explor. Prod. Technol. 2021 , 11 , 4339–4374. [ Google Scholar ] [ CrossRef ]
  • Yu, X.; Wang, J.; Hong, Q.-Q.; Teku, R.; Wang, S.-H.; Zhang, Y.-D. Transfer learning for medical images analyses: A survey. Neurocomputing 2022 , 489 , 230–254. [ Google Scholar ] [ CrossRef ]
  • Barkana, B.D.; Ozkan, Y.; Badara, J.A. Analysis of working memory from EEG signals under different emotional states. Biomed. Signal Process. Control. 2022 , 71 , 103249. [ Google Scholar ] [ CrossRef ]
  • Chen, W.; Huang, H.; Huang, J.; Wang, K.; Qin, H.; Wong, K.K. Deep learning-based medical image segmentation of the aorta using XR-MSF-U-Net. Comput. Methods Programs Biomed. 2022 , 225 , 107073. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Huang, C.; Gu, B.; Chen, Y.; Tan, X.; Feng, L. Energy return on energy, carbon, and water investment in oil and gas resource extraction: Methods and applications to the Daqing and Shengli oilfields. Energy Policy 2019 , 134 , 110979. [ Google Scholar ] [ CrossRef ]
  • Hazboun, S.; Boudet, H. Chapter 8—A ‘thin green line’ of resistance? Assessing public views on oil, natural gas, and coal export in the Pacific Northwest region of the United States and Canada. In Public Responses to Fossil Fuel Export ; Boudet, H., Hazboun, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2022; pp. 121–139. [ Google Scholar ]
  • Champeecharoensuk, A.; Dhakal, S.; Chollacoop, N.; Phdungsilp, A. Greenhouse gas emissions trends and drivers insights from the domestic aviation in Thailand. Heliyon 2024 , 10 , e24206. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Centobelli, P.; Cerchione, R.; Del Vecchio, P.; Oropallo, E.; Secundo, G. Blockchain technology for bridging trust, traceability and transparency in circular supply chain. Inf. Manag. 2022 , 59 , 103508. [ Google Scholar ] [ CrossRef ]
  • Majed, H.; Al-Janabi, S.; Mahmood, S. Data Science for Genomics (GSK-XGBoost) for Prediction Six Types of Gas Based on Intelligent Analytics. In Proceedings of the 2022 22nd International Conference on Computational Science and Its Applications (ICCSA), Malaga, Spain, 4–7 July 2022; pp. 28–34. [ Google Scholar ] [ CrossRef ]
  • Waterworth, A.; Bradshaw, M.J. Unconventional trade-offs? National oil companies, foreign investment and oil and gas development in Argentina and Brazil. Energy Policy 2018 , 122 , 7–16. [ Google Scholar ] [ CrossRef ]
  • Marins, M.A.; Barros, B.D.; Santos, I.H.; Barrionuevo, D.C.; Vargas, R.E.; de M. Prego, T.; de Lima, A.A.; de Campos, M.L.; da Silva, E.A.; Netto, S.L. Fault detection and classification in oil wells and production/service lines using random forest. J. Pet. Sci. Eng. 2020 , 197 , 107879. [ Google Scholar ] [ CrossRef ]
  • Dhaked, D.K.; Dadhich, S.; Birla, D. Power output forecasting of solar photovoltaic plant using LSTM. Green Energy Intell. Transp. 2023 , 2 , 100113. [ Google Scholar ] [ CrossRef ]
  • Yan, R.; Wang, S.; Peng, C. An Artificial Intelligence Model Considering Data Imbalance for Ship Selection in Port State Control Based on Detention Probabilities. J. Comput. Sci. 2021 , 48 , 101257. [ Google Scholar ] [ CrossRef ]
  • Agwu, O.E.; Okoro, E.E.; Sanni, S.E. Modelling oil and gas flow rate through chokes: A critical review of extant models. J. Pet. Sci. Eng. 2022 , 208 , 109775. [ Google Scholar ] [ CrossRef ]
  • Nandhini, K.; Tamilpavai, G. Hybrid CNN-LSTM and modified wild horse herd Model-based prediction of genome sequences for genetic disorders. Biomed. Signal Process. Control. 2022 , 78 , 103840. [ Google Scholar ] [ CrossRef ]
  • Balaji, S.; Karthik, S. Deep Learning Based Energy Consumption Prediction on Internet of Things Environment. Intell. Autom. Soft Comput. 2023 , 37 , 727–743. [ Google Scholar ] [ CrossRef ]
  • Yang, H.; Liu, X.; Chu, X.; Xie, B.; Zhu, G.; Li, H.; Yang, J. Optimization of tight gas reservoir fracturing parameters via gradient boosting regression modeling. Heliyon 2024 , 10 , e27015. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • de los Ángeles Sánchez Morales, M.; Anguiano, F.I.S. Data science—Time series analysis of oil & gas production in mexican fields. Procedia Comput. Sci. 2022 , 200 , 21–30. [ Google Scholar ] [ CrossRef ]
  • Tan, Y.; Al-Huqail, A.A.; Chen, Q.; Majdi, H.S.; Algethami, J.S.; Ali, H.E. Analysis of groundwater pollution in a petroleum refinery energy contributed in rock mechanics through ANFIS-AHP. Int. J. Energy Res. 2022 , 46 , 20928–20938. [ Google Scholar ] [ CrossRef ]
  • Wu, M.; Wang, G.; Liu, H. Research on Transformer Fault Diagnosis Based on SMOTE and Random Forest. In Proceedings of the 2022 4th International Conference on Electrical Engineering and Control Technologies (CEECT), Shanghai, China, 16–18 December 2022; pp. 359–363. [ Google Scholar ] [ CrossRef ]
  • Dashti, Q.; Matar, S.; Abdulrazzaq, H.; Al-Shammari, N.; Franco, F.; Haryanto, E.; Zhang, M.Q.; Prakash, R.; Bolanos, N.; Ibrahim, M.; et al. Data Analytics into Hydraulic Modelling for Better Understanding of Well/Surface Network Limits, Proactively Identify Challenges and, Provide Solutions for Improved System Performance in the Greater Burgan Field. In Proceedings of the Abu Dhabi International Petroleum Exhibition & Conference, Abu Dhabi, United Arab Emirates, 15–18 November 2021. [ Google Scholar ] [ CrossRef ]
  • Wang, X.; Daryapour, M.; Shahrabadi, A.; Pirasteh, S.; Razavirad, F. Artificial neural networks in predicting of the gas molecular diffusion coefficient. Chem. Eng. Res. Des. 2023 , 200 , 407–418. [ Google Scholar ] [ CrossRef ]
  • Kamarudin, R.; Ang, Y.; Topare, N.; Ismail, M.; Mustafa, K.; Gunnasegaran, P.; Abdullah, M.; Mazlan, N.; Badruddin, I.; Zedan, A.; et al. Influence of oxyhydrogen gas retrofit into two-stroke engine on emissions and exhaust gas temperature variations. Heliyon 2024 , 10 , e26597. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Raghuraman, R.; Darvishi, A. Detecting Transformer Fault Types from Dissolved Gas Analysis Data Using Machine Learning Techniques. In Proceedings of the 2022 IEEE 15th Dallas Circuit and System Conference (DCAS), Dallas, TX, USA, 17–19 June 2022; pp. 1–5. [ Google Scholar ] [ CrossRef ]
  • Mukherjee, T.; Burgett, T.; Ghanchi, T.; Donegan, C.; Ward, T. Predicting Gas Production Using Machine Learning Methods: A Case Study. In Proceedings of the SEG International Exposition and Annual Meeting, San Antonio, TX, USA, 25 September 2019; pp. 2248–2252. [ Google Scholar ] [ CrossRef ]
  • Dixit, N.; McColgan, P.; Kusler, K. Machine Learning-Based Probabilistic Lithofacies Prediction from Conventional Well Logs: A Case from the Umiat Oil Field of Alaska. Energies 2020 , 13 , 4862. [ Google Scholar ] [ CrossRef ]
  • Aldosari, H.; Elfouly, R.; Ammar, R. Evaluation of Machine Learning-Based Regression Techniques for Prediction of Oil and Gas Pipelines Defect. In Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 16–18 December 2020; pp. 1452–1456. [ Google Scholar ] [ CrossRef ]
  • Elmousalami, H.H.; Elaskary, M. Drilling stuck pipe classification and mitigation in the Gulf of Suez oil fields using artificial intelligence. J. Pet. Explor. Prod. Technol. 2020 , 10 , 2055–2068. [ Google Scholar ] [ CrossRef ]
  • Taha, I.B.; Mansour, D.-E.A. Novel Power Transformer Fault Diagnosis Using Optimized Machine Learning Methods. Intell. Autom. Soft Comput. 2021 , 28 , 739–752. [ Google Scholar ] [ CrossRef ]
  • Tiyasha; Tung, T.M.; Yaseen, Z.M. A survey on river water quality modelling using artificial intelligence models: 2000–2020. J. Hydrol. 2020 , 585 , 124670. [ Google Scholar ] [ CrossRef ]
  • Agatonovic-Kustrin, S.; Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 2000 , 22 , 717–727. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Tao, H.; Hameed, M.M.; Marhoon, H.A.; Zounemat-Kermani, M.; Heddam, S.; Kim, S.; Sulaiman, S.O.; Tan, M.L.; Sa’adi, Z.; Mehr, A.D.; et al. Groundwater level prediction using machine learning models: A comprehensive review. Neurocomputing 2022 , 489 , 271–308. [ Google Scholar ] [ CrossRef ]
  • Kalam, S.; Yousuf, U.; Abu-Khamsin, S.A.; Bin Waheed, U.; Khan, R.A. An ANN model to predict oil recovery from a 5-spot waterflood of a heterogeneous reservoir. J. Pet. Sci. Eng. 2022 , 210 , 110012. [ Google Scholar ] [ CrossRef ]
  • Eckert, E.; Bělohlav, Z.; Vaněk, T.; Zámostný, P.; Herink, T. ANN modelling of pyrolysis utilising the characterisation of atmospheric gas oil based on incomplete data. Chem. Eng. Sci. 2007 , 62 , 5021–5025. [ Google Scholar ] [ CrossRef ]
  • Qin, G.; Xia, A.; Lu, H.; Wang, Y.; Li, R.; Wang, C. A hybrid machine learning model for predicting crater width formed by explosions of natural gas pipelines. J. Loss Prev. Process. Ind. 2023 , 82 , 104994. [ Google Scholar ] [ CrossRef ]
  • Wang, Q.; Song, Y.; Zhang, X.; Dong, L.; Xi, Y.; Zeng, D.; Liu, Q.; Zhang, H.; Zhang, Z.; Yan, R.; et al. Evolution of corrosion prediction models for oil and gas pipelines: From empirical-driven to data-driven. Eng. Fail. Anal. 2023 , 146 , 107097. [ Google Scholar ] [ CrossRef ]
  • Sami, N.A.; Ibrahim, D.S. Forecasting multiphase flowing bottom-hole pressure of vertical oil wells using three machine learning techniques. Pet. Res. 2021 , 6 , 417–422. [ Google Scholar ] [ CrossRef ]
  • Chohan, H.Q.; Ahmad, I.; Mohammad, N.; Manca, D.; Caliskan, H. An integrated approach of artificial neural networks and polynomial chaos expansion for prediction and analysis of yield and environmental impact of oil shale retorting process under uncertainty. Fuel 2022 , 329 , 125351. [ Google Scholar ] [ CrossRef ]
  • Carvalho, G.d.A.; Minnett, P.J.; Ebecken, N.F.F.; Landau, L. Machine-Learning Classification of SAR Remotely-Sensed Sea-Surface Petroleum Signatures—Part 1: Training and Testing Cross Validation. Remote Sens. 2022 , 14 , 3027. [ Google Scholar ] [ CrossRef ]
  • Li, X.; Han, W.; Shao, W.; Chen, L.; Zhao, D. Data-Driven Predictive Model for Mixed Oil Length Prediction in Long-Distance Transportation Pipeline. In Proceedings of the 2021 IEEE 10th Data Driven Control and Learning Systems Conference (DDCLS), Suzhou, China, 14–16 May 2021; pp. 1486–1491. [ Google Scholar ] [ CrossRef ]
  • Mendoza, J.H.; Tariq, R.; Espinosa, L.F.S.; Anguebes, F.; Bassam, A. Soft Computing Tools for Multiobjective Optimization of Offshore Crude Oil and Gas Separation Plant for the Best Operational Condition. In Proceedings of the 2021 18th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 10–12 November 2021; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Sakhaei, A.; Zamir, S.M.; Rene, E.R.; Veiga, M.C.; Kennes, C. Neural network-based performance assessment of one- and two-liquid phase biotrickling filters for the removal of a waste-gas mixture containing methanol, α-pinene, and hydrogen sulfide. Environ. Res. 2023 , 237 , 116978. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hasanzadeh, M.; Madani, M. Deterministic tools to predict gas assisted gravity drainage recovery factor. Energy Geosci. 2023 , 5 , 100267. [ Google Scholar ] [ CrossRef ]
  • Zhang, X.-Q.; Cheng, Q.-L.; Sun, W.; Zhao, Y.; Li, Z.-M. Research on a TOPSIS energy efficiency evaluation system for crude oil gathering and transportation systems based on a GA-BP neural network. Pet. Sci. 2023 , 21 , 621–640. [ Google Scholar ] [ CrossRef ]
  • Ismail, A.; Ewida, H.F.; Nazeri, S.; Al-Ibiary, M.G.; Zollo, A. Gas channels and chimneys prediction using artificial neural networks and multi-seismic attributes, offshore West Nile Delta, Egypt. J. Pet. Sci. Eng. 2022 , 208 , 109349. [ Google Scholar ] [ CrossRef ]
  • Goliatt, L.; Saporetti, C.; Oliveira, L.; Pereira, E. Performance of evolutionary optimized machine learning for modeling total organic carbon in core samples of shale gas fields. Petroleum 2023 , 10 , 150–164. [ Google Scholar ] [ CrossRef ]
  • Amar, M.N.; Ghahfarokhi, A.J.; Ng, C.S.W.; Zeraibi, N. Optimization of WAG in real geological field using rigorous soft computing techniques and nature-inspired algorithms. J. Pet. Sci. Eng. 2021 , 206 , 109038. [ Google Scholar ] [ CrossRef ]
  • Mao, W.; Wei, B.; Xu, X.; Chen, L.; Wu, T.; Peng, Z.; Ren, C. Power transformers fault diagnosis using graph neural networks based on dissolved gas data. J. Phys. Conf. Ser. 2022 , 2387 , 012029. [ Google Scholar ] [ CrossRef ]
  • Ghosh, I.; Chaudhuri, T.D.; Alfaro-Cortés, E.; Gámez, M.; García, N. A hybrid approach to forecasting futures prices with simultaneous consideration of optimality in ensemble feature selection and advanced artificial intelligence. Technol. Forecast. Soc. Chang. 2022 , 181 , 121757. [ Google Scholar ] [ CrossRef ]
  • Wang, B.; Guo, Y.; Wang, D.; Zhang, Y.; He, R.; Chen, J. Prediction model of natural gas pipeline crack evolution based on optimized DCNN-LSTM. Mech. Syst. Signal Process. 2022 , 181 , 109557. [ Google Scholar ] [ CrossRef ]
  • Yang, R.; Liu, X.; Yu, R.; Hu, Z.; Duan, X. Long short-term memory suggests a model for predicting shale gas production. Appl. Energy 2022 , 322 , 119415. [ Google Scholar ] [ CrossRef ]
  • Werneck, R.d.O.; Prates, R.; Moura, R.; Gonçalves, M.M.; Castro, M.; Soriano-Vargas, A.; Júnior, P.R.M.; Hossain, M.M.; Zampieri, M.F.; Ferreira, A.; et al. Data-driven deep-learning forecasting for oil production and pressure. J. Pet. Sci. Eng. 2022 , 210 , 109937. [ Google Scholar ] [ CrossRef ]
  • Antariksa, G.; Muammar, R.; Nugraha, A.; Lee, J. Deep sequence model-based approach to well log data imputation and petrophysical analysis: A case study on the West Natuna Basin, Indonesia. J. Appl. Geophys. 2023 , 218 , 105213. [ Google Scholar ] [ CrossRef ]
  • Das, S.; Paramane, A.; Chatterjee, S.; Rao, U.M. Accurate Identification of Transformer Faults from Dissolved Gas Data Using Recursive Feature Elimination Method. IEEE Trans. Dielectr. Electr. Insul. 2023 , 30 , 466–473. [ Google Scholar ] [ CrossRef ]
  • Barjouei, H.S.; Ghorbani, H.; Mohamadian, N.; Wood, D.A.; Davoodi, S.; Moghadasi, J.; Saberi, H. Prediction performance advantages of deep machine learning algorithms for two-phase flow rates through wellhead chokes. J. Pet. Explor. Prod. Technol. 2021 , 11 , 1233–1261. [ Google Scholar ] [ CrossRef ]
  • Martínez, V.; Rocha, A. The Golem: A General Data-Driven Model for Oil & Gas Forecasting Based on Recurrent Neural Networks. IEEE Access 2023 , 11 , 41105–41132. [ Google Scholar ] [ CrossRef ]
  • Wang, Z.; Bai, L.; Song, G.; Zhang, Y.; Zhu, M.; Zhao, M.; Chen, L.; Wang, M. Optimized faster R-CNN for oil wells detection from high-resolution remote sensing images. Int. J. Remote Sens. 2023 , 44 , 6897–6928. [ Google Scholar ] [ CrossRef ]
  • Hiassat, A.; Diabat, A.; Rahwan, I. A genetic algorithm approach for location-inventory-routing problem with perishable products. J. Manuf. Syst. 2017 , 42 , 93–103. [ Google Scholar ] [ CrossRef ]
  • Sharma, V.; Cali, Ü.; Sardana, B.; Kuzlu, M.; Banga, D.; Pipattanasomporn, M. Data-driven short-term natural gas demand forecasting with machine learning techniques. J. Pet. Sci. Eng. 2021 , 206 , 108979. [ Google Scholar ] [ CrossRef ]
  • Phan, H.C.; Duong, H.T. Predicting burst pressure of defected pipeline with Principal Component Analysis and adaptive Neuro Fuzzy Inference System. Int. J. Press. Vessel. Pip. 2021 , 189 , 104274. [ Google Scholar ] [ CrossRef ]
  • Hamedi, H.; Zendehboudi, S.; Rezaei, N.; Saady, N.M.C.; Zhang, B. Modeling and optimization of oil adsorption capacity on functionalized magnetic nanoparticles using machine learning approach. J. Mol. Liq. 2023 , 392 , 123378. [ Google Scholar ] [ CrossRef ]
  • Castro, A.O.D.S.; Santos, M.D.J.R.; Leta, F.R.; Lima, C.B.C.; Lima, G.B.A. Unsupervised Methods to Classify Real Data from Offshore Wells. Am. J. Oper. Res. 2021 , 11 , 227–241. [ Google Scholar ] [ CrossRef ]
  • Ma, B.; Shuai, J.; Liu, D.; Xu, K. Assessment on failure pressure of high strength pipeline with corrosion defects. Eng. Fail. Anal. 2013 , 32 , 209–219. [ Google Scholar ] [ CrossRef ]
  • Shuai, Y.; Shuai, J.; Xu, K. Probabilistic analysis of corroded pipelines based on a new failure pressure model. Eng. Fail. Anal. 2017 , 81 , 216–233. [ Google Scholar ] [ CrossRef ]
  • Phan, H.C.; Dhar, A.S.; Mondal, B.C. Revisiting burst pressure models for corroded pipelines. Can. J. Civ. Eng. 2017 , 44 , 485–494. [ Google Scholar ] [ CrossRef ]
  • Freire, J.; Vieira, R.; Castro, J.; Benjamin, A. Part 3: Burst tests of pipeline with extensive longitudinal metal loss. Exp. Tech. 2006 , 30 , 60–65. [ Google Scholar ] [ CrossRef ]
  • Cronin, D.S. Assessment of Corrosion Defects in Pipelines. Ph.D. Thesis, University of Waterloo, Waterloo, ON, Canada, 2000. [ Google Scholar ]
  • Ghasemieh, A.; Lloyed, A.; Bahrami, P.; Vajar, P.; Kashef, R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decis. Anal. J. 2023 , 7 , 100242. [ Google Scholar ] [ CrossRef ]
  • Jeny, J.R.V.; Reddy, N.S.; Aishwarya, P.; Samreen. A Classification Approach for Heart Disease Diagnosis using Machine Learning. In Proceedings of the 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 7–9 October 2021; pp. 456–459. [ Google Scholar ] [ CrossRef ]
  • Mazumder, R.K.; Salman, A.M.; Li, Y. Failure risk analysis of pipelines using data-driven machine learning algorithms. Struct. Saf. 2021 , 89 , 102047. [ Google Scholar ] [ CrossRef ]
  • Liu, S.; Zhao, Y.; Wang, Z. Artificial Intelligence Method for Shear Wave Travel Time Prediction considering Reservoir Geological Continuity. Math. Probl. Eng. 2021 , 2021 , 5520428. [ Google Scholar ] [ CrossRef ]
  • Saroja, S.; Haseena, S.; Madavan, R. Dissolved Gas Analysis of Transformer: An Approach Based on ML and MCDM. IEEE Trans. Dielectr. Electr. Insul. 2023 , 30 , 2429–2438. [ Google Scholar ] [ CrossRef ]
  • Raj, R.A.; Sarathkumar, D.; Venkatachary, S.K.; Andrews, L.J.B. Classification and Prediction of Incipient Faults in Transformer Oil by Supervised Machine Learning using Decision Tree. In Proceedings of the 2023 3rd International conference on Artificial Intelligence and Signal Processing (AISP), Vijayawada, India, 18–20 March 2023; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Aslam, N.; Khan, I.U.; Alansari, A.; Alrammah, M.; Alghwairy, A.; Alqahtani, R.; Alqahtani, R.; Almushikes, M.; AL Hashim, M. Anomaly Detection Using Explainable Random Forest for the Prediction of Undesirable Events in Oil Wells. Appl. Comput. Intell. Soft Comput. 2022 , 2022 , 1558381. [ Google Scholar ] [ CrossRef ]
  • Turan, E.M.; Jaschke, J. Classification of undesirable events in oil well operation. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Strbske Pleso, Slovakia, 1–4 June 2021; pp. 157–162. [ Google Scholar ] [ CrossRef ]
  • Gatta, F.; Giampaolo, F.; Chiaro, D.; Piccialli, F. Predictive maintenance for offshore oil wells by means of deep learning features extraction. Expert Syst. 2022 , 41 , e13128. [ Google Scholar ] [ CrossRef ]
  • Brønstad, C.; Netto, S.L.; Ramos, A.L.L. Data-driven Detection and Identification of Undesirable Events in Subsea Oil Wells. In Proceedings of the SENSORDEVICES 2021 Twelfth International Conference on Sensor Device Technologies and Applications, Athens, Greece, 14–18 November 2021; pp. 1–6. [ Google Scholar ]
  • Ben Jabeur, S.; Khalfaoui, R.; Ben Arfi, W. The effect of green energy, global environmental indexes, and stock markets in predicting oil price crashes: Evidence from explainable machine learning. J. Environ. Manag. 2021 , 298 , 113511. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Baabbad, H.K.H.; Artun, E.; Kulga, B. Understanding the Controlling Factors for CO 2 Sequestration in Depleted Shale Reservoirs Using Data Analytics and Machine Learning. In Proceedings of the SPE EuropEC—Europe Energy Conference featured at the 83rd EAGE Annual Conference & Exhibition, Madrid, Spain, 6–9 June 2022. [ Google Scholar ] [ CrossRef ]
  • Alsaihati, A.; Elkatatny, S.; Mahmoud, A.A.; Abdulraheem, A. Use of Machine Learning and Data Analytics to Detect Downhole Abnormalities While Drilling Horizontal Wells, with Real Case Study. J. Energy Resour. Technol. Trans. ASME 2021 , 143 , 043201. [ Google Scholar ] [ CrossRef ]
  • Kumar, A.; Hassanzadeh, H. A qualitative study of the impact of random shale barriers on SAGD performance using data analytics and machine learning. J. Pet. Sci. Eng. 2021 , 205 , 108950. [ Google Scholar ] [ CrossRef ]
  • Ma, H.; Wang, H.; Geng, M.; Ai, Y.; Zhang, W.; Zheng, W. A new hybrid approach model for predicting burst pressure of corroded pipelines of gas and oil. Eng. Fail. Anal. 2023 , 149 , 107248. [ Google Scholar ] [ CrossRef ]
  • Canonaco, G.; Roveri, M.; Alippi, C.; Podenzani, F.; Bennardo, A.; Conti, M.; Mancini, N. A Machine-Learning Approach for the Prediction of Internal Corrosion in Pipeline Infrastructures. In Proceedings of the 2021 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Glasgow, UK, 17–20 May 2021; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Fang, J.; Cheng, X.; Gai, H.; Lin, S.; Lou, H. Development of machine learning algorithms for predicting internal corrosion of crude oil and natural gas pipelines. Comput. Chem. Eng. 2023 , 177 , 108358. [ Google Scholar ] [ CrossRef ]
  • Lv, Q.; Zheng, R.; Guo, X.; Larestani, A.; Hadavimoghaddam, F.; Riazi, M.; Hemmati-Sarapardeh, A.; Wang, K.; Li, J. Modelling minimum miscibility pressure of CO 2 -crude oil systems using deep learning, tree-based, and thermodynamic models: Application to CO 2 sequestration and enhanced oil recovery. Sep. Purif. Technol. 2023 , 310 , 123086. [ Google Scholar ] [ CrossRef ]
  • Zhu, X.; Zhang, H.; Ren, Q.; Zhang, D.; Zeng, F.; Zhu, X.; Zhang, L. An automatic identification method of imbalanced lithology based on Deep Forest and K-means SMOTE. Geoenergy Sci. Eng. 2023 , 224 , 211595. [ Google Scholar ] [ CrossRef ]
  • Chanchotisatien, P.; Vong, C. Feature engineering and feature selection for fault type classification from dissolved gas values in transformer oil. In Proceedings of the ICSEC 2021—25th International Computer Science and Engineering Conference, Chiang Rai, Thailand, 18–20 November 2021; pp. 75–80. [ Google Scholar ] [ CrossRef ]
  • de Jesus Rocha Santos, M.; de Salvo Castro, A.O.; Leta, F.R.; De Araujo, J.F.M.; de Souza Ferreira, G.; de Araújo Santos, R.; de Campos Lima, C.B.; Lima, G.B.A. Statistical analysis of offshore production sensors for failure detection applications / Análise estatística dos sensores de produção offshore para aplicações de detecção de falhas. Braz. J. Dev. 2021 , 7 , 85880–85898. [ Google Scholar ] [ CrossRef ]
  • Ali, M.; Zhu, P.; Jiang, R.; Huolin, M.; Ehsan, M.; Hussain, W.; Zhang, H.; Ashraf, U.; Ullaah, J.; Ullah, J. Reservoir characterization through comprehensive modeling of elastic logs prediction in heterogeneous rocks using unsupervised clustering and class-based ensemble machine learning. Appl. Soft Comput. 2023 , 148 , 110843. [ Google Scholar ] [ CrossRef ]
  • Salamai, A.A. Deep learning framework for predictive modeling of crude oil price for sustainable management in oil markets. Expert Syst. Appl. 2023 , 211 , 118658. [ Google Scholar ] [ CrossRef ]
  • Ashayeri, C.; Jha, B. Evaluation of transfer learning in data-driven methods in the assessment of unconventional resources. J. Pet. Sci. Eng. 2021 , 207 , 109178. [ Google Scholar ] [ CrossRef ]
  • Vuttipittayamongkol, P.; Tung, A.; Elyan, E. A Data-Driven Decision Support Tool for Offshore Oil and Gas Decommissioning. IEEE Access 2021 , 9 , 137063–137082. [ Google Scholar ] [ CrossRef ]
  • Song, T.; Zhu, W.; Chen, Z.; Jin, W.; Song, H.; Fan, L.; Yue, M. A novel well-logging data generation model integrated with random forests and adaptive domain clustering algorithms. Geoenergy Sci. Eng. 2023 , 231 , 212381. [ Google Scholar ] [ CrossRef ]
  • Awuku, B.; Huang, Y.; Yodo, N. Predicting Natural Gas Pipeline Failures Caused by Natural Forces: An Artificial Intelligence Classification Approach. Appl. Sci. 2023 , 13 , 4322. [ Google Scholar ] [ CrossRef ]
  • Al-Mudhafar, W.J.; Abbas, M.A.; Wood, D.A. Performance evaluation of boosting machine learning algorithms for lithofacies classification in heterogeneous carbonate reservoirs. Mar. Pet. Geol. 2022 , 145 , 105886. [ Google Scholar ] [ CrossRef ]
  • Wen, H.; Liu, L.; Zhang, J.; Hu, J.; Huang, X. A hybrid machine learning model for landslide-oriented risk assessment of long-distance pipelines. J. Environ. Manag. 2023 , 342 , 118177. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Otchere, D.A.; Ganat, T.O.A.; Nta, V.; Brantson, E.T.; Sharma, T. Data analytics and Bayesian Optimised Extreme Gradient Boosting approach to estimate cut-offs from wireline logs for net reservoir and pay classification. Appl. Soft Comput. 2022 , 120 , 108680. [ Google Scholar ] [ CrossRef ]
  • Gamal, H.; Elkatatny, S.; Alsaihati, A.; Abdulraheem, A. Intelligent Prediction for Rock Porosity While Drilling Complex Lithology in Real Time. Comput. Intell. Neurosci. 2021 , 2021 , 9960478. [ Google Scholar ] [ CrossRef ]
  • Ismail, M.F.H.; May, Z.; Asirvadam, V.S.; Nayan, N.A. Machine-Learning-Based Classification for Pipeline Corrosion with Monte Carlo Probabilistic Analysis. Energies 2023 , 16 , 3589. [ Google Scholar ] [ CrossRef ]
  • Prasojo, R.A.; Putra, M.A.A.; Ekojono; Apriyani, M.E.; Rahmanto, A.N.; Ghoneim, S.S.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M. Precise transformer fault diagnosis via random forest model enhanced by synthetic minority over-sampling technique. Electr. Power Syst. Res. 2023 , 220 , 109361. [ Google Scholar ] [ CrossRef ]
  • Ma, Z.; Chang, H.; Sun, Z.; Liu, F.; Li, W.; Zhao, D.; Chen, C. Very Short-Term Renewable Energy Power Prediction Using XGBoost Optimized by TPE Algorithm. In Proceedings of the 2020 4th International Conference on HVDC (HVDC), Xi’an, China, 6–9 November 2020; pp. 1236–1241. [ Google Scholar ] [ CrossRef ]
  • Ma, S.; Jiang, Z.; Liu, W. Modeling Drying-Energy Consumption in Automotive Painting Line Based on ANN and MLR for Real-Time Prediction. Int. J. Precis. Eng. Manuf. Technol. 2019 , 6 , 241–254. [ Google Scholar ] [ CrossRef ]
  • Guo, Z.; Wang, H.; Kong, X.; Shen, L.; Jia, Y. Machine Learning-Based Production Prediction Model and Its Application in Duvernay Formation. Energies 2021 , 14 , 5509. [ Google Scholar ] [ CrossRef ]
  • Ibrahim, N.M.; Alharbi, A.A.; Alzahrani, T.A.; Abdulkarim, A.M.; Alessa, I.A.; Hameed, A.M.; Albabtain, A.S.; Alqahtani, D.A.; Alsawwaf, M.K.; Almuqhim, A.A. Well Performance Classification and Prediction: Deep Learning and Machine Learning Long Term Regression Experiments on Oil, Gas, and Water Production. Sensors 2022 , 22 , 5326. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Yin, H.; Liu, C.; Wu, W.; Song, K.; Dan, Y.; Cheng, G. An integrated framework for criticality evaluation of oil & gas pipelines based on fuzzy logic inference and machine learning. J. Nat. Gas Sci. Eng. 2021 , 96 , 104264. [ Google Scholar ] [ CrossRef ]
  • Chen, H.; Zhang, C.; Jia, N.; Duncan, I.; Yang, S.; Yang, Y. A machine learning model for predicting the minimum miscibility pressure of CO 2 and crude oil system based on a support vector machine algorithm approach. Fuel 2021 , 290 , 120048. [ Google Scholar ] [ CrossRef ]
  • Naserzadeh, Z.; Nohegar, A. Development of HGAPSO-SVR corrosion prediction approach for offshore oil and gas pipelines. J. Loss Prev. Process. Ind. 2023 , 84 , 105092. [ Google Scholar ] [ CrossRef ]
  • Yuan, Z.; Chen, L.; Liu, G.; Shao, W.; Zhang, Y.; Yang, W. Physics-based Bayesian linear regression model for predicting length of mixed oil. Geoenergy Sci. Eng. 2023 , 223 , 211466. [ Google Scholar ] [ CrossRef ]
  • Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control ; John Wiley & Sons: Hoboken, NJ, USA, 2015. [ Google Scholar ]
  • McCuen, R.H. Modeling Hydrologic Change: Statistical Methods ; CRC Press: Boca Raton, FL, USA, 2016. [ Google Scholar ]
  • Liu, J.; Zhao, Z.; Zhong, Y.; Zhao, C.; Zhang, G. Prediction of the dissolved gas concentration in power transformer oil based on SARIMA model. Energy Rep. 2022 , 8 , 1360–1367. [ Google Scholar ] [ CrossRef ]
  • Li, X.; Guo, X.; Liu, L.; Cao, Y.; Yang, B. A novel seasonal grey model for forecasting the quarterly natural gas production in China. Energy Rep. 2022 , 8 , 9142–9157. [ Google Scholar ] [ CrossRef ]
  • Rashidi, S.; Mehrad, M.; Ghorbani, H.; Wood, D.A.; Mohamadian, N.; Moghadasi, J.; Davoodi, S. Determination of bubble point pressure & oil formation volume factor of crude oils applying multiple hidden layers extreme learning machine algorithms. J. Pet. Sci. Eng. 2021 , 202 , 108425. [ Google Scholar ] [ CrossRef ]
  • Gong, X.; Liu, L.; Ma, L.; Dai, J.; Zhang, H.; Liang, J.; Liang, S. A Leak Sample Dataset Construction Method for Gas Pipeline Leakage Estimation Using Pipeline Studio. In Proceedings of the International Conference on Advanced Mechatronic Systems (ICAMechS), Tokyo, Japan, 9–12 December 2021; pp. 28–32. [ Google Scholar ] [ CrossRef ]
  • Chung, S.; Loh, A.; Jennings, C.M.; Sosnowski, K.; Ha, S.Y.; Yim, U.H.; Yoon, J.-Y. Capillary flow velocity profile analysis on paper-based microfluidic chips for screening oil types using machine learning. J. Hazard. Mater. 2023 , 447 , 130806. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mohamadian, N.; Ghorbani, H.; Wood, D.A.; Mehrad, M.; Davoodi, S.; Rashidi, S.; Soleimanian, A.; Shahvand, A.K. A geomechanical approach to casing collapse prediction in oil and gas wells aided by machine learning. J. Pet. Sci. Eng. 2021 , 196 , 107811. [ Google Scholar ] [ CrossRef ]
  • Sabah, M.; Mehrad, M.; Ashrafi, S.B.; Wood, D.A.; Fathi, S. Hybrid machine learning algorithms to enhance lost-circulation prediction and management in the Marun oil field. J. Pet. Sci. Eng. 2021 , 198 , 108125. [ Google Scholar ] [ CrossRef ]
  • Shi, J.; Xie, W.; Huang, X.; Xiao, F.; Usmani, A.S.; Khan, F.; Yin, X.; Chen, G. Real-time natural gas release forecasting by using physics-guided deep learning probability model. J. Clean. Prod. 2022 , 368 , 133201. [ Google Scholar ] [ CrossRef ]
  • Machado, A.P.F.; Vargas, R.E.V.; Ciarelli, P.M.; Munaro, C.J. Improving performance of one-class classifiers applied to anomaly detection in oil wells. J. Pet. Sci. Eng. 2022 , 218 , 110983. [ Google Scholar ] [ CrossRef ]
  • Zhou, J.; Liu, B.; Shao, M.; Yin, C.; Jiang, Y.; Song, Y. Lithologic classification of pyroclastic rocks: A case study for the third member of the Huoshiling Formation, Dehui fault depression, Songliao Basin, NE China. J. Pet. Sci. Eng. 2022 , 214 , 110456. [ Google Scholar ] [ CrossRef ]
  • Zhang, G.; Wang, Z.; Mohaghegh, S.; Lin, C.; Sun, Y.; Pei, S. Pattern visualization and understanding of machine learning models for permeability prediction in tight sandstone reservoirs. J. Pet. Sci. Eng. 2021 , 200 , 108142. [ Google Scholar ] [ CrossRef ]
  • Zuo, Z.; Ma, L.; Liang, S.; Liang, J.; Zhang, H.; Liu, T. A semi-supervised leakage detection method driven by multivariate time series for natural gas gathering pipeline. Process. Saf. Environ. Prot. 2022 , 164 , 468–478. [ Google Scholar ] [ CrossRef ]
  • Chen, Z.; Yu, W.; Liang, J.-T.; Wang, S.; Liang, H.-C. Application of statistical machine learning clustering algorithms to improve EUR predictions using decline curve analysis in shale-gas reservoirs. J. Pet. Sci. Eng. 2022 , 208 , 109216. [ Google Scholar ] [ CrossRef ]
  • Fernandes, W.; Komati, K.S.; Gazolli, K.A.d.S. Anomaly detection in oil-producing wells: A comparative study of one-class classifiers in a multivariate time series dataset. J. Pet. Explor. Prod. Technol. 2023 , 14 , 343–363. [ Google Scholar ] [ CrossRef ]
  • Gao, G.; Hazbeh, O.; Rajabi, M.; Tabasi, S.; Ghorbani, H.; Seyedkamali, R.; Shayanmanesh, M.; Radwan, A.E.; Mosavi, A.H. Application of GMDH model to predict pore pressure. Front. Earth Sci. 2023 , 10 , 1043719. [ Google Scholar ] [ CrossRef ]
  • Cirac, G.; Farfan, J.; Avansi, G.D.; Schiozer, D.J.; Rocha, A. Deep hierarchical distillation proxy-oil modeling for heterogeneous carbonate reservoirs. Eng. Appl. Artif. Intell. 2023 , 126 , 107076. [ Google Scholar ] [ CrossRef ]
  • Dayev, Z.; Shopanova, G.; Toksanbaeva, B.; Yetilmezsoy, K.; Sultanov, N.; Sihag, P.; Bahramian, M.; Kıyan, E. Modeling the flow rate of dry part in the wet gas mixture using decision tree/kernel/non-parametric regression-based soft-computing techniques. Flow Meas. Instrum. 2022 , 86 , 102195. [ Google Scholar ] [ CrossRef ]
  • Das, S.; Paramane, A.; Chatterjee, S.; Rao, U.M. Sensing Incipient Faults in Power Transformers Using Bi-Directional Long Short-Term Memory Network. IEEE Sens. Lett. 2023 , 7 , 7000304. [ Google Scholar ] [ CrossRef ]
  • Gao, J.; Li, Z.; Zhang, M.; Gao, Y.; Gao, W. Unsupervised Seismic Random Noise Suppression Based on Local Similarity and Replacement Strategy. IEEE Access 2023 , 11 , 48924–48934. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]SVM, QPSO-ANN, WQPSO-ANN, and LWQPSO-ANNNon-temporalPipelineBuried gas pipeline
99 samples
PredictionPipe diameter (mm), operating pressure (MPa), cover depth (m), and crater width (m)Crater widthMap, R , MSE. RMSE, MAPE, and MAELWQPSO-ANNThe proposed method outperformed the other method by more than 95%.
[ ]RF, KNN, and ANNNon-temporalWellsMiddle East fields: for vertical wells
206 samples
PredictionOil gravity (API), well perforation depth (depth (ft), surface temperature (ST (F)), well bottom-hole temperature (BT (F)), flowing gas rate (Qg (Mscf/day)), flowing water rate (Qw (bbl/day)), production tubing internal diameter (ID (inches)), and wellhead pressure (Pwh (psia)).Vertical oil wells’ flowing bottom-hole pressure Pwf (psia)MSE and R ANN
R = 97% (training) and 93% (testing)
The suggested model had a much greater value than the other models.
[ ]ANN, LSB, and BaggingNon-temporalOilOil shale
2600 samples
PredictionAir molar flowrate, illite silica, carbon, hydrogen content, feed preheater temp, and air preheater tempPetroleum output with CO emissionsRMSEANN
RMSE oil yield = 99.6%
RMSE CO = 99.9%
The suggested model’s precision outperformed the performance of the remaining models.
[ ]NB, KNN, DT, RF, SVM, and ANNTemporalOilOcean slick signature
769 samples
ClassificationThe data are confidential.Sea-surface petroleum signaturesAccuracy, sensitivity, specificity, and predictive valuesANN
Accuracy = 90%
The proposed model did not give significant results.
[ ]ANN, SVM, EL, and SVRNon-temporalPipelineThe data are confidential.ClassificationCO , temperature, pH, liquid velocity, pressure, stress, glycol concentration, H S, organic acid, oil type, water chemistry, and hydraulic diameterCorrosion defect depthMSE and R EL, ANN, and SVRThe proposed methods had a low error rate.
[ ]PLS, DNN, FPM, FP-DNN, and FP-PLSNon-temporalPipelineLong-distance pipelines
2093 samples
PredictionMixed oil length, inner diameter, pipeline width, Reynolds number, equivalent length, and actual mixed oil length.Mixed oil lengthRMSEDNN
RMSE = 146%
The error rate is not convincing and is the highest one.
[ ]ANN and GANon-temporalCrude OilASPEN HYSYS
V11 process simulator
PredictionWell, feed flow rate,
the pressure of gas products, interstage gas discharge pressure, isentropic efficiency of centrifugal compressor
Enhance petroleum productionR ANNThe performance of ANN+GA to enhance petroleum production is improved.
[ ]ANNNon-temporalGasThe data are confidential.
104 samples
PredictionSulfur dioxide, methanol, and α-pineneThe removal of gas-phase M, P, and H in an OLP-BTF and a TLP-BTF.R and MSEANN+PSO
R > 99%
The proposed model is good, and the author suggested improving the model with real-world applications.
[ ]ANN, LSSVM, and MGGPTemporalReservoirPrevious experimental and simulation studies
223 samples
PredictionHeight, dip angle, wetting phase viscosity, non-wetting phase viscosity, wetting phase density, non-wetting phase density, matrix porosity, fracture porosity, matrix permeability, fracture permeability, injection rate, production time, and recovery factorGas-assisted gravity drainage (GAGD)R , RMSE, MSE, ARE, and AAREANN
R = 97%
RMSE = 0.0520
The ANN outperformed the proposed method (MGGP = 89% (R ) and 0.0846 (RMSE)).
[ ]GNN and Multivariate Time SeriesTemporalTransformerDGA
1408 samples
ClusteringH , CH , C H , C H , C H , CO, and CO Power transformer fault diagnosisAccuracyMTGNN
Accuracy = 92%
The model was proven to be effective in its application.
[ ]ANN and Multilayer Perceptron with BackpropagationNon-temporalCrude OilRecent literature
172 samples
PredictionPressure (P) [Kpa], temperature (T) [C], liquid viscosity (uL) [c.p.], gas viscosity (uG) [c.p.], liquid molar volume (VL) [m /kmol], gas molar volume (VG) [m /kmol], liquid molecular weight (MWL) [kg/kmol], gas molecular weight (MWG) [kg/kmol], and interfacial tension (o) [Dyne]Diffusion coefficient (D) [m /s]MSE and RMSEMultilayer Perceptron with Backpropagation
R :
Training dataset = 88%
Testing dataset = 89%
The suggested model had low accuracy.
The hybrid model did not improve the model’s accuracy.
[ ]GA with a backpropagation neural networkTemporalCrude oilCrude oil gathering and transportation system
509 samples
PredictionThe inlet temperature of the combined system, outlet temperature of the combined system, inlet pressure of the combined system, outlet pressure of the combined system, inlet and outlet temperature of the transfer station system, inlet and outlet pressure of the transfer station system, inlet and outlet of the oil gathering wellhead system, treatment liquid volume, total power consumption, and total gas consumptionEnergy = 99%
Heat = 99%
Power = 97%
R GA with a backpropagation neural networkThe model provided considerable results.
[ ]MLP and ANNTemporalDrillingEgyptian General Petroleum Corporation (EGPC)
1045 samples
Clustering and classificationEpoch, age, formation, lithology, and fieldsGas channels and chimney predictionRMSPEMLP
RMSE = 0.10
The proposed model had a lower error rate and outperformed the other method.
[ ]ELM, Elastic Net Linear, Linear-SVR, Multivariate Adaptive Regression Spline, Artificial Bee Colony, PSO, Differential Evolution, Simple Genetic Algorithm, GWO, and xNESTemporalShale gasYuDong-Nan shale gas fieldPredictionThe minerals were quartz, calcite, dolomite, barite, pyrite, siderite, clay, and K-feldspar.Total organic carbonR , RMSE, MAE, MAPE, MARE, and WIDE+ELM = 0.497 (RMSE)Acceptable results for hybrid ELM models with the proposed method, except for GWO
[ ]MLP and Radial Basis Function Neural NetworkTemporalReservoirGullfaks in the North SeaPredictionInjection rate for water, gas, and half-cycle time. Downtime.Water alternating gasAverage absolute relative deviation (AARD)MLP-LMAThe proposed model outperformed the other two proxy models and significantly reduced the simulation time.
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]LSTM and GRUTemporalReservoirMetro Interstate Traffic Volume dataset, Appliances Energy Prediction dataset, and UNISIM-II-M-CO
301 samples
PredictionFluid production (oil, gas, and water), pressure (bottom-hole), and their ratios (water cut, gas–oil ratio, and gas–liquid ratio).Oil production and pressureMAE, RMSE, and SMAPELSTM + Seq2Seq and GRU2 architecturesThe author suggested looking at another metaheuristic method, such as GA.
[ ]DCNN + LSTM, ANN, SVR, LSTM, and RNNTemporalPipelineReal-time pipeline crack
90,000 data samples
PredictionPipeline condition, label, crack size, data length, sampling frequency, and tube pressureNatural gas pipeline crackRMSE, MAPE, MAE, MSE, and SNROptimized DCNN + LSTM
Accuracy = 99.37%
The model showcased impressive performance.
[ ]LSTM, Bi-LSTM, and GRUTemporalWellWest Natuna Basin dataset
11,497 samples
PredictionGR, Vp, LLD, LLS, NPHI, and RHOBWell log data imputationMAE, RMSE, MAPE, and R LSTM
RMSE = 94%
The suggested model provided a greater accuracy.
[ ]KNN, SVM, and XGBoostNon-temporalTransformerDGA local power utilities and IEC TC 10 dataset
1530 samples
ClassificationF7, F10, F17, F18, F19, F21, F24, F34, F36, and F40Transformer faultsAccuracy, precision, and recallKNN + SMOTE
Accuracy:
DGA = 98%
IEC TC 10 = 97%
The proposed model outperformed the other model.
[ ]DL, DT, RF, ANN, and SVRNon-temporalReservoirSorush oil field and oil field in southern Iran
7245 samples
PredictionMeasure choke size (D64), wellhead pressure (Pwh), oil specific gravity (γo), and gas–liquid ratio (GLR).Wellhead choke flow ratesRMSE and R DL
R = 99%
Compared to the other model, the accuracy of the suggested model was greater.
[ ]LSTM and GRUTemporalReservoirsUNISIM-IIH and Volve Oilfield
3257 samples
ClassificationOil, gas, water, or pressureOil and gas forecastingSMAPE and R GRU
R = 99%
The proposed model had the highest accuracy.
[ ]Faster R-CNN_Res50, Faster R-CNN_Res50_DC, Faster-R_CNN_Res50_FPN with Edge Detection, and Cluster+Soft-NMSNon-temporalWellGoogle Earth Imagery
439 samples
ClusteringWidth and heightClustered oil wellsPrecision, recall, F1 score, and APFaster R-CNN with ClusterRPN = 71%The proposed method’s running time was higher than the other models, and its accuracy was less than 90%.
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]ANFIS, LSSVM-CSA, and Gene Expression ProgrammingNon-temporalOilThe data are confidential.PredictionMixing time (min), MNP dosage (g/L), and oil concentration (ppm)Oil adsorption capacity (mg/g adsorbent)R , MPE, and MAPELSSVM-CSA
R = 99%
The proposed method was outperformed by the other two models.
[ ]ANFIS and ANFIS+PCANon-temporalPipelinePublished studies
[ , , , , ]
217 samples
ClassificationPipe dimension, burst pressure, pipe wall thickness, defect depth, and defect widthPressureRMSE, MAE, and R ANFIS+PCA
R = 99%
The proposed method outperformed other models and significantly improved the model’s accuracy.
[ ]ANN, SVR, and ANFISNon-temporalReservoirCPG’s waterflooding research group at the King Fahd University of Petroleum and Minerals in Saudi Arabia
9000 samples
ClusteringReservoir heterogeneity degree (V), mobility ratio (M), permeability anisotropy ratio (kz/kx), wettability indicator (WI), production water cut (fw), and oil/water density ratio (DR)The effectiveness of moveable oil recovery during a flood (RFM)MAPE, MAE, MSE, and R ANNThe proposed model had a better accuracy than the other models and had lower a runtime and cost.
[ ]RF, Fuzzy C Means, and Control ChartTemporalWell3W dataset
50,000 samples
ClassificationP-PDG, T-PDG, and T-PCK, and grouping of three classes (“normal”, “high fault”, and “high fault”)Failure detection applicationsTotal varianceControl chart + RF
Specificity = 99%
Sensitivity = 100%
The proposed method showed higher sensitivity and specificity.
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]KNN, DT, RF, NB, AdaBoost, XGBoost, and CatBoostNon-temporalPipelineNational Science Foundation (NSF) Critical Resilient Interdependent Infrastructure Systems and Processes (CRISP)
959 samples
ClassificationPipe diameter, wall thickness, defect depth, defect length, yield strength, ultimate tensile strength, and operating pressureFailure risk pipelinePrecision, recall, and Mean accuracyXGBoost
Accuracy = 85%
The proposed model needs improvement in accuracy.
[ ]LR, RF, SVM, XGBoost, and ANNNon-temporalReservoirWell log data from North China
1500 samples
ClassificationCAL, CNL, AC, GR, PE, RD, RMLL, RS, SP, DEN, DTS, and SPShear wave travel time (DTS)R XGBoost
R = 99% (training) and 96% (testing)
The best model was significant.
[ ]ELM, SVM, KNN, DT, RF, and ELTemporalTransformerDGA
542 samples
ClassificationC H , C H , CH , and H Power transformer faultsMean accuracyEN
Accuracy = 78% (Training) and 84% (Testing)
The proposed model’s performance accuracy was not above 90%.
[ ]DT, LDA, GB, Ensemble Tree, LGBM, RF, KNN, NB, LR, QDA, Ridge, and SVM-LinearNon-temporalTransformerDGA
3147 samples
ClassificationC H , C H , C H , and CH Transformer faultsAccuracy, AUC, recall, precision, F1 score, Kappa, MCC, and Processing runtimeQDA
Accuracy = 99.29%
The proposed method had the
best accuracy classifier model.
[ ]DTTemporalWellKG composition
180 samples
ClassificationKG, including hydrogen (H ), methane (CH ), ethane (C H ), ethylene (C H ), and acetylene (C H )Incipient faults in transformer oil.Accuracy and AUCDT
Accuracy = 62.9%
The current model exhibited potential, and we recommend exploring opportunities for refinement to enhance its overall efficacy.
[ ]LR, DT, RF, KNN, SMOTE, XAI, SHAP, and LIMENon-temporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON- PCK, T-JUS, PCK, P-JUS- CKGL, T-JUS- CKGL, and QGLDetect anomalies in oil wellsAccuracy, recall, precision, F1 score, and AUCRF
Accuracy = 99.6%, recall = 99.64%, precision = 99.91%, F1 score = 99.77%, and AUC = 1.00%.
The result of the proposed model was significant.
[ ]LDA, QDA, Linear SVC, LR, DT, RF, and AdaboostTemporal Well3W dataset
2000 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON-CKP, and T-JUS-CKPUndesirable eventsF1 score and accuracyDT
Accuracy = 97%
The feature selection did not boost accuracy, and training time was increased with feature selection. The proposed method struggled with class 2 due to limited data and mismatched labels from calculated features.
[ ]DT, ANN, SVM. LR, KNN, and NBTemporalPipelineExternal defects of pipelines in the United States
7000 samples
ClassificationConsider the defect’s length, breadth, and pipeline’s nominal thickness.Classification for pipeline corrosionAccuracyDT
Accuracy = 99.9%
The accuracy of the model was significant to the research.
[ ]LGBM, CatBoost, XGBoost, RF, and NNTemporalCrude oilWTI crude oil
2687 samples
ClassificationGold, silver, crude oil, platinum, copper, the dollar index, the volatility index, and the Euro Bitcoin: Green Energy Resources ESG.Oil pricesAccuracy and AUCLGBM and RFThe proposed method indicated superiority over traditional methods.
[ ]GB, RF, and MLRNon-temporalReservoirShale gas reservoirs
1400 samples
PredictionHorizontal wellbore length, hydraulic fracture length, reservoir length, SRV fracture porosity, permeability, spacing, pressure, and total production time.CO MSERFThe best method surpassed the other method in ML.
[ ]RF, ANN, and FNTemporalDrillingReal time Well-1 data
8983 samples
ClassificationStandpipe pressure (SPP), weight on bit (WOB), rotary speed (RS), flow rate (Q), hook load (HL), rate of penetration (ROP), and rotary speed (RS)Torque and drag (T&D)R and AAPERFThe proposed model had higher accuracy than the other two models.
[ ]RFTemporalReservoir2D simulation in STARS
240 samples
PredictionFormation compressibility, volumetric heat capacity, rock, water, oil, and thermal conductivityShale barrierR and RMSERFThe author suggested incorporating more training data and features to improve the proposed method.
[ ]RF, XGBoost, SVM, and LGBMNon-temporalPipelineFull-scale corroded O&G pipelines
314 samples
PredictionDepth, length, and width of corrosion defects, wall thickness, pipe diameter, steel grade, and burst pressureBurst pressure of gas and oil corroded pipelinesR , RMSE, MAE, and MAPEXGBoost
R = 99% (training) and 98% (testing)
The hybrid proposed model had significantly higher prediction accuracy.
[ ]XGBoost, SVM, and NNNon-temporalPipelineOLGA data and PIG data
1700 samples
ClassificationGeometrical parameters: start of odometry, end of odometry. Latitude, longitude, elevation, and the length of bar. Water volumetric flow rate, continuous velocity, water film shear stress, hold-up, flow regime, pressure, total mass, and volumetric flow rate inclination, temperature, section area, gas mass and volumetric flow rates, gas velocity, wall shear stress, total water mass and flow rate (including vapor),Internal corrosion in pipeline infrastructuresMean accuracy and F1 scoreXGBoost
Accuracy = 62%
The proposed model needs improvement in accuracy.
[ ]RF and CatBoostNon-temporalPipelineCrude oil dataset
3240 samples
PredictionStream composition (NO , NH S, and NCO ), pressure (P), velocity (v), and temperature (T)Corrosion ratesR , MSE, MAE, and RMSECatBoost
Accuracy = 99.9% (training and testing)
The proposed model’s accuracy outperformed the other models.
[ ]RF and KNNTemporalTransformerDGA
11,400 samples
ClassificationAcetylene (CC HH ), ethylene (CC HH ), ethane (CC HH ), methane (CCHH ), and hydrogen (HH )Identify transformer fault typesMean accuracyKNN
Accuracy = 88%
The proposed model needs an improvement in accuracy.
[ ]XGBoost, CatBoost, LGBM, RF, deep MLN, DBN, and CNNNon-temporalCrude oilPrevious studies on CO –oil MMP databank
310 samples
ClassificationCrude oil fractions (N , C , H S, CO , and C -C ), average critical injection gas temperature (Tcave), reservoir temperature (Tres), and molecular weight of C5+ fraction (MWc5+)Estimating the MMP of CO –crude oil systemARD, AARD, RMSE, MPa, and SDCatBoost
R = 99%
The proposed model confirmed its superiority over other models.
[ ]DF + K-means, RF, SVM, DNN, and DFNon-temporalLithologyLithology dataset from the Pearl River Mouth Basin
601 samples
ClassificationSandstone (S00), siltstone (S06), grey siltstone (S37), mudstone (N00), sandy mudstone (N01), and limestone (H00).Lithology identificationPrecision, recall, and FβDF + K-means
Accuracy = 90%
The baseline method had poor prediction of the minority class, small-amount data label, error labeling, and noisy data.
[ ]GSK- XGBoostTemporalTransformerDGA
128 samples
ClassificationAmmonia, acetaldehyde, acetone, ethylene, ethanol, and tolueneEthanol, ethylene, ammonia, acetaldehyde, acetone, and tolueneAccuracy, precision, recall, F-measure, and beta-factor GSK- XGBoost
Mean accuracy = 50%
The accuracy of the GSK-XGBoost model fell below 90% after employing the developed strategy, while computational time increased.
[ ]LGBM, XGBoost, RF, LR, SVM, NB, KNN, and DTNon-temporalTransformerDGA
796 samples
ClassificationH , CH , C2H , C H , and C H Fault type classificationAccuracy, precision, recall, and F1 scoreLGBM
Accuracy = 87.06%
The model demonstrated a high level of competence.
[ ]Adaboost, RF, KNN, NB, MLP, and SVMNon-temporalDrillingDrill bit type in Norwegian wells
4312 samples
ClassificationParameter used:
Depth as measured, vertical true depth, penetration rate, bit weight, minutes per round, torque, standpipe pressure, mud mass, flow rate, total gas, bit kind, bot quantity, D-exponent, area of total flow, specific mechanical energy, cut depth, and aggressiveness of drill bit.
Drill bit selectionAccuracy, precision, F1 score, recall, MCC, and G-meanRF
Accuracy = 97% (training) and 91% (testing)
The proposed method was more reliable, stable, and accurate than
previous models.
[ ]RFTemporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, P-PCK, T-PCK, P-JUS-CKGL, T-JUS-CKGL, and gas lift flowEarly fault detectionAccuracy, faulty-normal accuracy (FNACC), real faulty-normal accuracy (RFNACC)RF
Accuracy = 94%
The proposed method had good detection of the early fault.
[ ]One-Directional, CNN, RF, GNN, and QDATemporalWell3W
1984 samples
ClassificationP-PDG, T-TPT, P-MON-CKP, T-JUS-CKP, P-JUS-CKGL, and QGLAnomalous events in oilAccuracy, precision, recall, and F1 scoreRF
Mean accuracy = 95%
The time windows increased.
[ ]RF and PCATemporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON-CKP, and T-PCKAnomalous events in oil wellsAccuracyRF+PCA
Accuracy = 90%
The proposed method’s accuracy > 95% for all the classes.
[ ]SVM, LOF, and RFTemporalReservoirWell log data
37 samples
ClusteringDepth, gamma ray, shallow resistivity, deep resistivity, neutron, density, CALI, and DTSSonic (DTC)R K-Means+RF
R = 0.92 to R = 0.98
The proposed hybrid approach outperformed several baseline methods.
[ ]RFTemporalWellField and well dataset from public dataset U.S. well
934 samples
ClusteringAPI, On-stream date, Surface latitude and longitude, formation thickness, TVD, lateral length, total proppant mass, total injected fluid volume, API gravity, porosity, permeability, TOC, VClay, oil production rate, gas production rate, water production rate, GPI, and frac fluidBarrel of oil equivalent (BOE)RMSE and R RF
RMSE:
Train = 7.25%
Test = 17.49%
The proposed method needs improvement in accuracy.
The RF model was overfitting, and the accuracy of the proposed method must be improved.
[ ]RF with Analog-to-digital convertersNon-temporalWellWell logging dataset
100 samples
ClusteringNeutron (CNL), gamma ray (GR), density (DEN), and compressional slowness (DTC)Well logging data generationRMSE, MAE, MAPE, and MSERF with analog-to-digital converters
RMSE = 9%, MAE = 6%, MAPE = 0.031%, and MSE = 86%
The proposed model needs improvement in accuracy for clustering.
[ ]RFTemporalTransformerDPM1 and DPM2 for DGA
2123 samples
ClassificationH (hydrogen), CH (methane), C H (acetylene), C H (ethylene), C H (ethane), CO (carbon monoxide), CO (carbon dioxide), O (oxygen), and N (nitrogen)Transformer fault diagnosisAccuracyRF
Accuracy:
DPM1 = 96.2%
DPM2 = 96.5%
For the evaluation dataset, the suggested models diagnosed errors with a satisfactory level of performance.
[ ]KNN, Multilayer Perceptron Neural Network, multiclass SVM, and XGBoostTemporalPipelineClimate change data
81 samples
ClassificationLocation, time, pipeline age, pipeline material, temperature, humidity, and wind speed.Gas pipelineAccuracy, precision, recall, and F1 scoreXGBOOST
Accuracy = 92%
The model outperformed other models; however, it needs improvement.
[ ]LogitBoost, GBM, XGBoost, AdaBoost, and KNNTemporalWellLithofacies and well log dataset
399 samples
ClassificationGR, CALI, NEU, DT, DEN, RES DEP, RES SLW, PHIT, and SWLithofacies predictionsTotal Percent Correct (TPC) is an accuracy measureXGBoost
TPC = 97%
The model gave significant results for the proposed method.
[ ]Recursive feature elimination and particle swarm optimization-AdaBoostNon-temporalPipelineChangshou-Fuling-Wulong-Nanchuan (CN) gas pipeline dataset
3986 samples
ClusteringLandslide susceptibility area, percentage, and historical landslidesLong-distance pipelinesAccuracy, sensitivity, precision, and F1 scoreRecursive feature elimination and particle swarm optimization-AdaBoost
Accuracy = 90% (training) and 83% (testing)
The proposed model needs improvement in accuracy.
[ ]LSTM, AdaBoost, LR, SVR, DNN, RF, and adaptive RFTemporalCrude oilUnited states’ Energy Information Administration
Brent COP data
PredictionShape, location, and scaleCrude oil price (COP)MAPE, MSE, RMSE, MAE, and EVSAdaptive RF
MAPE = 112.31%; MAE = 52%; MSE = 53%; RMSE =73%; R = 99%; and EVS = 99%
The proposed model outperformed the others; however, the running time was higher than those of the other models.
[ ]RF and DTTemporalDrillingThe data are confidential.PredictionWOB, torque, standpipe pressure, drill string rotation speed, rate of penetration, and pump rateRock porosityR , AAPE, and VAFRF
Accuracy = 99% (training) and 90% (testing)
The model stood out for its exceptional performance.
[ ]BayesOpt-XGBoost, and XGBoostNon-temporalReservoirEquinor Volve Field datasets
2853 samples
ClassificationDT, GR, NPHI, RT, and RHOBVshale, porosity, horizontal permeability (KLOGH), and water saturationRMSE and MAEBayesOpt-XGBoost
Accuracy = 93%, precision score = 98%, recall score = 86%, and combined F1 score = 93%
The proposed method was not robust enough to predict all the output.
[ ]RF, KNN, NB, DT, and NNTemporalTransformerNew O&G decommissioning dataset from GitHub
1846 samples
ClassificationDimensions, circumference, length, metal, plastic, concrete, residues, environmental expenses, and weight Predictive decommissioning optionsRecall, precision, F1 score, and AUCRF
Accuracy: Full features = 80.06%
Redundant removed = 80.66%
The proposed method needs improvement.
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]MLR, SVR, and GPRNon-temporalGasM6COND and M6GAS
129 samples
ClusteringCondensate–gas ratio, total horizontal lateral length, gas saturation, total organic carbon content, cluster and stage counts, proppant amount, fluid volume, and total horizontal lateral lengthGas wellRMSE and R GPRThe proposed method needs improvement in accuracy.
[ ]XGBoost, ANN, RNN, MLR, PLR, SVR, DTR, and RFRTemporalO&G productionSaudi Aramco of five well reservoirs
1,968 samples
ClassificationLocation, contact, average permeability, volume, production, pressure ratio between the wellhead and bottom-hole, and productionOil, gas, and waterR , MAE, MSE, and RMSERNN
R :
Oil = 98%
Gas = 87%
Water = 92%
The proposed model needs improvement in output.
[ ]MLP, RF, and SVRNon-temporalPipelineHistory record of pipeline failure
149,940 samples
ClassificationEffects of transportation disruptions on safety and health, the environment and ecology, and equipment maintenanceNatural gas pipeline failureRMSE, MAE, MSE, and R RFThe proposed methods had the shortest computing times and best-fitting results.
[ ]SVMNon-temporalReservoirMMP data
147 samples
ClassificationReservoir temperature, oil composition, and gas compositionMinimum miscibility pressure of CO and crude oilMSESVM-POLY kernelThe proposed model’s accuracy outperformed the other models.
[ ]RF, ARN, LSTM, Independently Recurrent Neural Network, component-wise gradientTemporalWell3W
1984 samples
ClassificationP-PDG, T-TPT, P-TPT, Initial Normal, Steady-state, and transientOil well productionAccuracy, precision, recall, F scoreARN
Accuracy = 96%
Precision = 88%
Recall = 84%
F-measure = 85%
The proposed model was not robust due to misclassifications for undesirable events for type 3 and type 8.
[ ]SVR-GA-PSO, SVR, SVR-GA, SVR-FA, SVR-PSO, SVR-ABC, SVR-BAT, SVR-COA, SVR-GWO, SVR-HAS, SVR-ICA, and SVR-SFLATemporalPipelineIranian oil fields
340 samples
ClassificationOnshore oil and gas pipelines: pit depths, exposure times, pitting start times, operational pressures, temperatures, water cuts, redox potentials, resistivities, pH, concentrations of sulfate and chloride ions, and production ratesCarbon steel corrosion rateMSE, RMSE, MAE, EVS, R , and RSESVR-GA-PSO
R = 99%
RMSE = 0.0099
MSE = 9.84 × 10
MAE = 0.008
RSE = 0.001
EVS = 0.955
The proposed model showed a better result than the other ones.
[ ]BLR, PBBLR, ANN, and Gradient Boosting DTNon-temporalPipelineSCADA (Supervisory Control and Data Acquisition) system
728 samples
PredictionDiameter, Reynolds number, transportation distance, and mixed oil lengthActual mixed oil lengthRMSE, MAE, and R PBBLRThe PBBLR method needs improvement on the accuracy of using SCADA dataset to predict actual mixed oil length
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]SARIMA, LSTM, and ARTemporalTransformerDGA
610 samples
PredictionH , CH , C H , C H , CO, CO , and total hydrocarbon (TH).Dissolved gas concentrationARESARIMAThe SARIMA method had a good average accuracy
[ ]LSTM and ARIMATemporalWellsLongmaxi Formation of the Sichuan Basin
3650 samples
PredictionDate and daily productionShale gas productionMAE, RMSE, and R LSTM
Accuracy = 0.63%
The accuracy of the model needs improvement.
[ ]GM, FGM, DGGM, ARIMA, PSOGM, and PSO-FDGGMTemporalGasQuarterly production of natural gas in ChinaPredictionTraining period and natural gas productionNatural gas productionMAPEPSO-FDGGM
MAPE = 3.19%
The model’s performance was noteworthy and reliable.
ReferenceModelsTemporalityFieldDatasetClassInput ParameterOutput ParameterPerformance MetricsBest ModelAdvantages/Disadvantages
[ ]Multivariate Empirical Mode Decomposition with Genetic Algorithm, LSSVM-GA, and LSSVM-PSONon-temporalCrude oilsBubble point pressure and oil formation volume factor
638 samples
ClusteringTemperature (T), oil gravity (API), gas specific gravity (γg), and ratio of gas oil solutionBubble point pressure and oil formation volume factor of crude oilsRMSEMELM-PSOThe hybrid proposed model outperformed the empirical method.
[ ]PCA, SVM, and LDATemporalOilReal-time oil samples
30 samples
ClassificationPore size remained the same. The capillary flow rate (l2/t) was a function of interfacial properties (γLG and θ) and viscosity (μ).Oil typesAccuracySVM
Accuracy = 90%
The proposed model needs improvement in accuracy because the accuracy < 95%.
[ ]MLP-PSO and MLP-GANon-temporalWell logThree wellbores drilled
22,323 samples
PredictionWell depth, compressional wave velocity (Vp), shear wave velocity (Vs), bulk density (ρ), and pressure pore (Pp),Probable depth of casing collapseR and RMSEMLP-PSOThe proposed model outperformed the other models’ accuracy.
[ ]LSSVM-COA, LSSVM-PSO, LSSVM-GA, MLP-COA, MLP-PSO, MLP-GA, LSSVM, and MLPNon-temporalDrilling305 drilled wells in the Marun oil field
2820 samples
PredictionNorthing, easting, depth, meterage, formation type, hole size, WOB, flow rate, MW, MFVIS, retort solid, pore pressure, drilling time, fracture pressure, fan 600/fan 300, gel10min/gel10s, pump pressure, and RPMSeverity of mud lossR and RMSEMLP-GA
RMSE = 93%
The accuracy of the proposed model can be improved.
[ ]Hybrid-Physics Guided-Variational Bayesian Spatial-Temporal neural networkTemporalGasNatural gas
600 samples
PredictionSize of geometry, release point position, release diameter, released gas, volumetric release rate, length of release, and sensor locationNatural gas concentrationR Hybrid_PG_VBSTnn
R = 99%
The proposed integration enhanced the spatiotemporal forecasting performance.
[ ]CNN, Linear SVM, Gaussian SVM, and SVM+CNNTemporalGasLeakage dataset
1000 samples
ClassificationMethane, ethane, propane, isobutane, butane, helium, nitrogen, hydrogen sulfide, carbon dioxideGas pipeline leakage estimationAccuracySVM
Accuracy = 95.5%
The model stood out for its exceptional performance.
[ ]LSTM and OCSVMTemporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON-CKP, and T-JUS-CKPIdentify two types of faultsRecall, specificity, and accuracyOCSVM
Accuracy = 91%
The use of feature selection did not improve the classifier accuracy. The proposed model was not robust enough to classify 2 types of wells.
[ ]Ordered Nearest Neighbors, Weighted Nearest Neighbors, LDA, and QDATemporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON-CKP, T-JUS-CKP, and CLASSPredicting flow instabilityRecall, specificity, and accuracyONN
Accuracy = 81%
The author suggested investigating another metaheuristic method.
[ ]CNN, SVM, and SVM+CNNTemporalPipelineLeakage dataset
1000 samples
PredictionLength, outer diameter, wall thickness, and location in the modelPrediction in tight sandstone reservoirsAccuracySVM+CNN model, achieved 95.5%The SVM+CNN model outperformed the CNN and SVM
[ ]DT and SVMNon-temporalReservoirHigh-resolution FMI dataClassificationResponse of logging, pyroclastic lava, normal pyroclastic rock, and sedimentary pyroclastic rockLithologic classification of pyroclastic rocksAccuracySVM
Accuracy = 98.6%
The SVM accuracy was higher than 95% which is 98.6%
[ ]BAE-OCSVM, CAE-OCSVM, LSTM-AE- OCSVM, RD-OCSVM, RF-OCSVM, PCA-OCSVM, VAE-OCSVM, and LSTM-AE-IFTemporalGasData from SCADA
9980 samples
ClassificationDiameter, wall thickness, and lengthLeakage of natural gasAUC, accuracy, F1 score, precision, TPR, and FPRLSTM-AE-OCSVM
Accuracy = 98%
The best model achieved higher accuracy, and the author suggested using abnormal data for future work.
[ ]LSTM and GRUTemporalReservoirsUNISIM-IIH and Volve oilfield
3257 samples
ClassificationOil, gas, water, or pressureOil and gas forecastingSMAPE and R GRU
R = 99%
The proposed model had the highest accuracy.
[ ]OCSVM, LOF, Elliptical Envelope, and Autoencoder withfeedforward+LSTMTemporalWell3W
1984 samples
ClassificationP-PDG, P-TPT, T-TPT, P-MON-CKP, T-JUS-CKP, P-JUS-CKGL, T-JUS-CKGL, QGL, and Label vectorFault detectionF1 scoreLOF
F1 score = 85%
The proposed method needs improvement in accuracy.
[ ]K-Means Clustering and KNNTemporalReservoirsAntrim, Barnett, Eager Ford, Woodford, Fayetteville, Haynesville, and Marcellus
55,623 samples
ClusteringWell location, well depth, well length, and production starting yearEUR predictionsR K-MC
R = 0.18
The proposed model outperformed the other models using average fitting parameters.
[ ]GS-GMDHNon-temporalWellOil fields located in the Middle East
2748 samples
PredictionLaterolog (LLS), photoelectric index (PEF), compressional wave velocity (Vp), porosity (NPHI), gamma ray (spectral) (SGR), density (RHOB), amma ray (corrected) (CGR), shear wave velocity (Vs), caliper (CALI), resistivity (ILD), and sonic transit time (DT)Pore pressureRMSE, R , MSE, SI, and ENSGS-GMDH
RMSE = 1.88 psi and R = 0.9997
GS-GMDH had the best accuracy.
[ ]RF, Gradient Boosting Regressor, Bagging, CNN, KNN, and Deep Hierarchical DecompositionTemporalReservoirGeological data
180 samples
ClassificationPorosity, fracture porosity, fracture permeability, rocky type, net gross, matrix permeability, water relative permeability, formation volume factor, rock compressibility, pressure dependence of water viscosity, gas density, water density, vertical continuity, relative permeability curves, oil–water contact, and fluid viscosityOil production, water production, water injection, and liquid productionMAE and SMAPEDeep Hierarchical Decomposition
MAE:
OP = 0.76%
The proposed method decreased the computational speed.
[ ]M5P tree model, RF, Random Tree, Reduced Error Pruning Tree, GPR, SVM, and MARSNon-temporalGasCoriolis flow meter
201 samples
ClassificationWet gas flow rate (kg/h) and absolute gas humidity (g/m )Estimation of the dry gas flow rate (kg/h)RMSE, MAE, LMI, and WIGPR-RBKF
MAE = 163.3266 kg/h, RMSE = 483.1359 kg/h, CC = 0.9915 for the dataset used for testing
The best model was superior to the other models, and the author suggested exploring other soft-computing methods.
Input Parameter of Undesirable Well Events[ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
P-PDG
P-TPT
T-TPT
P-MON-CKP
T-JUS-CKP
T-JUS-CKGL
P-JUS-CKGL
P-CKGL
QGL
T-PDG
T-PCK
Input Parameter of Internal Transformer Defects[ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
Acetylene (C H )
Ethylene (C H )
Ethane (C H )
Methane (CH )
Hydrogen (H )
Total Hydrocarbon (TH)
Carbon Monoxide (CO)
Carbon Dioxide (CO )
Ammonia (NH )
Acetaldehyde (CH CHO)
Acetone (CH CO)
Nitrogen (N )
Ethanol (CH CH OH)
Input Parameter of Well Logging[ ][ ][ ][ ][ ][ ]
Gamma Ray (GR)
Sonic (Vp)
Deep and Shallow Resistivities (LLD and LLS)
Neuro-porosity (NPHI)
Density (RHOB)
Caliper (CALI)
Neutron (NEU)
Sonic Transit Time (DT)
Bulk Density (DEN)
Deep Resistivity (RD)
True Resistivity (RT)
Shallow Resistivity (RES SLW)
Total Porosity (PHIT)
Water Saturation (SW)
Compressional Slowness (DTC)
Depth
ML MethodsModel VariantsModel Performance (%)
Artificial Neural NetworkLWQPSO-ANN95
ANN93
ANN99.6
ANN90
DNN146
ANN+PSO99
ANN97
MTGNN92
Multilayer Perceptron Backpropagation89
GA backpropagation neural network97
MLP10
DE+ELM49.7
Deep LearningDCNN+LSTM99.37
LSTM94
KNN+SMOTE98
DL99
GRU99
Faster R-CNN+ClutserRPN71
Fuzzy Logic and Neuro-fuzzyLSSVM+CSA99
ANFIS+PCA99
Control Chart+RF99
Decision Tree, Random Forest, and HybridXGBOOST85
XGBOOST96
EL84
QDA99.29
DT62.9
RF99.6
DT97
DT99.9
XGBOOST62
CATBOOST99.9
KNN88
CATBOST99
DF+K-MEANS90
GSK+XGBOOST50
LGBM87.06
RF91
RF94
RF95
RF+PCA90
K-MEANS+RF98
RF17.49
RF+Analog-to-digital converters9
RF96
XGBOOST92
XGBOOST97
Recursive feature elimination+PSO+ADABOOST83
Adaptove+RF73
RF90
BayesOpt+XGBOOST93
RF80.06
Interrelated AIRNN98
ARN96
SVR+GA+PSO99
Statistical modelARIMA63
ML model utilized for predictive analytics in the O&G fieldSVM90
MLP+GA93
Hybrid-Physics Guided-Variational Bayesian Spatial-Temporal Neural Network99
SVM95.5
OCSVM91
ONN81
SVMCNN95.5
AVM98.6
LSTM+AE+OCSVM98
GRU99
LOF85
K+MC18
Deep Hierarchical Decomposition76
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

R Azmi, P.A.; Yusoff, M.; Mohd Sallehud-din, M.T. A Review of Predictive Analytics Models in the Oil and Gas Industries. Sensors 2024 , 24 , 4013. https://doi.org/10.3390/s24124013

R Azmi PA, Yusoff M, Mohd Sallehud-din MT. A Review of Predictive Analytics Models in the Oil and Gas Industries. Sensors . 2024; 24(12):4013. https://doi.org/10.3390/s24124013

R Azmi, Putri Azmira, Marina Yusoff, and Mohamad Taufik Mohd Sallehud-din. 2024. "A Review of Predictive Analytics Models in the Oil and Gas Industries" Sensors 24, no. 12: 4013. https://doi.org/10.3390/s24124013

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Advertisement

Advertisement

Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research

  • Original Research
  • Open access
  • Published: 11 July 2023

Cite this article

You have full access to this open access article

research papers in data analytics

  • Hamed Jahani   ORCID: orcid.org/0000-0002-7091-6060 1 ,
  • Richa Jain   ORCID: orcid.org/0000-0002-8307-2442 2 &
  • Dmitry Ivanov   ORCID: orcid.org/0000-0003-4932-9627 3  

13k Accesses

20 Citations

Explore all metrics

Data science and big data analytics (DS &BDA) methodologies and tools are used extensively in supply chains and logistics (SC &L). However, the existing insights are scattered over different literature sources and there is a lack of a structured and unbiased review methodology to systematise DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and sustainability paradigms. In this study, we first propose an unique systematic review methodology for the field of DS &BDA in SC &L. Second, we use the methodology proposed for a systematic literature review on DS &BDA techniques in the SC &L fields aiming at classifying the existing DS &BDA models/techniques employed, structuring their practical application areas, identifying the research gaps and potential future research directions. We analyse 364 publications which use a variety of DS &BDA-driven modelling methods for SC &L processes across different decision-making levels. Our analysis is triangulated across efficiency, resilience, and sustainability perspectives. The developed review methodology and proposed novel classifications and categorisations can be used by researchers and practitioners alike for a structured analysis and applications of DS &BDA in SC &L.

Similar content being viewed by others

research papers in data analytics

Big data optimisation and management in supply chain management: a systematic literature review

research papers in data analytics

A Systematic Literature Review on Sustainability and Disruptions in Supply Chains

research papers in data analytics

Big Data Analytics for Supply Chain Management: A Literature Review and Research Agenda

Avoid common mistakes on your manuscript.

1 Introduction and background

In supply chains (SCs), large data sets are available through multiple sources such as enterprise resource planning (ERP) systems, logistics service providers, sales, supplier collaboration platforms, digital manufacturing, Blockchain, sensors, and customer buying patterns (Li et al., 2020b ; Rai et al., 2021 ; Li et al., 2022a ). Such data can be structured, semi-structured, and unstructured. Big data analytics (BDA) can be used to create knowledge from data to improve SC performance and decision-making capabilities. While BDA offers substantial opportunities for value creation, it also presents significant challenges for organisations (Chen et al., 2014 ; Choi et al., 2018 ).

Compared to BDA that deal with collecting, storing, and analysing data, data science (DS) focuses on more complex data analytics. In particular, predictive analytics such as machine learning and deep learning algorithms are considered. From the methodological perspective, DS &BDA contribute to decision-making at strategic, tactical, and operational levels of SC management. Organisations can use DS &BDA capabilities to achieve competitive advantage in the markets (Kamley et al., 2016 ). DS &BDA techniques also help organisations improve their SC design and management by reducing costs, increasing sustainability, mitigating risk and improving resilience (Baryannis et al., 2019b ), understanding customer demands, and predicting market trends (Potočnik et al., 2019 ).

Along with methodological advancements, a progress in the DS &BDA tools can be observed. SC analytics software help researchers and practitioners alike to develop better forecasting, optimization, and simulation models (Analytics, 2020 ). These tools can also extract data and produce advanced visualizations. Along with the large corporations such as SAP®, IBM, and Oracle, there are also specific SC software such as anyLogistix™ and LLamasoft™, that allow to integrate simulation and network design with SC operations data to build digital SC twins (Ivanov, 2021b ; Burgos & Ivanov, 2021 ). The advanced methodical and software developments result in growing opportunities for SC researchers and practitioners. However, the existing insights are scattered over different literature sources and there is a lack of a structured review on the DS &BDA application areas in SC and logistics (SC &L) areas, comprehensively covering efficiency, resilience and sustainability paradigms which encouraged us to conduct this systematic and comprehensive literature review. In the next section, we elaborate in detail on our motivation for this study.

1.1 Motivation of the study

Google trends for “Data Science” and “Big Data” have exhibited continuously increasing interest over the last 19 years in the DS &BDA in SC &L field, while the trend for SCs has steadily exhibited high interest (see Fig.  1 ). However, interest in BDA started increasing earlier than that for DS. We can also observe the recent convergence in the trends for “Big Data” and “Data Science”.

figure 1

Trends of interest in the topics of this research (2004–2022)

From an academic point-of-view, various literature review studies have recently indicated the benefits of using DS &BDA in SC &L management (Pournader et al., 2021 ; Riahi et al., 2021 ; Novais et al., 2019 ; Neilson et al., 2019 ; Ameri Sianaki et al., 2019 ; Baryannis et al., 2019b ; Choi et al., 2018 ; Govindan et al., 2018 ; Mishra et al., 2018 ; Arunachalam et al., 2018 ; Tiwari et al., 2018 ). Table  3 demonstrates the latest literature review publications in line with DS &BDA and affirms that although several review papers can be found around this topic, the reviews only explore SC &L from the specific viewpoint of BDA. Kotu and Deshpande ( 2018 ) concede that although the concept of big data is worthy of being explored separately, a holistic view on all aspects of data science with consideration of big data is of utmost importance and still needs to be researched in several areas such as SCs. Our investigation also shows that studies including BDA mostly discuss architecture and tools for BDA, but lack a contextualisation in the general data science methodologies. Waller and Fawcett ( 2013 ) affirm that along with the importance of data analysis in SCs, other issues related to data science are important in the SC, such as “data generation”, “data acquisition”, “data storage methods”, “fundamental technologies”, and “data-driven applications”, which are not necessarily connected to BDA.

The growing number of studies in DS &BDA and SC &L substantiates the need to adopt systematic approaches to aggregating and assessing research results to provide an objective and unbiased summary of research evidence. A systematic literature review is a procedural aggregation of precise outcomes of research. We explored several survey studies around our topic, shown in Table  3 , to understand how researchers employ a systematic approach for their review process. Our general observation from analysis of the literature is that the existing surveys mostly focus on the BDA while missing a detailed analysis of DS and intersections of BDA and DS - a distinct and substantial contribution made by our study (Grover & Kar, 2017 ; Brinch, 2018 ; Nguyen et al., 2018 ; Kamble & Gunasekaran, 2020 ; Neilson et al., 2019 ; Talwar et al., 2021 ; Maheshwari et al., 2021 ). For instance, Maheshwari et al. ( 2021 ) conduct a systematic review for finding the role of BDA in SCM, but only select the keywords “Big data analytics” with “Supply chain management” or “Logistics management” or “Inventory management” which definitely miss many relevant studies using DS applications with big data.

1.2 Basic terminologies

Since several terms are used in the area of DS &BDA, we introduce here some of the main terminologies in the domain of our research.

Data science is a knowledge-based field of study that provides not only predictive and statistical tools for decision-makers, but also an effective solution that can help manage organisations from a data-driven perspective. DS requires integration of different skills such as statistics, machine learning, predictive analyses, data-driven techniques, and computer sciences (Kotu & Deshpande, 2018 ; Waller & Fawcett, 2013 ).

Big data includes the mass of structured or unstructured data and has been commonly characterised in the literature by 6Vs, i.e., “volume” (high-volume data), “variety” (a great variety of formats and sources), “velocity” (rapid growth in generation), “veracity” (quality, trust, and importance of data), “variability” (statistical variation in the data parameters) and “value” (huge economic benefits from low-data density) (Mishra et al., 2018 ; Chen et al., 2014 ).

Predictive analytics project the future of a SC by investigating its data and employing mathematical and forecasting models (Kotu & Deshpande, 2018 ).

Prescriptive analytics employs optimisation, simulation, and decision-making mechanisms to enhance business performance (Kotu & Deshpande, 2018 ; Chen et al., 2022 ).

Diagnostic analytics is a financial-analytical approach that aims to discover events causes and behaviours (Xu & Li, 2016 ; Windt & Hütt, 2011 ).

Descriptive analytics aims to analyse problems and provide historical analytics regarding the organisation’s processes by applying some techniques such as data mining, data aggregation, online analytical processing (OLAP), or business intelligence (BI) (Kotu & Deshpande, 2018 ).

The remainder of this paper is organised as follows: Sect.  2 describes our systematic research methodology to introduce the research questions, objectives, and conceptual framework, and to identify potential related studies. Section  4 presents and describes our content analysis results of the selected studies. Section  5 identifies gaps in the literature of DS &BDA within the context of SC &L. Finally, Sect.  6 concludes our study by summarising the significant features of our detailed framework and by providing several future research avenues.

2 Research methodology

2.1 research questions and objectives.

To develop a conceptual framework for our research, the following research questions (RQs) have been framed:

What strategies are required (in line with the systematic review protocol) to identify studies related to our research topic? (RQ1)

What can be inferred about the research process and guidelines from the previous survey studies related to DS &BDA in SC &L? (RQ2)

What research topics and methodologies have been investigated in DS &BDA in the context of SC &L? (RQ3)

What are the existing gaps in the literature for using DS &BDA techniques in SC &L? (RQ4)

Consequently, the research objectives are defined as follows:

Developing a comprehensive and unbiased systematic process to identify a methodological taxonomy of DS &BDA in SC &L.

Proposing a conceptual framework to categorize application areas of DS &BDA in SC &L.

Identifying the gaps and future research areas in development and application of DS &BDA techniques in SC &L.

2.2 Research process

Figure  2 depicts the nine main steps of our research process derived from Kitchenham ( 2004 ). The process includes three major phases: “research planning” , “conducting review” , and “reporting results” . We initially prepared the research plan by clarifying the research questions , defining the research objectives , and developing a review protocol for our study. A review protocol is an essential element in undertaking a systematic review and determines how primary studies are chosen and analysed. It also involves choosing beneficial resources/databases, study selection procedures or criteria (inclusion and exclusion criteria), and the proposed data synthesis method.

According to our defined review protocol, the second phase of the proposed research process (conducting the review) involves:

Conducting the analysis of recent review studies.

Material collection and identification of the available studies concerning the domain.

Developing a conceptual framework for reviewing and coding the collected studies.

Finally, we analyse the results of the content analysis and coding/classifying the selected studies. This phase also involves exploring the potential gaps and concluding with significant insights.

figure 2

Outline of the research process

2.3 Review protocol

Our review protocol is a systematic process of searching, demarcating, appraising, and selecting of articles. A similar protocol has been adopted by a number of highly cited review papers in the literature (Nguyen et al., 2018 ; Wang et al., 2018b ; Brinch, 2018 ). Material was collected from standard academic databases such as Web of Knowledge, Science Direct, Scopus, and Google Scholar, and only included “articles”, “research papers” or “reviews”. Results were limited to articles written in English language only between the years 2005 and 2021. The rationale behind this year range is the following: it will allow us to overview the latest studies to identify the research gaps in the area of DS &BDA in SC &L. Additionally, it will enable us to develop a coding strategy to formulate a conceptual framework for classifying the literature.

Initially a broad set of keywords were chosen to select potentially relevant studies. These keywords were “supply chain” OR “logistics”, along with at least one of the following keywords: “data science”, “data driven”, “data mining”, “text mining”, “data analytics”, “big data”, “predictive analytics”, and “machine learning”. However, additional search terms were identified later on from the relevant identified articles, to formulate more sophisticated search strings. We limited our search to articles that include the search keywords in their “title”, “abstract”, or “keywords”. The entire contents of the articles were not studied at this stage. If any database returned a huge number of articles during the search, we then followed a strategy to exclude or make selections from that database.

Table  1 shows the number of extracted papers from each database. This is further subdivided as per the keywords in Table  2 . Since we followed a comprehensive approach and selected a broad range of keywords, it resulted in a large number of studies, in comparison with the related review papers. Investigating the search results from Google Scholar demonstrated that most of the articles were irrelevant. Therefore, we identified Google Scholar database’s result as unreliable and did not consider the associated articles for the selection process. Moreover, after a thorough content analysis of the review articles and an examination of their search keywords with our proposed keywords set (listed in Table  2 ), we recognised that the “SC analytics” and “big data analytics” set of words had been commonly used in most of these articles; thus, to provide a more comprehensive search process, we also added these two keywords to our previous set of search keywords. According to the above-mentioned selection process, the number of preliminary papers extracted from the three search databases was reduced to 6064. The last search process was applied in January 2023.

In the next stages, duplicates from the databases were removed and, in order to ensure quality, only papers that were published in A* and A-ranked journals were selected, Footnote 1 or journal papers published in Q1-ranked journals in the SJR Report. Footnote 2 . This process was repeated twice: once for identifying a database of only literature review studies, and once for all other studies (see Figs.  3 and 4 ). Regarding the selection of review studies, we also looked at papers with the most citations in the Google Scholar database. We selected these studies by sorting the search list collected by each keyword set (see the last column of Table  2 ). This stage could not be applied to the process of selecting all studies, as the most cited non-review studies, listed in the Google Scholars search, were books, chapters, and other non-relevant articles. At the last stage, a content analysis was done to exclude those studies that were not closely associated with our field of interest.

2.4 Analysis of recent relevant review studies

As noted in the research process, we initially aimed to deliberate recent relevant review studies. The purpose of this approach is twofold. First, we are able to overview the latest review approaches and identify the interest and research gaps in the area of techniques involving DS &BDA in SC &L. Second, it helps us summarise coding strategies to develop a conceptual framework for classifying the literature.

Following our review protocol, the word “review” was added to the previous keywords, and the search process was repeated. This was done to extract only literature review studies from the shortlisted databases. No thorough analysis regarding the content of these studies was done at this time. This reduced the number of potential studies to 459. Amongst these, we found 317 duplicates. Furthermore, the focus on A* and A-ranked journals reduced the number of papers to 18. Then, we investigated the relevance of the remaining papers to our field of interest. After precisely reading the abstract and introduction sections, the papers not strongly associated with the subject or to the field of this research were also removed. Finally, 16 potential review studies were selected in this stage for a full text analysis. Figure  3 illustrates our meticulous selection procedure for selecting these articles. To ensure the comprehensiveness of the analysis, we also selected the most cited review studies listed in the Google Scholar database. The keywords set, as well as the search and filtering process, were applied as noted in the review protocol. We found three more relevant review papers in this stage (see the second stage of the process in Fig.  3 ). It is worth mentioning that some survey studies that only focus on BDA and SCs without a relevance to DS and logistics (e.g. Xu et al. ( 2021a )), have been removed from our list. Finally, 23 papers were selected from our two stages after full text filtering and content analysis.

figure 3

Research selection process regarding the literature review papers

3 Content analysis and framework development

3.1 lessons from the review studies.

To answer the second research question (RQ2) of our study, outlined in Sect.  2.1 , we analysed the content of the 23 selected review articles. Table  3 summarises these latest review articles. We categorise the lessons gained from this content analysis step in the following subsections.

3.1.1 Review methodologies

The investigation of venues of the selected survey studies introduced the top journals, listed in Table  3 . These journals are mostly among A*/A/Q1 -ranked journals. This confirmed that our approach regarding the inclusion of highly ranked journals was a correct strategy to limit the selected documents. Moreover, by looking over the search engines used by the survey studies (see the column Search Engines in Table  3 ), we confirmed our main databases for the material selection process in which we selected all relevant studies.

In Table  3 , the column Type of Review refers to the research methodology employed for reviewing the selected studies. From a total of 23 review papers, eight of them did not utilise any type of “systematic” (SR) or “bibliometric” (BIB) methods, and by investigating their research methodology in more detail, we categorised these articles as “others” (ORS), which means that they did not utilise an organised research methodology for their review process. Three articles reviewed the literature bibliographically (Pournader et al., 2021 ; Mishra et al., 2018 ; Iftikhar et al., 2022b ), and one article (Arunachalam et al., 2018 ) chose both methods (SR and BIB). The systematic approach also claimed to be implemented on some BIB based methods (see Pournader et al. ( 2021 ))

3.1.2 Gaps identification in research topics

Although the authors asserted a holistic view in their research process, they mostly investigated the SC from an operations viewpoint, i.e., production, logistics, inventory management, transportation, and demand planning (see the coding and classification in these studies: Nguyen et al. ( 2018 ); Tiwari et al. ( 2018 ); Choi et al. ( 2018 ); Maheshwari et al. ( 2021 )). Moreover, it seems that some aspects of SC operations were overlooked by the researchers (see the column Perspective and Special Features of Neilson et al. ( 2019 ) and Novais et al. ( 2019 )). For instance, some of the production and transportation aspects, such as the “network design”, “facilities capacity”, and “vehicle routing” were not reviewed by Nguyen et al. ( 2018 ) or Maheshwari et al. ( 2021 ), although they follow a comprehensive approach.

Any decision around a SC can be classified at three planning levels, i.e., “strategic”, “tactical”, and “operational” (Stadtler & Kilger, 2002 ; Ivanov et al., 2021b ). DS &BDA can provide useful solutions at each of these planning levels (Nguyen et al., 2018 ). Wang et al. ( 2016a ) focused on the value of SC analytics and the applications of BDA on strategic and operational levels. They acknowledge the importance of BDA for the SC strategies, which in turn, affects the “SC network”, “product design and development”, and “strategic sourcing”. They also note that at the operational level, BDA plays a critical role in the effective performance of analysing and measuring “demand”, “production”, “inventory”, “transportation” and “logistics”. The authors do not utilise the basic definitions and categorisations of the decisions levels that existed in the literature (Stadtler & Kilger, 2002 ).

3.1.3 Gaps identification in DS &BDA techniques

Grover and Kar ( 2017 ) classify BDA into “predictive”, “prescriptive”, “diagnostic”, and “descriptive” categories. Kotu and Deshpande ( 2018 ) note that these classifications can be considered for any research using DS &BDA tools. However, some review studies around BDA only refer to three of these classifications (predictive, prescriptive, and descriptive categories) (Wang et al., 2016a ; Arunachalam et al., 2018 ; Nguyen et al., 2018 ). Nguyen et al. ( 2018 ) classify the applications of BDA in SC &L based on the main three categories and conclude that prescriptive analytics is more controversial than the other two, since the results of this type of analytics are strongly influenced by the descriptive and predictive types.

Considering a broader exploration of logistics for any company, DS &BDA has significant importance in transportation systems for enhancing safety and sustainability. Neilson et al. ( 2019 ) review the applications of BDA from only the logistics perspective. They concede that the applications of BDA in the transportation system can be categorised as sharing traffic information (avoiding traffic congestion), urban planning (developing transportation infrastructures), and analysing accidents (improving traffic safety). Since the authors focus only on transportation systems, they explore the data collection process from urban facilities only such as smartphones, traffic lights, roadside sensors, global positioning systems (GPSs), and vehicles. The authors focus on special data types and formats that are mostly used in urban applications. They also classify the application of BDA in transportation into several categories, including predictive, real-time, historical, visual, video, and image analytics.

The special characteristics of big data as noted in our research terminologies (see Sect.  1.2 ), have been researched in certain survey studies. For instance, Addo-Tenkorang and Helo ( 2016 ) propose a framework based on the Internet of things (IoT), referred to as “IoT-value adding”, and extend five traits for big data: variety, velocity, volume, veracity, and value-adding. IoT is defined as the connectivity and sharing of data between physical things or technical equipments via the Internet (Addo-Tenkorang & Helo, 2016 ). Studies around BDA also list recent technologies and tools employed for dealing with large data sets. These technologies include but are not limited to cloud computing, IoT, and master database management systems (MDMS), which are associated with the veracity characteristic of big data; additionally, the tools include Apache Hadoop, Apache Spark, and Map-Reduce. In the case of the SC, Chen et al. ( 2014 ) concede that big data can be acquired from elements of the SC network, such as suppliers, manufacturers, warehouses, retailers, and customers, which are related to the variety characteristic of big data. Some of the researchers such as Brinch ( 2018 ) and Addo-Tenkorang and Helo ( 2016 ) consider the value of big data in SC &L. Brinch ( 2018 ) introduces a conceptual model for discovering, creating, and capturing value in SC management. Arunachalam et al. ( 2018 ) note that assessing the current state of an organisation on BDA will help its managers enhance the company’s capabilities. The authors suggest five BDA capabilities dimensions: “data generation”, “data integration and management”, “data advanced analytics”, “data visualisation”, and “data-driven culture”. The first two capabilities represent the level of data resources, whereas the second two demonstrate the level of analytical resources. The last is the foundation capability, compared to the other capabilities, which needs to be institutionalised in any organisation. Kamble and Gunasekaran ( 2020 ) also affirm that the performance measures used in a data-driven SC must be different from a traditional SC. For this purpose, the authors identify two categories of measures for data-driven SC performance monitoring: BDA capability and evaluating processes. BDA tools and platforms are also categorised into five groups according to the type of provided service: “Hadoop”, “Grid Gain”, “Map Reduce”, “High-performance computing cluster (HPCC) systems”, and “Storm” (Grover & Kar, 2017 ; Addo-Tenkorang & Helo, 2016 ). Each of these tools has different applications in SC &L.

Regarding the statistical techniques indexed in the review studies and their categories, we found the following classifications:

Techniques to measure data correlation (such as statistical regression (Zhang et al., 2019 ) and multivariate statistical analysis (Wesonga & Nabugoomu, 2016 )).

Simulation techniques (Wojtusiak et al., 2012a ; Antomarioni et al., 2021 )).

Optimisation techniques (including heuristic algorithms such as the genetic algorithm (Chi et al., 2007 ) and particle filters (Wang et al., 2018c )).

Machine learning methods (e.g. neural networks (Tsai & Huang, 2017 ), support vector machines (Weiss et al., 2016 )).

Data mining methods (e.g., classification (Merchán & Winkenbach, 2019 ), clustering (Windt & Hütt, 2011 ), regression (Benzidia et al., 2021 )).

These studies note that every technique has its strengths and weaknesses. For instance, statistical methods are fast but not adaptable enough to all problems. These methods cannot be applied to an unstructured and heterogeneous data set, while machine learning techniques are flexible, adaptable, yet time-consuming (Wang et al., 2016a ; Choi et al., 2018 ; Pournader et al., 2021 ). Some studies such as Ameri Sianaki et al. ( 2019 ) investigate the applications of DS &BDA in a specific industry such as healthcare or smart cities. The authors find applications of DS &BDA in healthcare SCs and classify them as “patients monitoring”, “diagnosing disorders”, and “remote surgery”. Each of the mentioned techniques can also be applied in different types/levels of analytics. For example, optimisation is a prescriptive analytic and cannot be predictive. However, simulation techniques can be used in predictive, diagnostic, and prescriptive analytics (Baryannis et al., 2019a ). Therefore, one perspective that can help us define the conceptual analysis of articles is the categorisation of DS &BDA techniques based on different analytical types/levels. This categorisation proposes guidelines for practitioners as well.

These review studies also investigate their selected articles in certain specific domains in SC &L such as SC risk management, in which decision-making is required to be fast, and the data is acquired from multi-dimensional sources. Baryannis et al. ( 2019a ) explore risk and uncertainty in the SC by reviewing the applications of AI in BDA. The authors categorise the methods proposed for SC risk management in two main classes: mathematical programming and network-based models, and find that mathematical programming approaches have received more attention in the literature. Data-driven optimisation (DDO) approaches are other recent and effective approaches used in this area of research (Jiao et al., 2018 ; Gao et al., 2019 ; Zhao & You, 2019 ; Ning & You, 2018 ). The related methods are recognised as a combination of machine learning and mathematical programming methods for making decisions under uncertainty. DDO approaches can be further subdivided into four categories: “stochastic programming”, “robust optimisation”, “chance-constrained programming”, and “scenario-based optimisation” (Ning & You, 2019 ; Nguyen et al., 2021 ). In the DDO approaches, uncertainty is not predetermined, and decisions are made based on real data. Therefore, these are the main differences between data-driven approaches and traditional mathematical approaches. The results of the DDO methods are also less conservative, and consequently, closer to reality. The selection of techniques and tools is very critical because they strongly influence the outputs of analytics.

One of the most widely used tools for managing and integrating data in SC &L is cloud computing (Mourtzis et al., 2021 ; Sokolov et al., 2020 ). With this tool, the data is stored in cyberspace and serviced according to user needs. This technology can play an important role in SC &L. Novais et al. ( 2019 ) explore the role of cloud computing on the chain’s integrity. Some studies (Jiang, 2019 ; Zhu, 2018 ; Zhong et al., 2016 ) show that the impact of cloud computing on the integration of the SC (financially or commercially) is positive. This technology helps improve the integration of information, financial and physical flows in the SC via information sharing between the SC members, optimising payment and cash processes among partners, and controlling inventory levels and costs. In the case of information sharing, we also found the fuzzy model developed by Ming et al. ( 2021 ) as a valuable method considering BDA concerns.

3.2 Material collection

By looking at the keywords, we found that two combinations of them, i.e., “data mining” AND “logistics” and “machine learning” AND “logistics”, had yielded the most search results. Precisely reviewing some of the articles, instead of merely the “logistics” word, the “logistics regression” phrase was detected, which is a common methodology in data mining, and not in the transportation field. Therefore, the keyword “logistics regression” was excluded from the list via “AND NOT”. The selection process resulted in 6681 potential papers. In the next step, we excluded irrelevant papers by overviewing the abstracts and keywords. Articles related to “conceptual analysis”, “resource dependency theory”, “the importance of BDA”, “management capabilities”, and “the role or application of the BDA in the SC” were identified as unrelated articles. We also removed the review papers, as we explored them in the previous step, separately. These filtering criteria reduced the number of papers to 2583 (see Table  4 ). Figure  4 illustrates the selection procedure in detail. After removing duplicates and filtering for highly ranked journals, 1167 studies remained. In the next step, we scanned the papers’ abstracts (and in some cases, the full text) to examine the relevance of the paper to our domain. Since we aimed to limit our selection to research employing any DS &BDA models/techniques, we removed several papers that were using only conceptual models. A total of 364 articles were finally selected to go through the full text analysis and coding step.

figure 4

Research selection process regarding all relevant papers excluding review papers

3.3 Conceptual framework of DS &BDA in SC &L

Considering all the insights gained from the previous review studies, we propose a conceptual framework of our study encompassing two perspectives: (1) SC &L research problems/topics and (2) DS &BDA main approaches. This structure can help practitioners apply DS &BDA approaches for creating a competitive advantage. According to our research process outlined in Fig.  2 , we revised the list of each category with the help of a recursive process and gathered feedback from the full content analysis of the selected studies.

Figure  5 illustrates our proposed categorisation from the SC &L research problems/topics viewpoint in a framework. We classify SC operational processes (i.e., procurement, production, distribution, logistics, and sales) into three hierarchical levels of decision-making used in SC &L companies (Stadtler & Kilger, 2002 ). In the first operational process, we highlight decisions about procurement planning, which includes concerns about raw materials and suppliers (Cui et al., 2022 ). Production planning organises the products’ design and development (Ialongo et al., 2022 ). These issues coordinate suppliers’ and customers’ requirements. Distribution planning influences production and transportation decisions. Logistics or transportation planning deals with methodologies related to delivering products to end-customers or retailers. Sales planning is related to trades in business markets. We also consider SC design as a strategic decision and classify the studies in resilient, sustainable, and closed loop and reverse logistics categories.

figure 5

Conceptual framework of reviewing the selected studies from the SC &L research problems/topics viewpoint

Figure  6 demonstrates our conceptual framework proposed for the classifications of the DS &BDA main approaches. DS &BDA algorithms/techniques for SC &L are categorised based on this proposed framework.

figure 6

Conceptual framework of reviewing the selected studies from the viewpoint of the employed DS &BDA main approaches

All DS &BDA approaches, shown in Fig.  6 , are applied to each topic listed in Fig.  5 . In the next section, we explore our 364 selected articles in more detail with respect to each of these categories.

4 Context analysis of results

Responding to the third research question (RQ3), we initially visualise the research sources in the scope of a yearly distribution, publication venues, and analytics types. In the coding process, we review the context of the selected papers precisely and classify them based on the proposed framework. In this step, we explore the main contributions of the selected papers. It is worth mentioning that with the recursive process, we complete the proposed framework so as to cover all topics (the final version of the framework is delineated in Figs.  5 and 6 ). Consequently, we evaluate and synthesise the selected studies at the end of this section.

4.1 Data visualisation and descriptive analysis

4.1.1 distribution of papers per year and publications.

To identify the journals with the highest number contributions, and to provide an overview of the research trends, we classify all selected papers based on the publication per venue and year (see Fig.  7 ). Figure  7 a depicts the distribution of published papers between 2005 to August 2022. It can be observed that before 2005, the domain of DS &BDA was not investigated, and there is an insignificant contribution until 2012. In fact, before 2012, the concept of DS &BDA was considered as data mining or BI (Arunachalam et al., 2018 ). The Google trend of interest regarding DS &BDA topics, depicted in Fig.  1 , also confirms this trend and the consideration of DS &BDA after the year 2012. The publication trend also shows that the applications of DS &BDA in SC &L have attracted the attention of many researchers in the past four years. As the chart shows, the number of papers published in the last five years is approximately doubled to the summation of those in the previous years. Apparently, the number of studies has been declined since 2020 which is expected due to the specifics of the COVID-19 period.

figure 7

Distribution of the selected papers per year and for the top ten journal venues

The number of publications in the top ten journals is illustrated in Fig.  7 b. Overall, we found 157 various journal venues for all of our 364 selected studies in the domain of DS &BDA, with most of them in the “information system management” and “trasportation” 2020 SJR subject classification. Footnote 3 It is noticeable that a significant proportion of the studies (over 45%) have been published by high-impact journals, such as CIE , ESA , IJPR , JCP , and IJPE . Also, it is worth mentioning that the ESA journal recently has got the most publications in the field of DS &BDA applications in SC &L. The ESA journal is an open access journal whose focus is on intelligent systems applied in industry. The CIE and IJPR are both in the second rank, which mostly concentrates on SC &L, compared to “information systems”. Other journals in Fig.  7 b are among the most popular journals published in the field of SC &L.

4.1.2 Types of analytics approaches

The analytics type for each selected study needs to be further investigated. According to the classification introduced by Grover and Kar ( 2017 ) and Arunachalam et al. ( 2018 ), four types of analytics can be defined: descriptive, diagnostic, predictive, and prescriptive. Due to an extremely limited number of studies classified on diagnostic analytics (7 out of 364 publications in our data set), this area was excluded from our classification, similar to the survey study of Nguyen et al. ( 2018 ). A classification in each field of analytics is conducted based on the applied models and common techniques of analysis, as outlined in Table  5 (see also Wang et al. ( 2016a ); Grover and Kar ( 2017 ); Nguyen et al. ( 2018 ) for a description of these analytics types). The simulation approach is listed in both the predictive and prescriptive analytics (Viet et al., 2020 ; Wojtusiak et al., 2012b ; Wang et al., 2018c ).

Figure  8 shows the annual distribution of analytics types over time. Predictive analytics methodologies have become more popular in 2019–2022. 45% of the articles have followed a predictive approach in their proposed solution, which is the highest proportion compared to the other types of analytics. This is justified by the development of analytical tools and the ability to access dynamic data in addition to historical data (Arunachalam et al., 2018 ).

figure 8

Annual distribution of the selected studies with respect to analytics type

figure 9

Distribution of the articles by DS &BDA approaches

Additionally, we analyse the distribution of approaches used in the articles (see Table  5 ). Figure  9 a–c show the distribution of the main approaches employed in the selected studies regarding each type of analytics. Among all predictive approaches, we found that neural network is the most favourable technique, employed in 19% of the selected papers in the various main approaches of DS &BDA, such as forecasting, classification, and clustering. Moreover, among the main approaches and algorithms, the graph visualisation techniques are the most employed methods in the field of this survey (29% of the selected papers used this technique).

Ensemble learning is the process by which several algorithms/techniques (including forecasting or classification techniques) are strategically combined to solve a particular DS &BDA problem. Regarding the selection of appropriate techniques, ensemble learning can be employed to help reduce the probability of an unlucky selection of a poor technique and can improve performance of the whole model (Zhu et al., 2019b ; Hosseini & Al Khaled, 2019 ). Deep learning is an evolution of machine learning that uses a programmable neural network technique and can be employed for forecasting, classification, or any predictive model (Bao et al., 2019 ; Pournader et al., 2021 ; Rolf et al., 2022 ).

4.1.3 Methodological perspectives

Descriptive analysis is adopted in approximately 33% the examined literature. These articles have commonly used clustering, association, visualisation, and descriptive approaches in DS &BDA (see Fig.  9 a). The trend of using these approaches in the articles has almost been ascending, especially the visualisation ones that have received much attention in the last four years. Data visualisation is a beneficial tool for SC &L in different areas. The graphs and OLAP techniques have been the methods mainly used in data visualisation. This is because visualisation approaches are able to depict a portion of the research problem and are applicable to all areas of SC &L. In the clustering approach, there are a variety of techniques and algorithms. K-means clustering is the most discussed technique, which is used in analysing energy logistics vehicles (Mao et al., 2020 ), traffic accidents (Kuvvetli & Firuzan, 2019 ), traffic flows (Bhattacharya et al., 2014 ), pricing (Hogenboom et al., 2015 ), and routing (Ehmke et al., 2012 ). The third most commonly used approach in descriptive analytics is the association approach, which means the measure of association between two variables. The Apriori algorithm is the most popular association algorithm, which has been used in various issues, including transportation risk (Yang, 2020 ), demand forecasting (Kilimci et al., 2019 ), quality management (Wang & Yue, 2017 ), vehicle routing (Ting et al., 2014 ), research and development (R &D) (Liao et al., 2008b ), and customer feedback (Singh et al., 2018a ).

In the predictive analytics type, the classification approach is very popular (see Fig.  9 b). The most common algorithms used in the classification approach are SVM (20%), decision trees (19%), logistic regression (19%), and neural networks (11%). This approach is usually applied in decisions corresponding to demand forecasting (Nikolopoulos et al., 2021 ; Yu et al., 2019b ; Gružauskas et al., 2019 ; Zhu et al., 2019a ), quality management (Bahaghighat et al., 2019 ), customer churn (Coussement et al., 2017 ), delivery planning (Proto et al., 2020 ; Wang et al., 2020 ; Praet & Martens, 2020 ), and routing (Spoel et al., 2017 ). Next, regression techniques have received high attention. Both linear regression models (37%) and SVR (24%) are the most commonly used techniques in regression models. These regression models are mainly applicable to logistics decisions such as traffic accidents (Farid et al., 2019 ; Wang et al., 2016b ), vehicle delays (Eltoukhy et al., 2019 ), delivery planning (Ghasri et al., 2016 ; Merchán & Winkenbach, 2019 ), and sales decisions such as demand forecasting (Nikolopoulos et al., 2021 , 2016 ) and sales forecasting (Lau et al., 2018 ).

The neural network is an important and common technique for forecasting and can be applied in a wide variety of problems such as supplier selection (Pramanik et al., 2020 ) and demand or sales forecasting (Verstraete et al., 2019 ). Time series modelling is the fourth predictive approach. ARIMA (34%), exponential smoothing (17%), and moving averages (18%) are the most popular techniques for DS &BDA time series modelling. These techniques are usually applied for demand forecasting (Kilimci et al., 2019 ; Huber et al., 2017 ). In this survey, we find that ARENA and AnyLogic simulation software are used more than others for shop floor control simulations (Yang et al., 2013 ), machine scheduling (Heger et al., 2016 ), and routing (Ehmke et al., 2016 ). Text mining is a useful approach for understanding the feelings and opinions of customers or people. In the examined papers, this method has been used in only 8 articles in the fields of customer feedback (Hao et al., 2021 ), sales forecasting (Cui et al., 2018 ), SC mapping (Wichmann et al., 2020 ).

Prescriptive analytics has the lowest number of contributions, compared to the other types of analytics. The optimisation models, simulations, and multi-criteria decision-making are the main approaches of the prescriptive analytics type. Among them, optimisation has the most contributions (78% out of the prescriptive analytics studies). The optimisation techniques are most often used to optimise the facility location (Doolun et al., 2018 ), location of distribution centres (Wang et al., 2018a ), type of technology (shen How & Lam, 2018 ), capacity planning (Ning & You, 2018 ), number of facilities (Tayal & Singh, 2018 ), inventory management (Çimen & Kirkbride, 2017 ), and vehicle routing (Mokhtarinejad et al., 2015 ). In addition to optimisation, MCDM approaches are also used in decision-making. This approach is classified into two main technique categories (MADM and MODM) and applied to customer credit risk (Lyu & Zhao, 2019 ), supplier selection (Maghsoodi et al., 2018 ), inventory management (Kartal et al., 2016 ), and SC resilience (Belhadi et al., 2022 ).

4.1.4 Technique verification strategies

In order to solve an SC &L problem, a suitable algorithm/technique must be selected and then evaluated through a proper set of data. Figure  10 shows the percentages of the applied verification strategies. In the examined literature, researchers mainly employ case study strategy with real data to verify their selected approaches and models (Antomarioni et al., 2021 ; Nuss et al., 2019 ). However, a few others have used a generating data strategy (i.e., synthetic data) that is mainly seen in simulation techniques (Kang et al., 2019 ). Hence, almost all of algorithms/techniques require real data to be verified (Choi et al., 2018 ).

figure 10

Distribution of research verification strategies

4.1.5 Comparison with previous survey studies

We compare our results with the recent survey studies listed in Table  3 . The comparison of top journals demonstrates that our unbiased approach in finding studies includes more relevant journals focusing on information systems (e.g., ESA and IEEEA journals). For instance, in the survey by Nguyen et al. ( 2021 ), all top journals are listed among SC &L-focused journals (IJPE, TRC, IJPR, and ICE). This survey has employed only “data-driven” or “data-based” keywords that do not cover all aspects of data science or data analytics applications (e.g. machine learning, deep learning, big data, etc). The authors only use previous survey studies to find all keywords related to SC &L.

Comparison of “Search Engines” in Table  3 demonstrates that most of the previous surveys use a sole database (mostly Scopus) for their search process and do not double check or confirm the process by other databases. Our systematic process concluded many duplicates (see Tables  1 and 2 ) but by handling these duplicates we reached a more clean and accurate data set.

4.2 Classification of studies based on the conceptual framework

As depicted in Fig.  5 , SC &L is comprised of five internal processes: procurement, production, distribution, logistics, and sales. In each process, a hierarchical triple planning structure is required: (1) long-term planning or strategic decision-making over a multi-year scheduling horizon, (2) mid-term planning or tactical decision-making over a seasonal or maximum one-year scheduling horizon, and (3) short-term planning or operational decision-making , which has a planning period between a few days up to one season (Sugrue & Adriaens, 2021 ).

An overview of the processes shows that the logistics process has received more interest, especially during the last two years (128 papers, 31% of the corpus). Sales is another frequently studied field in applying DS &BDA (83 papers, 20% of the examined literature). Figure  11 illustrates the distribution of studies by each decision level. In the procurement process, supplier selection (27 papers) and order allocation (17 papers) are the most discussed. The results of our investigations indicate that long-term decisions such as the plant location (Doolun et al., 2018 ), type of technologies (Vondra et al., 2019 ), and R &D (Liao et al., 2009 ) have made considerable contributions to improve production decisions. For example, an inappropriate network design incurs high costs (Song & Kusiak, 2009 ). The two key aspects at the mid-term production decision level in DS &BDA are master production scheduling (determining production quantities at each period) and quality management (Masna et al., 2019 ). Shop floor control has been of interest to researchers in short-term production planning (Yang et al., 2013 ). The results further show that in distribution tactical decisions, most of the papers discuss inventory management (Sachs, 2015 ), while in this process at the long- and short-term decision levels, the issues of distribution centre location (Wang et al., 2018c ), warehouse replenishment (Priore et al., 2019 ), and order picking decisions (Mirzaei et al., 2021 ) have been attractive to researchers.

figure 11

Distribution of the selected studies with respect to SC &L processes and planning levels

Logistics decisions have been mainly studied at the short-term level, i.e., vehicle routing (Tsolakis et al., 2021 ) and delivery planning (Vieira et al., 2019a ). Unlike other processes, most articles discuss the operational decisions compared to the other levels (59% of logistics planning studies). Subsequently, mid-term transportation planning decisions, including material flow rate issues (Wu et al., 2019 ), have been investigated more frequently. In the sales process, decisions and issues are mainly planned at the mid-term level. Hence, decisions regarding customer demand forecasting (Yu et al., 2019b ), pricing (Liu, 2019 ), and sales forecasting (Villegas & Pedregal, 2019 ) are the three most commonly studied issues in this process.

Overall, at the long-term planning level, most articles contribute to production (35 papers) and procurement decisions (30 papers). At the mid-term decision level, due to attractive issues such as demand forecasting, sales process has been the most investigated area (70 papers). After that, logistics process is in second place (40 papers) in the form of contributions involving transportation planning issues. At the short-term level, logistics process is at the forefront (76 papers). Vehicle routing (Yao et al., 2019 ), delivery planning (Vieira et al., 2019a ), financing risk (Ying et al., 2021 ), and transportation risk (Zhao et al., 2017 ) have been addressed more frequently. Subsequently, distribution process, with a large difference in contributions (20 papers), ranks second. Material ordering (Vieira et al., 2019b ) and customer feedback (Singh et al., 2018a ) have the lowest contribution in terms of applying DS &BDA among other processes at the short-term decision level.

4.2.1 Long-term decisions in SC &L

Long-term procurement decisions deal with supplier selection (Hosseini & Al Khaled, 2019 ) and supplier performance (Chen et al., 2012 ). During the production and distribution processes, these decisions are made considering the network design of factories and the distribution centres such as the location, number, types of facilities, and centre capacity (Mishra & Singh, 2020 ; Flores & Villalobos, 2020 ; Mohseni & Pishvaee, 2020 ), whereas, strategic decisions in the logistics process comprise planning with respect to the transportation system infrastructure, carrier selection, and capacity design (Lamba & Singh, 2019 ; Lamba et al., 2019 ). The decisions include customer service level determination, strategic sales planning, and customer targeting through sales category.

We also consider the SC design decisions in this category, including resilient SC (Brintrup et al., 2020 ; Belhadi et al., 2022 ; Mungo et al., 2023 ; Mishra & Singh, 2022 ; Hägele et al., 2023 ), sustainable SC (Bag et al., 2022b ), closed-loop (Govindan & Gholizadeh, 2021 ) and reverse logistics (Shokouhyar et al., 2022 ). A more complete categorisation of the related articles is summarised in Table  6 .

4.2.2 Efficiency, sustainability, and resilience paradigms

The COVID-19 pandemic has clearly shown the importance of resilient SC designs (Rozhkov et al., 2022 ). SC resilience refers to having the capability to absorb or even avoid disruptions (Ivanov, 2021a ; Kosasih & Brintrup, 2021 ; Yang & Peng, 2023 ). Belhadi et al. ( 2022 ) concede that AI techniques provide capable solutions for designing and upgrading more resilient SCs. Zhao and You ( 2019 ) develop a resilient SC design by employing a data-driven robust optimisation approach and demonstrate how the DS &BDA concepts should be considered in SC models.

SC sustainability refers to consideration of environmental, societal, and human-centric aspects in SC decisions (Li et al., 2021b ; Homayouni et al., 2021 ; Sun et al., 2020 ; Li et al., 2020a ). Mishra and Singh ( 2020 ) develop a sustainable reverse logistics model by considering realistic parameters. They affirm that all three aspects of sustainability can be covered by BDA. Tsolakis et al. ( 2021 ) conduct a comprehensive literature review for AI-driven sustainability and conclude that the most essential techniques in modelling SCs are AI and optimisation techniques.

A closed-loop SC employs reverse logistics to supply re-manufactured products back into the forward logistics process. Jiao et al. ( 2018 ) develop a data-driven optimisation model to integrate sustainability features in a closed-loop SC. Shokouhyar et al. ( 2022 ) employ social media data for modelling a customer-centric reverse logistics with an emphasis on the BDA approaches for designing reverse logistics SCs.

4.2.3 Mid-term decisions in SC &L

The selected paper categorisation at the tactical decision level is outlined in Table  7 . Decisions regarding the allocation of orders to suppliers such as the order quantity planning and lot sizing (Lamba & Singh, 2019 ), supply risk management (Baryannis et al., 2019a ), raw materials quality management (Bouzembrak & Marvin, 2019 ), material requirement planning (Zhao & You, 2019 ), material cost management (Ou et al., 2016 ), and demand forecasting (Stip & Van Houtum, 2020 ) are all considered as mid-term procurement decisions.

The main tasks in mid-term production planning are master production scheduling (Flores & Villalobos, 2020 ), capacity planning (Sugrue & Adriaens, 2021 ), quality management (Ou et al., 2016 ), and demand forecasting (Dombi et al., 2018 ), while inventory management decisions (Ning & You, 2018 ), capacity planning (Oh & Jeong, 2019 ), in-stock product quality management issues (Ou et al., 2016 ), and warehouse demand forecasting (Zhou et al., 2022b ) are among the tactical distribution decisions.

Some of the main mid-term logistics decisions are transportation planning (Wu et al., 2020 ; Gao et al., 2019 ), service quality management (Gürbüz et al., 2019 ; Molka-Danielsen et al., 2018 ), transportation modes (Jula & Leachman, 2011 ), and demand forecasting (Potočnik et al., 2019 ; Boutselis & McNaught, 2019 ). Demand forecasting (Lee et al., 2011 ; Shukla et al., 2012 ), demand shaping (e.g., marketing) (Aguilar-Palacios et al., 2019 ; Liao et al., 2009 ), sales forecasting (Wong & Guo, 2010 ), pricing (Hogenboom et al., 2015 ), consumer behaviour (e.g., purchasing pattern) (Bodendorf et al., 2022b ; Garcia et al., 2019 ), and customer churn (Coussement et al., 2017 ) are planned in the tactical sale decisions.

4.2.4 Short-term decisions in SC &L

Short-term procurement planning includes ordering materials (Vieira et al., 2019b ). Production operational decisions include machine scheduling (Yue et al., 2021 ), shop floor control (such as preventive maintenance scheduling (Celik & Son, 2012 ) and material flows (Zhong et al., 2015 )), and decisions regarding the size of the production batch (Sadic et al., 2018 ). In the area of distribution , planning associated with packaging (Kim, 2018 ), warehouse replenishment (Taube & Minner, 2018 ), order picking (Mirzaei et al., 2021 ), and inventory turnover (Zhang et al., 2019 ) could be made in short-term decisions. A variety of operational decisions can be made at the logistics stage, including delivery planning (Praet & Martens, 2020 ), vehicle delay management (Kim et al., 2017 ), routing planning (Liu et al., 2019 ), and transportation risk management (Wu et al., 2017 ).

At this level of decision-making, due to the wide variety of decisions, we consider more categories than other levels. For example, we consider vehicle delivery planning (Vieira et al., 2019b ) and vehicle routing (Yao et al., 2019 ) as two separate categories. Also, in order to reduce the number of categories, we aggregate crash risk (Bao et al., 2019 ), traffic safety (Arbabzadeh & Jafari, 2017 ), and fraud detection decisions (Triepels et al., 2018 ) in the transport risk management category. Table  8 shows the results of reviewing the short-term decisions.

5 Identification of research gaps

To answer the fourth research question (RQ4), we evaluate the selected studies in details to find any existing gaps in the literature for using DS &BDA approaches in SC &L. We categorise our findings in the following sub-sections.

5.1 Data-driven optimisation

DDO has received a considerable attention. In our study, we aimed to identify related techniques by adding the word “data-driven” to our keyword set (see the preliminary search results for DDO in Table  2 ). DDO is a mathematical programming method that combines uncertainty approaches for optimisation with machine learning algorithms. The objective functions are often cost-related (Alhameli et al., 2021 ; Baryannis et al., 2019a ). Ning and You ( 2019 ) divided DDO into four modeling methods: stochastic programming, chance-constrained programming, robust optimisation, and scenario-based optimisation. In the SC &L area, some of the problem parameters may be considered as uncertain such as customer demand (Medina-González et al., 2018 ; Taube & Minner, 2018 ), production capacity (Jiao et al., 2018 ), and delivery time (Lee & Chien, 2014 ). In comparison with the traditional optimisation models under uncertainty, which consider perfect information for the parameters, DDO approaches employ information of random variables direct inputs to the proposed programming problems.

In our examined material, 21 papers studied optimisation under uncertainty. The stochastic programming methods (e.g., MILP and MINLP) were the most applied methods (e.g., Flores and Villalobos ( 2020 ); Taube and Minner ( 2018 )). Chance-constrained programming is an optimisation method in which the constraints in the probability distribution must be satisfied. This method has practical applicability in SC &L (Jiao et al., 2018 ). In robust optimisation, the uncertainty sets (the set of uncertain parameters) must be specified the in case of data sets. Therefore, in order address uncertainty in the SC &L area, this method seems to be more efficient. As in the SC &L, we are mostly facing uncertain data (Gao et al., 2019 ). In scenario-based optimisation, uncertainty scenarios are used to find an optimal solution. In our selected studies, there was no study using this method. It seems that this method has research potential, as long as the scenarios are created as a set of data, and the scenario-based DDO methods are applied especially in risk management (Baryannis et al., 2019a ).

Considering that BDA applications to SC &L are still in the process of development, employing BDA techniques (e.g., cloud computing or parallel computing) or tools (e.g., Hadoop, Spark, or Map-Reduce solutions) can be sonsidered as important future directions for using DDO methods in decision-making (Ning & You, 2019 ). Big data-driven optimisation (BDDO) methods, which are a combination of methods dealing with big data and techniques employing DDO, could be of interest in terms of solving several problems in SC &L.

5.2 SC &L processes and decision levels

The framework used in our study revealed the contributions of DS &BDA in SC &L processes. The material evaluation from the SC &L process point-of-view shows that the two processes of distribution and procurement are discussed less often in all three hierarchical levels of decision-making in the SC &L. While the SC is a set of hierarchical processes, and the decisions at each level are influenced by the ones from other levels and processes (Stadtler & Kilger, 2002 ), more attention can be given to distribution and procurement decisions, especially at the strategic level.

In the process of procurement, most of the studies focused on mid-term decisions (such as order allocation decisions (Kaur & Singh, 2018 ), supplier risk management (Brintrup et al., 2018 ), MRP (Zhao & You, 2019 ), and so forth), while short-term decisions in this process (e.g., ordering materials (Vieira et al., 2019b )) have received the least amount of attention.

Short-term decisions in the production process, such as lot sizing (Gao et al., 2019 ) and machine scheduling (Simkoff & Baldea, 2019 ) decisions, have received less attention compared to strategic decisions. In the process of distribution, warehouse capacity planning (Oh & Jeong, 2019 ) and inventory turnover (Zhang et al., 2019 ) decisions have been partially ignored representing a visible research gap. For example, capacity design (Gao et al., 2019 ) requires more attention in the domain of logistics processes. Shipment size planning (Piendl et al., 2019 ) has been identified as one of the most important decisions. In sales processes, customer feedback (Hao et al., 2021 ) is crucial in determining organisation strategies; however, this field has not received enough attention so far.

5.3 DS &BDA approaches, techniques, and tools

Our results demonstrate that a wide range of models and techniques can be used in the SC &L area. Nevertheless, some techniques are employed less. For instance, OLAP is a powerful technique behind many BI software solutions, but it is rarely employed in the models. OLAP is applied for the processing of multidimensional data or data collected from different databases, which are routine issues in the SC &L area. As another example, in data clustering, some other clustering techniques such as fuzzy k-modes, k-medoid and fuzzy c-means are used less in the reviewed articles. For instance, Kuvvetli and Firuzan ( 2019 ) apply k-means clustering to classify the number of traffic accidents in urban public transportation. However, the model is not examined by other clustering techniques such as the k-medoid or fuzzy c-means to ensure that the selected clustering technique is more accurate or efficient than the others.

Our study on the types of analytics indicates that the predictive analytics approach has attracted more attention. Nevertheless, using this approach has its own challenges. Executing predictive analytics techniques is time-consuming and requires iterative stages of testing, adopting, and resulting (Arunachalam et al., 2018 ). The majority of the studies have not discussed these challenges. Machine learning is one of the efficient methods of AI for analysing and learning data. Some articles have used machine learning methods, but only in the context of “AI”, which can be used with a wider range of techniques (Li et al., 2021a ).

Among the machine learning techniques used in the examined literature, deep learning (Punia et al., 2020 ; Kilimci et al., 2019 ) and ensemble learning (Zhu et al., 2019b ) techniques have received very limited attention, while these techniques increase the ideal prediction accuracy (Baryannis et al., 2019a ). Moreover, “transfer learning” and “reinforcement learning” have not been employed in the examined literature. These methods enhance neural networks and deep learning techniques.

5.4 Big data analytics (BDA)

Although some scholars have argued in favor of BDA approaches, they have not fully addressed BDA challenges such as generation, integration, and BDA techniques (Arunachalam et al., 2018 ; Novais et al., 2019 ). Among the 227 examined articles, 107 articles used the buzzword “Big Data” in their publications, but a few of them (we found 13 articles) focused on big data characteristics, techniques, and architectures. Therefore, we suppose the the rest probably used large data sets, but not necessarily big data. Considering the special characteristics of big data, it is required that the studies on BDA unequivocally and practically refer to big data techniques (Chen et al., 2014 ; Grover & Kar, 2017 ; Arunachalam et al., 2018 ; Brinch, 2018 ).

Since big data in SC &L can be generated from various SC processes and from different data collection resources (such as GPS, sensors, and multimedia tools), extracting knowledge from various types of the data is another concern in BDA. The diversity of the data is anticipated to increase in the future Baryannis et al. ( 2019b ); thus, integration in data analysis is an important debate in BDA. It is expected that researchers will considerably focus on data integration in the future.

BDA implementation, like other analytical tools and types of process monitoring, is time-consuming and requires management commitment. Executive BDA challenges, such as strategic management, business process management, knowledge management and performance measurement, need to be reviewed and analysed (Brinch, 2018 ; Choi et al., 2018 ; Kamble & Gunasekaran, 2020 ). Moreover, instead of focusing on some limited performance metrics, the key performance indicators of an SC &L company, such as financial or profitability indicators, must be monitored for proper BDA implementation. In the future, with the development of BDA techniques, such as the proposed BDDO techniques, some prescriptive analytics approaches will become more preferred (Arunachalam et al., 2018 ).

5.5 Data collection and generation

Unstructured data, such as the data extracted from social media and websites, are great sources of data acquisition that seem to be ignored in the SC &L literature. This type of data should be considered more in the future. Besides,in order to extract more value from DS &BDA approaches, real-time data is much more reliable than historical data because it can better describe SC behaviour (Nguyen et al., 2018 ). Therefore, SC &L companies should rapidly employ analytics with real-time processing tools. IOT, RFID, and sensor devices are technologies that facilitate real-time recognition (Zhu, 2018 ; Zhong et al., 2016 ), and it is suggested that these tools be used in any of the real-time processes in SC &L. A special role in this area will be played by digital twins and associated technologies for real-time data collection such as 5 G (Ivanov et al., 2021a ; Choi et al., 2022 ; Ivanov & Dolgui, 2021a ; Dolgui & Ivanov, 2022 ; Ivanov et al., 2022 ).

5.6 SC design

Analysis of DS &BDA models indicated a few papers considering not only efficient but also sustainable and resilient network designs. Table  6 illustrates that there is a large gap in literature for considering DS &BDA concepts in resilient SCs. Belhadi et al. ( 2022 ) confirm that the COVID-19 pandemic made the SCs focus on resilient principles. The authors affirm that DS &BDA techniques highly support SC resilient strategies.

Our observation for employing the DS &BDA techniques in sustainable SCs reveals another future direction for SC &L research. We realised that only 5% of the studies consider sustainability concepts in their models. Although considering the environmental and human impacts on SC design is a contemporary subject for SCs, Tsolakis et al. ( 2021 ) acknowledge that Industry 4.0 and the Internet of Things necessitate the applications of DS &BDA techniques but with deliberating social and environmental aspects in line with SCs’ progress. The authors confirm that the recent extant literature has not adequately covered the sustainability implications of DS &BDA innovations.

The closed-loop SC and reverse logistics are also among the rare design configurations for DS &BDA models. Govindan and Gholizadeh ( 2021 ) concede that the analysis of the processes in a closed-loop SC requires big data and once the sustainability and resilient features are combined to the model, a BDA model is capable of addressing the proposed problems in such types of SCs. This means that the volume, velocity, and variety of the input data should be considered in the models.

5.7 COVID-19 and pandemic

Since 2020, COVID-19 pandemic has posed significant challenges for SCs. Different SC echelons have collaborated under deep uncertainty. Academic research introduced some new models and frameworks (Ivanov & Dolgui, 2021b ; Ivanov, 2021c ; Ardolino et al., 2022 ). We identified several studies within this research stream in our selected data set. For instance, Barnes et al. ( 2021 ) study consumer behaviour in pandemic and named it as “panic buying”. Using big data of social media, the authors apply text mining with compensatory control theory to demonstrate early warning of potential demand problems. Nikolopoulos et al. ( 2021 ) study forecasting and planning during a pandemic using nearest neighbours clustering method. They use Google trends data to predict COVID-19 growth rates and model excess demand of products.

One of the central questions regarding the pandemic is how to design a pandemic-resilient SC (Ivanov & Dolgui, 2020 ; Nikolopoulos et al., 2021 ; Ivanov & Dolgui, 2021a ; Ivanov, 2021a ; Choi et al., 2022 ) and how to adapt to “new normal” (Bag et al., 2022a ; Ivanov, 2021b ). By emphasising the role of BDA in SC &L, Belhadi et al. ( 2022 ) examine the effect of COVID-19 outbreak on manufacturing and service SC resilience. Kar et al. ( 2022 ) investigate fake news on consumer buying behavior during pandemic and focus on the effect of resultant fear on hoarding of necessary products. SC performance in COVID-19 era is also investigated by researchers through BDA (Li et al., 2022b ; Rozhkov et al., 2022 ). Although several studies contributed in the area of using DS &BDA approaches, the literature needs a dedicated survey study similar to (Ardolino et al., 2021 ; Queiroz et al., 2022 ). Novel contributions in this area can be done with BDA and DS applications in the context of SC viability and Industry 5.0 (Ivanov, 2023 ; Ivanov & Keskin, 2023 ).

6 Conclusion and research directions

In this study, we proposed a literature review methodology and a holistic conceptual framework for classifying the applications of DS &BDA in SC &L. An investigation of the relevant review studies illustrated several gaps in former studies, which motivated us to focus on a conceptual framework for our reviewing process. Our broad keyword search initially found a large variety of papers published from 2005 to 2022. Employing a detailed review protocol and process, we selected 364 publications from highly ranked journals. We also focused on studies using DS &BDA modelling methods for solving SC &L problems. We revealed the contributions of DS &BDA in SC &L processes and highlighted the potential for future studies in each SC &L process. We also indicated the effective and bold role of DS &BDA applications/techniques in triple hierarchical decision levels. Three main types of analytics were used to categorise DS &BDA techniques and tools. The overall results indicated that the predictive approach was the most popular one. However, with the development of BDA techniques and the DDO approaches in the future, the prescriptive approach is likely to become more attractive. We also emphasised the deployment of effective deep learning, ensemble learning, and machine learning techniques in SCs. In the area of SC design, we proposed a structured and unbiased review on the DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and sustainability paradigms.

Limitations exist as with any study. Although we conducted a systematic literature review, the selected papers were restricted due to our proposed inclusion and exclusion criteria. We tried to include all relevant papers and selected highly ranked journals to increase the quality of the research. Nevertheless, a larger data set using computer science/engineering conferences and journals may result in a better exploration of the literature. This will reduce the echo chamber effect of citations in which a specific subset of journals keep citing each other and find each other worthy. The proposed conceptual framework may need to be extended, especially in the case of prescriptive analytics approaches. Also, we may be prejudiced in our interpretation of the literature. The material collection process showed that studies on the topic of DS &BDA in SC &L are substantially growing. Therefore, annual survey studies on this topic (with a broad range of keywords) are suggested for future research. Furthermore, any of the main approaches in DS &BDA applications (such as clustering, classification, simulation, text mining, or time series analysis) can be investigated separately in SC &L.

2019 Australian Business Deans Council (ABDC) journal rank https://abdc.edu.au/research/abdc-journal-quality-list/ .

2020 Scimago Journal & Country Rank (SJR) https://www.scimagojr.com/journalrank.php .

https://www.scimagojr.com/journalrank.php .

Abbasi, B., Babaei, T., Hosseinifard, Z., Smith-Miles, K., & Dehghani, M. (2020). Predicting solutions of large-scale optimization problems via machine learning: A case study in blood supply chain management. Computers and Operations Research, 119 , 104941.

Article   Google Scholar  

Addo-Tenkorang, R., & Helo, P. T. (2016). Big data applications in operations/supply-chain management: A literature review. Computers and Industrial Engineering, 101 , 528–543.

Aguilar-Palacios, C., Muñoz-Romero, S., & Rojo-Álvarez, J. L. (2019). Forecasting promotional sales within the neighbourhood. IEEE Access, 7 , 74759–74775.

Akinade, O. O., & Oyedele, L. O. (2019). Integrating construction supply chains within a circular economy: An ANFIS-based waste analytics system (A-WAS). Journal of Cleaner Production, 229 , 863–873.

Alahmadi, D., & Jamjoom, A. (2022). Decision support system for handling control decisions and decision-maker related to supply chain. Journal of Big Data, 9 (1).

Alhameli, F., Ahmadian, A., & Elkamel, A. (2021). Multiscale decision-making for enterprise-wide operations incorporating clustering of high-dimensional attributes and big data analytics: Applications to energy hub. Energies, 14 (20).

Aloini, D., Benevento, E., Stefanini, A., & Zerbino, P. (2019). Process fragmentation and port performance: Merging SNA and text mining. International Journal of Information Management, 51 , 101925.

Altintas, N., & Trick, M. (2014). A data mining approach to forecast behavior. Annals of Operations Research, 216 (1), 3–22.

Ameri Sianaki, O., Yousefi, A., Tabesh, A. R., & Mahdavi, M. (2019). Machine learning applications: The past and current research trend in diverse industries. Inventions, 4 (1), 8.

Amoozad Mahdiraji, H., Yaftiyan, F., Abbasi-Kamardi, A., & Garza-Reyes, J. (2022). Investigating potential interventions on disruptive impacts of Industry 4.0 technologies in circular supply chains: Evidence from SMEs of an emerging economy. Computers and Industrial Engineering, 174 .

Analytics, T. S. C. (2020). Top supply chain analytics: 50 useful software solutions and data analysis tools to gain valuable supply chain insights. Visited on 2020-01-31. www.camcode.com/asset-tags/top-supply-chain-analytics/

Anparasan, A. A., & Lejeune, M. A. (2018). Data laboratory for supply chain response models during epidemic outbreaks. Annals of Operations Research, 270 (1–2), 53–64.

Antomarioni, S., Lucantoni, L., Ciarapica, F. E., & Bevilacqua, M. (2021). Data-driven decision support system for managing item allocation in an ASRS: A framework development and a case study. Expert Systems with Applications, 185 , 115622.

Arbabzadeh, N., & Jafari, M. (2017). A data-driven approach for driving safety risk prediction using driver behavior and roadway information data. IEEE Transactions on Intelligent Transportation Systems, 19 (2), 446–460.

Ardolino, M., Bacchetti, A., Dolgui, A., Franchini, G., Ivanov, D., & Nair, A. (2022). The Impacts of digital technologies on coping with the COVID-19 pandemic in the manufacturing industry: A systematic literature review. International Journal of Production Research , 1–24.

Ardolino, M., Bacchetti, A., & Ivanov, D. (2021). Analysis of the COVID-19 pandemic’s impacts on manufacturing: A systematic literature review and future research agenda. Operations Management Research .

Arunachalam, D., Kumar, N., & Kawalek, J. P. (2018). Understanding big data analytics capabilities in supply chain management: Unravelling the issues, challenges and implications for practice. Transportation Research Part E: Logistics and Transportation Review, 114 , 416–436.

Bag, S., Choi, T.-M., Rahman, M., Srivastava, G., & Singh, R. (2022a). Examining collaborative buyer-supplier relationships and social sustainability in the “new normal” era: The moderating effects of justice and big data analytical intelligence. Annals of Operations Research , 1–46.

Bag, S., Gupta, S., & Wood, L. (2022). Big data analytics in sustainable humanitarian supply chain: Barriers and their interactions. Annals of Operations Research, 319 (1), 721–760.

Bag, S., Luthra, S., Mangla, S., & Kazancoglu, Y. (2021). Leveraging big data analytics capabilities in making reverse logistics decisions and improving remanufacturing performance. International Journal of Logistics Management, 32 (3), 742–765.

Google Scholar  

Bahaghighat, M., Akbari, L., & Xin, Q. (2019). A machine learning-based approach for counting blister cards within drug packages. IEEE Access, 7 , 83785–83796.

Baker, T., Jayaraman, V., & Ashley, N. (2013). A data-driven inventory control policy for cash logistics operations: An exploratory case study application at a financial institution. Decision Sciences, 44 (1), 205–226.

Ballings, M., & Van den Poel, D. (2012). Customer event history for churn prediction: How long is long enough? Expert Systems with Applications, 39 (18), 13517–13522.

Bányai, T., Illés, B., & Bányai, Á. (2018). Smart scheduling: An integrated first mile and last mile supply approach. Complexity, 2018 .

Bao, J., Liu, P., & Ukkusuri, S. V. (2019). A spatiotemporal deep learning approach for citywide short-term crash risk prediction with multi-source data. Accident Analysis and Prevention, 122 , 239–254.

Barnes, S. J., Diaz, M., & Arnaboldi, M. (2021). Understanding panic buying during COVID-19: A text analytics approach. Expert Systems with Applications, 169 , 114360.

Barraza, N., Moro, S., Ferreyra, M., & de la Peña, A. (2019). Mutual information and sensitivity analysis for feature selection in customer targeting: A comparative study. Journal of Information Science, 45 (1), 53–67.

Baryannis, G., Dani, S., & Antoniou, G. (2019). Predicting supply chain risks using machine learning: The trade-off between performance and interpretability. Future Generation Computer Systems, 101 , 993–1004.

Baryannis, G., Validi, S., Dani, S., & Antoniou, G. (2019). Supply chain risk management and artificial intelligence: State of the art and future research directions. International Journal of Production Research, 57 (7), 2179–2202.

Belhadi, A., Kamble, S., Fosso Wamba, S., & Queiroz, M. (2022). Building supply-chain resilience: An artificial intelligence-based technique and decision-making framework. International Journal of Production Research, 60 (14), 4487–4507.

Benzidia, S., Makaoui, N., & Bentahar, O. (2021). The impact of big data analytics and artificial intelligence on green supply chain process integration and hospital environmental performance. Technological Forecasting and Social Change, 165 , 120557.

Bhattacharya, A., Kumar, S. A., Tiwari, M., & Talluri, S. (2014). An intermodal freight transport system for optimal supply chain logistics. Transportation Research Part C: Emerging Technologies, 38 , 73–84.

Blackburn, R., Lurz, K., Priese, B., Göb, R., & Darkow, I.-L. (2015). A predictive analytics approach for demand forecasting in the process industry. International Transactions in Operational Research, 22 (3), 407–428.

Bodendorf, F., Dimitrov, G., & Franke, J. (2022a). Analyzing and evaluating supplier carbon footprints in supply networks. Journal of Cleaner Production, 372 .

Bodendorf, F., Merkl, P., & Franke, J. (2022). Artificial neural networks for intelligent cost estimation-a contribution to strategic cost management in the manufacturing supply chain. International Journal of Production Research, 60 (21), 6637–6658.

Boutselis, P., & McNaught, K. (2019). Using Bayesian networks to forecast spares demand from equipment failures in a changing service logistics context. International Journal of Production Economics, 209 , 325–333.

Bouzembrak, Y., & Marvin, H. J. (2019). Impact of drivers of change, including climatic factors, on the occurrence of chemical food safety hazards in fruits and vegetables: A Bayesian Network approach. Food Control, 97 , 67–76.

Brinch, M. (2018). Understanding the value of big data in supply chain management and its business processes. International Journal of Operations and Production Management .

Brintrup, A., Pak, J., Ratiney, D., Pearce, T., Wichmann, P., Woodall, P., & McFarlane, D. (2020). Supply chain data analytics for predicting supplier disruptions: A case study in complex asset manufacturing. International Journal of Production Research, 58 (11), 3330–3341.

Brintrup, A., Wichmann, P., Woodall, P., McFarlane, D., Nicks, E., & Krechel, W. (2018). Predicting hidden links in Supply Networks. Complexity, 2018 .

Bucur, P. A., Hungerländer, P., & Frick, K. (2019). Quality classification methods for ball nut assemblies in a multi-view setting. Mechanical Systems and Signal Processing, 132 , 72–83.

Burgos, D., & Ivanov, D. (2021). Food retail supply chain resilience and the COVID-19 pandemic: A digital twin-based impact analysis and improvement directions. Transportation Research Part E: Logistics and Transportation Review, 152 , 102412.

Carbonneau, R., Laframboise, K., & Vahidov, R. (2008). Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184 (3), 1140–1154.

Cavalcante, I. M., Frazzon, E. M., Forcellini, F. A., & Ivanov, D. (2019). A supervised machine learning approach to data-driven simulation of resilient supplier selection in digital manufacturing. International Journal of Information Management, 49 , 86–97.

Cavallo, D. P., Cefola, M., Pace, B., Logrieco, A. F., & Attolico, G. (2019). Non-destructive and contactless quality evaluation of table grapes by a computer vision system. Computers and Electronics in Agriculture, 156 , 558–564.

Celik, N., Lee, S., Vasudevan, K., & Son, Y.-J. (2010). DDDAS-based multi-fidelity simulation framework for supply chain systems. IIE Transactions, 42 (5), 325–341.

Celik, N., & Son, Y.-J. (2012). Sequential Monte Carlo-based fidelity selection in dynamic-data-driven adaptive multi-scale simulations. International Journal of Production Research, 50 (3), 843–865.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19 (2), 171–209.

Chen, M.-C., Huang, C.-L., Chen, K.-Y., & Wu, H.-P. (2005). Aggregation of orders in distribution centers using data mining. Expert Systems with Applications, 28 (3), 453–460.

Chen, M.-C., & Wu, H.-P. (2005). An association-based clustering approach to order batching considering customer demand patterns. Omega, 33 (4), 333–343.

Chen, R., Wang, Z., Yang, L., Ng, C., & Cheng, T. (2022). A study on operational risk and credit portfolio risk estimation using data analytics. Decision Sciences, 53 (1), 84–123.

Chen, W., Song, J., Shi, L., Pi, L., & Sun, P. (2013). Data mining-based dispatching system for solving the local pickup and delivery problem. Annals of Operations Research, 203 (1), 351–370.

Chen, X., Liu, L., & Guo, X. (2021). Analysing repeat blood donation behavior via big data. Industrial Management and Data Systems, 121 (2), 192–208.

Chen, Y.-S., Cheng, C.-H., & Lai, C.-J. (2012). Extracting performance rules of suppliers in the manufacturing industry: An empirical study. Journal of Intelligent Manufacturing, 23 (5), 2037–2045.

Chen, Y.-T., Sun, E., Chang, M.-F., & Lin, Y.-B. (2021b). Pragmatic real-time logistics management with traffic IoT infrastructure: Big data predictive analytics of freight travel time for Logistics 4.0. International Journal of Production Economics, 238 .

Chi, H.-M., Ersoy, O. K., Moskowitz, H., & Ward, J. (2007). Modeling and optimizing a vendor managed replenishment system using machine learning and genetic algorithms. European Journal of Operational Research, 180 (1), 174–193.

Choi, T.-M., Dolgui, A., & Ivanov, D., & Pesch, E. (2022). OR and analytics for digital, resilient, and sustainable manufacturing 4.0. Annals of Operations Research, 310 (1), 1–6.

Choi, T.-M., Wallace, S. W., & Wang, Y. (2018). Big data analytics in operations management. Production and Operations Management, 27 (10), 1868–1883.

Choy, K., Tan, K., & Chan, F. (2007). Design of an intelligent supplier knowledge management system: An integrative approach. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 221 (2), 195–211.

Chuang, Y.-F., Chia, S.-H., & Yih Wong, J. (2013). Customer value assessment of pharmaceutical marketing in Taiwan. Industrial Management and Data Systems, 113 (9), 1315–1333.

Çimen, M., & Kirkbride, C. (2017). Approximate dynamic programming algorithms for multidimensional flexible production-inventory problems. International Journal of Production Research, 55 (7), 2034–2050.

Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95 , 27–36.

Cui, R., Gallino, S., Moreno, A., & Zhang, D. J. (2018). The operational value of social media information. Production and Operations Management, 27 (10), 1749–1769.

Cui, R., Li, M., & Zhang, S. (2022). AI and procurement. Manufacturing and Service Operations Management, 24 (2), 691–706.

Dai, J., Xie, Y., Xu, J., & Lv, C. (2020). Environmentally friendly equilibrium strategy for coal distribution center site selection. Journal of Cleaner Production, 246 , 119017.

Dai, Y., Dou, L., Song, H., Zhou, L., & Li, H. (2022). Two-way information sharing of uncertain demand forecasts in a dual-channel supply chain. Computers and Industrial Engineering, 169 .

De Caigny, A., Coussement, K., & De Bock, K. W. (2018). A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. European Journal of Operational Research, 269 (2), 760–772.

De Clercq, D., Jalota, D., Shang, R., Ni, K., Zhang, Z., Khan, A., Wen, Z., Caicedo, L., & Yuan, K. (2019). Machine learning powered software for accurate prediction of biogas production: A case study on industrial-scale Chinese production data. Journal of Cleaner Production, 218 , 390–399.

De Giovanni, P., Belvedere, V., & Grando, A. (2022). The selection of industry 4.0 technologies through Bayesian networks: An operational perspective. IEEE Transactions on Engineering Management , 1–16.

Dev, N. K., Shankar, R., Gunasekaran, A., & Thakur, L. S. (2016). A hybrid adaptive decision system for supply chain reconfiguration. International Journal of Production Research, 54 (23), 7100–7114.

Di Ciccio, C., Van der Aa, H., Cabanillas, C., Mendling, J., & Prescher, J. (2016). Detecting flight trajectory anomalies and predicting diversions in freight transportation. Decision Support Systems, 88 , 1–17.

Dolgui, A., & Ivanov, D. (2022). 5G in digital supply chain and operations management: Fostering flexibility, end-to-end connectivity and real-time visibility through internet-of-everything. International Journal of Production Research, 60 (2), 442–451.

Dombi, J., Jónás, T., & Tóth, Z. E. (2018). Modeling and long-term forecasting demand in spare parts logistics businesses. International Journal of Production Economics, 201 , 1–17.

Doolun, I. S., Ponnambalam, S., Subramanian, N., & Kanagaraj, G. (2018). Data driven hybrid evolutionary analytical approach for multi objective location allocation decisions: Automotive green supply chain empirical evidence. Computers and Operations Research, 98 , 265–283.

Ehmke, J. F., Campbell, A. M., & Thomas, B. W. (2016). Data-driven approaches for emissions-minimized paths in urban areas. Computers and Operations Research, 67 , 34–47.

Ehmke, J. F., Meisel, S., & Mattfeld, D. C. (2012). Floating car based travel times for city logistics. Transportation Research Part C: Emerging Technologies, 21 (1), 338–352.

Eltoukhy, A. E., Wang, Z., Chan, F. T., & Fu, X. (2019). Data analytics in managing aircraft routing and maintenance staffing with price competition by a Stackelberg–Nash game model. Transportation Research Part E: Logistics and Transportation Review, 122 , 143–168.

Farid, A., Abdel-Aty, M., & Lee, J. (2019). Comparative analysis of multiple techniques for developing and transferring safety performance functions. Accident Analysis and Prevention, 122 , 85–98.

Figueiras, P., Gonçalves, D., Costa, R., Guerreiro, G., Georgakis, P., & Jardim-Gonçalves, R. (2019). Novel Big Data-supported dynamic toll charging system: Impact assessment on Portugal’s shadow-toll highways. Computers and Industrial Engineering, 135 , 476–491.

Flores, H., & Villalobos, J. R. (2020). A stochastic planning framework for the discovery of complementary, agricultural systems. European Journal of Operational Research, 280 (2), 707–729.

Fu, W., & Chien, C.-F. (2019). UNISON data-driven intermittent demand forecast framework to empower supply chain resilience and an empirical study in electronics distribution. Computers and Industrial Engineering, 135 , 940–949.

Fukuda, S., Yasunaga, E., Nagle, M., Yuge, K., Sardsud, V., Spreer, W., & Müller, J. (2014). Modelling the relationship between peel colour and the quality of fresh mango fruit using Random Forests. Journal of Food Engineering, 131 , 7–17.

Gan, M., Yang, S., Li, D., Wang, M., Chen, S., Xie, R., & Liu, J. (2018). A novel intensive distribution logistics network design and profit allocation problem considering sharing economy. Complexity, 2018 .

Gao, J., Ning, C., & You, F. (2019). Data-driven distributionally robust optimization of shale gas supply chains under uncertainty. AIChE Journal, 65 (3), 947–963.

Garcia, S., Cordeiro, A., de Alencar Nääs, I., & Neto, P. L. (2019). The sustainability awareness of Brazilian consumers of cotton clothing. Journal of Cleaner Production, 215 , 1490–1502.

Ghasri, M., Maghrebi, M., Rashidi, T. H., & Waller, S. T. (2016). Hazard-based model for concrete pouring duration using construction site and supply chain parameters. Automation in Construction, 71 , 283–293.

Göçmen, E., & Erol, R. (2019). Transportation problems for intermodal networks: Mathematical models, exact and heuristic algorithms, and machine learning. Expert Systems with Applications, 135 , 374–387.

Gopal, P., Rana, N., Krishna, T., & Ramkumar, M. (2022). Impact of big data analytics on supply chain performance: An analysis of influencing factors. Annals of Operations Research , 1–29.

Gordini, N., & Veglio, V. (2017). Customers churn prediction and marketing retention strategies. An application of support vector machines based on the AUC parameter-selection technique in B2B e-commerce industry. Industrial Marketing Management, 62 , 100–107.

Govindan, K., Cheng, T., Mishra, N., & Shukla, N. (2018). Big data analytics and application for logistics and supply chain management.

Govindan, K., & Gholizadeh, H. (2021). Robust network design for sustainable-resilient reverse logistics network using big data: A case study of end-of-life vehicles. Transportation Research Part E: Logistics and Transportation Review, 149 , 102279.

Grover, P., & Kar, A. K. (2017). Big data analytics: A review on theoretical contributions and tools used in literature. Global Journal of Flexible Systems Management, 18 (3), 203–229.

Gružauskas, V., Gimžauskienė, E., & Navickas, V. (2019). Forecasting accuracy influence on logistics clusters activities: The case of the food industry. Journal of Cleaner Production, 240 , 118225.

Grzybowska, H., Kerferd, B., Gretton, C., & Waller, S. T. (2020). A simulation-optimisation genetic algorithm approach to product allocation in vending machine systems. Expert Systems with Applications, 145 , 113110.

Gumus, A. T., Guneri, A. F., & Keles, S. (2009). Supply chain network design using an integrated neuro-fuzzy and MILP approach: A comparative design study. Expert Systems with Applications, 36 (10), 12570–12577.

Gunduz, M., Demir, S., & Paksoy, T. (2021). Matching functions of supply chain management with smart and sustainable Tools: A novel hybrid BWM-QFD based method. Computers and Industrial Engineering, 162 .

GuoHua, Z., Wei, W., et al. (2021). Study of the game model of E-commerce information sharing in an agricultural product supply chain based on fuzzy big data and LSGDM. Technological Forecasting and Social Change, 172 , 121017.

Gürbüz, F., Eski, İ, Denizhan, B., & Dağlı, C. (2019). Prediction of damage parameters of a 3PL company via data mining and neural networks. Journal of Intelligent Manufacturing, 30 (3), 1437–1449.

Ha, S. H., & Krishnan, R. (2008). A hybrid approach to supplier selection for the maintenance of a competitive supply chain. Expert Systems with Applications, 34 (2), 1303–1311.

Hägele, S., Grosse, E. H., & Ivanov, D. (2023). Supply chain resilience: A tertiary study. International Journal of Integrated Supply Management, 16 (1), 52–81.

Han, S., Cao, B., Fu, Y., & Luo, Z. (2018). A liner shipping competitive model with consideration of service quality management. Annals of Operations Research, 270 (1–2), 155–177.

Han, S., Fu, Y., Cao, B., & Luo, Z. (2018). Pricing and bargaining strategy of e-retail under hybrid operational patterns. Annals of Operations Research, 270 (1–2), 179–200.

Hao, H., Guo, J., Xin, Z., & Qiao, J. (2021). Research on e-commerce distribution optimization of rice agricultural products based on consumer satisfaction. IEEE Access, 9 , 135304–135315.

Heger, J., Branke, J., Hildebrandt, T., & Scholz-Reiter, B. (2016). Dynamic adjustment of dispatching rule parameters in flow shops with sequence-dependent set-up times. International Journal of Production Research, 54 (22), 6812–6824.

Ho, C.-T.B., Koh, S. L., Mahamaneerat, W. K., Shyu, C.-R., Ho, S.-C., & Chang, C. A. (2007). Domain-concept association rules mining for large-scale and complex cellular manufacturing tasks. Journal of Manufacturing Technology Management, 18 (7), 787–806.

Ho, G. T., Lau, H. C., Kwok, S., Lee, C. K., & Ho, W. (2009). Development of a co-operative distributed process mining system for quality assurance. International Journal of Production Research, 47 (4), 883–918.

Hogenboom, A., Ketter, W., Van Dalen, J., Kaymak, U., Collins, J., & Gupta, A. (2015). Adaptive tactical pricing in multi-agent supply chain markets using economic regimes. Decision Sciences, 46 (4), 791–818.

Hojati, A. T., Ferreira, L., Washington, S., & Charles, P. (2013). Hazard based models for freeway traffic incident duration. Accident Analysis and Prevention, 52 , 171–181.

Homayouni, Z., Pishvaee, M. S., Jahani, H., & Ivanov, D. (2021). A robust-heuristic optimization approach to a green supply chain design with consideration of assorted vehicle types and carbon policies under uncertainty. Annals of Operations Research , 1–41.

Hong, G.-H., & Ha, S. H. (2008). Evaluating supply partner’s capability for seasonal products using machine learning techniques. Computers and Industrial Engineering, 54 (4), 721–736.

Hosseini, S., & Al Khaled, A. (2019). A hybrid ensemble and AHP approach for resilient supplier selection. Journal of Intelligent Manufacturing, 30 (1), 207–228.

Hou, F., Li, B., Chong, A.Y.-L., Yannopoulou, N., & Liu, M. J. (2017). Understanding and predicting what influence online product sales? A neural network approach. Production Planning and Control, 28 (11–12), 964–975.

Hsiao, Y.-C., Wu, M.-H., & Li, S. C. (2019). Elevated performance of the smart city: A case study of the IoT by innovation mode. IEEE Transactions on Engineering Management, 68 (5), 1461–1475.

Huber, J., Gossmann, A., & Stuckenschmidt, H. (2017). Cluster-based hierarchical demand forecasting for perishable goods. Expert Systems with Applications, 76 , 140–151.

Ialongo, L. N., de Valk, C., Marchese, E., Jansen, F., Zmarrou, H., Squartini, T., & Garlaschelli, D. (2022). Reconstructing firm-level interactions in the Dutch input-output network from production constraints. Scientific Reports, 12 (1), 1–12.

Iftikhar, A., Ali, I., Arslan, A., & Tarba, S. (2022a). Digital innovation, data analytics, and supply chain resiliency: A bibliometric-based systematic literature review. Annals of Operations Research , 1–24.

Iftikhar, A., Purvis, L., Giannoccaro, I., & Wang, Y. (2022b). The impact of supply chain complexities on supply chain resilience: The mediating effect of big data analytics. Production Planning and Control , 1–21.

Iranitalab, A., & Khattak, A. (2017). Comparison of four statistical and machine learning methods for crash severity prediction. Accident Analysis and Prevention, 108 , 27–36.

Islam, S., & Amin, S. H. (2020). Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data, 7 (1), 1–22.

Ivanov, D. (2021a). Digital supply chain management and technology to enhance resilience by building and using end-to-end visibility during the COVID-19 pandemic. IEEE Transactions on Engineering Management , 1–11.

Ivanov, D. (2021b). Exiting the COVID-19 pandemic: After-shock risks and avoidance of disruption tails in supply chains. Annals of Operations Research , 1–18.

Ivanov, D. (2021). Supply Chain Viability and the COVID-19 pandemic: A conceptual and formal generalisation of four major adaptation strategies. International Journal of Production Research, 59 (12), 3535–3552.

Ivanov, D. (2023). The industry 5.0 framework: Viability-based integration of the resilience, sustainability, and human-centricity perspectives. International Journal of Production Research , 61 (5), 1683–1695.

Ivanov, D., & Dolgui, A. (2020). Viability of intertwined supply networks: extending the supply chain resilience angles towards survivability. A position paper motivated by COVID-19 outbreak. International Journal of Production Research, 58 (10), 2904–2915.

Ivanov, D., & Dolgui, A. (2021). A digital supply chain twin for managing the disruption risks and resilience in the era of Industry 4.0. Production Planning and Control, 32 (9), 775–788.

Ivanov, D., & Dolgui, A. (2021). OR-methods for coping with the ripple effect in supply chains during COVID-19 pandemic: Managerial insights and research implications. International Journal of Production Economics, 232 , 107921.

Ivanov, D., Dolgui, A., & Sokolov, B. (2022). Cloud supply chain: Integrating industry 4.0 and digital platforms in the “supply chain-as-a-service’’. Transportation Research Part E: Logistics and Transportation Review, 160 , 102676.

Ivanov, D., & Keskin, B. B. (2023). Post-pandemic adaptation and development of supply chain viability theory. Omega, 116 , 102806.

Ivanov, D., Tang, C. S., Dolgui, A., Battini, D., & Das, A. (2021). Researchers’ perspectives on Industry 4.0: Multi-disciplinary analysis and opportunities for operations management. International Journal of Production Research, 59 (7), 2055–2078.

Ivanov, D., Tsipoulanidis, A., Schönberger, J., et al. (2021). Global supply chain and operations management . Springer.

Book   Google Scholar  

Jain, R., Singh, A., Yadav, H., & Mishra, P. (2014). Using data mining synergies for evaluating criteria at pre-qualification stage of supplier selection. Journal of Intelligent Manufacturing, 25 (1), 165–175.

Jain, V., Wadhwa, S., & Deshmukh, S. (2007). Supplier selection using fuzzy association rules mining approach. International Journal of Production Research, 45 (6), 1323–1353.

Jeong, H., Jang, Y., Bowman, P. J., & Masoud, N. (2018). Classification of motor vehicle crash injury severity: A hybrid approach for imbalanced data. Accident Analysis and Prevention, 120 , 250–261.

Ji, G., Yu, M., Tan, K., Kumar, A., & Gupta, S. (2022). Decision optimization in cooperation innovation: the impact of big data analytics capability and cooperative modes. Annals of Operations Research , 1–24.

Jiang, C., & Sheng, Z. (2009). Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system. Expert Systems with Applications, 36 (3), 6520–6526.

Jiang, W. (2019). An intelligent supply chain information collaboration model based on Internet of things and big data. IEEE Access, 7 , 58324–58335.

Jiao, Z., Ran, L., Zhang, Y., Li, Z., & Zhang, W. (2018). Data-driven approaches to integrated closed-loop sustainable supply chain design under multi-uncertainties. Journal of Cleaner Production, 185 , 105–127.

Jula, P., & Leachman, R. C. (2011). Long-and short-run supply-chain optimization models for the allocation and congestion management of containerized imports from Asia to the United States. Transportation Research Part E: Logistics and Transportation Review, 47 (5), 593–608.

Jung, S., Hong, S., & Lee, K. (2018). A data-driven air traffic sequencing model based on pairwise preference learning. IEEE Transactions on Intelligent Transportation Systems, 20 (3), 803–816.

Kamble, S., Belhadi, A., Gunasekaran, A., Ganapathy, L., & Verma, S. (2021a). A large multi-group decision-making technique for prioritizing the big data-driven circular economy practices in the automobile component manufacturing industry. Technological Forecasting and Social Change, 165 .

Kamble, S. S., & Gunasekaran, A. (2020). Big data-driven supply chain performance measurement system: A review and framework for implementation. International Journal of Production Research, 58 (1), 65–86.

Kamble, S. S., Gunasekaran, A., Kumar, V., Belhadi, A., & Foropon, C. (2021). A machine learning based approach for predicting blockchain adoption in supply chain. Technological Forecasting and Social Change, 163 , 120465.

Kamley, S., Jaloree, S., & Thakur, R. (2016). Performance forecasting of share market using machine learning techniques: A review. International Journal of Electrical and Computer Engineering (2088-8708), 6 (6).

Kang, Y., Lee, S., & Do Chung, B. (2019). Learning-based logistics planning and scheduling for crowdsourced parcel delivery. Computers and Industrial Engineering, 132 , 271–279.

Kappelman, A. C., & Sinha, A. K. (2021). Optimal control in dynamic food supply chains using big data. Computers and Operations Research, 126 , 105117.

Kar, A., Tripathi, S., Malik, N., Gupta, S., & Sivarajah, U. (2022). How does misinformation and capricious opinions impact the supply chain: A study on the impacts during the pandemic. Annals of Operations Research , 1–22.

Kartal, H., Oztekin, A., Gunasekaran, A., & Cebi, F. (2016). An integrated decision analytic framework of machine learning with multi-criteria decision making for multi-attribute inventory classification. Computers and Industrial Engineering, 101 , 599–613.

Kaur, H., & Singh, S. P. (2018). Heuristic modeling for sustainable procurement and logistics in a supply chain using big data. Computers and Operations Research, 98 , 301–321.

Kazancoglu, Y., Ozkan-Ozen, Y., Sagnak, M., Kazancoglu, I., & Dora, M. (2021a). Framework for a sustainable supply chain to overcome risks in transition to a circular economy through Industry 4.0. Production Planning and Control , 1–16.

Kazancoglu, Y., Sagnak, M., Mangla, S., Sezer, M., & Pala, M. (2021b). A fuzzy based hybrid decision framework to circularity in dairy supply chains through big data solutions. Technological Forecasting and Social Change, 170 .

Keller, T., Thiesse, F., & Fleisch, E. (2014). Classification models for RFID-based real-time detection of process events in the supply chain: An empirical study. ACM Transactions on Management Information Systems (TMIS), 5 (4), 1–30.

Ketter, W., Collins, J., Gini, M., Gupta, A., & Schrater, P. (2009). Detecting and forecasting economic regimes in multi-agent automated exchanges. Decision Support Systems, 47 (4), 307–318.

Kiekintveld, C., Miller, J., Jordan, P. R., Callender, L. F., & Wellman, M. P. (2009). Forecasting market prices in a supply chain game. Electronic Commerce Research and Applications, 8 (2), 63–77.

Kilimci, Z. H., Akyuz, A. O., Uysal, M., Akyokus, S., Uysal, M. O., Atak Bulbul, B., & Ekmis, M. A. (2019). An improved demand forecasting model using deep learning approach and proposed decision integration strategy for supply chain. Complexity, 2019 .

Kim, S., Kim, H., & Park, Y. (2017). Early detection of vessel delays using combined historical and real-time information. Journal of the Operational Research Society, 68 (2), 182–191.

Kim, S., Sohn, W., Lim, D., & Lee, J. (2021). A multi-stage data mining approach for liquid bulk cargo volume analysis based on bill of lading data. Expert Systems with Applications , 115304.

Kim, T. Y. (2018). Improving warehouse responsiveness by job priority management: A European distribution centre field study. Computers and Industrial Engineering, 139 , 105564.

Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33 (2004), 1–26.

Kosasih, E. E., & Brintrup, A. (2021). A machine learning approach for predicting hidden links in supply chain with graph neural networks. International Journal of Production Research , 1–14.

Kotu, V., & Deshpande, B. (2018). Data science: Concepts and practice . New York: Morgan Kaufmann.

Kumar, S., Nottestad, D. A., & Murphy, E. E. (2009). Effects of product postponement on the distribution network: A case study. Journal of the Operational Research Society, 60 (4), 471–480.

Kuo, R. J., Wang, Y. C., & Tien, F. C. (2010). Integration of artificial neural network and MADA methods for green supplier selection. Journal of Cleaner Production, 18 (12), 1161–1170.

Kusi-Sarpong, S., Orji, I., Gupta, H., & Kunc, M. (2021). Risks associated with the implementation of big data analytics in sustainable supply chains. Omega (United Kingdom), 105 .

Kuvvetli, Ü., & Firuzan, A. R. (2019). Applying Six Sigma in urban public transportation to reduce traffic accidents involving municipality buses. Total Quality Management and Business Excellence, 30 (1–2), 82–107.

Lamba, K., & Singh, S. P. (2019). Dynamic supplier selection and lot-sizing problem considering carbon emissions in a big data environment. Technological Forecasting and Social Change, 144 , 573–584.

Lamba, K., Singh, S. P., & Mishra, N. (2019). Integrated decisions for supplier selection and lot-sizing considering different carbon emission regulations in Big Data environment. Computers and Industrial Engineering, 128 , 1052–1062.

Lau, R. Y. K., Zhang, W., & Xu, W. (2018). Parallel aspect-oriented sentiment analysis for sales forecasting with big data. Production and Operations Management, 27 (10), 1775–1794.

Lázaro, J. L., Jiménez, Á. B., & Takeda, A. (2018). Improving cash logistics in bank branches by coupling machine learning and robust optimization. Expert Systems with Applications, 92 , 236–255.

Le Thi, H. A. (2020). DC programming and DCA for supply chain and production management: State-of-the-art models and methods. International Journal of Production Research, 58 (20), 6078–6114.

Lee, C. (2017). A GA-based optimisation model for big data analytics supporting anticipatory shipping in Retail 4.0. International Journal of Production Research, 55 (2), 593–605.

Lee, C. K., Ho, W., Ho, G. T., & Lau, H. C. (2011). Design and development of logistics workflow systems for demand management with RFID. Expert Systems with Applications, 38 (5), 5428–5437.

Lee, C.-Y., & Chien, C.-F. (2014). Stochastic programming for vendor portfolio selection and order allocation under delivery uncertainty. OR Spectrum, 36 (3), 761–797.

Lee, H., Aydin, N., Choi, Y., Lekhavat, S., & Irani, Z. (2018). A decision support system for vessel speed decision in maritime logistics using weather archive big data. Computers and Operations Research, 98 , 330–342.

Lee, Y.-C., Hsiao, Y.-C., Peng, C.-F., Tsai, S.-B., Wu, C.-H., & Chen, Q. (2015). Using Mahalanobis-Taguchi system, logistic regression, and neural network method to evaluate purchasing audit quality. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 229 (1), 3–12.

Leung, K. H., Mo, D. Y., Ho, G. T., Wu, C.-H., & Huang, G. Q. (2020). Modelling near-real-time order arrival demand in e-commerce context: A machine learning predictive methodology. Industrial Management and Data Systems, 120 (6), 1149–1174.

Li, G., Li, L., Choi, T.-M., & Sethi, S. P. (2020). Green supply chain management in Chinese firms: Innovative measures and the moderating role of quick response technology. Journal of Operations Management, 66 (7–8), 958–988.

Li, G., Li, N., & Sethi, S. P. (2021). Does CSR reduce idiosyncratic risk? Roles of operational efficiency and AI innovation. Production and Operations Management, 30 (7), 2027–2045.

Li, G., Lim, M. K., & Wang, Z. (2020). Stakeholders, green manufacturing, and practice performance: Empirical evidence from Chinese fashion businesses. Annals of Operations Research, 290 (1), 961–982.

Li, G., Wu, H., Sethi, S. P., & Zhang, X. (2021). Contracting green product supply chains considering marketing efforts in the circular economy era. International Journal of Production Economics, 234 , 108041.

Li, G., Xue, J., Li, N., & Ivanov, D. (2022). Blockchain-supported business model design, supply chain resilience, and firm performance. Transportation Research Part E: Logistics and Transportation Review, 163 , 102773.

Li, G.-D., Yamaguchi, D., & Nagai, M. (2008). A grey-based rough decision-making approach to supplier selection. The International Journal of Advanced Manufacturing Technology, 36 (9–10), 1032.

Li, J., Zeng, X., Liu, C., & Zhou, X. (2018). A parallel Lagrange algorithm for order acceptance and scheduling in cluster supply chains. Knowledge-Based Systems, 143 , 271–283.

Li, L., Chi, T., Hao, T., & Yu, T. (2018). Customer demand analysis of the electronic commerce supply chain using Big Data. Annals of Operations Research, 268 (1–2), 113–128.

Li, L., Dai, Y., & Sun, Y. (2021). Impact of data-driven online financial consumption on supply chain services. Industrial Management and Data Systems, 121 (4), 856–878.

Li, L., Gong, Y., Wang, Z., & Liu, S. (2022b). Big data and big disaster: A mechanism of supply chain risk management in global logistics industry. International Journal of Operations and Production Management .

Li, R., Pereira, F. C., & Ben-Akiva, M. E. (2015). Competing risk mixture model and text analysis for sequential incident duration prediction. Transportation Research Part C: Emerging Technologies, 54 , 74–85.

Li, S., & Kuo, X. (2008). The inventory management system for automobile spare parts in a central warehouse. Expert Systems with Applications, 34 (2), 1144–1153.

Liao, S.-H., Chen, C.-M., & Wu, C.-H. (2008). Mining customer knowledge for product line and brand extension in retailing. Expert Systems with Applications, 34 (3), 1763–1776.

Liao, S.-H., Chen, Y.-N., & Tseng, Y.-Y. (2009). Mining demand chain knowledge of life insurance market for new product development. Expert Systems with Applications, 36 (5), 9422–9437.

Liao, S.-H., Hsieh, C.-L., & Huang, S.-P. (2008). Mining product maps for new product development. Expert Systems with Applications, 34 (1), 50–62.

Lim, M., Li, Y., & Song, X. (2021). Exploring customer satisfaction in cold chain logistics using a text mining approach. Industrial Management and Data Systems, 121 (12), 2426–2449.

Lin, R.-H., Chuang, C.-L., Liou, J. J., & Wu, G.-D. (2009). An integrated method for finding key suppliers in SCM. Expert Systems with Applications, 36 (3), 6461–6465.

Lin, W., Wu, Z., Lin, L., Wen, A., & Li, J. (2017). An ensemble random forest algorithm for insurance big data analysis. IEEE Access, 5 , 16568–16575.

Liu, C., Feng, Y., Lin, D., Wu, L., & Guo, M. (2020). IoT based laundry services: an application of big data analytics, intelligent logistics management, and machine learning techniques. International Journal of Production Research, 58 (17), 5113–5131.

Liu, C., Li, H., Tang, Y., Lin, D., & Liu, J. (2019). Next generation integrated smart manufacturing based on big data analytics, reinforced learning, and optimal routes planning methods. International Journal of Computer Integrated Manufacturing, 32 (9), 820–831.

Liu, P. (2019). Pricing policies and coordination of low-carbon supply chain considering targeted advertisement and carbon emission reduction costs in the big data environment. Journal of Cleaner Production, 210 , 343–357.

Liu, P., & Yi, S.-P. (2017). Pricing policies of green supply chain considering targeted advertising and product green degree in the big data environment. Journal of Cleaner Production, 164 , 1614–1622.

Liu, W., Long, S., Xie, D., Liang, Y., & Wang, J. (2021). How to govern the big data discriminatory pricing behavior in the platform service supply chain? An examination with a three-party evolutionary game model. International Journal of Production Economics, 231 , 107910.

Lyu, X., & Zhao, J. (2019). Compressed sensing and its applications in risk assessment for internet supply chain finance under big data. IEEE Access, 7 , 53182–53187.

Ma, D., Hu, J., & Yao, F. (2021). Big data empowering low-carbon smart tourism study on low-carbon tourism O2O supply chain considering consumer behaviors and corporate altruistic preferences. Computers and Industrial Engineering, 153 .

Maghsoodi, A. I., Kavian, A., Khalilzadeh, M., & Brauers, W. K. (2018). CLUS-MCDA: A novel framework based on cluster analysis and multiple criteria decision theory in a supplier selection problem. Computers and Industrial Engineering, 118 , 409–422.

Maheshwari, S., Gautam, P., & Jaggi, C. K. (2021). Role of Big Data Analytics in supply chain management: Current trends and future perspectives. International Journal of Production Research, 59 (6), 1875–1900.

Maldonado, S., González-Ramírez, R. G., Quijada, F., & Ramírez-Nafarrate, A. (2019). Analytics meets port logistics: A decision support system for container stacking operations. Decision Support Systems, 121 , 84–93.

Mancini, M., Mircoli, A., Potena, D., Diamantini, C., Duca, D., & Toscano, G. (2020). Prediction of pellet quality through machine learning techniques and near-infrared spectroscopy. Computers and Industrial Engineering, 147 , 106566.

Mao, J., Hong, D., Ren, R., Li, X., Wang, J., & Nasr, E. S. A. (2020). Driving conditions of new energy logistics vehicles using big data technology. IEEE Access, 8 , 123891–123903.

Masna, N. V. R., Chen, C., Mandal, S., & Bhunia, S. (2019). Robust authentication of consumables with extrinsic tags and chemical fingerprinting. IEEE Access, 7 , 14396–14409.

Matusiak, M., de Koster, R., & Saarinen, J. (2017). Utilizing individual picker skills to improve order batching in a warehouse. European Journal of Operational Research, 263 (3), 888–899.

Medina-González, S., Shokry, A., Silvente, J., Lupera, G., & Espuña, A. (2018). Optimal management of bio-based energy supply chains under parametric uncertainty through a data-driven decision-support framework. Computers and Industrial Engineering, 139 , 105561.

Merchán, D., & Winkenbach, M. (2019). An empirical validation and data-driven extension of continuum approximation approaches for urban route distances. Networks, 73 (4), 418–433.

Metzger, A., Leitner, P., Ivanović, D., Schmieders, E., Franklin, R., Carro, M., Dustdar, S., & Pohl, K. (2014). Comparing and combining predictive business process monitoring techniques. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45 (2), 276–290.

Miguéis, V. L., Van den Poel, D., Camanho, A. S., & e Cunha, J. F. (2012). Modeling partial customer churn: On the value of first product-category purchase sequences. Expert Systems with Applications, 39 (12), 11250–11256.

Ming, L., GuoHua, Z., & Wei, W. (2021). Study of the Game Model of E-Commerce Information Sharing in an Agricultural Product Supply Chain based on fuzzy big data and LSGDM. Technological Forecasting and Social Change, 172 .

Mirzaei, M., Zaerpour, N., & de Koster, R. (2021). The impact of integrated cluster-based storage allocation on parts-to-picker warehouse performance. Transportation Research Part E: Logistics and Transportation Review, 146 , 102207.

Mishra, D., Gunasekaran, A., Papadopoulos, T., & Childe, S. J. (2018). Big Data and supply chain management: A review and bibliometric analysis. Annals of Operations Research, 270 (1–2), 313–336.

Mishra, S., & Singh, S. (2022). A stochastic disaster-resilient and sustainable reverse logistics model in big data environment. Annals of Operations Research, 319 (1), 853–884.

Mishra, S., & Singh, S. P. (2020). A stochastic disaster-resilient and sustainable reverse logistics model in big data environment. Annals of Operations Research , 1–32.

Mishra, S., & Singh, S. P. (2021). A clean global production network model considering hybrid facilities. Journal of Cleaner Production, 281 , 124463.

Mocanu, E., Nguyen, P. H., Gibescu, M., & Kling, W. L. (2016). Deep learning for estimating building energy consumption. Sustainable Energy, Grids and Networks, 6 , 91–99.

Mohseni, S., & Pishvaee, M. S. (2020). Data-driven robust optimization for wastewater sludge-to-biodiesel supply chain design. Computers and Industrial Engineering, 139 , 105944.

Mokhtarinejad, M., Ahmadi, A., Karimi, B., & Rahmati, S. H. A. (2015). A novel learning based approach for a new integrated location-routing and scheduling problem within cross-docking considering direct shipment. Applied Soft Computing, 34 , 274–285.

Molka-Danielsen, J., Engelseth, P., & Wang, H. (2018). Large scale integration of wireless sensor network technologies for air quality monitoring at a logistics shipping base. Journal of Industrial Information Integration, 10 , 20–28.

Mourtzis, D., Dolgui, A., Ivanov, D., Peron, M., & Sgarbossa, F. (2021). Design and operation of production networks for mass personalization in the era of cloud technology . Elsevier.

Mungo, L., Lafond, F., Astudillo-Estévez, P., & Farmer, J. D. (2023). Reconstructing production networks using machine learning. Journal of Economic Dynamics and Control , 104607.

Murray, P. W., Agard, B., & Barajas, M. A. (2018). Forecast of individual customer’s demand from a large and noisy dataset. Computers and Industrial Engineering, 118 , 33–43.

Muteki, K., & MacGregor, J. F. (2008). Optimal purchasing of raw materials: A data-driven approach. AIChE Journal, 54 (6), 1554–1559.

Neilson, A., Daniel, B., Tjandra, S., et al. (2019). Systematic review of the literature on big data in the transportation domain: Concepts and applications. Big Data Research, 17 , 35–44.

Newman, W. R., & Krehbiel, T. C. (2007). Linear performance pricing: A collaborative tool for focused supply cost reduction. Journal of Purchasing and Supply Management, 13 (2), 152–165.

Nguyen, A., Lamouri, S., Pellerin, R., Tamayo, S., & Lekens, B. (2022). Data analytics in pharmaceutical supply chains: State of the art, opportunities, and challenges. International Journal of Production Research, 60 (22), 6888–6907.

Nguyen, A., Pellerin, R., Lamouri, S., & Lekens, B. (2022b). Managing demand volatility of pharmaceutical products in times of disruption through news sentiment analysis. International Journal of Production Research , 1–12.

Nguyen, D. T., Adulyasak, Y., Cordeau, J.-F., & Ponce, S. I. (2021). Data-driven operations and supply chain management: Established research clusters from 2000 to early 2020. International Journal of Production Research , 1–25.

Nguyen, T., Li, Z., Spiegler, V., Ieromonachou, P., & Lin, Y. (2018). Big data analytics in supply chain management: A state-of-the-art literature review. Computers and Operations Research, 98 , 254–264.

Ni, M., Xu, X., & Deng, S. (2007). Extended QFD and data-mining-based methods for supplier selection in mass customization. International Journal of Computer Integrated Manufacturing, 20 (2–3), 280–291.

Nikolopoulos, K., Punia, S., Schäfers, A., Tsinopoulos, C., & Vasilakis, C. (2021). Forecasting and planning during a pandemic: COVID-19 growth rates, supply chain disruptions, and governmental decisions. European Journal of Operational Research, 290 (1), 99–115.

Nikolopoulos, K. I., Babai, M. Z., & Bozos, K. (2016). Forecasting supply chain sporadic demand with nearest neighbor approaches. International Journal of Production Economics, 177 , 139–148.

Ning, C., & You, F. (2018). Data-driven stochastic robust optimization: General computational framework and algorithm leveraging machine learning for optimization under uncertainty in the big data era. Computers and Chemical Engineering, 111 , 115–133.

Ning, C., & You, F. (2019). Optimization under uncertainty in the era of big data and deep learning: When machine learning meets mathematical programming. Computers and Chemical Engineering, 125 , 434–448.

Niu, B., Dai, Z., & Chen, L. (2022). Information leakage in a cross-border logistics supply chain considering demand uncertainty and signal inference. Annals of Operations Research, 309 (2), 785–816.

Noroozi, A., Mokhtari, H., & Abadi, I. N. K. (2013). Research on computational intelligence algorithms with adaptive learning approach for scheduling problems with batch processing machines. Neurocomputing, 101 , 190–203.

Novais, L., Maqueira, J. M., & Ortiz-Bas, Á. (2019). A systematic literature review of cloud computing use in supply chain integration. Computers and Industrial Engineering, 129 , 296–314.

Nuss, P., Ohno, H., Chen, W.-Q., & Graedel, T. (2019). Comparative analysis of metals use in the United States economy. Resources, Conservation and Recycling, 145 , 448–456.

Oh, J., & Jeong, B. (2019). Tactical supply planning in smart manufacturing supply chain. Robotics and Computer-Integrated Manufacturing, 55 , 217–233.

Opasanon, S., & Kitthamkesorn, S. (2016). Border crossing design in light of the ASEAN Economic Community: Simulation based approach. Transport Policy, 48 , 1–12.

Ou, T.-Y., Cheng, C.-Y., Chen, P.-J., & Perng, C. (2016). Dynamic cost forecasting model based on extreme learning machine: A case study in steel plant. Computers and Industrial Engineering, 101 , 544–553.

Ozgormus, E., & Smith, A. E. (2020). A data-driven approach to grocery store block layout. Computers and Industrial Engineering, 139 , 105562.

Pal Singh, S., Adhikari, A., Majumdar, A., & Bisi, A. (2022). Does service quality influence operational and financial performance of third party logistics service providers? A mixed multi criteria decision making -text mining-based investigation. Transportation Research Part E: Logistics and Transportation Review, 157 .

Pan, S., Giannikas, V., Han, Y., Grover-Silva, E., & Qiao, B. (2017). Using customer-related data to enhance e-grocery home delivery. Industrial Management and Data Systems, 117 (9), 1917–1933.

Papanagnou, C. I., & Matthews-Amune, O. (2018). Coping with demand volatility in retail pharmacies with the aid of big data exploration. Computers and Operations Research, 98 , 343–354.

Parmar, D., Wu, T., Callarman, T., Fowler, J., & Wolfe, P. (2010). A clustering algorithm for supplier base management. International Journal of Production Research, 48 (13), 3803–3821.

Piendl, R., Matteis, T., & Liedtke, G. (2019). A machine learning approach for the operationalization of latent classes in a discrete shipment size choice model. Transportation Research Part E: Logistics and Transportation Review, 121 , 149–161.

Potočnik, P., Šilc, J., Papa, G., et al. (2019). A comparison of models for forecasting the residential natural gas demand of an urban area. Energy, 167 , 511–522.

Pournader, M., Ghaderi, H., Hassanzadegan, A., & Fahimnia, B. (2021). Artificial intelligence applications in supply chain management. International Journal of Production Economics , 108250.

Praet, S., & Martens, D. (2020). Efficient parcel delivery by predicting customers’ locations. Decision Sciences, 51 (5), 1202–1231.

Prakash, A., & Deshmukh, S. (2011). A multi-criteria customer allocation problem in supply chain environment: An artificial immune system with fuzzy logic controller based approach. Expert Systems with Applications, 38 (4), 3199–3208.

Pramanik, D., Mondal, S. C., & Haldar, A. (2020). Resilient supplier selection to mitigate uncertainty: Soft-computing approach. Journal of Modelling in Management .

Priore, P., Ponte, B., Rosillo, R., & de la Fuente, D. (2019). Applying machine learning to the dynamic selection of replenishment policies in fast-changing supply chain environments. International Journal of Production Research, 57 (11), 3663–3677.

Proto, S., Di Corso, E., apiletti, D., Cagliero, L., Cerquitelli, T., Malnati, G., & Mazzucchi, D. (2020). REDTag: A predictive maintenance framework for parcel delivery services. IEEE Access, 8 , 14953–14964.

Punia, S., Singh, S. P., & Madaan, J. K. (2020). A cross-temporal hierarchical framework and deep learning for supply chain forecasting. Computers and Industrial Engineering, 149 , 106796.

Putra, P., Mahendra, R., & Budi, I. (2022). Traffic and road conditions monitoring system using extracted information from Twitter. Journal of Big Data, 9 (1).

Quariguasi Frota Neto, J., & Dutordoir, M. (2020). Mapping the market for remanufacturing: An application of “Big Data” analytics. International Journal of Production Economics, 230 .

Queiroz, M. M., Ivanov, D., Dolgui, A., & Wamba, S. F. (2022). Impacts of epidemic outbreaks on supply chains: Mapping a research agenda amid the COVID-19 pandemic through a structured literature review. Annals of Operations Research, 319 (1), 1159–1196.

Rahmanzadeh, S., Pishvaee, M., & Govindan, K. (2022). Emergence of open supply chain management: the role of open innovation in the future smart industry using digital twin network. Annals of Operations Research , 1–29.

Rai, R., Tiwari, M. K., Ivanov, D., & Dolgui, A. (2021). Machine learning in manufacturing and Industry 4.0 applications.

Riahi, Y., Saikouk, T., Gunasekaran, A., & Badraoui, I. (2021). Artificial intelligence applications in supply chain: A descriptive bibliometric analysis and future research directions. Expert Systems with Applications, 173 , 114702.

Rolf, B., Jackson, I., Müller, M., Lang, S., Reggelin, T., & Ivanov, D. (2022). A review on reinforcement learning algorithms and applications in supply chain management. International Journal of Production Research , 1–29.

Roy, V., Mitra, S., Chattopadhyay, M., & Sahay, B. (2018). Facilitating the extraction of extended insights on logistics performance from the logistics performance index dataset: A two-stage methodological framework and its application. Research in Transportation Business and Management, 28 , 23–32.

Rozhkov, M., Ivanov, D., Blackhurst, J., & Nair, A. (2022). Adapting supply chain operations in anticipation of and during the COVID-19 pandemic. Omega, 110 , 102635.

Sachs, A.-L. (2015). The data-driven newsvendor with censored demand observations. In Retail analytics (pp. 35–56). Springer.

Sadic, S., de Sousa, J. P., & Crispim, J. A. (2018). A two-phase MILP approach to integrate order, customer and manufacturer characteristics into Dynamic Manufacturing Network formation and operational planning. Expert Systems with Applications, 96 , 462–478.

See-To, E. W., & Ngai, E. W. (2018). Customer reviews for demand distribution and sales nowcasting: A big data approach. Annals of Operations Research, 270 (1–2), 415–431.

Segev, D., Levi, R., Dunn, P. F., & Sandberg, W. S. (2012). Modeling the impact of changing patient transportation systems on peri-operative process performance in a large hospital: Insights from a computer simulation study. Health Care Management Science, 15 (2), 155–169.

Seitz, A., Grunow, M., & Akkerman, R. (2020). Data driven supply allocation to individual customers considering forecast bias. International Journal of Production Economics, 227 , 107683.

Sener, A., Barut, M., Dag, A., & Yildirim, M. B. (2019). Impact of commitment, information sharing, and information usage on supplier performance: A Bayesian belief network approach. Annals of Operations Research , 1–34.

Shajalal, M., Hajek, P., & Abedin, M. Z. (2021). Product backorder prediction using deep neural network on imbalanced data. International Journal of Production Research , 1–18.

Shang, Y., Dunson, D., & Song, J.-S. (2017). Exploiting big data in logistics risk assessment via Bayesian nonparametrics. Operations Research, 65 (6), 1574–1588.

Sharma, R., Kamble, S. S., Gunasekaran, A., Kumar, V., & Kumar, A. (2020). A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Computers and Operations Research, 119 , 104926.

Shen, B., Choi, T.-M., & Chan, H.-L. (2019). Selling green first or not? A Bayesian analysis with service levels and environmental impact considerations in the Big Data Era. Technological Forecasting and Social Change, 144 , 412–420.

shen How, B., & Lam, H. L. (2018). Sustainability evaluation for biomass supply chain synthesis: novel principal component analysis (PCA) aided optimisation approach. Journal of Cleaner Production, 189 , 941–961.

Shokouhyar, S., Dehkhodaei, A., & Amiri, B. (2022). A mixed-method approach for modelling customer-centric mobile phone reverse logistics: Application of social media data. Journal of Modelling in Management, 17 (2), 655–696.

Shukla, V., Naim, M. M., & Thornhill, N. F. (2012). Rogue seasonality detection in supply chains. International Journal of Production Economics, 138 (2), 254–272.

Simkoff, J. M., & Baldea, M. (2019). Parameterizations of data-driven nonlinear dynamic process models for fast scheduling calculations. Computers and Chemical Engineering, 129 , 106498.

Singh, A., Shukla, N., & Mishra, N. (2018). Social media data analytics to improve supply chain management in food industries. Transportation Research Part E: Logistics and Transportation Review, 114 , 398–415.

Singh, A. K., Subramanian, N., Pawar, K. S., & Bai, R. (2018). Cold chain configuration design: Location-allocation decision-making using coordination, value deterioration, and big data approximation. Annals of Operations Research, 270 (1–2), 433–457.

Sodero, A. C., & Rabinovich, E. (2017). Demand and revenue management of deteriorating inventory on the Internet: An empirical study of flash sales markets. Journal of Business Logistics, 38 (3), 170–183.

Sokolov, B., Ivanov, D., & Dolgui, A. (2020). Scheduling in industry 4.0 and cloud manufacturing (Vol. 289). Springer.

Song, Z., & Kusiak, A. (2009). Optimising product configurations with a data-mining approach. International Journal of Production Research, 47 (7), 1733–1751.

Spoel, V., Chintan, A., & Hillegersberg, V. (2017). Predictive analytics for truck arrival time estimation: A field study at a European Distribution Center. International Journal of Production Research, 55 (17), 5062–5078.

Srinivasan, R., Giannikas, V., Kumar, M., Guyot, R., & McFarlane, D. (2019). Modelling food sourcing decisions under climate change: A data-driven approach. Computers and Industrial Engineering, 128 , 911–919.

Stadtler, H., & Kilger, C. (2002). Supply chain management and advanced planning (Vol. 4). New York: Springer.

Stip, J., & Van Houtum, G.-J. (2019). On a method to improve your service BOMs within spare parts management. International Journal of Production Economics , 107466.

Stip, J., & Van Houtum, G.-J. (2020). On a method to improve your service BOMs within spare parts management. International Journal of Production Economics, 221 , 107466.

Sugrue, D., & Adriaens, P. (2021). A data fusion approach to predict shipping efficiency for bulk carriers. Transportation Research Part E: Logistics and Transportation Review, 149 , 102326.

Sun, J., Li, G., & Lim, M. K. (2020). China’s power supply chain sustainability: An analysis of performance and technology gap. Annals of Operations Research , 1–29.

Susanty, A., Puspitasari, N., Prastawa, H., & Renaldi, S. (2021). Exploring the best policy scenario plan for the dairy supply chain: A DEMATEL approach. Journal of Modelling in Management, 16 (1), 240–266.

Talwar, S., Kaur, P., Fosso Wamba, S., & Dhir, A. (2021). Big Data in operations and supply chain management: a systematic literature review and future research agenda. International Journal of Production Research, 1–26.

Tan, K. H., Zhan, Y., Ji, G., Ye, F., & Chang, C. (2015). Harvesting big data to enhance supply chain innovation capabilities: An analytic infrastructure based on deduction graph. International Journal of Production Economics, 165 , 223–233.

Tao, Q., Gu, C., Wang, Z., Rocchio, J., Hu, W., & Yu, X. (2018). Big data driven agricultural products supply chain management: A trustworthy scheduling optimization approach. IEEE Access, 6 , 49990–50002.

Taube, F., & Minner, S. (2018). Data-driven assignment of delivery patterns with handling effort considerations in retail. Computers and Operations Research, 100 , 379–393.

Tavana, M., Fallahpour, A., Di Caprio, D., & Santos-Arteaga, F. J. (2016). A hybrid intelligent fuzzy predictive model with simulation for supplier evaluation and selection. Expert Systems with Applications, 61 , 129–144.

Tayal, A., & Singh, S. P. (2018). Integrating big data analytic and hybrid firefly-chaotic simulated annealing approach for facility layout problem. Annals of Operations Research, 270 (1–2), 489–514.

Thomassey, S. (2010). Sales forecasts in clothing industry: The key success factor of the supply chain management. International Journal of Production Economics, 128 (2), 470–483.

Ting, S., Tse, Y., Ho, G., Chung, S., & Pang, G. (2014). Mining logistics data to assure the quality in a sustainable food supply chain: A case in the red wine industry. International Journal of Production Economics, 152 , 200–209.

Tirkel, I. (2013). Forecasting flow time in semiconductor manufacturing using knowledge discovery in databases. International Journal of Production Research, 51 (18), 5536–5548.

Tiwari, S., Wee, H. M., & Daryanto, Y. (2018). Big data analytics in supply chain management between 2010 and 2016: Insights to industries. Computers and Industrial Engineering, 115 , 319–330.

Tomičić-Pupek, K., Srpak, I., Havaš, L., & Srpak, D. (2020). Algorithm for customizing the material selection process for application in power engineering. Energies, 13 (23), 6458.

Triepels, R., Daniels, H., & Feelders, A. (2018). Data-driven fraud detection in international shipping. Expert Systems with Applications, 99 , 193–202.

Tsai, F.-M., & Huang, L. J. (2017). Using artificial neural networks to predict container flows between the major ports of Asia. International Journal of Production Research, 55 (17), 5001–5010.

Tsao, Y.-C. (2017). Managing default risk under trade credit: Who should implement Big-Data analytics in supply chains? Transportation Research Part E: Logistics and Transportation Review, 106 , 276–293.

Tsolakis, N., Zissis, D., Papaefthimiou, S., & Korfiatis, N. (2021). Towards AI driven environmental sustainability: An application of automated logistics in container port terminals. International Journal of Production Research , 1–21.

Tsolakis, N., Zissis, D., Papaefthimiou, S., & Korfiatis, N. (2022). Towards AI driven environmental sustainability: An application of automated logistics in container port terminals. International Journal of Production Research, 60 (14), 4508–4528.

Tsou, C.-M. (2013). On the strategy of supply chain collaboration based on dynamic inventory target level management: A theory of constraint perspective. Applied Mathematical Modelling, 37 (7), 5204–5214.

Tucnik, P., Nachazel, T., Cech, P., & Bures, V. (2018). Comparative analysis of selected path-planning approaches in large-scale multi-agent-based environments. Expert Systems with Applications, 113 , 415–427.

Vahdani, B., Iranmanesh, S., Mousavi, S. M., & Abdollahzade, M. (2012). A locally linear neuro-fuzzy model for supplier selection in cosmetics industry. Applied Mathematical Modelling, 36 (10), 4714–4727.

Verstraete, G., Aghezzaf, E.-H., & Desmet, B. (2019). A data-driven framework for predicting weather impact on high-volume low-margin retail products. Journal of Retailing and Consumer Services, 48 , 169–177.

Vieira, A. A., Dias, L. M., Santos, M. Y., Pereira, G. A., & Oliveira, J. A. (2019). Simulation of an automotive supply chain using big data. Computers and Industrial Engineering, 137 , 106033.

Vieira, A. A., Dias, L. M., Santos, M. Y., Pereira, G. A., & Oliveira, J. A. (2019). Supply chain hybrid simulation: From Big Data to distributions and approaches comparison. Simulation Modelling Practice and Theory, 97 , 101956.

Viet, N. Q., Behdani, B., & Bloemhof, J. (2020). Data-driven process redesign: anticipatory shipping in agro-food supply chains. International Journal of Production Research, 58 (5), 1302–1318.

Villegas, M. A., & Pedregal, D. J. (2019). Automatic selection of unobserved components models for supply chain forecasting. International Journal of Forecasting, 35 (1), 157–169.

Vondra, M., Touš, M., & Teng, S. Y. (2019). Digestate evaporation treatment in biogas plants: A techno-economic assessment by Monte Carlo, neural networks and decision trees. Journal of Cleaner Production, 238 , 117870.

Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34 (2), 77–84.

Wang, F., Zhu, Y., Wang, F., Liu, J., Ma, X., & Fan, X. (2020). Car4Pac: Last mile parcel delivery through intelligent car trip sharing. IEEE Transactions on Intelligent Transportation Systems, 21 (10), 4410–4424.

Wang, G., Gunasekaran, A., & Ngai, E. W. (2018). Distribution network design with big data: Model and analysis. Annals of Operations Research, 270 (1–2), 539–551.

Wang, G., Gunasekaran, A., Ngai, E. W., & Papadopoulos, T. (2016). Big data analytics in logistics and supply chain management: Certain investigations for research and applications. International Journal of Production Economics, 176 , 98–110.

Wang, J., & Yue, H. (2017). Food safety pre-warning system based on data mining for a sustainable food supply chain. Food Control, 73 , 223–229.

Wang, K., Simandl, J. K., Porter, M. D., Graettinger, A. J., & Smith, R. K. (2016). How the choice of safety performance function affects the identification of important crash prediction variables. Accident Analysis and Prevention, 88 , 1–8.

Wang, L., Guo, S., Li, X., Du, B., & Xu, W. (2018). Distributed manufacturing resource selection strategy in cloud manufacturing. The International Journal of Advanced Manufacturing Technology, 94 (9–12), 3375–3388.

Wang, Y., Assogba, K., Liu, Y., Ma, X., Xu, M., & Wang, Y. (2018). Two-echelon location-routing optimization with time windows based on customer clustering. Expert Systems with Applications, 104 , 244–260.

Weiss, S. M., Dhurandhar, A., Baseman, R. J., White, B. F., Logan, R., Winslow, J. K., & Poindexter, D. (2016). Continuous prediction of manufacturing performance throughout the production lifecycle. Journal of Intelligent Manufacturing, 27 (4), 751–763.

Weng, T., Liu, W., & Xiao, J. (2019). Supply chain sales forecasting based on lightGBM and LSTM combination model. Industrial Management and Data Systems, 120 (2), 265–279.

Wesonga, R., & Nabugoomu, F. (2016). Framework for determining airport daily departure and arrival delay thresholds: Statistical modelling approach. SpringerPlus, 5 (1), 1026.

Wey, W.-M., & Huang, J.-Y. (2018). Urban sustainable transportation planning strategies for livable City’s quality of life. Habitat International, 82 , 9–27.

Wichmann, P., Brintrup, A., Baker, S., Woodall, P., & McFarlane, D. (2020). Extracting supply chain maps from news articles using deep neural networks. International Journal of Production Research, 58 (17), 5320–5336.

Windt, K., & Hütt, M.-T. (2011). Exploring due date reliability in production systems using data mining methods adapted from gene expression analysis. CIRP Annals, 60 (1), 473–476.

Wojtusiak, J., Warden, T., & Herzog, O. (2012). Machine learning in agent-based stochastic simulation: Inferential theory and evaluation in transportation logistics. Computers and Mathematics with Applications, 64 (12), 3658–3665.

Wojtusiak, J., Warden, T., & Herzog, O. (2012). The learnable evolution model in agent-based delivery optimization. Memetic Computing, 4 (3), 165–181.

Wong, W., & Guo, Z. (2010). A hybrid intelligent model for medium-term sales forecasting in fashion retail supply chains using extreme learning machine and harmony search algorithm. International Journal of Production Economics, 128 (2), 614–624.

Wu, P.-J., Chen, M.-C., & Tsau, C.-K. (2017). The data-driven analytics for investigating cargo loss in logistics systems. International Journal of Physical Distribution and Logistics Management, 47 (1), 68–83.

Wu, T., Xiao, F., Zhang, C., Zhang, D., & Liang, Z. (2019). Regression and extrapolation guided optimization for production-distribution with ship-buy-exchange options. Transportation Research Part E: Logistics and Transportation Review, 129 , 15–37.

Wu, X., Cao, Y., Xiao, Y., & Guo, J. (2020). Finding of urban rainstorm and waterlogging disasters based on microblogging data and the location-routing problem model of urban emergency logistics. Annals of Operations Research, 290 (1), 865–896.

Wu, Z., Li, Y., Wang, X., Su, J., Yang, L., Nie, Y., & Wang, Y. (2022). Mining factors affecting taxi detour behavior from GPS traces at directional road segment level. IEEE Transactions on Intelligent Transportation Systems, 23 (7), 8013–8023.

Wy, J., Jeong, S., Kim, B.-I., Park, J., Shin, J., Yoon, H., & Lee, S. (2011). A data-driven generic simulation model for logistics-embedded assembly manufacturing lines. Computers and Industrial Engineering, 60 (1), 138–147.

Xiang, Z., & Xu, M. (2019). Dynamic cooperation strategies of the closed-loop supply chain involving the Internet service platform. Journal of Cleaner Production, 220 , 1180–1193.

Xiang, Z., & Xu, M. (2020). Dynamic game strategies of a two-stage remanufacturing closed-loop supply chain considering Big Data marketing, technological innovation and overconfidence. Computers and Industrial Engineering, 145 .

Xu, F., Li, Y., & Feng, L. (2019). The influence of big data system for used product management on manufacturing-remanufacturing operations. Journal of Cleaner Production, 209 , 782–794.

Xu, G., Qiu, X., Fang, M., Kou, X., & Yu, Y. (2019). Data-driven operational risk analysis in E-Commerce Logistics. Advanced Engineering Informatics, 40 , 29–35.

Xu, J., Pero, M. E. P., Ciccullo, F., & Sianesi, A. (2021). On relating big data analytics to supply chain planning: Towards a research agenda. International Journal of Physical Distribution and Logistics Management, 51 (6), 656–682.

Xu, X., Guo, W. G., & Rodgers, M. D. (2020). A real-time decision support framework to mitigate degradation in perishable supply chains. Computers and Industrial Engineering, 150 , 106905.

Xu, X., & Li, Y. (2016). The antecedents of customer satisfaction and dissatisfaction toward various types of hotels: A text mining approach. International Journal of Hospitality Management, 55 , 57–69.

Xu, X., Shen, Y., Chen, W. A., Gong, Y., & Wang, H. (2021). Data-driven decision and analytics of collection and delivery point location problems for online retailers. Omega, 100 , 102280.

Yan, P., Pei, J., Zhou, Y., & Pardalos, P. (2021). When platform exploits data analysis advantage: change of OEM-led supply chain structure. Annals of Operations Research , 1–27.

Yang, B. (2020). Construction of logistics financial security risk ontology model based on risk association and machine learning. Safety Science, 123 .

Yang, H., Bukkapatnam, S. T., & Barajas, L. G. (2013). Continuous flow modelling of multistage assembly line system dynamics. International Journal of Computer Integrated Manufacturing, 26 (5), 401–411.

Yang, L., Jiang, A., & Zhang, J. (2021). Optimal timing of big data application in a two-period decision model with new product sales. Computers and Industrial Engineering, 160 , 107550.

Yang, Y., & Peng, C. (2023). A prediction-based supply chain recovery strategy under disruption risks. International Journal of Production Research , 1–15.

Yao, Y., Zhu, X., Dong, H., Wu, S., Wu, H., Tong, L. C., & Zhou, X. (2019). ADMM-based problem decomposition scheme for vehicle routing problem with time windows. Transportation Research Part B: Methodological, 129 , 156–174.

Yin, S., Jiang, Y., Tian, Y., & Kaynak, O. (2016). A data-driven fuzzy information granulation approach for freight volume forecasting. IEEE Transactions on Industrial Electronics, 64 (2), 1447–1456.

Yin, W., He, S., Zhang, Y., & Hou, J. (2018). A product-focused, cloud-based approach to door-to-door railway freight design. IEEE Access, 6 , 20822–20836.

Ying, H., Chen, L., & Zhao, X. (2021). Application of text mining in identifying the factors of supply chain financing risk management. Industrial Management and Data Systems, 121 (2), 498–518.

Yu, B., Guo, Z., Asian, S., Wang, H., & Chen, G. (2019). Flight delay prediction for commercial air transport: A deep learning approach. Transportation Research Part E: Logistics and Transportation Review, 125 , 203–221.

Yu, C.-C., & Wang, C.-S. (2008). A hybrid mining approach for optimizing returns policies in e-retailing. Expert Systems with Applications, 35 (4), 1575–1582.

Yu, L., Zhao, Y., Tang, L., & Yang, Z. (2019). Online big data-driven oil consumption forecasting with Google trends. International Journal of Forecasting, 35 (1), 213–223.

Yu, Y., He, Y., & Zhao, X. (2021). Impact of demand information sharing on organic farming adoption: An evolutionary game approach. Technological Forecasting and Social Change, 172 .

Yue, G., Tailai, G., & Dan, W. (2021). Multi-layered coding-based study on optimization algorithms for automobile production logistics scheduling. Technological Forecasting and Social Change, 170 , 120889.

Zakeri, A., Saberi, M., Hussain, O. K., & Chang, E. (2018). An early detection system for proactive management of raw milk quality: An Australian case study. IEEE Access, 6 , 64333–64349.

Zamani, E. D., Smyth, C., Gupta, S., & Dennehy, D. (2022). Artificial intelligence and big data analytics for supply chain resilience: A systematic literature review. Annals of Operations Research , 1–28.

Zhang, G., Shang, J., & Li, W. (2012). An information granulation entropy-based model for third-party logistics providers evaluation. International Journal of Production Research, 50 (1), 177–190.

Zhang, K., Qu, T., Zhang, Y., Zhong, R., & Huang, G. (2022). Big data-enabled intelligent synchronisation for the complex production logistics system under the opti-state control strategy. International Journal of Production Research, 60 (13), 4159–4175.

Zhang, R., Li, J., Wu, S., & Meng, D. (2016). Learning to select supplier portfolios for service supply chain. PLoS ONE, 11 (5), e0155672.

Zhang, T., Zhang, C. Y., & Pei, Q. (2019). Misconception of providing supply chain finance: Its stabilising role. International Journal of Production Economics, 213 , 175–184.

Zhao, J., Wang, J., & Deng, W. (2015). Exploring bikesharing travel time and trip chain by gender and day of the week. Transportation Research Part C: Emerging Technologies, 58 , 251–264.

Zhao, K., & Yu, X. (2011). A case based reasoning approach on supplier selection in petroleum enterprises. Expert Systems with Applications, 38 (6), 6839–6847.

Zhao, N., & Wang, Q. (2021). Analysis of two financing modes in green supply chains when considering the role of data collection. Industrial Management and Data Systems, 121 (4), 921–939.

Zhao, R., Liu, Y., Zhang, N., & Huang, T. (2017). An optimization model for green supply chain management by using a big data analytic approach. Journal of Cleaner Production, 142 , 1085–1097.

Zhao, S., & You, F. (2019). Resilient supply chain design and operations with decision-dependent uncertainty using a data-driven robust optimization approach. AIChE Journal, 65 (3), 1006–1021.

Zhao, X., Yeung, K., Huang, Q., & Song, X. (2015). Improving the predictability of business failure of supply chain finance clients by using external big dataset. Industrial Management and Data Systems, 115 (9), 1683–1703.

Zheng, M., Wu, K., Sun, C., & Pan, E. (2019). Optimal decisions for a two-echelon supply chain with capacity and demand information. Advanced Engineering Informatics, 39 , 248–258.

Zhong, R. Y., Huang, G. Q., Lan, S., Dai, Q., Chen, X., & Zhang, T. (2015). A big data approach for logistics trajectory discovery from RFID-enabled production data. International Journal of Production Economics, 165 , 260–272.

Zhong, R. Y., Lan, S., Xu, C., Dai, Q., & Huang, G. Q. (2016). Visualization of RFID-enabled shopfloor logistics Big Data in Cloud Manufacturing. The International Journal of Advanced Manufacturing Technology, 84 (1–4), 5–16.

Zhou, J., Li, X., Zhao, X., & Wang, L. (2021). Driving performance grading and analytics: Learning internal indicators and external factors from multi-source data. Industrial Management and Data Systems, 121 (12), 2530–2570.

Zhou, Y., & Guo, Z. (2021a). Research on intelligent solution of service industry supply chain network optimization based on genetic algorithm. Journal of Healthcare Engineering, 2021 .

Zhou, Y., & Guo, Z. (2021b). Research on intelligent solution of service industry supply chain network optimization based on genetic algorithm. Journal of Healthcare Engineering, 2021 .

Zhou, Y., Yu, L., Chi, G., Ding, S., & Liu, X. (2022a). An enterprise default discriminant model based on optimal misjudgment loss ratio. Expert Systems with Applications, 205 .

Zhou, Z., Wang, M., Huang, J., Lin, S., & Lv, Z. (2022). Blockchain in big data security for intelligent transportation with 6G. IEEE Transactions on Intelligent Transportation Systems, 23 (7), 9736–9746.

Zhu, D. (2018). IOT and big data based cooperative logistical delivery scheduling method and cloud robot system. Future Generation Computer Systems, 86 , 709–715.

Zhu, J. (2022). DEA under big data: Data enabled analytics and network data envelopment analysis. Annals of Operations Research, 309 (2), 761–783.

Zhu, Y., Zhao, Y., Zhang, J., Geng, N., & Huang, D. (2019a). Spring onion seed demand forecasting using a hybrid Holt-Winters and support vector machine model. PLoS ONE, 14 (7).

Zhu, Y., Zhou, L., Xie, C., Wang, G.-J., & Nguyen, T. V. (2019). Forecasting SMEs’ credit risk in supply chain finance with an enhanced hybrid ensemble machine learning approach. International Journal of Production Economics, 211 , 22–33.

Download references

Acknowledgements

The authors would like to express their sincere gratitude to the editors and anonymous reviewers for their important comments and suggestions that helped to improve this paper.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

School of Accounting, Information Systems and Supply Chain, RMIT University, Melbourne, Australia

Hamed Jahani

Department of Management, Tilburg School of Economics and Management, Tilburg University, Tilburg, The Netherlands

Berlin School of Economics and Law, Global Supply Chain & Operations Management, Berlin, Germany

Dmitry Ivanov

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dmitry Ivanov .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Acronyms of journal names

See Table  9 .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Jahani, H., Jain, R. & Ivanov, D. Data science and big data analytics: a systematic review of methodologies used in the supply chain and logistics research. Ann Oper Res (2023). https://doi.org/10.1007/s10479-023-05390-7

Download citation

Accepted : 08 May 2023

Published : 11 July 2023

DOI : https://doi.org/10.1007/s10479-023-05390-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data analytics
  • Predictive analytics
  • Data science
  • Data mining
  • Machine learning
  • Supply chain
  • Data-driven optimisation
  • Find a journal
  • Publish with us
  • Track your research

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

The High Cost of Misaligned Business and Analytics Goals

  • Preethika Sainam,
  • Seigyoung Auh,
  • Richard Ettenson,
  • Bulent Menguc

research papers in data analytics

Findings from research on more than 300 companies undergoing data and analytics transformations.

How and where do companies’ investments in new and improved data and analytic capabilities contribute to tangible business benefits like profitability and growth? Should they invest in talent? Technology? Culture? According to new research, the degree of alignment between business goals and analytics capabilities is among the most important factors. While companies that are early in their analytics journey will see value creation even with significant internal misalignment, at higher levels of data maturity aligned companies find that analytics capabilities create significantly more value across growth, financial, and customer KPIs.

Business leaders are feeling acute pressure to ramp up their company’s data and analytics capabilities — and fast — or risk falling behind more data-savvy competitors. If only the path to success were that straightforward! In our previous research, we found that capitalizing on data and analytics requires creating a data culture, obtaining senior leadership commitment, acquiring data and analytics skills and competencies, as well as empowering employees. And each of these dimensions is necessary just to start the analytics journey.

research papers in data analytics

  • PS Preethika Sainam is an Assistant Professor of Global Marketing at Thunderbird School of Global Management, Arizona State University.
  • SA Seigyoung Auh is Professor of Global Marketing at Thunderbird School of Global Management, Arizona State University, and Research Faculty at the Center for Services Leadership at the WP Carey School of Business, Arizona State University.
  • RE Richard Ettenson is Professor and Keickhefer Fellow in Global Marketing and Brand Strategy, The Thunderbird School of Global Management, Arizona State University .
  • BM Bulent Menguc is a Professor of Marketing at the Leeds University Business School in the U.K.

Partner Center

Closing Gaps in Data-Sharing Is Critical for Public Health

Updated federal strategy could also ease burdens on agencies, providers.

Navigate to:

  • Table of Contents

research papers in data analytics

Every day, public health officials use data from each other and from doctors, hospitals, and health systems to protect people from infectious and environmental threats. When these officials receive timely, accurate, and complete information from health care providers, they can more clearly detect disease, prevent its spread, and help people connect to care. To improve the quality of this information, the U.S. Centers for Disease Control and Prevention developed the Public Health Data Strategy (PHDS), which was updated in April, to facilitate data-sharing between these many stakeholders. As the director of the CDC’s Office of Public Health Data, Surveillance, and Technology, Dr. Jennifer Layden is responsible for leading, coordinating, and executing the strategy.  

This interview has been edited for clarity and length.

What is the Public Health Data Strategy?

It’s CDC’s two-year plan to provide accountability for the data, technology, policy, and administrative actions necessary to meet our public health data goals. We aim to address challenges in data exchange between health care organizations and public health authorities, moving us toward one interconnected system that protects and improves health.

And what are the main goals of this effort?

The PHDS has four main goals: strengthen the core of public health data; accelerate access to analytic and automated solutions that support public health investigations and advance health equity; visualize and share insights to inform public health action; and advance more open and interoperable public health data. The plan sets milestones that help public health partners, health care organizations and providers, and the public understand what’s being done and what progress is being made toward these goals.

What barriers does the strategy aim to address?

Electronic health care records (EHRs) and associated efforts at interoperability [the successful exchange of health information between different systems] have seen over $35 billion of investment over the last couple decades. This has led to robust and widespread use of EHRs , adoption of health IT standards , and improved data-sharing across health care. Public health, however, hasn’t seen the same investment. And this has contributed to gaps in the completeness of data and the timely exchange of information to support public health.

Can you share an example of these gaps?

At the beginning of the COVID pandemic, we had race and ethnicity data on less than 60% of cases. New investments in public health, largely tied to the COVID response, allowed for advanced connectivity with the use of electronic case reporting, or eCR [the automated electronic reporting of individual cases of illness], as well as electronic laboratory reporting [the automated sharing of lab reports]. This led to a rapid improvement in the completeness of race and ethnicity data, which improved the nation’s ability to identify disparities in COVID burden and severity.

As we work to transform public health systems, we need to leverage existing health IT standards and technical approaches to ensure better connections between public health and health care. This benefits us all through more streamlined data-sharing, reduced burden on health care facilities and providers, and faster detection of health threats and outbreaks. And ultimately, improved bi-directional data-sharing [where data is available to health care providers who generate the information and health departments that receive the data] will benefit patients and those who care for them .

What progress have you seen so far?

The PHDS was launched in 2023 with 15 milestones, such as increasing the number of critical access hospitals sending electronic case reports as well as increasing the number of jurisdictions inputting eCR data into disease surveillance systems. Twelve were met , and work continues on the remaining three. The milestones reached in 2023 have made it easier to share information, provided access to modern tools, and improved the real-time monitoring of health threats, all of which strengthened public health data systems. The latest version of the PHDS includes updated 2024 milestones as well as new ones for 2025 that will advance the nation’s public health data capabilities. Milestones for the next two years focus on improving the completeness and coverage of eCR, syndromic surveillance [which uses anonymized emergency room data to identify emerging threats quickly], and data on mortality and wastewater. [When wastewater contains viruses, bacteria, and other infectious diseases circulating in a community, it can provide early warning even if people don’t have symptoms or seek care.]

How will the strategy make it easier for public health agencies and health care to share data?

Collaboration is at the heart of the new milestones. The updated strategy focuses on accelerating the adoption of eCR to ensure timely detection of illnesses, expanding data-sharing initiatives to improve public health responses and decision-making, and driving innovations in analytics to address health disparities and promote health equity.

These new milestones aim to reduce burdens on public health agencies by reducing the need to manually input case data into disease surveillance systems and will mitigate the overhead for managing individual point-to-point connections with labs to support eCR. The strategy will also let public health agencies more effectively identify and address health disparities based on a wider range of health equity measures.

In addition, the Workforce Accelerator Initiative, launched by the CDC Foundation, will recruit, place and support more than 100 technical experts in public health agencies to achieve the strategy’s goals.

What other partners will be engaged to accomplish the strategy?

Successful implementation will require collaboration with public health agencies, public health partners, private industry, health care partners, and other federal agencies, as well as sustained resources. We will directly engage with public health agencies to understand their priority needs and work with public health partners to support their progress toward key milestones. We’ll also collaborate with private partners to encourage dialogue and promote data exchange pilots, as well as with providers and labs to gather feedback on how we can better support their progress.

The CDC is working with the Office of the National Coordinator for Health Information Technology (ONC) to create a common approach for data exchange among health care, public health agencies, and federal agencies. This effort involves a partnership with representatives from health care, health IT, states, and federal organizations that sets up an exchange system to make it easier for providers to send data to public health agencies and for public health agencies to receive it. The collaboration will provide data standards, common agreements, and exchange networks that will assist public health agencies in their data exchange needs. We’ll continue to collaborate with ONC, as well as the Centers for Medicare & Medicaid Services, to advance a shared understanding of activities that support our milestones and will reach out to other federal agencies to synergize our efforts.

What will success look like?

We have ambitious goals to strengthen the connections between public health and health care. And other federal initiatives, like the movement toward the Trusted Exchange Framework and Common Agreement (TEFCA), adoption of USCDI+ , and new data standards lay out a pathway to making this a reality.

In five years, we aim to have 75% of state and big city jurisdictions , along with CDC, connected to TEFCA. This can eliminate inefficient point-to-point interfaces and enable more reliable exchange of real-time information. We also want to have 90% of emergency room data connected and flowing to public health agencies and envision a future where eCR has replaced most manual reporting of cases of infectious diseases and other conditions.

And big picture, what would this accomplish?

Reaching these goals would mean having more complete data and faster reporting of threats that could put our nation at risk. This will lead to better detection of outbreaks, faster response times, and healthier communities—and ultimately result in an integrated public health ecosystem that produces and uses data to support healthier communities and keep people safe.

Sheri Doyle

Don’t miss our latest facts, findings, and survey results in The Rundown

ADDITIONAL RESOURCES

An adult’s hand holds a near-empty glass of alcohol on a dark wooden table.

Trust Magazine

More from pew.

 A marsh with mostly water in the foreground and mostly seagrasses beyond is tranquil in the soft light of sunrise or sunset, which gives most of the seagrasses a golden hue. The sky is partly cloudy.

  • Introduction
  • Conclusions
  • Article Information

This plot shows the scaled mean values of pretax outcomes and prognostic covariates included in the synthetic control analysis of SSB volume purchased. Mean values are scaled to be between 0 and 100 on the basis of each variable’s maximum and minimum values found in the primary sample. Shaded dots correspond to the mean value for a treated city, and hollow dots correspond to its synthetic control. Other race and ethnicity, as determined by the 2010 US Census, included multiracial and Hispanic individuals.

This plot shows the percentage change in volume sold measured in ounces (orange squares) and the percentage change in shelf prices measured in US dollars (blue circles) for the augmented synthetic control with staggered adoption composite analysis. The plot shows the same information for the augmented synthetic control analyses of the 5 treated localities individually. Price elasticities of demand are provided, and 95% CIs and P values for each percentage change in price or volume are also provided.

A, This panel shows the percentage change in shelf prices (in US dollars) in response to implementing an excise SSB tax for the staggered adoption composite analysis. B, This panel shows the percentage change in volume sold (in oz). The blue line represents the composite treated unit, and the gray lines represent in-space placebo estimates from the donor pool, which comprise untaxed localities. Percentage changes are calculated for the average of the pretreatment means of each of the 5 treated localities. The light blue dotted line represents the start of the SSB tax. The composite effect size estimates and P values are provided in each panel.

This figure shows the percentage change in volume sold (in oz) from the staggered adoption composite analysis in immediately adjacent bordering 3-digit zip codes. These data were examined in response to implementing an excise SSB tax in the 5 treated zip codes. The dark blue line represents the composite adjacent border unit, and the gray lines represent in-space placebo estimates from the donor pool. Percentage changes are calculated for the average of the pretreatment means of each of the 12 adjacent border localities. The light blue dotted line represents the start of the SSB tax. The composite effect size estimates and P values are provided.

eMethods. Supplemental Methods

eTable 1. Total Coverage of SSB Ounces Sold in Matched Nielsen Retail Scanner Data

eTable 2. Total Population (2010) by City within Taxed 3-Digit Zip Codes

eTable 3. Two-Way Fixed Effects Estimation Results for Composite and Individual City Analyses

eFigure 1. Comparing Treated and Synthetic Values of Prognostic Factors from the Analysis of SSB Shelf Prices

eFigure 2. Overlap of US Census Sociodemographic Characteristics Between Each Taxed City and the Donor Pool of Control 3-Digit Zip Codes

eFigure 3. Composite and Individual Locality Price Pass-Through

eFigure 4. Composite and Individual Changes in Volume Sales in Adjacent Border Zip Codes

eFigure 5. Augmented Synthetic Control Estimates for Individual Locality Changes in Price

eFigure 6. Augmented Synthetic Control Estimates for Individual Locality Changes in Volume Sales

eFigure 7. Augmented Synthetic Control Estimates of Individual Locality Changes in Volume Sales of SSB Products in Border Areas

eFigure 8. Augmented Synthetic Control Estimates for Composite Changes in Price and Volume Sales of SSB Products (Population Weighted)

eFigure 9. Augmented Synthetic Control Estimates of Composite Changes in Volume Sales of SSB Products in Border Areas (Population Weighted)

eFigure 10. Composite and Individual Locality Demand Elasticity Estimates (Urbanicity > 0.85)

eFigure 11. Augmented Synthetic Control Estimates for Composite Changes in Price and Volume Sales of SSB Products (Urbanicity > 0.85)

eFigure 12. Augmented Synthetic Control Estimates of Composite Changes in Volume Sales of SSB Products in Border Areas (Urbanicity > 0.85)

eFigure 13. Composite and Individual Locality Demand Elasticity Estimates (Urbanicity > 0.9)

eFigure 14. Augmented Synthetic Control Estimates for Composite Changes in Price and Volume Sales of SSB Products (Urbanicity > 0.9)

eFigure 15. Augmented Synthetic Control Estimates of Composite Changes in Volume Sales of SSB Products in Border Areas (Urbanicity > 0.9)

eFigure 16. TWFE Estimates of Composite Changes in Prices, Volume Sales, and Border Volume Sales

eFigure 17. TWFE Estimates of Individual Locality Changes in Prices

eFigure 18. TWFE Estimates of Individual Locality Changes in Volume Sales

eFigure 19. TWFE Estimates of Individual Locality Changes in Volume Sales of SSB Products in Border Areas

Data Sharing Statement

  • Tax on Sugar-Sweetened Drinks Tied to Higher Prices, Fewer Purchases JAMA Medical News in Brief February 13, 2024 Emily Harris
  • Error in Figure JAMA Health Forum Correction February 16, 2024

See More About

Select your interests.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Others Also Liked

  • Download PDF
  • X Facebook More LinkedIn

Kaplan S , White JS , Madsen KA , Basu S , Villas-Boas SB , Schillinger D. Evaluation of Changes in Prices and Purchases Following Implementation of Sugar-Sweetened Beverage Taxes Across the US. JAMA Health Forum. 2024;5(1):e234737. doi:10.1001/jamahealthforum.2023.4737

Manage citations:

© 2024

  • Permissions

Evaluation of Changes in Prices and Purchases Following Implementation of Sugar-Sweetened Beverage Taxes Across the US

  • 1 Department of Economics, US Naval Academy, Annapolis, Maryland
  • 2 Department of Health Law, Policy & Management, Boston University School of Public Health, Boston, Massachusetts
  • 3 School of Public Health, University of California, Berkeley
  • 4 Institute of Health Policy, Management and Evaluation, University of Toronto, Ontario, Canada
  • 5 Department of Agricultural & Resource Economics, University of California, Berkeley
  • 6 Division of General Internal Medicine, Center for Vulnerable Populations, San Francisco General Hospital/University of California, San Francisco
  • Medical News in Brief Tax on Sugar-Sweetened Drinks Tied to Higher Prices, Fewer Purchases Emily Harris JAMA
  • Correction Error in Figure JAMA Health Forum

Question   What changes occurred in sugar-sweetened beverage (SSB) prices and purchase volume after SSB taxes were implemented in 5 large US cities?

Findings   In this cross-sectional study, SSB taxes in Boulder, Colorado; Philadelphia, Pennsylvania; Oakland, California; San Francisco, California; and Seattle, Washington, were associated with a 33.1% composite increase in SSB prices (92% pass-through of taxes to consumers) and a 33% reduction in purchase volume, without increasing cross-border purchases. The results were sustained in the months following tax implementation.

Meaning   The results suggest substantial, consistent declines in SSB purchases across several US cities; insofar as reducing SSB consumption can improve population health, scaling SSB taxes more broadly should be considered.

Importance   Sugar-sweetened beverage (SSB) taxes are promoted as key policies to reduce cardiometabolic diseases and other conditions, but comprehensive analyses of SSB taxes in the US have been difficult because of the absence of sufficiently large data samples and methods limitations.

Objective   To estimate changes in SSB prices and purchases following SSB taxes in 5 large US cities.

Design, Setting, and Participants   In this cross-sectional study with an augmented synthetic control analysis, changes in prices and purchases of SSBs were estimated following SSB tax implementation in Boulder, Colorado; Philadelphia, Pennsylvania; Oakland, California; Seattle, Washington; and San Francisco, California. Changes in SSB prices (in US dollars) and purchases (volume in ounces) in these cities in the 2 years following tax implementation were estimated and compared with control groups constructed from other cities. Changes in adjacent, untaxed areas were assessed to detect any increase in cross-border purchases. Data used for this analysis spanned from January 1, 2012, to February 29, 2020, and were analyzed between June 1, 2022, and September 29, 2023.

Main Outcomes and Measures   The main outcomes were the changes in SSB prices and volume purchased.

Results   Using nutritional information, 5500 unique universal product codes were classified as SSBs, according to tax designations. The sample included 26 338 stores—496 located in treated localities, 1340 in bordering localities, and 24 502 in the donor pool. Prices of SSBs increased by an average of 33.1% (95% CI, 14.0% to 52.2%; P  < .001) during the 2 years following tax implementation, corresponding to an average price increase of 1.3¢ per oz and a 92% tax pass-through rate from distributors to consumers. SSB purchases declined in total volume by an average of 33.0% (95% CI, −2.2% to −63.8%; P  = .04) following tax implementation, corresponding to a −1.00 price elasticity of demand. The observed price increase and corresponding volume decrease immediately followed tax implementation, and both outcomes were sustained in the months thereafter. No evidence of increased cross-border purchases following tax implementation was found.

Conclusions and Relevance   In this cross-sectional study, SSB taxes led to substantial, consistent declines in SSB purchases across 5 taxed cities following price increases associated with those taxes. Scaling SSB taxes nationally could yield substantial public health benefits.

Sugar-sweetened beverages (SSBs) are a major source of nonnutritional calories and are associated with serious adverse health outcomes, including type 2 diabetes, obesity, cardiovascular disease, gum disease, caries, and others that contribute to morbidity and mortality. 1 , 2 Because of the associations between SSBs and these outcomes, excise taxes on SSBs have been proposed in the US and around the world. As of November 2022, 8 US jurisdictions and more than 50 countries have implemented some form of SSB tax. 3 Several systematic reviews and meta-analyses have examined the association of SSB excise taxes with both prices and consumption. 4 - 6 The most recent international review finds a pass-through rate from distributors to consumers of 82% (95% CI, 66%-98%), a mean reduction in SSB sales of 15% (95% CI, −20% to −9%), and an average demand elasticity of −1.59 (95% CI, −2.11 to −1.08). 7

Yet, nearly all US-based studies of SSB taxes analyzed 1 taxed city and compared it with a control city. To our knowledge, only 2 existing studies have evaluated joint estimates of SSB taxes across multiple taxed cities. 8 , 9 However, recent statistical advances suggest that these estimates likely suffer from bias associated with conventional 2-way fixed effects (TWFE) approaches that cannot account for time-varying confounders, which differ between experimental and control populations. 10 Unbiased estimation of a composite effect, which provides a pooled estimate of SSB taxes across multiple taxed cities, is critical for understanding the generalizability of SSB tax outcomes to different localities with heterogeneous characteristics; such an estimate is complementary to existing estimates from individual localities with SSB taxes in place. This estimate, though imperfect, also better informs the potential effectiveness of a nationwide tax, which was recommended by a recent federal commission on diabetes 11 and is especially relevant considering the beverage industry’s recent efforts to preempt localities from levying SSB taxes. 12

In this cross-sectional study with an augmented synthetic control (ASC) analysis, retail sales data from Boulder, Colorado; Philadelphia, Pennsylvania; Oakland, California; Seattle, Washington; and San Francisco, California, were used to estimate the composite effect of SSB taxes on SSB prices and volume purchased. We applied recent advances in statistical methods to estimate an ASC model with staggered adoption, which produces joint estimates from taxes in several treated cities, despite different timing of policy implementation. Unlike conventional TWFE approaches, an ASC model with staggered adoption addresses time-invariant and time-varying unobserved confounders that differ between taxed cities and their untaxed comparators. 10 , 13 , 14 We also estimated composite changes in cross-border shopping in untaxed adjacent areas to examine if consumers offset SSB purchases in bordering localities following SSB tax implementation.

This study followed the Strengthening the Reporting of Observational Studies in Epidemiology ( STROBE ) reporting guideline for cross-sectional studies. 15 Informed consent was waived because the data were deidentified. The research was determined not to meet the criteria for human participant research by the institutional review board at the University of California, San Francisco. The data used in this analysis spanned from January 1, 2012, to February 29, 2020, and were analyzed between June 1, 2022, and September 29, 2023.

Retail scanner data on SSB prices (in US dollars) and volume sold (in ounces) and a staggered adoption ASC approach were used to estimate the composite change in prices and purchases following the implementation of SSB taxes in Boulder, Philadelphia, Oakland, Seattle, and San Francisco. We also estimated composite changes in cross-border shopping using adjacent, untaxed areas.

The primary data set was the Nielsen Corporation’s retail scanner data. It consisted of product-week-store observations from selected chain stores in nearly all 3-digit zip codes across the US (871) over the study period from January 1, 2012, to February 29, 2020. The data included total units sold and the average sale price per unit for each observation. Beverage products from this data set were supplemented with nutritional and general product information from Label Insight (Nielsen Consumer LLC) 16 and hand-coded nutritional information. This enabled the classification of individual beverage products as SSBs or not, on the basis of tax regulations across the 5 cities. Artificially sweetened beverages were not included in the analysis, despite coverage in Philadelphia’s SSB tax. The eMethods in Supplement 1 contain additional details on product selection and tax status classification procedures.

The Table provides a summary of information about the study’s localities. There were 5 taxed 3-digit zip codes examined: 803 (Boulder), 191 (Philadelphia), 946 (Oakland), 981 (Seattle), and 941 (San Francisco). Each of these 3-digit zip codes formed the full set of taxed jurisdictions. The California cities Berkeley and Albany (947) were not included because they were taxed at different times and could not be separately identified from one another (see Limitations). Localities with sales taxes, which include the District of Columbia and Navajo Nation, were omitted because they tend to be smaller in magnitude and less likely to change purchasing behavior. Among the 5 treated localities in this cross-sectional study, the 3 dates in which SSB taxes were implemented varied by city—in Philadelphia, the SSB tax was implemented January 1, 2017; in Boulder and Oakland, July 1, 2017; and in San Francisco and Seattle, January 1, 2018. Tax amounts ranged from 1¢ per oz to 2¢ per oz. Cross-border purchasing was examined in all immediately adjacent 3-digit zip codes, of which there were 13 ( Table ). These areas did not contain any taxed jurisdictions.

Two primary outcome measures were examined, including the monthly change in total number of ounces of SSB products sold in treated localities compared with the synthetic control localities following tax implementation. Total ounces of SSB products sold was the outcome used in the cross-border shopping analysis.

This cross-sectional study used an ASC approach. 17 The original synthetic control method uses a data-driven approach to construct a synthetic control unit as a weighted average of all potential control units that best match the treated unit on both the pretreatment outcome and prognostic factors. 18 The ASC approach extends this method by (1) allowing for multiple treated units experiencing treatment at different times and (2) providing a robust correction procedure when the synthetic unit’s pretreatment outcomes do not closely match those of the treated units. Using a donor pool of untaxed, nonbordering 3-digit zip codes, a synthetic treated unit was constructed for each of the 5 treated cities using pretax SSB prices and purchases, as well as a set of time-invariant characteristics from the 2010 Decennial Census and 2016 American Community Survey. 19 , 20 Data were analyzed using R statistical software, version 4.3.2 (R Project for Statistical Computing).

The primary ASC analyses were estimated at the 3-digit zip code–by-month level. We used the weighted average shelf price of SSBs and aggregated the total ounces purchased of SSBs at this unit of observation. Then, separate estimations assessed the composite posttax implementation change in shelf prices and volume sold in treated localities compared with a synthetic locality for each. Each individual city was given equal weight in calculating the composite outcome. The percentage change in shelf prices and volume sold was computed using pretax average shelf prices and volume sold in the treated localities.

Adhering to the approach in the study by Abadie, 21 the donor pool was limited to units with similar characteristics, namely jurisdictions within 1 SD (0.35) of the mean urbanicity level of the 5 treated localities (0.98), following the US Census definition of urban vs rural. A total of 284 three-digit zip codes remained, including the 5 treated localities, but omitting the 13 border localities. Sociodemographic and geographic characteristics used in constructing the synthetic units are shown in Figure 1 and described in the second section of the eMethods in Supplement 1 . These characteristics were chosen on the basis of previous research examining SSB taxes. 22 - 25

To determine the statistical significance of the ASC average treatment effects, which are calculated as the average posttax percentage changes in SSB prices and purchases for treated units relative to that of the synthetic control units, placebo estimates were generated for each donor unit one by one, as if each of those units had been treated. 18 Because treated localities implemented taxes at different times, this procedure was repeated for each treated locality, generating 279 × 5 = 1395 placebo estimates. To generate P values, the ratio of mean squared prediction error in the posttax vs pretax period was computed for the composite unit estimate and each placebo estimate, which were then ranked from largest to smallest. 26 The P value was calculated as the ratio of the composite unit ranking to the total number of units (1396) and indicated statistical significance when P  < .05. More details are provided in the eMethods in Supplement 1 .

To fully quantify the changes following SSB taxes in treated cities, we also explored whether purchasing behavior changed in adjacent 3-digit border zip codes. The same ASC procedure was implemented, except all adjacent border localities were considered treated, and taxed cities were excluded. Because border localities tended to be semiurban or suburban, the subsample of donor pool units was modified to those featuring an urbanicity level within 1 SD (0.35) of the mean urbanicity of the 13 border localities (0.75). A total of 369 three-digit zip codes remained, including the 13 border localities. This analysis used the same Census characteristics and P value calculation approach.

To assess sensitivity, 2 different urbanicity cutoffs were used to determine the donor pool subsample. Both an urbanicity level of 0.9 and 0.85 were used, reducing the donor pool of 3-digit zip codes to 204 and 226, respectively.

The main analytic sample included 28 512 three-digit zip code–by-month observations from 297 three-digit zip codes across 98 months. Using nutritional information from the supplementary hand-coded and Label Insight data, 5500 unique universal product codes (UPCs) were confirmed as SSBs according to the tax designations. The sample included 26 338 stores—496 located in treated localities, 1340 in bordering localities, and 24 502 in the donor pool. The Table provides summary information for each group of localities.

Figure 1 compares each treated unit with its corresponding synthetic unit, focusing on pretax mean SSB volume in ounces and the 12 sociodemographic and geographic covariates. (In Supplement 1 , eFigure 1 displays the price analysis comparisons.) Variables were scaled to be between 0 and 100, so that the units of measure were comparable. In most instances, these values were highly similar (within 5 index points), and no comparisons differed by more than 14 index points. In Supplement 1 , eFigure 2 displays sample distributions of each Census characteristic.

In the composite treated locality, shelf prices of SSB products increased by an average of 33.1% (95% CI, 14.0%-52.2%; P  < .001) in the 2 years following tax implementation, relative to the average percentage change in the composite synthetic locality. This corresponded to an average price increase of 1.3¢ per oz ( Figure 2 ) and a 92% price pass-through rate (eFigure 3 in Supplement 1 ). The volume of SSBs purchased declined by an average of 33.0% (95% CI, −2.2% to −63.8%; P  = .04) during the same time frame, relative to the average percentage change in the composite synthetic locality. This corresponded to an average monthly change of 18 534 oz/store-month ( Figure 2 ). Together, these estimates yielded a −1.00 price elasticity of demand, suggesting SSB purchasing behavior was responsive to changes in shelf prices ( Figure 2 ). Figure 2 also shows changes in shelf prices and volume purchased for the 5 taxed localities individually. The demand elasticity estimates were relatively consistent across taxed localities, ranging from −0.80 (Philadelphia) to −1.37 (Seattle). Shelf price changes for individual cities were significant at the 10% level, yet null changes in volume purchased could not be rejected for each city at the 10% level.

Figure 3 A shows time-varying ASC results for SSB shelf prices, and Figure 3 B shows this information for volume sales. The blue line indicates the difference between the composite treated unit and synthetic unit, and the gray lines represent each placebo estimate. In both analyses, a close fit between the composite treated unit and synthetic unit was found in the pretax period. There was a steep, immediate increase in shelf prices and decrease in volume sales following tax implementation, which was sustained in the months thereafter.

Each city in the composite analysis was equally weighted. The procedure and context through which each city introduced an SSB tax varies, and the findings are intended for policymakers considering tax implementation in specific geographies. The population-weighted composite estimates are similar (eFigures 8 and 9 in Supplement 1 ).

The analyses for different urbanicity cutoffs generated similar results (eFigures 10 and 13 in Supplement 1 ). In Supplement 1 , eFigures 5 and 6 show the individual city ASC analyses.

Figure 4 shows the time-varying ASC results for cross-border SSB volume sales. There was no statistically significant mean change in cross-border purchases of SSBs following tax implementation (−2.4%; 95% CI, −12.8% to 8.1%; P  = .67), which remained stable in the years following the tax. No significant change in cross-border SSB volume purchases was observed in each taxed city (eFigure 4 in Supplement 1 ). Estimates for different urbanicity cutoffs provided similar findings (eFigures 12 and 15 in Supplement 1 ). In Supplement 1 , eFigure 7 displays the time-varying cross-border analyses for each taxed city.

In this cross-sectional study with an ASC analysis, SSB excise taxes were associated with large, consistent declines in SSB purchases across 5 US taxed cities following tax-driven price changes. Quasi-experimental methods were used to estimate the overall changes following SSB taxes implemented at different times and locations relative to a synthetic control of untaxed areas. The results show shelf prices of SSB products increased by an average of 33.1% (1.3¢ per oz) in the years following SSB tax implementation, corresponding to a 92% price pass-through rate from distributors to consumers. Volume sales fell by 33.0% during the same time frame, without evidence of changes in cross-border shopping in untaxed adjacent areas.

Although the estimates generally support previous estimates from single-city studies, these results help answer the critical question of how much variation across taxed localities is due to the unique characteristics of a locality vs the generalizable outcomes of a tax. Compared with a recent international meta-analysis of SSB taxes, the results suggest a slightly higher pass-through rate, a substantially larger reduction in volume purchased, and moderately less demand responsiveness to price changes. 6 These modest discrepancies may reflect differences in geographic areas of comparators, store sample composition, and greater accounting of unmeasured confounders in this analysis than in previous studies. Additionally, conflicting findings have been found regarding cross-border purchasing following SSB taxes, with some studies pointing to significant increases and others finding no changes. 27 - 30 The results provided no evidence of changes in cross-border purchasing.

To further contextualize the findings, we estimated a TWFE event-study model, detailed in the third section of the eMethods in Supplement 1 . This model has been the primary approach taken in previous SSB tax evaluation studies. In Supplement 1 , eTable 3 shows the point estimates are generally comparable with the ASC estimates, although some moderate differences exist. Inspection of the prepolicy coefficients in the event-study plots suggests that these estimates have varying degrees of bias associated with imperfect pretrends (eFigures 16-19 in Supplement 1 ). 31 , 32 The TWFE estimates are much more precisely estimated than the ASC estimates, in part because the TWFE CIs may be overly narrow. 33 - 35 Nevertheless, this trade-off highlights this study’s focus on generating unbiased estimates at the partial expense of precision.

It is important to interpret these estimates in the context of projected health benefits. Several studies have found that a 15% to 20% increase in price/decrease in consumption generates significant health benefits, including reductions in myocardial infarction events, ischemic heart disease, coronary heart events, strokes, diabetes, and obesity. 36 - 38 This study estimated a 33.1% increase in price and a corresponding 33.0% decrease in volume, suggesting health benefits at least as substantial as those found previously.

Additionally, studies have suggested that SSB taxes are highly cost-effective. 22 , 37 , 39 Wang et al 37 found a nationwide tax could have avoided $17 billion in medical costs between 2010 and 2020. Lee et al 39 found approximately $53 billion in cost savings throughout an average individual lifetime. More recently, White et al 22 found that a 27% reduction in consumption in Oakland is expected to accrue more than $100 000 per 10 000 residents in societal cost savings during a 10-year period. This study’s findings suggest SSB taxation would likely generate significant improvements in population health and substantial cost savings.

First, the retail scanner data identify purchasing behavior and not direct consumption. It is possible, though unlikely, that taxed populations consumed a different share of purchased SSBs than did untaxed control populations (eg, producing more waste). Second, the data were geocoded by 3-digit zip code. This prevented Berkeley and Albany (3-digit zip code 947) from being included because they could not be separately identified and were taxed at different times. The 3-digit zip codes for included taxed cities contained a small number of untaxed jurisdictions, accounting for less than 7% of the total population of these areas (eTable 2 in Supplement 1 ). However, this misclassification should only lead to an underestimate of the changes following tax implementation.

We also lacked nutritional information for certain beverage UPCs. Of the UPCs of SSBs in the scanner data, we successfully matched 84.0% of sales volume in ounces using Label Insight and hand-coded data featuring nutritional information. To the extent that the set of unmatched UPCs was similar across taxed and untaxed jurisdictions, the findings should be unaffected. Additionally, the scanner data contained only a subsample of all stores in each zip code; thus, the data did not include all volume sales. Using SSB tax revenues to estimate total volume sales in treated localities, coverage from this set of products was 12.7% (eTable 1 in Supplement 1 ). The coverage estimates were similar but slightly lower than recent SSB tax evaluations using Nielsen data. 29 , 40 Lower coverage in Philadelphia was partially due to the exclusion of artificially sweetened beverages from this analysis. Coverage could not be calculated in donor zip codes because there were no SSB taxes in place. However, the ASC estimation generated a reliable counterfactual group from the existing sample of donor zip codes, which should mitigate any unintended bias caused by unequal SSB coverage across treatment and control localities.

Next, although the ASC estimates for each individual city in the volume analysis ( Figure 2 ; eFigure 6 in Supplement 1 ) were similar to those in prior studies, 7 they were relatively imprecise, and a null effect could not be rejected at the 5% level. Furthermore, although the composite estimates for the volume analysis were much more precise, reductions in purchases as small as 2% or as large as 64% could not be ruled out at a 95% CI level. While synthetic control methods deliver less biased estimates than difference-in-differences approaches, they also generate less statistical power. 41 However, difference-in-differences studies involving a small number of treated units may underestimate the true variance of effect estimates. 33 - 35 As more localities introduce SSB taxes, synthetic control methods with staggered adoption will have greater precision.

In addition, only posted shelf prices were observed in the scanner data, which may lead to underestimates of pass-through rates. While excise taxes are generally reflected in shelf prices, certain retailers may have only included the tax once products were scanned at the register. 42 Moreover, the scanner data were primarily composed of information from large chain stores. Thus, these results may not extend to independent stores, although similar estimates have been found in those settings. 43 Finally, the 5 treated localities studied here, while geographically distinct and racially, ethnically, and socioeconomically diverse, were not fully representative of the US population. Therefore, the findings may not be fully generalizable on a national scale, a limitation most relevant to less urban populations.

In this cross-sectional study with an ASC analysis, SSB taxes in Boulder, Philadelphia, Oakland, San Francisco, and Seattle were associated with 33.1% composite increases in SSB prices (92% pass-through rate) and 33.0% reductions in SSB purchases, with no offset through cross-border purchases of SSBs. The changes in prices and purchases remained stable in the years following tax implementation. The findings have important implications for the potential efficacy of SSB taxes across larger geographic jurisdictions and at the national level. Scaling SSB excise taxes across the US would likely generate significant population health benefits and medical cost savings.

Accepted for Publication: October 26, 2023.

Published: January 5, 2024. doi:10.1001/jamahealthforum.2023.4737

Correction: This article was corrected on February 16, 2024, to fix the upper 95% CI value in Figure 3A from 52.5% to 52.2%.

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2024 Kaplan S et al. JAMA Health Forum .

Corresponding Author: Scott Kaplan, PhD, Department of Economics, US Naval Academy, 106 and 107 Maryland Ave, Annapolis, MD 21402 ( [email protected] ).

Author Contributions: Prof Kaplan had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Drs Basu and Villas-Boas contributed equally to this work.

Concept and design: All authors.

Acquisition, analysis, or interpretation of data: Kaplan, White, Madsen, Schillinger.

Drafting of the manuscript: Kaplan, White, Schillinger.

Critical review of the manuscript for important intellectual content: All authors.

Statistical analysis: Kaplan, White.

Obtained funding: Schillinger.

Administrative, technical, or material support: Kaplan, Madsen, Schillinger.

Supervision: Kaplan, White, Basu, Villas-Boas, Schillinger.

Conflict of Interest Disclosures: Dr Basu reported grants from the National Institutes of Health; Centers for Disease Control and Prevention; personal fees from the University of California, San Francisco; Collective Health; Waymark; and HealthRIGHT 360 outside the submitted work. No other disclosures were reported.

Funding/Support: This work was supported by grants from the National Institute of Diabetes and Digestive and Kidney Diseases (R01 DK116852 and 2P30 DK092924), and the Centers for Disease Control and Prevention’s National Center for Chronic Disease Prevention and Health Promotion (U18DP006526).

Role of the Funder: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: All estimates and analyses in this article are by the authors and not by the Nielsen Corporation. Researchers’ analyses were calculated (or derived) based in part on data from Nielsen Consumer LLC and marketing databases provided through the NielsenIQ data sets at the Kilts Marketing Data Center at the University of Chicago Booth School of Business. The conclusions drawn from the NielsenIQ data are those of the researchers and do not reflect the views of NielsenIQ. NielsenIQ is not responsible for, had no role in, and was not involved in analyzing and preparing the results reported herein.

Data Sharing Statement: See Supplement 2 .

Additional Contributions: We thank University of California, Berkeley undergraduate students Youssef Andrawis, Ryan Andresen, Anqi Chen, Matthew Chill, Maggie Deng, Amanda Gold, Drake Hayes, Nan Hou, Liam Howell, Hongxian Huang, Zixia Huang, Jason Liu, Julie Maeng, Michael Quiroz, Emaan Saddique, Caroline Wu, Yuemin Xu, April You, Haolin Zhang, and Yihui Zhu for assistance with entering and cleaning the nutritional data and beverage classification.

COMMENTS

  1. Data Science and Analytics: An Overview from Data-Driven Smart

    Introduction. We are living in the age of "data science and advanced analytics", where almost everything in our daily lives is digitally recorded as data [].Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [].

  2. Home

    Overview. The International Journal of Data Science and Analytics is a pioneering journal in data science and analytics, publishing original and applied research outcomes. Focuses on fundamental and applied research outcomes in data and analytics theories, technologies and applications. Promotes new scientific and technological approaches for ...

  3. (PDF) Data Analytics: A Literature Review Paper

    This paper aims to analyze some. of the different analytics metho ds and tools which can be applied to big data, as. well as the opportunities provided by the application of big data a nalytics in ...

  4. (PDF) Data Analytics and Techniques: A Review

    The lifecycle for data analysis will help to manage and organize the tasks connected to big data research and analysis. Data Analytics evolution with big data analytics, SQL analytics, and ...

  5. Research on Data Science, Data Analytics and Big Data

    Abstract. Big Data refers to a huge volume of data of various types, i.e., structured, semi structured, and unstructured. This data is generated through various digital channels such as mobile, Internet, social media, e-commerce websites, etc. Big Data has proven to be of great use since its inception, as companies started realizing its importance for various business purposes.

  6. Predictive Analytics: A Review of Trends and Techniques

    Predictive analytics, a branch in the domain of advanced. analytics, is used in predicting the fut ure events. It analyzes. the current and historical data in order to make predictions. about the ...

  7. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  8. Big data analytics capabilities: Patchwork or progress? A systematic

    This paper presents a systematic literature review of the research field on big data analytics capabilities (BDACs). With the emergence of big data and digital transformation, a growing number of researchers have highlighted the need for organizations to develop BDACs. ... In brief, existing papers have neglected research on BDAC antecedents or ...

  9. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in …. View full aims & scope.

  10. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  11. Big data analytics: a survey

    Abundant research results of data analysis [20, 27, 63] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature ... In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that ...

  12. Big data analytics in healthcare: a systematic literature review

    2.1. Characteristics of big data. The concept of BDA overarches several data-intensive approaches to the analysis and synthesis of large-scale data (Galetsi, Katsaliaki, and Kumar Citation 2020; Mergel, Rethemeyer, and Isett Citation 2016).Such large-scale data derived from information exchange among different systems is often termed 'big data' (Bahri et al. Citation 2018; Khanra, Dhir ...

  13. Learning to Do Qualitative Data Analysis: A Starting Point

    On the basis of Rocco (2010), Storberg-Walker's (2012) amended list on qualitative data analysis in research papers included the following: (a) the article should provide enough details so that reviewers could follow the same analytical steps; (b) the analysis process selected should be logically connected to the purpose of the study; and (c ...

  14. The role of data science and data analytics for innovation: a

    1. Introduction. Methods of data analytics, data science, business analytics, and big data - which we collectively refer to as Data Science/Data Analytics (DS/DA) - are increasingly important topics in business practice as well as for academia.

  15. Research themes in big data analytics for policymaking: Insights from a

    Our results offer scholars in public policy a vantage point to the theoretical foundations of research in big data and data analytics in public policy-making. We also draw from the identified communities to highlight emerging research themes that can guide research forward. ... This approach has been used in several research papers to form the ...

  16. Data Science & Analytics Research Topics (Includes Free Webinar)

    Data Science-Related Research Topics. Developing machine learning models for real-time fraud detection in online transactions. The use of big data analytics in predicting and managing urban traffic flow. Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.

  17. A Review of Artificial Intelligence Methods for Data Science and Data

    By treating the end-to-end data science workflow as data itself and through the conceptual modeling of the goals and functional intent of the data analyst, the entire process of data analytics ...

  18. Big data analytics and firm performance: Findings from a mixed-method

    Several research papers demonstrate that big data analytics, when applied to problems of specific domains such as healthcare, service provision, supply chain management, ... An emerging theme in big data analytics and business value research is that companies differ in the way they operate, and thus require attention in different sets of ...

  19. Different Types of Data Analysis; Data Analysis Methods and ...

    This article is concentrated to define data analysis and the concept of data preparation. Then, the data analysis methods will be discussed. For doing so, the f ... Hamed, Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects (August 1, 2022). ... Research Paper Series; Conference Papers; Partners in ...

  20. The role of data science and data analytics for innovation: a

    DOI: 10.1080/2573234x.2024.2365917 Corpus ID: 270622352; The role of data science and data analytics for innovation: a literature review @article{NatividadeJoergensen2024TheRO, title={The role of data science and data analytics for innovation: a literature review}, author={Pedro Natividade Joergensen and Michael Zaggl}, journal={Journal of Business Analytics}, year={2024}, url={https://api ...

  21. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  22. Data Science and Analytics: An Overview from Data-Driven Smart

    The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...

  23. Epidemic outcomes following government responses to COVID-19 ...

    The current paper presents the results of nearly 100,000 reasonable ways of assessing the relationship between government responses and COVID-19 outcomes. ... (Working Paper 27719, National Bureau of Economic Research, 2020); https://doi.org ... J. A. Harder, The multiverse of methods: Extending the multiverse analysis to address data ...

  24. Bayesian Statistical Inference for Factor Analysis Models with ...

    Clustered data are a complex and frequently used type of data. Traditional factor analysis methods are effective for non-clustered data, but they do not adequately capture correlations between multiple observed individuals or variables in clustered data. This paper proposes a Bayesian approach utilizing MCMC and Gibbs sampling algorithms to accurately estimate parameters of interest within the ...

  25. A Review of Predictive Analytics Models in the Oil and Gas ...

    Enhancing the management and monitoring of oil and gas processes demands the development of precise predictive analytic techniques. Over the past two years, oil and its prediction have advanced significantly using conventional and modern machine learning techniques. Several review articles detail the developments in predictive maintenance and the technical and non-technical aspects of ...

  26. Data science and big data analytics: a systematic review of ...

    Data science and big data analytics (DS &BDA) methodologies and tools are used extensively in supply chains and logistics (SC &L). However, the existing insights are scattered over different literature sources and there is a lack of a structured and unbiased review methodology to systematise DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and ...

  27. The High Cost of Misaligned Business and Analytics Goals

    In our previous research, we found that capitalizing on data and analytics requires creating a data culture, obtaining senior leadership commitment, acquiring data and analytics skills and ...

  28. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data

    This quality improvement study evaluates the ability of GPT-4 Advanced Data Analysis to create a fake data set that can be used for the purpose of ... Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmol. 2023;141(12):1174-1175. doi:10.1001 ...

  29. Closing Gaps in Data-Sharing Is Critical for Public Health

    Collaboration is at the heart of the new milestones. The updated strategy focuses on accelerating the adoption of eCR to ensure timely detection of illnesses, expanding data-sharing initiatives to improve public health responses and decision-making, and driving innovations in analytics to address health disparities and promote health equity.

  30. Loading

    Loading... ... Loading...