Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

google research papers machine learning

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

560k Accesses

1678 Citations

31 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

google research papers machine learning

Machine Learning Approaches for Smart City Applications: Emergence, Challenges and Opportunities

google research papers machine learning

Insights into the Advancements of Artificial Intelligence and Machine Learning, the Present State of Art, and Future Prospects: Seven Decades of Digital Revolution

google research papers machine learning

Editorial: Machine Learning, Advances in Computing, Renewable Energy and Communication (MARC)

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research

google research papers machine learning

Frequently Asked Questions

Journal of Machine Learning Research

The Journal of Machine Learning Research (JMLR), established in 2000 , provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

  • 2024.02.18 : Volume 24 completed; Volume 25 began.
  • 2023.01.20 : Volume 23 completed; Volume 24 began.
  • 2022.07.20 : New special issue on climate change .
  • 2022.02.18 : New blog post: Retrospectives from 20 Years of JMLR .
  • 2022.01.25 : Volume 22 completed; Volume 23 began.
  • 2021.12.02 : Message from outgoing co-EiC Bernhard Schölkopf .
  • 2021.02.10 : Volume 21 completed; Volume 22 began.
  • More news ...

Latest papers

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning Matteo Bettini, Amanda Prorok, Vincent Moens , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Learning from many trajectories Stephen Tu, Roy Frostig, Mahdi Soltanolkotabi , 2024. [ abs ][ pdf ][ bib ]

Interpretable algorithmic fairness in structured and unstructured data Hari Bandi, Dimitris Bertsimas, Thodoris Koukouvinos, Sofie Kupiec , 2024. [ abs ][ pdf ][ bib ]

FedCBO: Reaching Group Consensus in Clustered Federated Learning through Consensus-based Optimization José A. Carrillo, Nicolás García Trillos, Sixu Li, Yuhua Zhu , 2024. [ abs ][ pdf ][ bib ]

On the Connection between Lp- and Risk Consistency and its Implications on Regularized Kernel Methods Hannes Köhler , 2024. [ abs ][ pdf ][ bib ]

Pre-trained Gaussian Processes for Bayesian Optimization Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis Yuanxing Chen, Qingzhao Zhang, Shuangge Ma, Kuangnan Fang , 2024. [ abs ][ pdf ][ bib ]

From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data Katharina Proksch, Christoph Alexander Weikamp, Thomas Staudt, Benoit Lelandais, Christophe Zimmer , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PAMI: An Open-Source Python Library for Pattern Mining Uday Kiran Rage, Veena Pamalla, Masashi Toyoda, Masaru Kitsuregawa , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Law of Large Numbers and Central Limit Theorem for Wide Two-layer Neural Networks: The Mini-Batch and Noisy Case Arnaud Descours, Arnaud Guillin, Manon Michel, Boris Nectoux , 2024. [ abs ][ pdf ][ bib ]

Risk Measures and Upper Probabilities: Coherence and Stratification Christian Fröhlich, Robert C. Williamson , 2024. [ abs ][ pdf ][ bib ]

Parallel-in-Time Probabilistic Numerical ODE Solvers Nathanael Bosch, Adrien Corenflos, Fatemeh Yaghoobi, Filip Tronarp, Philipp Hennig, Simo Särkkä , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data Shuo-Chieh Huang, Ruey S. Tsay , 2024. [ abs ][ pdf ][ bib ]

Dropout Regularization Versus l2-Penalization in the Linear Model Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber , 2024. [ abs ][ pdf ][ bib ]

Efficient Convex Algorithms for Universal Kernel Learning Aleksandr Talitckii, Brendon Colbert, Matthew M. Peet , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Manifold Learning by Mixture Models of VAEs for Inverse Problems Giovanni S. Alberti, Johannes Hertrich, Matteo Santacesaria, Silvia Sciutto , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Algorithmic Framework for the Optimization of Deep Neural Networks Architectures and Hyperparameters Julie Keisler, El-Ghazali Talbi, Sandra Claudel, Gilles Cabriel , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity Laixi Shi, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Grokking phase transitions in learning local rules with gradient descent Bojan Žunkovič, Enej Ilievski , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Unsupervised Tree Boosting for Learning Probability Distributions Naoki Awaya, Li Ma , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Linear Regression With Unmatched Data: A Deconvolution Perspective Mona Azadkia, Fadoua Balabdaoui , 2024. [ abs ][ pdf ][ bib ]

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit Karl Hajjar, Lénaïc Chizat, Christophe Giraud , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sharp analysis of power iteration for tensor PCA Yuchen Wu, Kangjie Zhou , 2024. [ abs ][ pdf ][ bib ]

On the Intrinsic Structures of Spiking Neural Networks Shao-Qun Zhang, Jia-Yi Chen, Jin-Hui Wu, Gao Zhang, Huan Xiong, Bin Gu, Zhi-Hua Zhou , 2024. [ abs ][ pdf ][ bib ]

Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance Lisha Chen, Heshan Fernando, Yiming Ying, Tianyi Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data Wanli Hong, Shuyang Ling , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, Kun Zhang , 2024. [ abs ][ pdf ][ bib ]

Classification of Data Generated by Gaussian Mixture Models Using Deep ReLU Networks Tian-Yi Zhou, Xiaoming Huo , 2024. [ abs ][ pdf ][ bib ]

Differentially Private Topological Data Analysis Taegyu Kang, Sehwan Kim, Jinwon Sohn, Jordan Awan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Optimality of Misspecified Spectral Algorithms Haobo Zhang, Yicheng Li, Qian Lin , 2024. [ abs ][ pdf ][ bib ]

An Entropy-Based Model for Hierarchical Learning Amir R. Asadi , 2024. [ abs ][ pdf ][ bib ]

Optimal Clustering with Bandit Feedback Junwen Yang, Zixin Zhong, Vincent Y. F. Tan , 2024. [ abs ][ pdf ][ bib ]

A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression Youngseok Kim, Wei Wang, Peter Carbonetto, Matthew Stephens , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks Yuval Belfer, Amnon Geifman, Meirav Galun, Ronen Basri , 2024. [ abs ][ pdf ][ bib ]

Permuted and Unlinked Monotone Regression in R^d: an approach based on mixture modeling and optimal transport Martin Slawski, Bodhisattva Sen , 2024. [ abs ][ pdf ][ bib ]

Volterra Neural Networks (VNNs) Siddharth Roheda, Hamid Krim, Bo Jiang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm Zhu Li, Dimitri Meunier, Mattes Mollenhauer, Arthur Gretton , 2024. [ abs ][ pdf ][ bib ]

Bayesian Regression Markets Thomas Falconer, Jalal Kazempour, Pierre Pinson , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sharpness-Aware Minimization and the Edge of Stability Philip M. Long, Peter L. Bartlett , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization Sijia Chen, Yu-Jie Zhang, Wei-Wei Tu, Peng Zhao, Lijun Zhang , 2024. [ abs ][ pdf ][ bib ]

Multi-Objective Neural Architecture Search by Learning Search Space Partitions Yiyang Zhao, Linnan Wang, Tian Guo , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fermat Distances: Metric Approximation, Spectral Convergence, and Clustering Algorithms Nicolás García Trillos, Anna Little, Daniel McKenzie, James M. Murphy , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Spherical Rotation Dimension Reduction with Geometric Loss Functions Hengrui Luo, Jeremy E. Purvis, Didong Li , 2024. [ abs ][ pdf ][ bib ]

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks Yuxin Sun, Dong Lao, Anthony Yezzi, Ganesh Sundaramoorthi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Two is Better Than One: Regularized Shrinkage of Large Minimum Variance Portfolios Taras Bodnar, Nestor Parolya, Erik Thorsen , 2024. [ abs ][ pdf ][ bib ]

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning Jinchi Chen, Jie Feng, Weiguo Gao, Ke Wei , 2024. [ abs ][ pdf ][ bib ]

Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning Ilnura Usmanova, Yarden As, Maryam Kamgarpour, Andreas Krause , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Cluster-Adaptive Network A/B Testing: From Randomization to Estimation Yang Liu, Yifan Zhou, Ping Li, Feifang Hu , 2024. [ abs ][ pdf ][ bib ]

On the Computational and Statistical Complexity of Over-parameterized Matrix Sensing Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, Constantine Caramanis , 2024. [ abs ][ pdf ][ bib ]

Optimization-based Causal Estimation from Heterogeneous Environments Mingzhang Yin, Yixin Wang, David M. Blei , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimal Locally Private Nonparametric Classification with Public Data Yuheng Ma, Hanfang Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learning to Warm-Start Fixed-Point Optimization Algorithms Rajiv Sambharya, Georgina Hall, Brandon Amos, Bartolomeo Stellato , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Regression Using Over-parameterized Shallow ReLU Neural Networks Yunfei Yang, Ding-Xuan Zhou , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Copula Models for Multivariate, Mixed, and Missing Data Joseph Feldman, Daniel R. Kowal , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Analysis of Quantile Temporal-Difference Learning Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney , 2024. [ abs ][ pdf ][ bib ]

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts Isaac Gibbs, Emmanuel J. Candès , 2024. [ abs ][ pdf ][ bib ]      [ code ]

More Efficient Estimation of Multivariate Additive Models Based on Tensor Decomposition and Penalization Xu Liu, Heng Lian, Jian Huang , 2024. [ abs ][ pdf ][ bib ]

A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment Robert Hu, Dino Sejdinovic, Robin J. Evans , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Assessing the Overall and Partial Causal Well-Specification of Nonlinear Additive Noise Models Christoph Schultheiss, Peter Bühlmann , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Simple Cycle Reservoirs are Universal Boyu Li, Robert Simon Fong, Peter Tino , 2024. [ abs ][ pdf ][ bib ]

On the Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling Rong Tang, Yun Yang , 2024. [ abs ][ pdf ][ bib ]

Generalization and Stability of Interpolating Neural Networks with Minimal Width Hossein Taheri, Christos Thrampoulidis , 2024. [ abs ][ pdf ][ bib ]

Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression Jiading Liu, Lei Shi , 2024. [ abs ][ pdf ][ bib ]

Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations Yuanyuan Wang, Wei Huang, Mingming Gong, Xi Geng, Tongliang Liu, Kun Zhang, Dacheng Tao , 2024. [ abs ][ pdf ][ bib ]

Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning Maximilian Hüttenrauch, Gerhard Neumann , 2024. [ abs ][ pdf ][ bib ]

Kernel Thinning Raaz Dwivedi, Lester Mackey , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions Xuxing Chen, Tesi Xiao, Krishnakumar Balasubramanian , 2024. [ abs ][ pdf ][ bib ]

Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks Yunpeng Zhao, Ning Hao, Ji Zhu , 2024. [ abs ][ pdf ][ bib ]

Statistical Inference for Fairness Auditing John J. Cherian, Emmanuel J. Candès , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Adjusted Wasserstein Distributionally Robust Estimator in Statistical Learning Yiling Xie, Xiaoming Huo , 2024. [ abs ][ pdf ][ bib ]

DoWhy-GCM: An Extension of DoWhy for Causal Inference in Graphical Causal Models Patrick Blöbaum, Peter Götz, Kailash Budhathoki, Atalanti A. Mastakouri, Dominik Janzing , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Flexible Bayesian Product Mixture Models for Vector Autoregressions Suprateek Kundu, Joshua Lukemire , 2024. [ abs ][ pdf ][ bib ]

A Variational Approach to Bayesian Phylogenetic Inference Cheng Zhang, Frederick A. Matsen IV , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fat-Shattering Dimension of k-fold Aggregations Idan Attias, Aryeh Kontorovich , 2024. [ abs ][ pdf ][ bib ]

Unified Binary and Multiclass Margin-Based Classification Yutong Wang, Clayton Scott , 2024. [ abs ][ pdf ][ bib ]

Neural Feature Learning in Function Space Xiangxiang Xu, Lizhong Zheng , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PyGOD: A Python Library for Graph Outlier Detection Kay Liu, Yingtong Dou, Xueying Ding, Xiyang Hu, Ruitong Zhang, Hao Peng, Lichao Sun, Philip S. Yu , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria Tengyuan Liang , 2024. [ abs ][ pdf ][ bib ]

Fixed points of nonnegative neural networks Tomasz J. Piotrowski, Renato L. G. Cavalcante, Mateusz Gabor , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks Fanghui Liu, Leello Dadi, Volkan Cevher , 2024. [ abs ][ pdf ][ bib ]

A Survey on Multi-player Bandits Etienne Boursier, Vianney Perchet , 2024. [ abs ][ pdf ][ bib ]

Transport-based Counterfactual Models Lucas De Lara, Alberto González-Sanz, Nicholas Asher, Laurent Risser, Jean-Michel Loubes , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Adaptive Latent Feature Sharing for Piecewise Linear Dimensionality Reduction Adam Farooq, Yordan P. Raykov, Petar Raykov, Max A. Little , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Topological Node2vec: Enhanced Graph Embedding via Persistent Homology Yasuaki Hiraoka, Yusuke Imoto, Théo Lacombe, Killian Meehan, Toshiaki Yachimura , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Granger Causal Inference in Multivariate Hawkes Processes by Minimum Message Length Katerina Hlaváčková-Schindler, Anna Melnykova, Irene Tubikanec , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Representation Learning via Manifold Flattening and Reconstruction Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry, Yi Ma , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Bagging Provides Assumption-free Stability Jake A. Soloff, Rina Foygel Barber, Rebecca Willett , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fairness guarantees in multi-class classification with demographic parity Christophe Denis, Romuald Elie, Mohamed Hebiri, François Hu , 2024. [ abs ][ pdf ][ bib ]

Regimes of No Gain in Multi-class Active Learning Gan Yuan, Yunfan Zhao, Samory Kpotufe , 2024. [ abs ][ pdf ][ bib ]

Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls Mochuan Liu, Yuanjia Wang, Haoda Fu, Donglin Zeng , 2024. [ abs ][ pdf ][ bib ]

Margin-Based Active Learning of Classifiers Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice , 2024. [ abs ][ pdf ][ bib ]

Random Subgraph Detection Using Queries Wasim Huleihel, Arya Mazumdar, Soumyabrata Pal , 2024. [ abs ][ pdf ][ bib ]

Classification with Deep Neural Networks and Logistic Loss Zihan Zhang, Lei Shi, Ding-Xuan Zhou , 2024. [ abs ][ pdf ][ bib ]

Spectral learning of multivariate extremes Marco Avella Medina, Richard A Davis, Gennady Samorodnitsky , 2024. [ abs ][ pdf ][ bib ]

Sum-of-norms clustering does not separate nearby balls Alexander Dunlap, Jean-Christophe Mourrat , 2024. [ abs ][ pdf ][ bib ]      [ code ]

An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization Guy Kornowski, Ohad Shamir , 2024. [ abs ][ pdf ][ bib ]

Linear Distance Metric Learning with Noisy Labels Meysam Alishahi, Anna Little, Jeff M. Phillips , 2024. [ abs ][ pdf ][ bib ]      [ code ]

OpenBox: A Python Toolkit for Generalized Black-box Optimization Huaijun Jiang, Yu Shen, Yang Li, Beicheng Xu, Sixian Du, Wentao Zhang, Ce Zhang, Bin Cui , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Generative Adversarial Ranking Nets Yinghua Yao, Yuangang Pan, Jing Li, Ivor W. Tsang, Xin Yao , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Predictive Inference with Weak Supervision Maxime Cauchois, Suyash Gupta, Alnur Ali, John C. Duchi , 2024. [ abs ][ pdf ][ bib ]

Functions with average smoothness: structure, algorithms, and learning Yair Ashlagi, Lee-Ad Gottlieb, Aryeh Kontorovich , 2024. [ abs ][ pdf ][ bib ]

Differentially Private Data Release for Mixed-type Data via Latent Factor Models Yanqing Zhang, Qi Xu, Niansheng Tang, Annie Qu , 2024. [ abs ][ pdf ][ bib ]

The Non-Overlapping Statistical Approximation to Overlapping Group Lasso Mingyu Qi, Tianxi Li , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Faster Rates of Differentially Private Stochastic Convex Optimization Jinyan Su, Lijie Hu, Di Wang , 2024. [ abs ][ pdf ][ bib ]

Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization O. Deniz Akyildiz, Sotirios Sabanis , 2024. [ abs ][ pdf ][ bib ]

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits Junpei Komiyama, Edouard Fouché, Junya Honda , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Stable Implementation of Probabilistic ODE Solvers Nicholas Krämer, Philipp Hennig , 2024. [ abs ][ pdf ][ bib ]

More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund , 2024. [ abs ][ pdf ][ bib ]

Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space Zhengdao Chen , 2024. [ abs ][ pdf ][ bib ]

QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration Felix Chalumeau, Bryan Lim, Raphaël Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, Antoine Cully , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Random Forest Weighted Local Fréchet Regression with Random Objects Rui Qiu, Zhou Yu, Ruoqing Zhu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design Alexandre Duval, Victor Schmidt, Santiago Miret, Yoshua Bengio, Alex Hernández-García, David Rolnick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need? Roel Bouman, Zaharah Bukhsh, Tom Heskes , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data Vasilii Feofanov, Emilie Devijver, Massih-Reza Amini , 2024. [ abs ][ pdf ][ bib ]

Information Processing Equalities and the Information–Risk Bridge Robert C. Williamson, Zac Cranko , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Regression for 3D Point Cloud Learning Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Li Wang, Ming-Jun Lai , 2024. [ abs ][ pdf ][ bib ]      [ code ]

AMLB: an AutoML Benchmark Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Materials Discovery using Max K-Armed Bandit Nobuaki Kikkawa, Hiroshi Ohno , 2024. [ abs ][ pdf ][ bib ]

Semi-supervised Inference for Block-wise Missing Data without Imputation Shanshan Song, Yuanyuan Lin, Yong Zhou , 2024. [ abs ][ pdf ][ bib ]

Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou , 2024. [ abs ][ pdf ][ bib ]

Scaling Speech Technology to 1,000+ Languages Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli , 2024. [ abs ][ pdf ][ bib ]      [ code ]

MAP- and MLE-Based Teaching Hans Ulrich Simon, Jan Arne Telle , 2024. [ abs ][ pdf ][ bib ]

A General Framework for the Analysis of Kernel-based Tests Tamara Fernández, Nicolás Rivera , 2024. [ abs ][ pdf ][ bib ]

Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent Jiaming Xu, Hanjing Zhu , 2024. [ abs ][ pdf ][ bib ]

Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces Rui Wang, Yuesheng Xu, Mingsong Yan , 2024. [ abs ][ pdf ][ bib ]

Exploration of the Search Space of Gaussian Graphical Models for Paired Data Alberto Roverato, Dung Ngoc Nguyen , 2024. [ abs ][ pdf ][ bib ]

The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy , 2024. [ abs ][ pdf ][ bib ]

Minimax Rates for High-Dimensional Random Tessellation Forests Eliza O'Reilly, Ngoc Mai Tran , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L. Horowitz, Jian Huang , 2024. [ abs ][ pdf ][ bib ]

Spatial meshing for general Bayesian multivariate models Michele Peruzzi, David B. Dunson , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables Wei Luo, Yeying Zhu, Xuekui Zhang, Lin Lin , 2024. [ abs ][ pdf ][ bib ]

Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport Ricardo Baptista, Rebecca Morrison, Olivier Zahm, Youssef Marzouk , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Learnability of Out-of-distribution Detection Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu , 2024. [ abs ][ pdf ][ bib ]

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin , 2024. [ abs ][ pdf ][ bib ]

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions Maksim Velikanov, Dmitry Yarotsky , 2024. [ abs ][ pdf ][ bib ]

ptwt - The PyTorch Wavelet Toolbox Moritz Wolter, Felix Blanke, Jochen Garcke, Charles Tapley Hoyt , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Functional Directed Acyclic Graphs Kuang-Yao Lee, Lexin Li, Bing Li , 2024. [ abs ][ pdf ][ bib ]

Unlabeled Principal Component Analysis and Matrix Completion Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Estimation on Semi-Supervised Generalized Linear Model Jiyuan Tu, Weidong Liu, Xiaojun Mao , 2024. [ abs ][ pdf ][ bib ]

Towards Explainable Evaluation Metrics for Machine Translation Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger , 2024. [ abs ][ pdf ][ bib ]

Differentially private methods for managing model uncertainty in linear regression Víctor Peña, Andrés F. Barrientos , 2024. [ abs ][ pdf ][ bib ]

Data Summarization via Bilevel Optimization Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause , 2024. [ abs ][ pdf ][ bib ]

Pareto Smoothed Importance Sampling Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, Jonah Gabry , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Policy Gradient Methods in the Presence of Symmetries and State Abstractions Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling Instruction-Finetuned Language Models Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei , 2024. [ abs ][ pdf ][ bib ]

Tangential Wasserstein Projections Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learnability of Linear Port-Hamiltonian Systems Juan-Pablo Ortega, Daiying Yin , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Unbiased Estimation for Partially Observed Diffusions Jeremy Heng, Jeremie Houssineau, Ajay Jasra , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Mathematical Framework for Online Social Media Auditing Wasim Huleihel, Yehonathan Refael , 2024. [ abs ][ pdf ][ bib ]

An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates Jessie Finocchiaro, Rafael M. Frongillo, Bo Waggoner , 2024. [ abs ][ pdf ][ bib ]

Low-rank Variational Bayes correction to the Laplace method Janet van Niekerk, Haavard Rue , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling the Convex Barrier with Sparse Dual Algorithms Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Causal-learn: Causal Discovery in Python Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, Kun Zhang , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics Noga Mudrik, Yenho Chen, Eva Yezerets, Christopher J. Rozell, Adam S. Charles , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification Natalie S. Frank, Jonathan Niles-Weed , 2024. [ abs ][ pdf ][ bib ]

Data Thinning for Convolution-Closed Distributions Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity Jiang Hu, Kangkang Deng, Jiayuan Wu, Quanzheng Li , 2024. [ abs ][ pdf ][ bib ]

Revisiting RIP Guarantees for Sketching Operators on Mixture Models Ayoub Belhadji, Rémi Gribonval , 2024. [ abs ][ pdf ][ bib ]

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune, Jiayu Liu, Reinhard Heckel , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks Dong-Young Lim, Sotirios Sabanis , 2024. [ abs ][ pdf ][ bib ]

Axiomatic effect propagation in structural causal models Raghav Singal, George Michailidis , 2024. [ abs ][ pdf ][ bib ]

Optimal First-Order Algorithms as a Function of Inequalities Chanwoo Park, Ernest K. Ryu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Resource-Efficient Neural Networks for Embedded Systems Wolfgang Roth, Günther Schindler, Bernhard Klein, Robert Peharz, Sebastian Tschiatschek, Holger Fröning, Franz Pernkopf, Zoubin Ghahramani , 2024. [ abs ][ pdf ][ bib ]

Trained Transformers Learn Linear Models In-Context Ruiqi Zhang, Spencer Frei, Peter L. Bartlett , 2024. [ abs ][ pdf ][ bib ]

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]

Efficient Modality Selection in Multimodal Learning Yifei He, Runxiang Cheng, Gargi Balasubramaniam, Yao-Hung Hubert Tsai, Han Zhao , 2024. [ abs ][ pdf ][ bib ]

A Multilabel Classification Framework for Approximate Nearest Neighbor Search Ville Hyvönen, Elias Jääsaari, Teemu Roos , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization Lorenzo Pacchiardi, Rilwan A. Adewoyin, Peter Dueben, Ritabrata Dutta , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multiple Descent in the Multiple Random Feature Model Xuran Meng, Jianfeng Yao, Yuan Cao , 2024. [ abs ][ pdf ][ bib ]

Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling Ye He, Tyler Farghly, Krishnakumar Balasubramanian, Murat A. Erdogdu , 2024. [ abs ][ pdf ][ bib ]

Invariant and Equivariant Reynolds Networks Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Personalized PCA: Decoupling Shared and Unique Features Naichen Shi, Raed Al Kontar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee George H. Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel , 2024. [ abs ][ pdf ][ bib ]

Convergence for nonconvex ADMM, with applications to CT imaging Rina Foygel Barber, Emil Y. Sidky , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms T. Tony Cai, Hongji Wei , 2024. [ abs ][ pdf ][ bib ]

Sparse NMF with Archetypal Regularization: Computational and Robustness Properties Kayhan Behdin, Rahul Mazumder , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions Shijun Zhang, Jianfeng Lu, Hongkai Zhao , 2024. [ abs ][ pdf ][ bib ]

Effect-Invariant Mechanisms for Policy Generalization Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters , 2024. [ abs ][ pdf ][ bib ]

Pygmtools: A Python Graph Matching Toolkit Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneous-Agent Reinforcement Learning Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sample-efficient Adversarial Imitation Learning Dahuin Jung, Hyungyu Lee, Sungroh Yoon , 2024. [ abs ][ pdf ][ bib ]

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi , 2024. [ abs ][ pdf ][ bib ]

Rates of convergence for density estimation with generative adversarial networks Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov , 2024. [ abs ][ pdf ][ bib ]

Additive smoothing error in backward variational inference for general state-space models Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff , 2024. [ abs ][ pdf ][ bib ]

Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality Stephan Wojtowytsch , 2024. [ abs ][ pdf ][ bib ]

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Tail Decay Rate Estimation of Loss Function Distributions Etrit Haxholli, Marco Lorenzi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao , 2024. [ abs ][ pdf ][ bib ]

Post-Regularization Confidence Bands for Ordinary Differential Equations Xiaowu Dai, Lexin Li , 2024. [ abs ][ pdf ][ bib ]

On the Generalization of Stochastic Gradient Descent with Momentum Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang , 2024. [ abs ][ pdf ][ bib ]

Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension Shotaro Yagishita, Jun-ya Gotoh , 2024. [ abs ][ pdf ][ bib ]

Iterate Averaging in the Quest for Best Test Error Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Inference under B-bits Quantization Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang , 2024. [ abs ][ pdf ][ bib ]

Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box Ryan Giordano, Martin Ingram, Tamara Broderick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Sufficient Graphical Models Bing Li, Kyongwon Kim , 2024. [ abs ][ pdf ][ bib ]

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond Nathan Kallus, Xiaojie Mao, Masatoshi Uehara , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks Sebastian Neumayer, Lénaïc Chizat, Michael Unser , 2024. [ abs ][ pdf ][ bib ]

Improving physics-informed neural networks with meta-learned optimization Alex Bihlo , 2024. [ abs ][ pdf ][ bib ]

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent Stefan Ankirchner, Stefan Perko , 2024. [ abs ][ pdf ][ bib ]

Critically Assessing the State of the Art in Neural Network Verification Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn , 2024. [ abs ][ pdf ][ bib ]

Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design Arya Akhavan, Davit Gogolashvili, Alexandre B. Tsybakov , 2024. [ abs ][ pdf ][ bib ]

Modeling Random Networks with Heterogeneous Reciprocity Daniel Cirkovic, Tiandong Wang , 2024. [ abs ][ pdf ][ bib ]

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment Zixian Yang, Xin Liu, Lei Ying , 2024. [ abs ][ pdf ][ bib ]

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Decorrelated Variable Importance Isabella Verdinelli, Larry Wasserman , 2024. [ abs ][ pdf ][ bib ]

Model-Free Representation Learning and Exploration in Low-Rank MDPs Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal , 2024. [ abs ][ pdf ][ bib ]

Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method Ernesto Araya, Guillaume Braun, Hemant Tyagi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization Shicong Cen, Yuting Wei, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]

Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic Zheng Tracy Ke, Jun S. Liu, Yucong Ma , 2024. [ abs ][ pdf ][ bib ]

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction Yuze Han, Guangzeng Xie, Zhihua Zhang , 2024. [ abs ][ pdf ][ bib ]

On Truthing Issues in Supervised Classification Jonathan K. Su , 2024. [ abs ][ pdf ][ bib ]

2024.

Tackling the most challenging problems in computer science

Our teams aspire to make discoveries that positively impact society. Core to our approach is sharing our research and tools to fuel progress in the field, to help more people more quickly. We regularly publish in academic journals, release projects as open source, and apply research to Google products to benefit users at scale.

Featured research developments

google research papers machine learning

Mitigating aviation’s climate impact with Project Contrails

google research papers machine learning

Consensus and subjectivity of skin tone annotation for ML fairness

google research papers machine learning

A toolkit for transparency in AI dataset documentation

google research papers machine learning

Building better pangenomes to improve the equity of genomics

google research papers machine learning

A set of methods, best practices, and examples for designing with AI

google research papers machine learning

Learn more from our research

Researchers across Google are innovating across many domains. We challenge conventions and reimagine technology so that everyone can benefit.

google research papers machine learning

Publications

Google publishes over 1,000 papers annually. Publishing our work enables us to collaborate and share ideas with, as well as learn from, the broader scientific community.

google research papers machine learning

Research areas

From conducting fundamental research to influencing product development, our research teams have the opportunity to impact technology used by billions of people every day.

google research papers machine learning

Tools and datasets

We make tools and datasets available to the broader research community with the goal of building a more collaborative ecosystem.

google research papers machine learning

Meet the people behind our innovations

google research papers machine learning

Our teams collaborate with the research and academic communities across the world

google research papers machine learning

Partnerships to improve our AI products

Subscribe to the PwC Newsletter

Join the community, trending research, cogvideo: large-scale pretraining for text-to-video generation via transformers.

google research papers machine learning

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.

google research papers machine learning

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

internlm/mindsearch • 29 Jul 2024

Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework.

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs.

google research papers machine learning

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).

Neural General Circulation Models for Weather and Climate

google research papers machine learning

Here we present the first GCM that combines a differentiable solver for atmospheric dynamics with ML components, and show that it can generate forecasts of deterministic weather, ensemble weather and climate on par with the best ML and physics-based methods.

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

google research papers machine learning

SGLang: Efficient Execution of Structured Language Model Programs

SGLang consists of a frontend language and a runtime.

google research papers machine learning

Global Structure-from-Motion Revisited

colmap/glomap • 29 Jul 2024

Recovering 3D structure and camera motion from images has been a long-standing focus of computer vision research and is known as Structure-from-Motion (SfM).

Autoregressive Image Generation without Vector Quantization

In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space.

google research papers machine learning

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).

google research papers machine learning

Millions of new materials discovered with deep learning

Amil Merchant and Ekin Dogus Cubuk

  • Copy link ×

google research papers machine learning

AI tool GNoME finds 2.2 million new crystals, including 380,000 stable materials that could power future technologies

Modern technologies from computer chips and batteries to solar panels rely on inorganic crystals. To enable new technologies, crystals must be stable otherwise they can decompose, and behind each new, stable crystal can be months of painstaking experimentation.

Today, in a paper published in Nature , we share the discovery of 2.2 million new crystals – equivalent to nearly 800 years’ worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials.

With GNoME, we’ve multiplied the number of technologically viable materials known to humanity. Of its 2.2 million predictions, 380,000 are the most stable, making them promising candidates for experimental synthesis. Among these candidates are materials that have the potential to develop future transformative technologies ranging from superconductors, powering supercomputers, and next-generation batteries to boost the efficiency of electric vehicles.

GNoME shows the potential of using AI to discover and develop new materials at scale. External researchers in labs around the world have independently created 736 of these new structures experimentally in concurrent work. In partnership with Google DeepMind, a team of researchers at the Lawrence Berkeley National Laboratory has also published a second paper in Nature that shows how our AI predictions can be leveraged for autonomous material synthesis.

We’ve made GNoME’s predictions available to the research community. We will be contributing 380,000 materials that we predict to be stable to the Materials Project, which is now processing the compounds and adding them into its online database . We hope these resources will drive forward research into inorganic crystals, and unlock the promise of machine learning tools as guides for experimentation

Accelerating materials discovery with AI

google research papers machine learning

About 20,000 of the crystals experimentally identified in the ICSD database are computationally stable. Computational approaches drawing from the Materials Project, Open Quantum Materials Database and WBM database boosted this number to 48,000 stable crystals. GNoME expands the number of stable materials known to humanity to 421,000.

In the past, scientists searched for novel crystal structures by tweaking known crystals or experimenting with new combinations of elements - an expensive, trial-and-error process that could take months to deliver even limited results. Over the last decade, computational approaches led by the Materials Project and other groups have helped discover 28,000 new materials. But up until now, new AI-guided approaches hit a fundamental limit in their ability to accurately predict materials that could be experimentally viable. GNoME’s discovery of 2.2 million materials would be equivalent to about 800 years’ worth of knowledge and demonstrates an unprecedented scale and level of accuracy in predictions.

For example, 52,000 new layered compounds similar to graphene that have the potential to revolutionize electronics with the development of superconductors. Previously, about 1,000 such materials had been identified . We also found 528 potential lithium ion conductors, 25 times more than a previous study , which could be used to improve the performance of rechargeable batteries.

We are releasing the predicted structures for 380,000 materials that have the highest chance of successfully being made in the lab and being used in viable applications. For a material to be considered stable, it must not decompose into similar compositions with lower energy. For example, carbon in a graphene-like structure is stable compared to carbon in diamonds. Mathematically, these materials lie on the convex hull. This project discovered 2.2 million new crystals that are stable by current scientific standards and lie below the convex hull of previous discoveries. Of these, 380,000 are considered the most stable, and lie on the “final” convex hull – the new standard we have set for materials stability.

GNoME: Harnessing graph networks for materials exploration

google research papers machine learning

GNoME uses two pipelines to discover low-energy (stable) materials. The structural pipeline creates candidates with structures similar to known crystals, while the compositional pipeline follows a more randomized approach based on chemical formulas. The outputs of both pipelines are evaluated using established Density Functional Theory calculations and those results are added to the GNoME database, informing the next round of active learning.

GNoME is a state-of-the-art graph neural network (GNN) model. The input data for GNNs take the form of a graph that can be likened to connections between atoms, which makes GNNs particularly suited to discovering new crystalline materials.

GNoME was originally trained with data on crystal structures and their stability, openly available through the Materials Project . We used GNoME to generate novel candidate crystals, and also to predict their stability. To assess our model’s predictive power during progressive training cycles, we repeatedly checked its performance using established computational techniques known as Density Functional Theory (DFT), used in physics, chemistry and materials science to understand structures of atoms, which is important to assess the stability of crystals.

We used a training process called ‘active learning’ that dramatically boosted GNoME’s performance. GNoME would generate predictions for the structures of novel, stable crystals, which were then tested using DFT. The resulting high-quality training data was then fed back into our model training.

Our research boosted the discovery rate of materials stability prediction from around 50%, to 80% - based on MatBench Discovery , an external benchmark set by previous state-of-the-art models. We also managed to scale up the efficiency of our model by improving the discovery rate from under 10% to over 80% - such efficiency increases could have significant impact on how much compute is required per discovery.

AI ‘recipes’ for new materials

The GNoME project aims to drive down the cost of discovering new materials. External researchers have independently created 736 of GNoME’s new materials in the lab, demonstrating that our model’s predictions of stable crystals accurately reflect reality. We’ve released our database of newly discovered crystals to the research community. By giving scientists the full catalog of the promising ‘recipes’ for new candidate materials, we hope this helps them to test and potentially make the best ones.

google research papers machine learning

Upon completion of our latest discovery efforts, we searched the scientific literature and found 736 of our computational discoveries were independently realized by external teams across the globe. Above are six examples ranging from a first-of-its-kind Alkaline-Earth Diamond-Like optical material (Li4MgGe2S7) to a potential superconductor (Mo5GeB2).

Rapidly developing new technologies based on these crystals will depend on the ability to manufacture them. In a paper led by our collaborators at Berkeley Lab, researchers showed a robotic lab could rapidly make new materials with automated synthesis techniques. Using materials from the Materials Project and insights on stability from GNoME, the autonomous lab created new recipes for crystal structures and successfully synthesized more than 41 new materials, opening up new possibilities for AI-driven materials synthesis.

google research papers machine learning

A-Lab, a facility at Berkeley Lab where artificial intelligence guides robots in making new materials. Photo credit: Marilyn Sargent/Berkeley Lab

New materials for new technologies

To build a more sustainable future, we need new materials. GNoME has discovered 380,000 stable crystals that hold the potential to develop greener technologies – from better batteries for electric cars, to superconductors for more efficient computing.

Our research – and that of collaborators at the Berkeley Lab, Google Research, and teams around the world — shows the potential to use AI to guide materials discovery, experimentation, and synthesis. We hope that GNoME together with other AI tools can help revolutionize materials discovery today and shape the future of the field.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Landmark Papers in Machine Learning

daturkel/learning-papers

Folders and files.

NameName
20 Commits
ISSUE_TEMPLATE ISSUE_TEMPLATE

Repository files navigation

This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I’ve done my best to select the papers that I think are novel or significant.

My opinions are by no means the final word on these topics. Please create an issue or pull request if you have a suggestion.

Association Rule Learning

Decision trees, alexnet (image classification cnn), convolutional neural network, deepface (facial recognition), generative adversarial network, inception (classification/detection cnn), long short-term memory (lstm), residual neural network (resnet), transformer (sequence to sequence modeling), u-net (image segmentation cnn), vgg (image recognition cnn), gradient boosting, random forest, expectation maximization, stochastic gradient descent, non-negative matrix factorization, deepqa (watson), latent dirichlet allocation, latent semantic analysis, back-propagation, batch normalization, gated recurrent unit, collaborative filtering, matrix factorization, implicit matrix factorization, elastic net, k-nearest neighbors, support vector machine, the bootstrap.

Icon
🔒 Paper behind paywall. In some cases, I provide an alternative link to the paper it comes directly from one of the authors.
🔑 Freely available version of paywalled paper, directly from the author.
💽 Code associated with the paper.
🏛️ Precursor or historically relevant paper. This may be a fundamental breakthrough that paved the way for the concept in question to be developed.
🔬 Iteration, advancement, elaboration, or major popularization of a technique.
📔 Blog post or something other than a formal publication.
🌐 Website associated with the paper.
🎥 Video associated with the paper.
📊 Slides or images associated with the paper.

Papers proceeded by “See also” indicate either additional historical context or else major developments, breakthroughs, or applications.

Scalable Algorithms for Association Mining (2000) . Zaki, @IEEE 🔒.

Mining Frequent Patterns without Candidate Generation (2000) . Han, Pei, and Yin, @acm .

Mining Association Rules between Sets of Items in Large Databases (1993) , Agrawal, Imielinski, and Swami, @CiteSeerX 🏛️.

See also: The GUHA method of automatic hypotheses determination (1966) , Hájek, Havel, and Chytil, @Springer 🔒 🏛️.

  • The Enron Corpus: A New Dataset for Email Classification Research (2004) , Klimt and Yang, @Springer 🔒 / @author 🔑.
  • See also: Introducing the Enron Corpus (2004) , Klimt and Yang, @author .
  • ImageNet: A large-scale hierarchical image database (2009) , Deng et al., @IEEE 🔒 / @author 🔑.
  • See also: ImageNet Large Scale Visual Recognition Challenge (2015) , @Springer 🔒 / @arXiv 🔑 + @author 🌐.
  • Induction of Decision Trees (1986) , Quinlan, @Springer .

Deep Learning

  • ImageNet Classification with Deep Convolutional Neural Networks (2012) , @NIPS .
  • Gradient-based learning applied to document recognition (1998) , LeCun, Bottou, Bengio, and Haffner, @IEEE 🔒 / @author 🔑.
  • See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980) , Fukushima, @Springer 🏛️.
  • See also: Phoneme recognition using time-delay neural networks (1989) , Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE 🏛️.
  • See also: Fully Convolutional Networks for Semantic Segmentation (2014) , Long, Shelhamer, and Darrell, @arXiv .
  • DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014) , Taigman, Yang, Ranzato, and Wolf, Facebook Research .
  • General Adversarial Nets (2014) , Goodfellow et al., @NIPS + @Github 💽.
  • Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github 💽 + @OpenAI 📔.
  • See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI 🔬 + @Github 💽 + @OpenAI 📔.
  • See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI 📔.
  • Going Deeper with Convolutions (2014) , Szegedy et al., @ai.google + @Github 💽.
  • See also: Rethinking the Inception Architecture for Computer Vision (2016) , Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, @ai.google 🔬.
  • See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016) , Szegedy, Ioffe, Vanhoucke, and Alemi, @ai.google 🔬.
  • Long Short-term Memory (1995) , Hochreiter and Schmidhuber, @CiteSeerX .
  • Deep Residual Learning for Image Recognition (2015) , He, Zhang, Ren, and Sun, @arXiv .
  • Attention Is All You Need (2017) , Vaswani et al., @NIPS .
  • U-Net: Convolutional Networks for Biomedical Image Segmentation (2015) , Ronneberger, Fischer, Brox, @Springer 🔒 / @arXiv 🔑.
  • Very Deep Convolutional Networks for Large-Scale Image Recognition (2015) , Simonyan and Zisserman, @arXiv + @author 🌐 + @ICLR 📊 + @YouTube 🎥.

Ensemble Methods

A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997—published as abstract in 1995) , Freund and Schapire, @CiteSeerX .

See also: Experiments with a New Boosting Algorithm (1996) , Freund and Schapire, @CiteSeerX 🔬.

  • Bagging Predictors (1996) , Breiman, @Springer .
  • Greedy function approximation: A gradient boosting machine (2001) , Friedman, @Project Euclid .
  • See also: XGBoost: A Scalable Tree Boosting System (2016) , Chen and Guestrin, @arXiv 🔬 + @GitHub 💽.
  • Random Forests (2001) , Breiman and Schapire, @CiteSeerX .
  • Mastering the game of Go with deep neural networks and tree search (2016) , Silver et al., @Nature .
  • IBM's deep blue chess grandmaster chips (1999) , Hsu, @IEEE 🔒.
  • See also: Deep Blue (2002) , Campbell, Hoane, and Hsu, @ScienceDirect 🔒.

Optimization

  • Adam: A Method for Stochastic Optimization (2015) , Kingma and Ba, @arXiv .
  • Maximum likelihood from incomplete data via the EM algorithm (1977) , Dempster, Laird, and Rubin, @CiteSeerX .
  • Stochastic Estimation of the Maximum of a Regression Function (1952) , Kiefer and Wolfowitz, @ProjectEuclid .
  • See also: A Stochastic Approximation Method (1951) , Robbins and Monro, @ProjectEuclid 🏛️.

Miscellaneous

  • Learning the parts of objects by non-negative matrix factorization (1999) , Lee and Seung, @Nature 🔒.
  • The PageRank Citation Ranking: Bringing Order to the Web (1998) , Page, Brin, Motwani, and Winograd, @CiteSeerX .
  • Building Watson: An Overview of the DeepQA Project (2010) , Ferrucci et al., @AAAI .

Natural Language Processing

  • Latent Dirichlet Allocation (2003) , Blei, Ng, and Jordan, @JMLR
  • Indexing by latent semantic analysis (1990) , Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX .
  • Efficient Estimation of Word Representations in Vector Space (2013) , Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code 💽.

Neural Network Components

  • Autograd: Effortless Gratients in Numpy (2015) , @ICML + @ICML 📊 + @Github 💽.
  • Learning representations by back-propagating errors (1986) , Rumelhart, Hinton, and Williams, @Nature 🔒.
  • See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989) , LeCun et al., @IEEE 🔒🔬 / @author 🔑.
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015) , Ioffe and Szegedy @ICML via PMLR .
  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014) , Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR .
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014) , Cho et al, @arXiv .
  • The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958) , Rosenblatt, @CiteSeerX .

Recommender Systems

  • Using collaborative filtering to weave an information tapestry (1992) , Goldberg, Nichols, Oki, and Terry, @CiteSeerX .
  • Application of Dimensionality Reduction in Recommender System - A Case Study (2000) , Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX .
  • See also: Learning Collaborative Information Filters (1998) , Billsus and Pazzani, @CiteSeerX 🏛️.
  • See also: Netflix Update: Try This at Home (2006) , Funk, @author 📔 🔬.
  • Collaborative Filtering for Implicit Feedback Datasets (2008) , Hu, Koren, and Volinsky, @IEEE 🔒 / @author 🔑.
  • Regularization and variable selection via the Elastic Net (2005) , Zou and Hastie, @CiteSeer .
  • Regression Shrinkage and Selection Via the Lasso (1994) , Tibshirani, @CiteSeerX .
  • See also: Linear Inversion of Band-Limited Reflection Seismograms (1986) , Santosa and Symes, @SIAM 🏛️.
  • MapReduce: Simplified Data Processing on Large Clusters (2004) , Dean and Ghemawat, @ai.google .
  • TensorFlow: A system for large-scale machine learning (2016) , Abadi et al., @ai.google + @author 🌐.
  • Torch: A Modular Machine Learning Software Library (2002) , Collobert, Bengio and Mariéthoz, @Idiap + @author 🌐.
  • See also: Automatic differentiation in PyTorch (2017) , Paszke et al., @OpenReview 🔬+ @Github 💽.

Supervised Learning

  • Nearest neighbor pattern classification (1967) , Cover and Hart, @IEEE 🔒.
  • See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989) , Silverman and Jones, @JSTOR 🔒.
  • Support Vector Networks (1995) , Cortes and Vapnik, @Springer .
  • Bootstrap Methods: Another Look at the Jackknife (1979) , Efron, @Project Euclid .
  • See also: Problems in Plane Sampling (1949) , Quenouille, @Project Euclid 🏛️.
  • See also: Notes on Bias Estimation (1958) , Quenouille, @JSTOR 🏛️.
  • See also: Bias and Confidence in Not-quite Large Samples (1958) , Tukey, @Project Euclid 🔬.

A special thanks to Alexandre Passos for his comment on this Reddit thread , as well as the responders to this Quora post . They provided many great papers to get this list off to a great start.

Contributors 3

Open navigation menu

Diagnosing Diabetic Retinopathy with AI

How a team at Google is using AI to help doctors prevent blindness in diabetics.

google research papers machine learning

Google research scientist Varun Gulshan was looking for a project that would meet a few criteria.

The project would utilize Gulshan’s background developing artificial intelligence (AI) algorithms and stimulate his interest in science and medicine. And ideally, the project would help people in Gulshan’s home country of India.

I had started thinking a lot about working on more fundamental problems. I wanted to use image recognition for something that would benefit society.

Varun Gulshan, Google Research Scientist

He fired off an email to Phil Nelson, director of Google Accelerated Science (GAS), asking if there was such a project in the works.

A few weeks later, Gulshan was opening a digital drive containing hundreds of anonymized retina scans from a hospital in India. Nelson thought he had a project for Gulshan, but first he needed to know: Could an artificial intelligence model learn to identify which of these images showed signs of a specific cause of blindness, a disease called diabetic retinopathy?

“This was like a perfect skill-set match for me,” says Gulshan, whose background was working with AI to recognize hand gestures. “When I looked at these images, I could tell that, okay, deep learning is kind of working well enough,” he says. “We could really use it on these problems.”

A Growing Concern

With 70 million people with diabetes, India has a growing problem with diabetic retinopathy. The disease creates lesions in the back of the retina that can lead to total blindness, and 18 percent of diabetic Indians already have the ailment. With 415 million diabetics at risk for blindness worldwide (the United States, China, and India have the most cases), the disease is a global concern.

of the 70 million people with diabetes in India have the ailment.

But the good news is that permanent vision loss is not inevitable. For those who are diagnosed early enough, medications, therapies, exercise, and a healthy diet are highly effective treatments for preventing further damage.

The Challenges

Awareness is a huge issue with diabetic retinopathy. Many diabetic patients assume that early signs of the disease are simply minor vision problems, according to Dr. Rajiv Raman, a retina surgeon at Sankara Nethralaya Eye Hospital in Chennai, India. With no word in Hindi for “retina,” just talking about the disease is a challenge. “For cataracts we have a word, for glaucoma we have a word in Hindi as well as in Tamil, but diabetic retinopathy is – there is no translational word,” Dr. Raman says.

A photograph of Doctor Rajiv Raman

Dr. Rajiv Raman, Retina Surgeon at Sankara Nethralaya

Chennai, India

But while an ophthalmologist can explain the disease and how regular exams will monitor its progress, the real difficulty is getting at-risk patients a retinal exam in the first place. For rural communities worldwide, the prevalence of late-stage diabetic retinopathy has more to do with infrastructure than medicine. The journey from home to the nearest specialist can be long, and keeping multiple appointments is often very difficult.

“Many of the rural patients have an advanced stage of diabetic retinopathy, but they don’t know they are diabetics”

It is often impossible for patients in poverty with dependents to also care for themselves. Instead, they will carry on until the effects of diabetic retinopathy become too bad to ignore, which is often too late. “Many of the rural patients have an advanced stage of diabetic retinopathy, but they don’t know they are diabetics,” says Dr. Sheila John, head of teleophthalmology at Sankara Nethralaya. “They are losing sight. In some of the cases they have lost vision in one eye, [and] the other eye we have to save.”

Patients stand in a line outside of Sankara Nethralaya Eye Hospital

Patients waiting outside of Sankara Nethralaya Eye Hospital

Assembling the Team

The biggest challenge with diagnosing diabetic retinopathy, however, is the sheer number of cases. India alone has 70 million diabetics who must be screened, and there just aren’t enough trained clinicians to review their retinal scans.

We need to screen [patients] early on, when their vision is still good

But it simply isn’t feasible for specialists to open practices in rural areas where only a few patients may reside, according to Dr. R. Kim, chief medical officer at Aravind Eye Hospital in Madurai, India. “We need to screen them early on, when their vision is still good. So how do we do that?” Dr. Kim asks. “Because it’s not humanly possible to screen these 70 million.”

Florence Thng, Product Manager, Verily

If Google’s artificial intelligence could help make diagnosing diabetic retinopathy easier by accurately interpreting retinal scans, perhaps the eyesight of millions could be saved.

The tricky part was creating a data set for the AI model to learn from – a task which involved scoring and labeling all the scans one by one for different grades of severity. Solving that problem would eventually require a large team of ophthalmologists whose scoring of the scans would inform the AI model.

But the team would need more quality data if it was going to teach the AI model the nuances to truly read a retinal scan.

Teaching the Model

At the outset, the team was aided by ophthalmologists at Aravind and Sankara Nethralaya to label the retina images. After a few short months, the model was trained to identify key markers of diabetic retinopathy, such as nerve tissue damage, swelling, and hemorrhaging. And with a larger data set, Gulshan was sure they could make the model even more accurate.

Enter Dr. Jorge Cuadros, head of the Eye Picture Archive Communication System ( EyePACS ), a telemedicine network connecting patients in rural areas across the United States to ophthalmologists for diabetic retinopathy scans. But patients seen by EyePACS still have to wait weeks for a graded scan, and Dr. Cuadros was happy to help any effort for a faster diagnosis.

The data EyePACS shared comprised a wide range of patients and was a hundred times as much as the AI team had gathered by that point. That meant a huge labeling workload because each image had to be graded multiple times to compensate for the biases of different graders. “The model learns…what are the things they always did consistently,” says Dale Webster, a software engineer at Google. “This tends to result in something that’s a bit less biased and a bit more robust.”

To date, close to 100 ophthalmologists have rendered more than 1 million grades for the AI model.

How the AI Works

How the ai model works (1/4).

Over 50 ophthalmologists have manually reviewed more than 1 million anonymous retina scans, rating each for the level of diabetic retinopathy present.

How the AI Model Works (2/4)

Each scan is reviewed multiple times, and is graded manually on a scale of 1 (no diabetic retinopathy signs present) to 5 (extreme signs present).

How the AI Model Works (3/4)

The graded images are then fed into an image recognition algorithm. By feeding the algorithm thousands of graded images, it can start to understand signs of diabetic retinopathy just like an ophthalmologist would.

How the AI Model Works (4/4)

Once the algorithm has been trained, it can be used to power an application called an Automated Retinal Disease Assessment (ARDA). ARDA allows a user to upload a retina scan for instant analysis of diabetic retinopathy.

From Model to Device

For all the team members, the idea that they could turn this model into an actual Automated Retinal Disease Assessment (ARDA) device was the main reason for their involvement.

The key to that was another Google team member, Lily Peng. Trained as a medical doctor, Peng, like the rest of the Ophthalmology team, is driven by the prospect of creating an actual clinical impact.

I saw that we had a lot of big ideas – a lot of promises, right?” she asks. “But why do some of these never make it to the bedside?

Lily Peng, Google

Peng had a vision that the ARDA could be used in a clinical setting – but getting to that point required trials and regulatory approval. To do this, the team focused on two goals: conducting a clinical trial to begin testing the ARDA in the real world and writing a paper about the results for the Journal of the American Medical Association (JAMA).

“We wanted to go to JAMA because JAMA is about the practice of medicine,” says Nelson. “We didn’t just want to show that we could do this. We wanted to get on the map with doctors.”

Another part of getting the ARDA device on the map was presenting their work to the Food and Drug Administration (FDA). With Nelson at her side, Peng gave a “virtuoso performance” on the virtues of AI. Peng was a key advocate and translator between the different communities involved in bringing the ARDA to life.

“She can speak all languages,” Gulshan says, “so she could talk to us and understand the technical complexities of what we were doing, and also what the doctors were speaking, and what is relevant in terms of impact. Lily brought that and made it into…something that we can now think of putting into a clinic.”

A New Kind of Thermometer

No one on the Google team had any experience actually creating a medical device, so the team turned to Verily, a healthcare-focused Alphabet company (Alphabet also owns Google), to navigate the regulatory and clinical demands of getting the ARDA technology approved as a medical device.

Accepted into the FDA’s recently announced pre-certification pilot program – one of only nine companies selected out of hundreds that applied to participate – Verily is using its expertise to help usher the ARDA through clinical trials in India. And so is Gulshan, who moved back to India to help doctors and nurses use the device.

An ophthalmologist looks for damage in a patient's eye scan.

A Closer Look

After being seen by an optometrist each patient at Sankara Nethralaya is examined by an ophthalmologist. If the ophthalmologist sees potential damage to their eyes the patient undergoes a retina scan.

“Getting regulatory approval is important,” Peng says, “but more important is that the clinicians working with us feel confident about what they’re doing and feel good about using the software. And so it’s not just about safety and effectiveness; it’s whether or not this is actually going to be helpful to them.”

In a recent clinical trial, the ARDA was used to grade the images of 3,000 diabetic patients at two hospitals in India. Those grades were compared with doctors’ assessments, which confirmed the 2016 study reported in JAMA: The model was performing on par with their existing healthcare workers screening patients.

Dr. Rajiv Raman reviews a patient's retina image of signs of diabetic retinopathy.

Dr. Rajiv Raman reviews a patient's retina image of signs of diabetic retinopathy.

For Dr. Cuadros, the key benefit of the ARDA is simple math. He notes that the percentage of people with diabetic retinopathy in the United States is going down, indicating that preventative treatment is working. But because the rate of diabetes is increasing, the overall number of diabetic retinopathy patients remains the same. The number of people who need diabetic eye screening is on the rise, while the demand for treatment expertise remains the same.

And ophthalmologists feel the pinch.

Every day I should screen 3,000 patients, which is impossible,” Dr. Raman says. “So you definitely need a helping hand. And ARDA is my helping hand.

Dr. Rajiv Raman, Ophthalmologist

In such conditions, inserting expertise into primary care is a huge benefit. “If ARDA could be used in the primary care physician’s office, it would make a huge difference, because you will be screening more patients,” says Dr. Kim. “So the ophthalmologist…can focus on treating only those with retinopathy.”

I never knew that diabetes could cause blindness. I used to ride my bike from place to place. Until one day things got blurry in my left eye. After eight months I lost all vision in that eye.

Elumalai, patient

In fact, Dr. Raman imagines a device that’s as common as a thermometer or even a glucometer, a diagnostic tool that diabetics already use to monitor their blood sugar. “My job is not to screen for diabetic retinopathy,” he says. “My job is to do laser, to do injections, to give – to really do surgeries and help them alleviate their blindness.”

But no matter the vector of diagnosis, all agree that awareness is key to health. In fact, getting a diabetic retinopathy diagnosis can lead to better outcomes overall. “If you detect retinal disease at an early stage when they don’t need treatment,” Dr. Cuadros says, “it’s still an opportunity for the patient to understand that diabetes is beginning to affect their body. Hopefully that would motivate them to control their blood sugar better.”

Mythili, patient

Mythili is a patient of Dr. Rajiv Raman. She has been diabetic for 19 years and discovered she had diabetic retinopathy 5 years ago. She was well informed that her vision could be affected by diabetes and routinely went to get her eyes checked.

A Diagnostic Advance

More studies are underway, including ongoing clinical trials in India – the first time diabetic eye screenings will be performed at this level. And the Google and Verily teams are optimistic about the possibilities even beyond diabetic retinopathy. “Since [the JAMA article] we have made even more progress,” says Nelson. “We recently published a paper in Nature Biomedical Engineering showing that from a retina image we can predict not only several cardiovascular health risk factors but also your risk of a significant cardiovascular event.”

One day, diagnosing serious diseases may be as easy as taking a temperature or checking blood pressure. But in the near term, millions of diabetics could keep their vision thanks to an AI algorithm helping doctors quickly diagnose diabetic retinopathy.

Related Stories

A father uses YouTube to make a better prosthetic eye for his daughter

A father uses YouTube to make a better prosthetic eye for his daughter

How your smartphone could help save your life in an emergency

How your smartphone could help save your life in an emergency

How one woman saves lives with motorbikes, blood banks, and Google Maps

How one woman saves lives with motorbikes, blood banks, and Google Maps

Meet the team using machine learning to help save the world's bees

Meet the team using machine learning to help save the world's bees

When dementia takes memories away, a bicycle helps bring them back

When dementia takes memories away, a bicycle helps bring them back

The Evolution and Impact of Google Cloud  Platform in Machine Learning and AI

6 Pages Posted: 6 Aug 2024

Praveen Borra

Florida Atlantic University

Date Written: June 18, 2024

Google Cloud Platform (GCP) has emerged as a leader in Machine Learning (ML) and Artificial Intelligence (AI), known for its cutting-edge technologies and inclusive accessibility. GCP not only drives innovation but also democratizes access to powerful ML and AI tools, empowering organizations of all sizes to harness data-driven insights for enhanced innovation, efficiency, and scalable growth. GCP's impact transcends technological advancements, representing a significant shift in digital transformation across diverse industries. This paper delves into GCP's transformative influence through real-world examples and practical applications across sectors such as healthcare, finance, retail, and entertainment. By showcasing GCP's scalable computing resources and robust data analytics capabilities, it illuminates how these technologies enable businesses to discover new opportunities and operational efficiencies. GCP's holistic approach to ML and AI fosters a culture of continuous innovation, empowering enterprises to excel in the era of intelligent computing and data-driven decision-making.

Keywords: Google Cloud Platform, Machine Learning, Artificial Intelligence, TensorFlow, AutoML, BigQuery ML, AI Platform, Cloud Computing, Data Science, Deep Learning, Neural Networks, Industry Applications International Open-Access, Double-Blind, Peer-Reviewed, Refereed, Multidisciplinary Online Journal

Suggested Citation: Suggested Citation

Praveen Borra (Contact Author)

Florida atlantic university ( email ), do you have a job opening that you would like to promote on ssrn, paper statistics, related ejournals, econometrics: computer programs & software ejournal.

Subscribe to this fee journal for more curated articles on this topic

Artificial Intelligence eJournal

Computer science education ejournal.

Search form

In probing brain-behavior nexus, big datasets are better.

Metallic sculpture of human brain in front of a large language model illustration

(AI-generated image and Adobe Stock, created and edited by Michael S. Helfenbein)

When designing machine learning models, researchers first train the models to recognize data patterns and then test their effectiveness. But if the datasets used to train and test aren’t sufficiently large, models may appear to be less capable than they actually are, a new Yale study reports.

When it comes to models that identify patterns between the brain and behavior, this could have implications for future research, contribute to the replication crisis affecting psychological research, and hamper understanding of the human brain, researchers say.

The findings were published July 31 in the journal Nature Human Behavior.

Researchers increasingly use machine learning models to uncover patterns that link brain structure or function to, say, cognitive attributes like attention or symptoms of depression. Making these links allows researchers to better understand how the brain contributes to these attributes (and vice versa) and potentially enables them to predict who might be at risk for certain cognitive challenges based on brain imaging alone.

But models are only useful if they’re accurate across the general population, not just among the people included in the training data.

Often, researchers will split one dataset into a larger portion on which they train the model and a smaller portion used to test the model’s ability (since collecting two separate sets of data requires greater resources). A growing number of studies, however, have subjected machine learning models to a more rigorous test in order to evaluate their generalizability, testing them on an entirely different dataset made available by other researchers.

“ And that’s good,” said Matthew Rosenblatt, lead author of the study and a graduate student in the lab of Dustin Scheinost , associate professor of radiology and biomedical imaging at Yale School of Medicine. “If you can show something works in a totally different dataset, then it’s probably a robust brain-behavior relationship.”

Adding another dataset into the mix, however, comes with its own complications — namely, in regard to a study’s “power.” Statistical power is the probability that a research study will detect an effect if one exists. For example, a child’s height is closely related to their age. If a study is adequately powered, then that relationship will be observed. If the study is “low-powered,” on the other hand, there’s a higher risk of overlooking the link between age and height.

There are two important aspects to statistical power — the size of the dataset (also known as the sample size) and the effect size. And the smaller that one of those aspects is, the larger the other needs to be. The link between age and height is strong, meaning the effect size is large; one can observe that relationship in even a small dataset. But when the relationship between two factors is more subtle — like, say, age and how well one can sense through touch — researchers would need to collect data from more people to uncover that connection.

While there are equations that can calculate how big a dataset should be to achieve enough power, there aren’t any to easily calculate how large two datasets — one training and one testing — should be.

To understand how training and testing dataset sizes affect study power, researchers in the new study used data from six neuroimaging studies and resampled that data over and over, changing the dataset sizes to see how that affected statistical power.

“ We showed that statistical power requires relatively large sample sizes for both training and external testing datasets,” said Rosenblatt. “When we looked at published studies in the field that use this approach — testing models on a second dataset — we found most of their datasets were too small, underpowering their studies.”

Among already published studies, the researchers found that the median sizes for training and testing datasets were 129 and 108 participants, respectively. For measures with large effect sizes, like age, those dataset sizes were big enough to achieve adequate power. But for measures with medium effect sizes, such as working memory, datasets of those sizes resulted in a 51% chance that the study would not detect a relationship between brain structure and the measure; for measures with low effect sizes, like attention problems, those odds increased to 91%.

“ For these measures with smaller effect sizes, researchers may need datasets of hundreds to thousands of people,” said Rosenblatt.

As more neuroimaging datasets become available, Rosenblatt and his colleagues expect more researchers will opt to test their models on separate datasets.

“ That’s a move in the right direction,” said Scheinost. “Especially with reproducibility being the problem it is, validating a model on a second, external dataset is one solution. But we want people to think about their dataset sizes. Researchers must do what they can with the data they have, but as more data becomes available, we should all aim to test externally and make sure those test datasets are large.”

Health & Medicine

Media Contact

Fred Mamoun: [email protected] , 203-436-2643

google research papers machine learning

White coats launch Yale School of Medicine Class of 2028

A family of owl monkeys

Moving out: Mix of factors prompts owl monkeys to leave their parents

google research papers machine learning

How physiatry can help people with stiff person syndrome

google research papers machine learning

For some Black, Latino, and Asian people, summers are hotter 

  • Show More Articles

Research Paper Classification Using Machine and Deep Learning Techniques

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, index terms.

Computing methodologies

Machine learning

Machine learning approaches

Classification and regression trees

Recommendations

Deep learning--based text classification: a comprehensive review.

Deep learning--based models have surpassed classical machine learning--based approaches in various text classification tasks, including sentiment analysis, news categorization, question answering, and natural language inference. In this article, we ...

Intrusion Detection Using Big Data and Deep Learning Techniques

In this paper, Big Data and Deep Learning Techniques are integrated to improve the performance of intrusion detection systems. Three classifiers are used to classify network traffic datasets, and these are Deep Feed-Forward Neural Network (DNN) and two ...

Boosting to correct inductive bias in text classification

This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases ...

Information

Published in.

cover image ACM Other conferences

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • deep learning
  • gradient-boosted trees
  • machine learning
  • text classification
  • topic classification
  • Research-article
  • Refereed limited

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 3 Total Downloads
  • Downloads (Last 12 months) 3
  • Downloads (Last 6 weeks) 3

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Condensed Matter > Materials Science

Title: learning atoms from crystal structure.

Abstract: Computational modelling of materials using machine learning, ML, and historical data has become integral to materials research. The efficiency of computational modelling is strongly affected by the choice of the numerical representation for describing the composition, structure and chemical elements. Structure controls the properties, but often only the composition of a candidate material is available. Existing elemental descriptors lack direct access to structural insights such as the coordination geometry of an element. In this study, we introduce Local Environment-induced Atomic Features, LEAFs, which incorporate information about the statistically preferred local coordination geometry for atoms in crystal structure into descriptors for chemical elements, enabling the modelling of materials solely as compositions without requiring knowledge of their crystal structure. In the crystal structure, each atomic site can be described by similarity to common local structural motifs; by aggregating these features of similarity from the experimentally verified crystal structures of inorganic materials, LEAFs formulate a set of descriptors for chemical elements and compositions. The direct connection of LEAFs to the local coordination geometry enables the analysis of ML model property predictions, linking compositions to the underlying structure-property relationships. We demonstrate the versatility of LEAFs in structure-informed property predictions for compositions, mapping of chemical space in structural terms, and prioritising elemental substitutions. Based on the latter for predicting crystal structures of binary ionic compounds, LEAFs achieve the state-of-the-art accuracy of 86 per cent. These results suggest that the structurally informed description of chemical elements and compositions developed in this work can effectively guide synthetic efforts in discovering new materials.
Comments: 10 pages, 4 figures, supplementary information
Subjects: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Physics (physics.comp-ph)
Cite as: [cond-mat.mtrl-sci]
  (or [cond-mat.mtrl-sci] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

Comprehensive review of EEG data classification techniques for ADHD detection using machine learning and deep learning

Profile image of Nitin Ahire

2023, Revista română de pediatrie

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024
  • Open access
  • Published: 29 July 2024

Predicting hospital length of stay using machine learning on a large open health dataset

  • Raunak Jain 1 ,
  • Mrityunjai Singh 1 ,
  • A. Ravishankar Rao 2 &
  • Rahul Garg 1  

BMC Health Services Research volume  24 , Article number:  860 ( 2024 ) Cite this article

265 Accesses

Metrics details

Governments worldwide are facing growing pressure to increase transparency, as citizens demand greater insight into decision-making processes and public spending. An example is the release of open healthcare data to researchers, as healthcare is one of the top economic sectors. Significant information systems development and computational experimentation are required to extract meaning and value from these datasets. We use a large open health dataset provided by the New York State Statewide Planning and Research Cooperative System (SPARCS) containing 2.3 million de-identified patient records. One of the fields in these records is a patient’s length of stay (LoS) in a hospital, which is crucial in estimating healthcare costs and planning hospital capacity for future needs. Hence it would be very beneficial for hospitals to be able to predict the LoS early. The area of machine learning offers a potential solution, which is the focus of the current paper.

We investigated multiple machine learning techniques including feature engineering, regression, and classification trees to predict the length of stay (LoS) of all the hospital procedures currently available in the dataset. Whereas many researchers focus on LoS prediction for a specific disease, a unique feature of our model is its ability to simultaneously handle 285 diagnosis codes from the Clinical Classification System (CCS). We focused on the interpretability and explainability of input features and the resulting models. We developed separate models for newborns and non-newborns.

The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS. The best R 2 scores achieved are noteworthy: 0.82 for newborns using linear regression and 0.43 for non-newborns using catboost regression. Focusing on cardiovascular disease refines the predictive capability, achieving an improved R 2 score of 0.62. The models not only demonstrate high performance but also provide understandable insights. For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns.

Our study showcases the practical utility of machine learning models in predicting LoS during patient admittance. The emphasis on interpretability ensures that the models can be easily comprehended and replicated by other researchers. Healthcare stakeholders, including providers, administrators, and patients, stand to benefit significantly. The findings offer valuable insights for cost estimation and capacity planning, contributing to the overall enhancement of healthcare management and delivery.

Peer Review reports

Introduction

Democratic governments worldwide are placing an increasing importance on transparency, as this leads to better governance, market efficiency, improvement, and acceptance of government policies. This is highlighted by reports from the Organization for Economic Co-operation and Development (OECD) an international organization whose mission it is to shape policies that foster prosperity, equality, opportunity and well-being for all [ 1 ]. Openness and transparency have been recognized as pillars for democracy, and also for fostering sustainable development goals [ 2 ], which is a major focus of the United Nations ( https://sustainabledevelopment.un.org/sdg16 ).

An important government function is to provide for the healthcare needs of its citizens. The U.S. spends about $3.6 trillion a year on healthcare, which represents 18% of its GDP [ 3 ]. Other developed nations spend around 10% of their GDP on healthcare. The percentage of GDP spent on healthcare is rising as populations age. Consequently, research on healthcare expenditure and patient outcomes is crucial to maintain viable national economies. It is advantageous for nations to combine investigations by the private sector, government sector, non-profit agencies, and universities to find the best solutions. A promising path is to make health data open, which allows investigators from all sectors to participate and contribute their expertise. Though there are obvious patient privacy concerns, open health data has been made available by organizations such as New York State Statewide Planning and Research Cooperative System (SPARCS) [ 4 ].

Once the data is made available, it needs to be suitably processed to extract meaning and insights that will help healthcare providers and patients. We favor the creation and use of an open-source analytics system so that the entire research community can benefit from the effort [ 5 , 6 , 7 ]. As a concrete demonstration of the utility of our system and approach, we revealed that there is a growing incidence of mental health issues amongst adolescents in specific counties in New York State [ 8 ]. This has resulted in targeted interventions to address these problems in these communities [ 8 ]. Knowing where the problems lie allows policymakers and funding agencies to direct resources where needed.

Healthcare in the U.S. is largely provided through private insurance companies and it is difficult for patients to reliably understand what their expected healthcare costs are [ 9 , 10 ]. It is ironic that consumers can readily find prices of electronics items, books, clothes etc. online, but cannot find information about healthcare as easily. The availability of healthcare information including costs, incidence of diseases, and the expected length of stay for different procedures will allow consumers and patients to make better and more informed choices. For instance, in the U.S., patients can budget pre-tax contributions to health savings accounts, or decide when to opt for an elective surgery based on the expected duration of that procedure.

To achieve this capability, it is essential to have the underlying data and models that interpret the data. Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients. Significant advances have been made recently in the fields of data mining, machine-learning and artificial intelligence, with growing applications in healthcare [ 11 ]. To make our work concrete, we use our machine-learning system to predict the length of stay (LoS) in hospitals given the patient information in the open healthcare data released by New York State SPARCS [ 4 ].

The LoS is an important variable in determining healthcare costs, as costs directly increase for longer stays. The analysis by Jones [ 12 ] shows that the trends in LoS, hospital bed capacity and population growth have to be carefully analyzed for capacity planning and to ensure that adequate healthcare can be provided in the future. With certain health conditions such as cardiovascular disease, the hospital LoS is expected to increase due to the aging of the population in many countries worldwide [ 13 ]. During the COVID-19 pandemic, hospital bed capacity became a critical issue [ 14 ], and many regions in the world experienced a shortage of healthcare resources. Hence it is desirable to have models that can predict the LoS for a variety of diseases from available patient data.

The LoS is usually unknown at the time a patient is admitted. Hence, the objective of our research is to investigate whether we can predict the patient LoS from variables collected at the time of admission. By building a predictive model through machine learning techniques, we demonstrate that it is possible to predict the LoS from data that includes the Clinical Classifications Software (CCS) diagnosis code, severity of illness, and the need for surgery. We investigate several analytics techniques including feature selection, feature encoding, feature engineering, model selection, and model training in order to thoroughly explore the choices that affect eventual model performance. By using a linear regression model, we obtain an R 2 value of 0.42 when we predict the LoS from a set of 23 patient features. The success of our model will be beneficial to healthcare providers and policymakers for capacity planning purposes and to understand how to control healthcare costs. Patients and consumers can also use our model to estimate the LoS for procedures they are undergoing or for planning elective surgeries.

Stone et al. [ 15 ] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods. Lequertier et al. [ 16 ] surveyed methods for LoS prediction.

The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or restrict their analysis to data from specific hospitals. For instance, Sridhar et al. [ 17 ] created a model to predict the LoS for joint replacements in rural hospitals in the state of Montana by using a training set with 127 patients and a test set with 31 patients. In contrast, we have developed our model to predict the LoS for 285 different CCS diagnosis codes, over a set of 2.3 million patients over all hospitals in New York state. The CCS diagnosis code refers to the code used by the Clinical Classifications Software system, which encompasses 285 possible diagnosis and procedure categories [ 18 ]. Since the CCS diagnosis codes are too numerous to list, we give a few examples that we analyzed, including but not limited to abdominal hernia, acute myocardial infarction, acute renal failure, behavioral disorders, bladder cancer, Hodgkins disease, multiple sclerosis, multiple myeloma, schizophrenia, septicemia, and varicose veins. To the best of our knowledge, we are not aware of models that predict the LoS on such a variety of diagnosis codes, with a patient sample greater than 2 million records, and with freely available open data. Hence, our investigation is unique from this point of view.

Sotodeh et al. [ 19 ] developed a Markov model to predict the LoS in intensive care unit patients. Ma et al. [ 20 ] used decision tree methods to predict LoS in 11,206 patients with respiratory disease.

Burn et. al. examined trends in the LoS for patients undergoing hip-replacement and knee-replacement in the U.K. [ 21 ]. Their study demonstrated a steady decline in the LoS from 1997–2012. The purpose of their study was to determine factors that contributed to this decline, and they identified improved surgical techniques such as fast-track arthroplasty. However, they did not develop any machine-learning models to predict the LoS.

Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] and found that blood pressure is an important predictor of LoS. Garcia et al. determined factors influencing the LoS for undergoing treatment for hip fracture [ 23 ]. B. Vekaria et al. analyzed the variability of LoS for COVID-19 patients [ 24 ]. Arjannikov et al. [ 25 ] used positive-unlabeled learning to develop a predictive model for LoS.

Gupta et al. [ 26 ] conducted a meta-analysis of previously published papers on the role of nutrition on the LoS of cancer patients, and found that nutrition status is especially important in predicting LoS for gastronintestinal cancer. Similarly, Almashrafi et al. [ 27 ] performed a meta-analysis of existing literature on cardiac patients and reviewed factors affecting their LoS. However, they did not develop quantitative models in their work. Kalgotra et al. [ 28 ] use recurrent neural networks to build a prediction model for LoS.

Daghistani et al. [ 13 ] developed a machine learning model to predict length of stay for cardiac patients. They used a database of 16,414 patient records and predicted the length of stay into three classes, consisting of short LoS (< 3 days), intermediate LoS ( 3–5 days) and long LoS (> 5 days). They used detailed patient information, including blood test results, blood pressure, and patient history including smoking habits. Such detailed information is not available in the much larger SPARCS dataset that we utilized in our study.

Awad et al. [ 29 ] provide a comprehensive review of various techniques to predict the LoS. Though simple statistical methods have been used in the past, they make assumptions that the LoS is normally distributed, whereas the LoS has an exponential distribution [ 29 ]. Consequently, it is preferable to use techniques that do not make assumptions about the distribution of the data. Candidate techniques include regression, classification and regression trees, random forests, and neural networks. Rather than using statistical parametric techniques that fit parameters to specific statistical distributions, we favor data-driven techniques that apply machine-learning.

In 2020, during the height of the COVID-19 pandemic, the Lancet, a premier medical journal drew widespread rebuke [ 30 , 31 , 32 ] for publishing a paper based on questionable data. Many medical journals published expressions of concern [ 33 , 34 ]. The Lancet itself retracted the questionable paper [ 35 ], which is available at [ 36 ] with the stamp “retracted” placed on all pages. One possible solution to prevent such incidents from occurring is for top medical journals to require authors to make their data available for verification by the scientific community. Patient privacy concerns can be mitigated by de-identifying the records made available, as is already done by the New York State SPARCS effort [ 4 ]. Our methodology and analytics system design will become more relevant in the future, as there is a desire to prevent a repetition of the Lancet debacle. Even before the Lancet incident, there was declining trust amongst the public related to medicine and healthcare policy [ 37 ]. This situation continues today, with multiple factors at play, including biased news reporting in mainstream media [ 38 ]. A desirable solution is to make these fields more transparent, by releasing data to the public and explaining the various decisions in terms that the public can understand. The research in this paper demonstrates how such a solution can be developed.

Requirements

We describe the following three requirements of an ideal system for processing open healthcare data

Utilize open-source platforms to permit easy replicability and reproducibility.

Create interpretable and explainable models.

Demonstrate an understanding of how the input features determine the outcomes of interest.

The first requirement captures the need for research to be easily reproduced by peers in the field. There is growing concern that scientific results are becoming hard for researchers to reproduce [ 39 , 40 , 41 ]. This undermines the validity of the research and ultimately hurts the fields. Baker termed this the “reproducibility crisis”, and performed an analysis of the top factors that lead to irreproducibility of research [ 39 ]. Two of the top factors consist of the unavailability of raw data and code.

The second requirement addresses the need for the machine-learning models to produce explanations of their results. Though deep-learning models are popular today, they have been criticized for functioning as black-boxes, and the precise working of the model is hard to discern. In the field of healthcare, it is more desirable to have models that can be explained easily [ 42 ]. Unless healthcare providers understand how a model works, they will be reluctant to apply it in their practice. For instance, Reyes et al. determined that interpretable Artificial Intelligence systems can be better verified, trusted, and adopted in radiology practice [ 43 ].

The third requirement shows that it is important for relevant patient features to be captured that can be related to the outcomes of interest, such as LoS, total cost, mortality rate etc. Furthermore, healthcare providers should be able to understand the influence of these features on the performance of the model [ 44 ]. This is especially critical when feature engineering methods are used to combine existing features and create new features.

In the subsequent sections, we present our design for a healthcare analytics system that satisfies these requirements. We apply this methodology to the specific problem of predicting the LoS.

We have designed the overall system architecture as shown in Fig.  1 . This system is built to handle any open data source. We have shown the New York SPARCS as one of the data sources for the sake of specificity. Our framework can be applied to data from multiple sources such as the Center for Medicare and Medicaid Services (CMS in the U.S.) as shown in our previous work [ 6 ]. We chose a Python-based framework that utilizes Pandas [ 45 ] and Scikit learn [ 46 ]. Python is currently the most popular programming language for engineering and system design applications [ 47 ].

figure 1

Shows the system architecture. We use Python-based open-source tools such as Pandas and Scikit-Learn to implement the system

In Fig.  2 , we provide a detailed overview of the necessary processing stages. The specific algorithms used in each stage are described in the following sections.

figure 2

Shows the processing stages in our analytics pipeline

Recent research has shown that it is highly desirable for machine learning models used in the healthcare domain to be explainable to healthcare providers and professionals [ 48 ]. Hence, we focused on the interpretability and explainability of input features in our dataset and the models we chose to explore. We restricted our investigation to models that are explainable, including regression models, multinomial logistic regression, random forests, and decision trees. We also developed separate models for newborns and non-newborns.

Brief description of the dataset

During our investigation, we utilized open-health data provided by the New York State SPARCS system. The data we accessed was from the year 2016, which was the most recent year available at the time. This data was provided in the form of a CSV file, containing 2,343,429 rows and 34 columns. Each row contains de-identified in-patient discharge information. The dataset columns contained various types of information. They included geographic descriptors related to the hospital where care was provided, demographic descriptors such as patient race, ethnicity, and age, medical descriptors such as the CCS diagnosis code, APR DRG code, severity of illness, and length of stay. Additionally, payment descriptors were present, which included information about the type of insurance, total charges, and total cost of the procedure.

Detailed descriptions of all the elements in the data can be found in [ 49 ]. The CCS diagnosis code has been described earlier. The term “DRG” stands for Diagnostic Related Group [ 49 ], which is used by the Center for Medicare and Medicaid services in the U.S. for reimbursement purposes [ 50 ].

The data includes all patients who underwent inpatient procedures at all New York State Hospitals [ 51 ]. The payment for the care can come from multiple sources: Department of Corrections, Federal/State/Local/Veterans Administration, Managed Care, Medicare, Medicaid, Miscellaneous, Private Health Insurance, and Self-Pay. The dataset sourced from the New York State SPARCS system, encompassing a wider patient population beyond Medicare/Medicaid, holds greater value compared to datasets exclusively composed of Medicare/Medicaid patients. For instance, Gilmore et al. analyzed only Medicare patients [ 52 ].

We examine the distribution of the LoS in the dataset, as shown in Fig.  3 . We note that the providers of the data have truncated the length of stay to 120 days. This explains the peak we see at the tail of the distribution.

figure 3

Distribution of the length of stay in the dataset

Data pre-processing and cleaning

We identified 36,280 samples, comprising 1.55% of the data where there were missing values. These were discarded for further analysis. We removed samples which have Type of Admission = ‘Unknown’ (0.02% samples). So, the final data set has 2,306,668 samples. ‘Payment Typology 2’, and ‘Payment Typology 3’, have missing values (> = 50% samples), which were replaced by a ‘None’ string.

We note that approximately 10% of the dataset consists of rows representing newborns. We treat this group as a separate category. We found that the ‘Birth Weight’ feature had a zero value for non-newborn samples. Accordingly, to better use the ‘Birth Weight’ feature, we partitioned the data into two classes: newborns and non-newborns. This results in two classes of models, one for newborns and the second for all other patients. We removed the ‘Birth Weight’ feature in the input for the non-newborn samples as its value was zero for those samples.

The column ‘Total Costs’ (and in a similar way, ‘Total Charges’) are usually proportional to the LoS, and it would not be fair to use these variables to predict the LoS. Hence, we removed this column. We found that the columns 'Discharge Year', 'Abortion Edit Indicator'' are redundant for LoS prediction models, and we removed them. We also removed the columns ‘CCS Diagnosis Description’, ‘CCS Procedure Description’, ‘APR DRG Description’, ‘APR MDC Description’, and ‘APR Severity of Illness Description’ as we were given their corresponding numerical codes as features.

Since the focus of this paper is on the prediction of the LoS, we analyzed the distribution of LoS values in the dataset.

We developed regression models using all the LoS values, from 1–120. We also developed classification models where we discretized the LoS into specific bins. Since the distribution of LoS values is not uniform, and is heavily clustered around smaller values, we discretized the LoS into a small number of bins, e.g. 6 to 8 bins.

We utilized 10% of the data as a holdout test-set, which was not seen during the training phase. For the remaining 90% of the data, we used tenfold cross-validation in order to train the model and determine the best parameters to use.

Feature encoding

Many variables in the dataset are categorical, e.g., the variable “APR Severity of Illness Description” has the values in the set [Major, Minor, Moderate, Extreme]. We used distribution-dependent target encoding techniques and one-hot techniques to improve the model performance [ 53 ]. We replaced categorical data with the product of mean LoS and median LoS for a category value. The categorical feature can then better capture the dependence distribution of LoS with the value of the categorical feature.

For the linear regression model [ 54 ], we sampled a set of 6 categorical features, [‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’] which we target encoded with the mean of the LoS and the median of the LoS. We then one-hot encoded every feature (all features are categorical) and for each such one-hot encoded feature, created a new feature for each of the features in the sampled set, by replacing the ones in the one-hot encoded feature with the value of the corresponding feature in the sampled set. For example, we one-hot encoded ‘Operating Certificate Number’, and for samples where ‘Operating Certificate Number’ was 3, we created 6 features, each where samples having the value 3 were assigned the target encoded values of the sampled set features, and the other samples were assigned zero. We used such techniques to exploit the linear relation between LoS and each feature.

According to the sklearn documentation [ 55 ], a random forest regressor is “a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting”. The random forest regressor leverages ensemble learning based on many randomized decision trees to make accurate and robust predictions for regression problems. The averaging of many trees protects against single trees overfitting the training data.

The random forest classifier is also an ensemble learning technique and uses many randomized decision trees to make predictions for classification problems. The 'wisdom of crowds' concept suggests that the decision made by a larger group of people is typically better than an individual. The random forest classifier uses this intuition, and allows each decision tree to make a prediction. Finally, the most popular predicted class is chosen as the overall classification.

For the Random Forest Regressor [ 56 , 57 ] and Random Forest Classifier [ 58 ], we only used a similar distribution dependent target encoding as a random forest classifier/ regressor is unsuitable for sparse one-hot encoded columns.

Multinomial logistic regression is a type of regression analysis that predicts the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. It allows for more than two discrete outcomes, extending binomial logistic regression for binary classification to models with multiple class membership. For the multinomial logistic regression model [ 59 ], we used only one-hot encoding, and not target encoding, as the target value was categorical.

Finally, we experimented with combinations of target encoding and one-hot encoding. We can either use target encoding, or one-hot encoding, or both. When both encodings are employed, the dimensionality of the data increases to accommodate the one-hot encoded features. For each combination of encodings, we also experimented with different regression models including linear regression and random forest regression.

Feature importance, selection, and feature engineering

We experimented with different feature selection methods. Since the focus of our work is on developing interpretable and explainable models, we used SHAP analysis to determine relevant features.

We examine the importance of different features in the dataset. We used the SHAP value (Shapley Additive Explanations), a popular measure for feature importance [ 60 ]. Intuitively, the SHAP value measures the difference in model predictions when a feature is used versus omitted. It is captured by the following formula.

where \({{\varnothing }}_{i}\) is the SHAP value of feature \(i\) , \(p\) is the prediction by the model, n is the number of features and S is any set of features that does not include the feature \(i\) . The specific model we used for the prediction was the random forest regressor where we target-encoded all features with the product of the mean and the median of the LoS, since most of the features were categorical.

Classification models

One approach to the problem is to bin the LoS into different classes, and train a classifier to predict which class an input sample falls in. We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs.  3 and  4 .

figure 4

A density plot of the distribution of the length of stay. The area under the curve is 1. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

We used three different classification models, comprising the following:

Multinomial Logistic Regression

Random Forest Classifier

CatBoost classifier [ 62 ].

We used a Multinomial Logistic Regression model [ 59 ] trained and tested using tenfold cross validation to classify the LoS into one of the bins. The multinomial logistic regression model is capable of providing explainable results, which is part of the requirements. We used the feature engineering techniques described in the previous section.

We used a Random Forest Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins. We used a maximum depth of 10 so as to get explainable insights into the model.

Finally, we used a CatBoost Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins.

Regression models

We used three different regression models with the feature engineering techniques mentioned above ( Feature encoding section). These comprise:

Linear regression

Catboost regression

Random forest regression

The linear regression was implemented using the nn.Linear() function in the open source library PyTorch [ 63 ]. We used the ‘Adam’ optimization algorithm [ 64 ] in mini-batch settings to train the model weights for linear regression.

We investigated CatBoost regression in order to create models with minimal feature sets, whereby models with a low number of input features would provide adequate results. Accordingly, we trained a CatBoost Regressor [ 65 ] in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score.

The random forest regression was implemented using the function RandomForestRegressor() in scikit learn [ 55 ].

Model performance measures

For the regression models, we used the following metrics to compare the model performance.

The R 2 score and the p -value. We use a significance level of α = 0.05 (5 %) for our statistical tests.  If the p -value is small, i.e. less than α = 0.05, then the R 2 score is statistically significant.

For classifier models, we used the following metrics to compare the model performance.

True positive rate, false negative rate, and F1 score [ 66 ].

We computed the Brier score using Brier’s original calculation in his paper [ 67 ]. In this formulation, for R classes the Brier score B can vary between 0 and R, with 0 being the best score possible.

where \({\widehat{y}}_{i,c}\) is the class probability as per the model and \({I}_{i,c}=1\) if the i th sample belongs to class c and \({I}_{i,c}=0\) if it does not belong to class c .

We used the Delong test [ 68 ] to compare the AUC for different classifiers.

These metrics will allow other researchers to replicate our study and provide benchmarks for future improvements.

In this section we present the results of applying the techniques in the Methods section.

Descriptive statistics

We provide descriptive statistics that help the reader understand the distributions of the variables of interest.

Table 1 summarizes basic statistical properties of the LoS variable.

Figure  5 shows the distribution of the LoS variable for newborns.

figure 5

This figure depicts the distribution of the LoS variable for newborns

Table 2 shows the top 20 APR DRG descriptions based on their frequency of occurrence in the dataset.

Figure  6 shows the distribution of the LoS variable for the top 20 most frequently occurring APR DRG descriptions shown in Table  2 .

figure 6

A 3-d plot showing the distribution of the LoS for the top-20 most frequently occuring APR DRG descriptions. The x-axis (horizontal) depicts the LoS, the y-axis shows the APR DRG codes and the z-axis shows the density or frequency of occurrence of the LoS

We experimented with different encoding schemes for the categorical variables and for each encoding we examined different regression techniques. Our results are shown in Table 3 . We experimented with the three encoding schemes shown in the first column. The last row in the table shows a combination of one-hot encoding and target encoding, where the number of columns in the dataset are increased to accommodate one-hot encoded feature values for categorical variables.

Feature importance, selection and feature engineering

We obtained the SHAP plots using a Random Forest Regressor trained with target-encoded features.

Figures  7  and 8 show the SHAP values plots obtained for the features in the newborn partition of the dataset. We find that the features, “APR DRG Code”, “APR Severity of Illness Code”, “Patient Disposition”, “CCS Procedure Code”, are very useful in predicting the LoS. For instance, high feature values for “APR Severity of Illness Code”, which are encoded by red dots have higher SHAP values than the blue dots, which correspond to low feature values.

figure 7

SHAP Value plot for newborns

figure 8

1-D SHAP plot, in order of decreasing feature importance: top to bottom (for non-newborns)

A similar interpretation can be applied to the features in the non-newborn partition of the dataset. We note that “Operating Certificate Number” is among the top-10 most important features in both the newborn and non-newborn partitions. This finding is discussed in the Discussion section.

From Fig.  9 , we observe that as the severity of illness code increases from 1–4, there is a corresponding increase in the SHAP values.

figure 9

A 2-D plot showing the relationship between SHAP values for one feature, “APR Severity of Illness Code”, and the feature values themselves (non-newborns)

To further understand the relationship between the APR Severity of Illness code and the LoS, we created the plot in Fig.  10 . This shows that the most frequently occurring APR Severity of Illness code is 1 (Minor), and that the most frequently occurring LoS is 2 days. We provide this 2-D projection of the overall distribution of the multi-dimensional data as a way of understanding the relationship between the input features and the target variable, LoS.

figure 10

A density plot showing the relationship between APR Severity of Illness Code and the LoS. The color scale on the right determines the interpretation of colors in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

Similarly, Fig.  11 shows the relationship between the birth weight and the length of stay. The most common length of stay is two days.

figure 11

A density plot showing the distribution of the birth weight values (in grams) versus the LoS. The colorbar on the right shows the interpretation of color values shown in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

Classification

We obtained a classification accuracy of 46.98% using Multinomial Logistic Regression with tenfold cross-validation in the 5-class classification task for non-newborn cases. The confusion matrix in Fig.  12 shows that the highest density of correctly classified samples is in or close to the diagonal region. The regions where out model fails occurs between adjacent classes as can be inferred from the given confusion matrix.

figure 12

Confusion matrix for classification of non-newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers

For the newborn cases, we obtained a classification accuracy of 60.08% using Random Forest Classification model with tenfold cross-validation in the 5-class classification task. The confusion matrix in Fig.  13 shows that the majority of data samples lie in or close to the diagonal region. The regions where our model does not do well occurs between adjacent classes as can be inferred from the given confusion matrix,

figure 13

Confusion matrix for classification of newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers

The density plot in Fig.  14 shows the relationship between the actual LoS and the predicted LoS. For a LoS of 2 days, the centroid of the predicted LoS cluster is between 2 and 3 days.

figure 14

Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot

A quantitative depiction of our model errors is shown in Fig.  15 . The values in Fig.  15 are interpreted as follows. Referring to the column for LoS = 2, the top row shows that 51% of the predicted LoS values for an actual stay of 2 days is also 2 days (zero error), and that 23% of the predicted values for LoS equal to 2 days have an error of 1 day and so on. The relatively high values in the top row indicates that the model is performing well, with an error of less than 1 day. There are relatively few instances of errors between 2 and 3 days (typically less than 10% of the values show up in this row). The only exception is for the class corresponding to LoS great than 8 days. The truncation of the data to produce this class results in larger model errors specifically for this class.

figure 15

Shows the distribution of correctly predicted LoS values for each class used in our model. Along the columns, we depict the different classes used in the model, consisting of LoS equal to 1, 2, 3 …8, and more than 8. Each row depicts different errors made in the prediction. For instance, the top row depicts an error of less than or equal to one day between the actual LoS and the predicted Los. The second row from the top depicts an error which is greater than 1 and less than or equal 2 days. And so on for the other rows, for non-newborns

Figures  16 and 17 show the scatter plots for the linear regression models. The exact line represents a line with slope 1, and a perfect model would be one that produced all points lying on this line.

figure 16

Scatter plot showing an instance of a linear regression fit to the data (newborns). The R 2 score is 0.82. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)

figure 17

Scatter plot for linear regression. (non-newborns). The R 2 score is 0.42. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)

Figure  18 shows a density plot depicting the relationship between the predicted length of stay and the actual length of stay.

figure 18

Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 40 ] to generate the plot. The best fit regression line to our predictions is shown in green, whereas the blue line represents the ideal fit (line of slope 1, where actual LoS and predicted LoS are equal)

Most of the existing literature on LoS stay prediction is based on data for specific disease conditions such as cancer or cardiac disease. Hence, in order to understand which CCS diagnosis codes produce good model fits, we produced the plot in Fig.  19 .

figure 19

This figure shows the three CCS diagnosis codes that produced the top three R 2 scores using linear regression. These are 101, 100 and 109. The three CCS Diagnosis codes that produced the lowest R 2 scores are 159, 657, and 659

We provide the following descriptions in Tables  4  and 5 for the 3 CCS Diagnosis Codes in Fig.  19 with the top R 2 Scores using linear regression.

Similarly, the following table shows the 3 CCS Diagnosis Codes in Fig.  19 for the lowest R 2 Scores using linear regression.

Models with minimal feature sets

We trained a CatBoost Regressor [ 65 ] on the complete dataset in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score. This is shown in Fig.  20

figure 20

The labels for each row on the left show combinations of different input features. A CatBoost regression model was developed using the selected combination of features. The R 2 correlation scores for each model is shown in the bar graph

We can infer from Fig.  20 that only four features (‘'APR MDC Code', 'APR Severity of Illness Code', 'APR DRG Code', 'Patient Disposition') are sufficient for the model to reach very close to its maximum performance. We obtain similar concurring results when using other regression models for the same experiment.

Classification trees

We used a random forest tree approach to generate the trees in Figs.  21 and 22 .

figure 21

A random forest tree that represents a best-fit model to the data for newborns. With 4 levels of the decision tree, the R 2 score is 0.65

figure 22

A random forest tree using only a tree of depth 3 that represents a best-fit model to the data for non-newborns. The R 2 score is 0.28. We can generate trees with greater depth that better fit the data, but we have shown only a depth of 3 for the sake of readability in the printed version of this paper. Otherwise, the tree would be too large to be legible on this page. The main point in this figure is to showcase the ease of interpretation of the working of the model through rules

We used tenfold cross validation to determine the regression scores. The results are summarized in Tables  6 and 7 .

We computed the multi-class classifier metrics for logistic regression, using one-hot encoding for non-newborns. The results are presented in Table  8 . The first row represents the accuracy of the classifier when Class 0 is compared against the rest of the classes. A similar interpretation applies to the other rows in the table, ie one-versus-rest. The macro average gives the balanced recall and precision, and the resulting F1 score. The weighted average gives a support (number of samples) weighted average of the individual class metric. The overall accuracy is computed by dividing the total number of accurate predictions, which is 49,686 out of a total number of 105,932 samples, which yields a value of 0.47.

For the category of non-newborns, Fig.  23  provides a graphical plot that visualizes the ROC curves for the different multiclass classifiers we developed.

figure 23

This figure applies to data concerning non-newborns. We show the multiclass ROC curves for the performance of the catboost classifier for the different classes shown. The area under the ROC curve is 0.7844

In Table  9 we compare the performance of our multiclass classifier using logistic regression developed on 2016 SPARCS data against 2017 SPARCS data.

In order to compare the performance of the different classifiers, we computed the AUC measures reported in Table  10 . Figure 24 visualizes the data in Table 10 and Fig. 25 visualizes the data in Table 11 . In Tables 12 and 13 we report the results of computing the Delong test for non-newborns and newborns respectively. In Tables 14 and 15 we report the results of computing the Brier scores for non-new borns and newborns respectively.

figure 24

A bar chart that depicts the data in Table  10 for non-newborns

figure 25

A bar chart that depicts the data in Table  11

Model parameters

In Table  16 we present the parameter and hyperparameter values used in the different models.

Additional results shown in the Appendix/Supplementary material

Due to space restrictions, we show additional results in the Appendix/Supplementary Material. These results are in tabular form and describe the R 2 scores for different segmentations of the variables in the dataset, e.g. according to age group, severity of illness code, etc.

The most significant result we obtain is shown in Figs.  21 and 22 , which provides an interpretable working of the decision trees using random forest modeling. Figure  21 for newborns shows that the birth weight features prominently in the decision tree, occurring at the root node. Low birth weights are represented on the left side of the tree and are typically associated with longer hospital stays. Higher birth weights occur on the right side of the tree, and the node in the bottom row with 189,574 samples shows that the most frequently occurring predicted stay is 2.66 days. Figure  22 for non-newborns shows that the features of “APR DRG Code”, “APR Severity of Illness Code” and “Patient Disposition” are the most important top-level features to predict the LoS. This provides a relatively simple rule-based model, which can be easily interpreted by healthcare providers as well as patients. For instance, the right-most branch of the tree classifies the input data into a relatively high LoS (46 days) when the branch conditions APR DRG Code is greater than 813.55 and the APR Severity of Illness Code is less than 91.

The results in Fig.  19 and Table  4 show that if we restrict our model to specific CCS Diagnosis descriptions such as “coronary atherosclerosis and other heart disease”, we obtain a good R 2 Score of 0.62. The objective of our work is not to cherry-pick CCS Diagnosis codes that produce good results, but rather to develop a single model for the entire SPARCS dataset to obtain a birds-eye perspective. For future work, we can explicitly build separate models for each CCS Diagnosis code, and that could have relevance to specific medical specialties, such as cardiovascular care.

Similarly, the results in Fig.  19 and Table  5 show that there are CCS Diagnosis codes corresponding to schizophrenia and mood disorders that produce a poor model fit. Factors that contribute to this include the type of data in the SPARCS dataset, where information about patient vitals, medications, or a patient’s income level is not provided, and the inherent variability in treating schizophrenia and mood disorders. Baeza et al. [ 69 ] identified several variables that affect the LoS in psychiatric patients, which include psychiatric admissions in the previous years, psychiatric rating scale scores, history of attempted suicide, and not having sufficient income. Such variables are not provided in the SPARCS dataset. Hence a policy implication is to collect and make such data available, perhaps as a separate dataset focused on mental health issues, which have proven challenging to treat.

Figures  16 and 17 show that a better regression fit is obtained when a specific CCS Diagnosis code is used to build the model, such as “Newborn” in Fig.  16 . To put these results in context, we note that it is difficult to obtain a high R 2 value for healthcare datasets in general, and especially for large numbers of patient samples that span multiple hospitals. For instance, Bertsimas [ 70 ] reported an R 2 value of 0.2 and Kshirsagar [ 71 ] reported an R 2 value of 0.33 for similar types of prediction problems as studied in this paper.

Further details for a segmentation of R 2 scores by the different variable categories are shown in the Appendix/Supplementary Material section. For instance, the table corresponding to Age Groups shows that there is close agreement between the mean of the predicted LoS from our model and the actual LoS. Furthermore, the mean LoS increases steadily from 4.8 days for Age group 0–17 to 6.4 days for ages 70 or older. A discussion of these tables is outside the scope of this paper. However, they are being provided to help other researchers form hypotheses for further investigations or to find supporting evidence for ongoing research.

Table 3 shows that the best encoding scheme is to combine target encoding with one-hot encoding and then apply linear regression. This produces an R 2 score of 0.42 for the non-newborn data, which is the best fit we could obtain. This table also shows that significant improvements can be obtained by exploring the search space which consists of different strategies of feature encoding and regression methods. There is no theoretical framework which determines the optimum choice, and the best method is to conduct an experimental search. An important contribution of the current paper is to explore this search space so that other researchers can use and build upon our methodology.

The distribution of errors in Fig.  15 shows that the truncation we employed at a LoS of 8 days produces artifacts in the prediction model as all stays of greater than 8 days are lumped into one class. Nevertheless, the distribution of LoS values in Fig.  4 shows that a relatively small number of data samples have LoS greater than 8 days. In the future, we will investigate different truncation levels, and this is outside the scope of the current paper. By using our methodology, the truncation level can also be tuned by practitioners in the field, including hospital administrators and other researchers.

Our results in Fig.  7 show that certain features are not useful in predicting the LoS. The SHAP plot shows that features such as race, gender, and ethnicity are not useful in predicting the LoS. It would have been interesting if this were not the case, as that implies that there is systemic bias based on race, gender or ethnicity. For instance, a person with a given race may have a smaller LoS based on their demographic identity. This would be unacceptable in the medical field. It is satisfying to see that a large and detailed healthcare dataset does not show evidence of bias.

To place this finding in context, racial bias is an important area of research in the U.S., especially in fields such as criminology and access to financial services such as loans. In the U.S., it is well known that there is a disproportional imprisonment of black and Hispanic males [ 72 ]. Researchers working on criminal justice have determined that there is racial bias in the process of sentencing and granting parole, with blacks being adversely affected [ 73 ]. This bias is reinforced through any algorithms that are trained on the underlying data. There is evidence that banks discriminate against applicants for loans based on their race or gender [ 74 ].

This does not appear to be the case in our analysis of the SPARCS data. Though we did not specifically investigate the issue of racial bias in the LoS, the feature analysis we conducted automatically provides relevant answers. Other researchers including those in the U.K [ 21 ] have also determined that gender does not have an effect on LoS or costs. Hence the results in the current paper are consistent with the findings of other researchers in other countries working on entirely different datasets.

From Table  6 we see that in the case of data concerning non-newborns, the catboost regression performs the best, with an R 2 score of 0.432. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through catboost regression is statistically significant. Similarly, the p -values for linear regression and random forest regression indicate that these models produce predictions that are statistically significant, i.e. they did not occur by random chance.

From Table  7 that refers to data from newborns, the linear regression performs the best, with an R 2 score of 0.82. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through linear regression is statistically significant. Similarly, the p -values for random forest regression and catboost regression indicate that these models produce predictions that are statistically significant.

We examine the performance of classifiers on non-newborn data, as shown in Tables  10 and 12 . The Delong test conducted in Table  12 shows that there is a statistically significant difference between the AUCs of the pairwise comparisons of the models. Hence, we conclude that the catboost classifier performs the best with an average AUC of 0.7844. We also note that there is a marginal improvement in performance when we use the catboost classifier instead of the random forest classifier. Both the catboost classifier and the random forest classifier perform better than logistic regression. We conclude that the best performing model for non-newborns is the catboost classifier, followed by the random forest classifier, and then logistic regression.

In the case of newborn data, we examine the performance of the classifiers as shown in Tables  11 and 13 . From Table 13 , we note that the p -values in all the rows are less than 0.05, except for the binary class “one vs. rest for class 3”, random forests vs. catboost. Hence, for this particular comparison between the random forest classifier and the catboost classifier for “one vs. rest for class 3”, we cannot conclude that there is a statistically significant difference between the performance of these two classifiers. From Table  11 we observe that the AUCs of these two classifiers are very similar. We also note that only about 10% of the dataset consists of newborn cases.

From Table  14 we note that the Brier score for the catboost classifier is the lowest. A lower Brier score indicates better performance. According to the Brier scores for the non-newborn data, the catboost classifier performs the best, followed by the random forest classifier and then logistic regression. Table 15 shows that for newborns, the random forest classifier performs the best, followed by the catboost classifier and logistic regression. The performance of the random forest classifier and catboost classifier are very similar.

From a practical perspective, it may make sense to use a catboost classifier on both newborn and non-newborn data as it simplifies the processing pipeline. The ultimate decision rests with the administrators and implementers of these decision systems in the hospital environment.

Burn et al. observe [ 21 ] that though the U.S. has reported similar declines in LoS as in the U.K, the overall costs of joint replacement have risen. The U.K. government created policies to encourage the formation of specialist centers for joint replacement, which have resulted in reduction in the LoS as well as delivering cost reductions. The results and analysis presented in our current paper can help educate patients and healthcare consumers about trends in healthcare costs and how they can be reduced. An informed and educated electorate can press their elected representatives to make changes to the healthcare system to benefit the populace.

Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] where they used data from around 5000 patients and considered 35 input variables to build a predictive model. They found that the LoS was longer in patients with high blood pressure. In contrast, our method uses data from 2.5 million patients and considers multiple disease conditions simultaneously. We also do not have access to patient vitals such as blood pressure measurements, due to the limitation of the existing New York State SPARCS data.

Garcia et al. [ 23 ] conducted a study of elderly patients (age greater than 60) to understand factors governing the LoS for hip fracture treatment. They used 660 patient records and determined that the most significant variable was the American Society of Anesthesiologists (ASA) classification system. The ASA score ranges from 1–5 and captures the anesthesiologist’s impression of a patient’s health and comorbidities at the time of surgery. Garcia et al. showed a monotonically increasing relationship between the ASA score and the LoS. However, they did not build a specific predictive model. Their work shows that it is possible to find single variables with significant information content in order to estimate the LoS. The New York SPARCS dataset that we used does not contain the ASA score. Hence a policy implication of our research is to alert the healthcare authorities include such variables such as the ASA score where relevant in the datasets released in the future. The additional storage required is very small (one additional byte per patient record).

Arjannikov et al. [ 25 ] developed predictive models by binarizing the data into two categories, e.g. LoS <  = 2 days or LoS > 2 days. In our work, we did not employ such a discretization. In contrast, we used continuous regression techniques as well as classification into more than two bins. It is preferable to stay as close to the actual data as possible.

Almashrafi et al. [ 27 ] and Cots et al. [ 75 ] observed that larger hospitals tended to have longer LoS for patients undergoing cardiac surgery. Though we did not specifically examine cardiac surgery outcomes, our feature analysis indicated that the hospital operating certificate number had lower relevance than other features such as DRG codes. Nevertheless, the SHAP plots in Fig.  7 and Fig.  8 show that the hospital operating certificate number occurs within the top 10 features in order of SHAP values. We will investigate this relationship in more detail in future research, as it requires determining the size of the hospital from the operating certificate number and creating an appropriate machine-learning model. The Appendix contains results that show certain operating certificate numbers that produce a good model fit to the data.

A major focus of our research is on building interpretable and explainable models. Based on the principle of parsimony, it is preferable to utilize models which involve fewer features. This will provide simpler explanations to healthcare professionals as well as patients. We have shown through Fig.  20 that a model with five features performs just as well as a model with seven features. These features also make intuitive sense and the model’s operation can be understood by both patients and healthcare providers.

Patients in the U.S. increasingly have to pay for medical procedures out-of-pocket as insurance payments do not cover all the expenses, leading to unexpectedly large bills [ 76 ]. Many patients also do not possess health insurance in the U.S., with the consequence that they get charged the highest [ 77 ]. Kullgreen et.al. observe that patients in the U.S. need to be discerning healthcare consumers [ 78 ], as they can optimize the value they receive from out-of-pocket spending. In addition to estimating the cost of medical procedures, patients will also benefit from estimating the expected duration for a procedure such as joint replacement. This will allow them to budget adequate time for their medical procedures. Patients and consumers will benefit from obtaining estimates from an unbiased open data source such as New York State SPARCS and the use of our model.

Other researchers have developed specific LoS models for particular health conditions, such as cardiac disease [ 22 ], hip replacement [ 21 ], cancer [ 26 ], or COVID-19 [ 24 ]. In addition, researchers typically assume a prior statistical distribution for the outcomes, such a Weibull distribution [ 24 ]. However, we have not made any assumptions of specific prior statistical distributions, nor have we restricted our analysis to specific diseases. Consequently, our model and techniques should be more widely applicable, especially in the face of rapidly changing disease trajectories worldwide.

Our study is based exclusively on freely available open health data. Consequently, we cannot control the granularity of the data and must use the data as-is. We are unable to obtain more detailed patient information such as their physiological variables such as blood pressure, heartrate variability etc. at the time of admittance and during their stay. Hospitals, healthcare providers, and insurers have access to this data. However, there is no mandate for them to make this available to researchers outside their own organizations. Sometimes they sell de-identified data to interested parties such as pharmaceutical companies [ 79 ]. Due to the high costs involved in purchasing this data, researchers worldwide, especially in developing countries are at a disadvantage in developing AI algorithms for healthcare.

There is growing recognition that medical researchers need to standardize data formats and tools used for their analysis, and share them openly. One such effort is the organization for Observational Health Data Sciences and Informatics (OHDSI) as described in [ 80 ].

Twitter has demonstrated an interesting path forward, where a small percentage of its data was made available freely to all users for non-commercial purposes through an API [ 81 ]. Recently, Twitter has made a larger proportion of its data available to qualified academic researchers [ 82 ]. In the future, the profit motives of companies need to be balanced with considerations for the greater public good. An advantage of using the Twitter model is that it spurs more academic research and allows universities to train students and the workforce of the future on real-world and relevant datasets.

In the U.S., a new law went into effect in January 2021 requiring hospitals to make pricing data available publicly. The premise is that having this data would provide better transparency into the working of the healthcare system in the U.S. and lead to cost efficiencies. However, most hospitals are not in compliance with this law [ 83 ]. Concerted efforts by government officials as well as pressure by the public will be necessary to achieve compliance. If the eventual release of such data is not accompanied by a corresponding interest shown by academicians, healthcare researchers, policymakers, and the public it is likely that the very premise of the utility of this data will be called into question. Furthermore, merely dumping large quantities of data into the public domain is unlikely to benefit anyone. Hence research efforts such as the one presented in this paper will be valuable in demonstrating the utility of this data to all stakeholders.

Our machine-learning pipeline can easily be applied to new data that will be released periodically by New York SPARCS, and also to hospital pricing data [ 83 ]. Due to our open-source methodology, other researchers can easily extend our work and apply it to extract meaning from open health data. This improves reproducibility, which is an essential aspect of science. We will make our code available on Github to interested researchers for non-commercial purposes.

Limitations of our models

Our models are restricted to the data available through New York State SPARCS, which does not provide detailed information about patient vitals. More detailed physiological data is available through the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) framework [ 84 ], though for a smaller number of patients. We plan to extend our methodology to handle such data in the future. Another limitation of our study is that it does not account for patient co-morbidities. This arises from the de-identification process used to release the SPARCS data, where patient information is removed. Hence we are unable to analyze multiple hospital admissions for a given patient, possibly for different conditions. The main advantage of our approach is that it uses large-scale population data (2.3 million patients) but at a coarse level of granularity, where physiological data is not available. Nevertheless, our approach provides a high-level view of the operation of the healthcare system, which provides valuable insights.

There is growing interest in using data analytics to increase government transparency and inform policymaking. It is expected that the meaning and insights gained from such evidence-based analysis will translate to better policies and optimal usage of the available infrastructure. This requires cooperation between computer scientists, domain experts, and policy makers. Open healthcare data is especially valuable in this context due to its economic significance. This paper presents an open-source analytics system to conduct evidence-based analysis on openly available healthcare data.

The goal is to develop interpretable machine learning models that identify key drivers and make accurate predictions related to healthcare costs and utilization. Such models can provide actionable insights to guide healthcare administrators and policy makers. A specific illustration is provided via a robust machine learning pipeline that predicts hospital length of stay across 285 disease categories based on 2.3 million de-identified patient records. The length of stay is directly related to costs.

We focused on the interpretability and explainability of input features and the resulting models. Hence, we developed separate models for newborns and non-newborns, given differences in input features. The best performing model for non-newborn data was catboost regression, which used linear regression and achieved an R 2 score of 0.43. The best performing model for newborns and non-newborns respectively was linear regression, which achieved an R 2 score of 0.82. Key newborn predictors included birth weight, while non-newborn models relied heavily on the diagnostic related group classification. This demonstrates model interpretability, which is important for adoption. There is an opportunity to further improve performance for specific diseases. If we restrict our analysis to cardiovascular disease, we obtain an improved R 2 score of 0.62.

The presented approach has several desirable qualities. Firstly, transparency and reproducibility are enabled through the open-source methodology. Secondly, the model generalizability facilitates insights across numerous disease states. Thirdly, the technical framework can easily integrate new data while allowing modular extensions by the research community. Lastly, the evidence generated can readily inform multiple key stakeholders including healthcare administrators planning capacity, policy makers optimizing delivery, and patients making medical decisions.

Availability of data and materials

Data is publicly available at the website mentioned in the paper, https://www.health.ny.gov/statistics/sparcs/

There is an “About Us” tab in the website which contains all the contact details. The authors have nothing to do with this website as it is maintained by New York State.

Gurría A. Openness and Transparency - Pillars for Democracy, Trust and Progress. OECD.org. Available: https://www.oecd.org/unitedstates/opennessandtransparency-pillarsfordemocracytrustandprogress.htm . Accessed 28 June 2024.

Jetzek T. The Sustainable Value of Open Government Data: Uncovering the Generative Mechanisms of Open Data through a Mixed Methods Approach. lCopenhagen Business School, Institut for IT-Ledelse Department of IT Management. 2015.

Move fast and heal things: How health care is turning into a consumer product. The Economist. 2022.  https://www.economist.com/business/how-health-care-is-turning-into-a-consumer-product/21807114 . Accessed 28 June 2024.

New York State Department Of Health, Statewide Planning and Research Cooperative System (SPARCS).  https://www.health.ny.gov/statistics/sparcs/ . Accessed 5 Oct 2022.

Rao AR, Chhabra A, Das R, Ruhil V. A framework for analyzing publicly available healthcare data. In 2015 17th International Conference on E-health Networking, Application & Services (IEEE HealthCom). 2015: IEEE, pp. 653–656.

Rao AR, Clarke D. A fully integrated open-source toolkit for mining healthcare big-data: architecture and applications. In IEEE International Conference on Healthcare Informatics ICHI, Chicago. 2016: IEEE, pp. 255–261.

Rao AR, Garai S, Dey S, Peng H. PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data. SN Computer Science. 2021;2(6):1–22.

Article   Google Scholar  

Rao AR, Rao S, Chhabra R. Rising mental health incidence among adolescents in Westchester, NY. Community Ment Health J. 2021:1–1. 

Boylan J F. My $145,000 Surprise Medical Bill. New York Times. 2020.  https://www.nytimes.com/2020/02/19/opinion/surprise-medical-bill.html . Accessed 28 June 2024.

Peterson K, Bykowicz J. Congress Debates Push to End Surprise Medical Billing. Wall Street J. 2020.  https://www.wsj.com/articles/congress-debates-push-to-end-surprise-medical-billing-11589448603 . Accessed 28 June 2024.

Wang S, Zhang J, Fu Y, Li Y. ACM TIST Special Issue on Deep Learning for Spatio-Temporal Data: Part 1. 12th ed. NY: ACM New York; 2021. p. 1–3.

Google Scholar  

Jones R. lining length of stay and future bed numbers. BJHCM. 2015;21(9):440–1.

Daghistani TA, Elshawi R, Sakr S, Ahmed AM, Al-Thwayee A, Al-Mallah MH. Predictors of in-hospital length of stay among cardiac patients: a machine learning approach. Int J Cardiol. 2019;288:140–7.

Article   PubMed   Google Scholar  

Sen-Crowe B, Sutherland M, McKenney M, Elkbuli A. A closer look into global hospital beds capacity and resource shortages during the COVID-19 pandemic. J Surg Res. 2021;260:56–63.

Article   CAS   PubMed   Google Scholar  

Stone K, Zwiggelaar R, Jones P, Mac Parthaláin N. A systematic review of the prediction of hospital length of stay: Towards a unified framework. PLOS Digital Health. 2022;1(4):e0000017.

Article   PubMed   PubMed Central   Google Scholar  

Lequertier V, Wang T, Fondrevelle J, Augusto V, Duclos A. Hospital length of stay prediction methods: a systematic review. Med Care. 2021;59(10):929–38.

Sridhar S, Whitaker B, Mouat-Hunter A, McCrory B. Predicting Length of Stay using machine learning for total joint replacements performed at a rural community hospital. PLoS ONE. 2022;17(11);e0277479.

Article   CAS   PubMed   PubMed Central   Google Scholar  

CCS (Clinical Classifications Software) - Synopsis. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CCS/index.html . Accessed 13 Jan 2022.

Sotoodeh M, Ho JC. Improving length of stay prediction using a hidden Markov model. AMIA Summits on Translational Science Proceedings. 2019;2019:425.

PubMed Central   Google Scholar  

Ma F, Yu L, Ye L, Yao DD, Zhuang W. Length-of-stay prediction for pediatric patients with respiratory diseases using decision tree methods. IEEE J Biomed Health Inform. 2020;24(9):2651–62.

Burn E, et al. Trends and determinants of length of stay and hospital reimbursement following knee and hip replacement: evidence from linked primary care and NHS hospital records from 1997 to 2014. BMJ Open. 2018;8(1);e019146.

Hachesu PR, Ahmadi M, Alizadeh S, Sadoughi F. Use of data mining techniques to determine and predict length of stay of cardiac patients. Healthcare informatics research. 2013;19(2):121–9.

Garcia AE, et al. Patient variables which may predict length of stay and hospital costs in elderly patients with hip fracture. J Orthop Trauma. 2012;26(11):620–3.

Vekaria B, et al. Hospital length of stay for COVID-19 patients: Data-driven methods for forward planning. BMC Infect Dis. 2021;21(1):1–15.

Arjannikov T, Tzanetakis G. An empirical investigation of PU learning for predicting length of stay. In 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). 2021: IEEE, pp. 41–47.

Gupta D, Vashi PG, Lammersfeld CA, Braun DP. Role of nutritional status in predicting the length of stay in cancer: a systematic review of the epidemiological literature. Ann Nutr Metab. 2011;59(2–4):96–106.

Almashrafi A, Elmontsri M, Aylin P. Systematic review of factors influencing length of stay in ICU after adult cardiac surgery. BMC Health Serv Res. 2016;16(1):318.

Kalgotra P, Sharda R. When will I get out of the hospital? Modeling Length of Stay using Comorbidity Networks. J Manag Inf Syst. 2021;38(4):1150–84.

Awad A, Bader-El-Den M, McNicholas J. Patient length of stay and mortality prediction: a survey. Health Serv Manage Res. 2017;30(2):105–20.

Editorial-Board. The Lancet, HCL and Trump. Wall Street J. 2020.  https://www.wsj.com/articles/the-lancet-hcl-and-trump-11591226880 . Accessed 28 June 2024.

Servick  K, Enserink M. A mysterious company’s coronavirus papers in top medical journals may be unraveling. Science. 2020.  https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-top-medical-journals-may-be-unraveling . Accessed 28 June 2024.

Gabler E, Rabin RC. The Doctor Behind the Disputed Covid Data. New York Times. 2020.  https://www.nytimes.com/2020/07/27/science/coronavirus-retracted-studies-data.html . Accessed 28 June 2024.

Lancet-Editors. Expression of concern: Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. 2020;395:10240. https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-topmedical-journals-may-be-unraveling . Accessed 28 June 2024.

Editorial-Board. Expression of Concern: Mehra MR et al. Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J Med. 2020.  https://www.nejm.org/doi/full/10.1056/NEJMoa2007621 . Accessed 28 June 2024.

Hopkins JS, Gold R. Authors Retract Studies That Found Risks of Using Antimalaria Drugs Against Covid-19. Wall Street J. 2020. https://www.wsj.com/articles/authors-retract-study-that-found-risks-of-using-antimalaria-drug-against-covid-19-11591299329 . Accessed 28 June 2024.

https://www.thelancet.com/pdfs/journals/lancet/PIIS0140-6736(20)31180-6.pdf . Accessed 9 Jan 2022.

Wolfensberger M, Wrigley A. Trust in Medicine. Cambridge University Press. 2019. ISBN-13: 978-1108487191.

Bhattacharya J, Nicholson T. A Deceptive Covid Study, Unmasked. Wall Street J. 2022. https://www.wsj.com/articles/deceptive-covid-study-unmasked-abc-misleading-omicron-north-carolina-students-duke-mask-test-to-stay-11641933613 . Accessed 28 June 2024.

Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4.

Begley CG, Ioannidis JP. Reproducibility in science: improving the standard for basic and preclinical research. Circ Res. 2015;116(1):116–26.

Eisner D. Reproducibility of science: Fraud, impact factors and carelessness. J Mol Cell Cardiol. 2018;114:364–8.

Wang F, Kaushal R, Khullar D. Should health care demand interpretable artificial intelligence or accept “black box” medicine? Am College Phys. 2020;172:59–60.

Reyes M, et al. On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol Art Intell. 2020;2(3):e190043.

Savadjiev P, et al. Demystification of AI-driven medical image interpretation: past, present and future. Eur Radiol. 2019;29(3):1616–24.

McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc. 2012.

Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Machine Learn Res. 2011;12:2825–30.

Cass S. The top programming languages: Our latest rankings put Python on top-again-[Careers]. IEEE Spectr. 2020;57(8):22–22.

Tjoa E, Guan C. A survey on explainable artificial intelligence (xai): Toward medical xai," IEEE Transactions on Neural Networks and Learning Systems. 2020.

https://www.health.ny.gov/statistics/sparcs/docs/sparcs_data_dictionary.xlsx . Accessed 28 June 2024.

Design and development of the Diagnosis Related Group (DRG). https://www.cms.gov/icd10m/version37-fullcode-cms/fullcode_cms/Design_and_development_of_the_Diagnosis_Related_Group_(DRGs).pdf . Accessed 5 Oct 2022.

ARTICLE 28, Hospitals, Public Health (PBH) CHAPTER 45. 2023. Available: https://www.nysenate.gov/legislation/laws/PBH/A28 . Accessed 28 June 2024.

Gilmore‐Bykovskyi A, et al. Disparities in 30‐day readmission rates among Medicare enrollees with dementia. J Am Geriatr Soc. 2023.

Rodríguez P, Bautista MA, Gonzalez J, Escalera S. Beyond one-hot encoding: Lower dimensional target embedding. Image Vis Comput. 2018;75:21–31.

Montgomery DC, Peck EA, Vining GG. Introduction to linear regression analysis. 6th ed. John Wiley & Sons; 2021. ISBN-13 978-1119578727.

Random forest regressor in sklearn. Available: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html . Accessed 28 June 2024.

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003;43(6):1947–58.

Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22.

Böhning D. Multinomial logistic regression algorithm. Ann Inst Stat Math. 1992;44(1):197–200.

Vaid A, et al. Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation. J Med Internet Res. 2020;22(11);e24018.

Density Estimation.  https://scikit-learn.org/stable/modules/density.html . Accessed 5 Oct 2022.

CatBoost, a high-performance open source library for gradient boosting on decision trees. Available:  https://catboost.ai/  and https://catboost.ai/en/docs/concepts/python-usages-examples . Accessed 28 June 2024.

PyTorch documentation for torch.nn, the basic building blocks for graphs. Available: https://pytorch.org/docs/stable/nn.html . Accessed 28 June 2024.

Kingma DP, Ba J. Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980. 2014.

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features," arXiv preprint arXiv:1706.09516. 2017.

Tharwat A. Classification assessment methods. Applied computing and informatics. 2020;17(1):168–92.

Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988:837–45.

Baeza FL, da Rocha NS, Fleck MP. Predictors of length of stay in an acute psychiatric inpatient facility in a general hospital: a prospective study. Brazilian Journal of Psychiatry. 2017;40:89–96.

Bertsimas D, et al. Algorithmic prediction of health-care costs. Oper Res. 2008;56(6):1382–92.

Kshirsagar R. Accurate and Interpretable Machine Learning for Transparent Pricing of Health Insurance Plans," presented at the AAAI 2021 Conference. 2021.

Ulmer J, Painter-Davis N, Tinik L. Disproportional imprisonment of Black and Hispanic males: Sentencing discretion, processing outcomes, and policy structures. Justice Q. 2016;33(4):642–81.

Angwin J, J. Larso J, Mattu S, Kirchner L. Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica (2016). Google Scholar. 2016;23.

Steil JP, Albright L, Rugh JS, Massey DS. The social structure of mortgage discrimination. Hous Stud. 2018;33(5):759–76.

Cots F, Mercadé L, Castells X, Salvador X. Relationship between hospital structural level and length of stay outliers: Implications for hospital payment systems. Health Policy. 2004;68(2):159–68.

Evans M, McGinty T. Hospital Prices Are Arbitrary. Just Look at the Kingsburys’ $100,000 Bill. Wall Street J. 2021.  https://www.wsj.com/articles/hospital-prices-arbitrary-healthcare-medical-bills-insurance-11635428943 . Accessed 28 June 2024.

Evans M. Hospitals Often Charge Uninsured People the Highest Prices, New Data Show. Wall Street J. 2021. https://www.wsj.com/articles/hospitals-often-charge-uninsured-people-the-highest-prices-new-data-show-11625584448 . Accessed 28 June 2024.

Kullgren JT, et al. A survey of Americans with high-deductible health plans identifies opportunities to enhance consumer behaviors. Health Aff. 2019;38(3):416–24.

Wetsman N. Hospitals are selling treasure troves of medical data — what could go wrong? The Verge. 2021. Available: https://www.theverge.com/2021/6/23/22547397/medical-records-health-data-hospitals-research . Accessed 28 June 2024.

Hripcsak G, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574–8.

PubMed   PubMed Central   Google Scholar  

Gabarron E, Dorronzoro E, Rivera-Romero O, Wynn R. Diabetes on Twitter: a sentiment analysis. J Diabetes Sci Technol. 2019;13(3):439–44.

Statt N. Twitter is opening up its full tweet archive to academic researchers for free. The Verge. 2021. Available: https://www.theverge.com/2021/1/26/22250203/twitter-academic-research-public-tweet-archive-free-access . Accessed 28 June 2024. 

Evans M, Mathews AW, McGinty T. Hospitals Still Not Fully Complying With Federal Price-Disclosure Rules. Wall Street J. 2021.  https://www.wsj.com/articles/hospital-price-public-biden-11640882507 .

Johnson AE, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3(1):1–9.

Download references

Acknowledgements

We are grateful to the New York State SPARCS program for making the data available freely to the public. We greatly appreciate the feedback provided by the anonymous reviewers which helped in improving the quality of this manuscript.

No external funding was available for this research.

Author information

Authors and affiliations.

Indian Institute of Technology, Delhi, India

Raunak Jain, Mrityunjai Singh & Rahul Garg

Fairleigh Dickinson University, Teaneck, NJ, USA

A. Ravishankar Rao

You can also search for this author in PubMed   Google Scholar

Contributions

Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg contributed equally to all stages of preparation of the manuscript.

Corresponding author

Correspondence to A. Ravishankar Rao .

Ethics declarations

Ethics approval and consent to participate.

Not applicable as no human subjects were used in our study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jain, R., Singh, M., Rao, A.R. et al. Predicting hospital length of stay using machine learning on a large open health dataset. BMC Health Serv Res 24 , 860 (2024). https://doi.org/10.1186/s12913-024-11238-y

Download citation

Received : 19 June 2023

Accepted : 24 June 2024

Published : 29 July 2024

DOI : https://doi.org/10.1186/s12913-024-11238-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Artificial intelligence
  • Health informatics
  • Open-source software
  • Healthcare analytics

BMC Health Services Research

ISSN: 1472-6963

google research papers machine learning

IMAGES

  1. Getting Started With Research Papers On Machine Learning: What To Read

    google research papers machine learning

  2. Analysis of Machine Learning Research and Application

    google research papers machine learning

  3. How to Search and Download Research paper//Google Scholar//Sci-hub

    google research papers machine learning

  4. Machine Learning Research Paper Explained :A Few things to know about

    google research papers machine learning

  5. (PDF) Machine Learning Algorithms -A Review

    google research papers machine learning

  6. (PDF) Top read research articles in the field of Machine Learning

    google research papers machine learning

VIDEO

  1. Andrew Dudzik

  2. Eberhard Klempt: 50 Years of QCD

  3. [2024] Bracketology with Google Machine Learning || #qwiklabs || #GSP461 || [With Explanation🗣️]

  4. ML Was Hard Until I Learned These 5 Secrets!

  5. missing data vs missing mind

  6. How to actually learn AI/ML: Reading Research Papers

COMMENTS

  1. Google Scholar

    Google Scholar provides a simple way to broadly search for scholarly literature. Search across a wide variety of disciplines and sources: articles, theses, books, abstracts and court opinions.

  2. Publications

    Google publishes hundreds of research papers each year. Publishing our work enables us to collaborate and share ideas with, as well as learn from, the broader scientific community. ... Such annotations can be used for understanding the machine learning code written in popular frameworks, such as TensorFlow, PyTorch, JAX, and for finding bugs ...

  3. Machine Intelligence

    Machine Intelligence. Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and ...

  4. TensorFlow: A system for large-scale machine learning

    Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor- Flow achieves for several real-world applications.

  5. PDF Deep Neural Networks for YouTube Recommendations

    in [18]. Elkahky et al. used deep learning for cross domain user modeling [5]. In a content-based setting, Burges et al. used deep neural networks for music recommendation [21]. The paper is organized as follows: A brief system overview is presented in Section 2. Section 3 describes the candidate generation model in more detail, including how ...

  6. Publications

    Research. Google DeepMind at ICML 2024. Exploring AGI, the challenges of scaling and the future of multimodal generative AI. 19 July 2024. ... Learning 3D Particle-based Simulators from RGB-D Videos. Authors Will Whitney, Tatiana López, Tobias Pfaff, Yulia Rubanova, Thomas Kipf, Kimberly Stachenfeld, Kelsey Allen ...

  7. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  8. Journal of Machine Learning Research

    Journal of Machine Learning Research. The Journal of Machine Learning Research (JMLR), established in 2000, provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning.All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing.

  9. Research

    Improving skin tone evaluation in machine learning to uphold our AI principles. Discover ... Learn more from our research. Researchers across Google are innovating across many domains. ... Google publishes over 1,000 papers annually. Publishing our work enables us to collaborate and share ideas with, as well as learn from, the broader ...

  10. Datasets

    ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.

  11. GitHub

    Curated collection of Data Science, Machine Learning and Deep Learning papers, reviews and articles that are on must read list. NOTE: 🚧 in process of updating, let me know what additional papers, articles, blogs to add I will add them here. ... Google BERT Annoucement; 🥉 📄 Parameter-Efficient Transfer Learning for NLP. 🥉 📄 A ...

  12. Google Research

    Advancing the state of the art. Our teams advance the state of the art through research, systems engineering, and collaboration across Google. We publish hundreds of research papers each year across a wide range of domains, sharing our latest developments in order to collaboratively progress computing and science. Learn more about our philosophy.

  13. Machine learning-based approach: global trends, research directions

    Since ML appeared in the 1990s, all published documents (i.e., journal papers, reviews, conference papers, preprints, code repositories and more) related to this field from 1990 to 2020 have been selected, and specifically, within the search fields, the following keywords were used: "machine learning" OR "machine learning-based approach" OR ...

  14. The latest in Machine Learning

    Papers With Code highlights trending Machine Learning research and the code to implement it. Browse State-of-the-Art Datasets ; Methods; More ... Subscribe to the PwC Newsletter ×. Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. Read previous issues. Subscribe.

  15. Google Scholar reveals its most influential papers for 2020

    Adam was introduced in this paper at the 2014 International Conference on Learning Representations (ICLR) by Diederik P. Kingma, today a machine learning researcher at Google, and Jimmy Ba from ...

  16. ‪Andrew Ng‬

    2011. Distance metric learning with application to clustering with side-information. E Xing, M Jordan, SJ Russell, A Ng. Advances in neural information processing systems 15. , 2002. 4024. 2002. Deep speech 2: End-to-end speech recognition in english and mandarin. D Amodei, S Ananthanarayanan, R Anubhai, J Bai, E Battenberg, C Case, ...

  17. Cloud AI

    The Google Cloud AI Research team tackles AI research challenges motivated by Google Cloud's mission of bringing AI to tech, healthcare, finance, retail and many other industries. We work on a range of unique high-impact problems with the goal of maximizing both scientific and real-world impact - both pushing the state-of-the-art in AI (>60 ...

  18. Machine learning

    Machine learning articles from across Nature Portfolio. Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers ...

  19. PDF Machine Learning Applications for Data Center Optimization

    UE graph.Fig 1. Historical PUE values at Google. The application of machine learning algorithms to existing monitoring data provides an opportun. y to significantly improve DC operating efficiency. A typical large scale DC generates millions of data points across thousands of sensors every day, yet this data is rarely.

  20. Top 20 Recent Research Papers on Machine Learning and Deep Learning

    Machine learning, especially its subfield of Deep Learning, had many amazing advances in the recent years, and important research papers may lead to breakthroughs in technology that get used by billio ns of people. The research in this field is developing very quickly and to help our readers monitor the progress we present the list of most important recent scientific papers published since 2014.

  21. Millions of new materials discovered with deep learning

    Today, in a paper published in Nature, we share the discovery of 2.2 million new crystals - equivalent to nearly 800 years' worth of knowledge. We introduce Graph Networks for Materials Exploration (GNoME), our new deep learning tool that dramatically increases the speed and efficiency of discovery by predicting the stability of new materials.

  22. Landmark Papers in Machine Learning

    Landmark Papers in Machine Learning. This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I've done my best to select the papers that I ...

  23. Diagnosing & Preventing Diabetic Retinopathy with AI

    How a team at Google is using AI to help doctors prevent blindness in diabetics. Google research scientist Varun Gulshan was looking for a project that would meet a few criteria. The project would utilize Gulshan's background developing artificial intelligence (AI) algorithms and stimulate his interest in science and medicine.

  24. The Evolution and Impact of Google Cloud Platform in Machine Learning

    Google Cloud Platform (GCP) has emerged as a leader in Machine Learning (ML) and Artificial Intelligence (AI), known for its cutting-edge technologies and inclusive accessibility. GCP not only drives innovation but also democratizes access to powerful ML and AI tools, empowering organizations of all sizes to harness data-driven insights for ...

  25. Think big: Testing brain-behavior machine learning requires large

    When designing machine learning models, researchers first train the models to recognize data patterns and then test their effectiveness. ... Statistical power is the probability that a research study will detect an effect if one exists. For example, a child's height is closely related to their age. If a study is adequately powered, then that ...

  26. Machine learning topological energy braiding of non-Bloch bands

    Machine learning has been used to identify phase transitions in a variety of physical systems. However, there is still a lack of relevant research on non-Bloch energy braiding in non-Hermitian systems. In this work, we study non-Bloch energy braiding in one-dimensional non-Hermitian systems using unsupervised and supervised methods. In unsupervised learning, we use diffusion maps to ...

  27. Research Paper Classification Using Machine and Deep Learning

    Categorizing research papers into appropriate categories is one of the many components when organizing conferences, especially in the context of paper submissions and calls for papers. ... (GBT) algorithm, a machine learning technique, outperformed other algorithms, achieving a remarkable classification accuracy of 91.58% followed by the Deep ...

  28. [2408.02292] Learning Atoms from Crystal Structure

    View a PDF of the paper titled Learning Atoms from Crystal Structure, by Andrij Vasylenko (1) and 14 other authors. View PDF Abstract: Computational modelling of materials using machine learning, ML, and historical data has become integral to materials research. The efficiency of computational modelling is strongly affected by the choice of the ...

  29. Comprehensive review of EEG data classification techniques for ADHD

    Kautzky A, Vanicek T, Philippe C, Kranz GS, Wadsak W, Mitterhauser M, Lanzenberger R. Machine learning classification of ADHD and HC by multimodal serotonergic data. Transl Psychiatry. 2020;10(1):1-9. doi: 10.1038/s41398-020-0781-2 35. Kim JW, Sharma V, Ryan ND. Predicting methylphenidate response in ADHD using machine learning approaches.

  30. Predicting hospital length of stay using machine learning on a large

    The area of machine learning offers a potential solution, which is the focus of the current paper. Methods. We investigated multiple machine learning techniques including feature engineering, regression, and classification trees to predict the length of stay (LoS) of all the hospital procedures currently available in the dataset.