This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Machine learning tasks in ML.NET

  • 13 contributors

A machine learning task is the type of prediction or inference being made, based on the problem or question that is being asked, and the available data. For example, the classification task assigns data to categories, and the clustering task groups data according to similarity.

Machine learning tasks rely on patterns in the data rather than being explicitly programmed.

This article describes the different machine learning tasks that you can choose from in ML.NET and some common use cases.

Once you have decided which task works for your scenario, then you need to choose the best algorithm to train your model. The available algorithms are listed in the section for each task.

Binary classification

A supervised machine learning task that is used to predict which of two classes (categories) an instance of data belongs to. The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. Examples of binary classification scenarios include:

  • Understanding sentiment of Twitter comments as either "positive" or "negative".
  • Diagnosing whether a patient has a certain disease or not.
  • Making a decision to mark an email as "spam" or not.
  • Determining if a photo contains a particular item or not, such as a dog or fruit.

For more information, see the Binary classification article on Wikipedia.

Binary classification trainers

You can train a binary classification model using the following algorithms:

  • AveragedPerceptronTrainer
  • SdcaLogisticRegressionBinaryTrainer
  • SdcaNonCalibratedBinaryTrainer
  • SymbolicSgdLogisticRegressionBinaryTrainer
  • LbfgsLogisticRegressionBinaryTrainer
  • LightGbmBinaryTrainer
  • FastTreeBinaryTrainer
  • FastForestBinaryTrainer
  • GamBinaryTrainer
  • FieldAwareFactorizationMachineTrainer
  • PriorTrainer
  • LinearSvmTrainer

Binary classification inputs and outputs

For best results with binary classification, the training data should be balanced (that is, equal numbers of positive and negative training data). Missing values should be handled before training.

The input label column data must be Boolean . The input features column data must be a fixed-size vector of Single .

These trainers output the following columns:

Multiclass classification

A supervised machine learning task that is used to predict the class (category) of an instance of data. The input of a classification algorithm is a set of labeled examples. Each label normally starts as text. It is then run through the TermTransform, which converts it to the Key (numeric) type. The output of a classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. Examples of multi-class classification scenarios include:

  • Categorizing flights as "early", "on time", or "late".
  • Understanding movie reviews as "positive", "neutral", or "negative".
  • Categorizing hotel reviews as "location", "price", "cleanliness", etc.

For more information, see the Multiclass classification article on Wikipedia.

One vs all upgrades any binary classification learner to act on multiclass datasets. More information on Wikipedia .

Multiclass classification trainers

You can train a multiclass classification model using the following training algorithms:

  • LightGbmMulticlassTrainer
  • SdcaMaximumEntropyMulticlassTrainer
  • SdcaNonCalibratedMulticlassTrainer
  • LbfgsMaximumEntropyMulticlassTrainer
  • NaiveBayesMulticlassTrainer
  • OneVersusAllTrainer
  • PairwiseCouplingTrainer

Multiclass classification inputs and outputs

The input label column data must be key type. The feature column must be a fixed size vector of Single .

This trainer outputs the following:

A supervised machine learning task that is used to predict the value of the label from a set of related features. The label can be of any real value and is not from a finite set of values as in classification tasks. Regression algorithms model the dependency of the label on its related features to determine how the label will change as the values of the features are varied. The input of a regression algorithm is a set of examples with labels of known values. The output of a regression algorithm is a function, which you can use to predict the label value for any new set of input features. Examples of regression scenarios include:

  • Predicting house prices based on house attributes such as number of bedrooms, location, or size.
  • Predicting future stock prices based on historical data and current market trends.
  • Predicting sales of a product based on advertising budgets.

Regression trainers

You can train a regression model using the following algorithms:

  • LbfgsPoissonRegressionTrainer
  • LightGbmRegressionTrainer
  • SdcaRegressionTrainer
  • OnlineGradientDescentTrainer
  • FastTreeRegressionTrainer
  • FastTreeTweedieTrainer
  • FastForestRegressionTrainer
  • GamRegressionTrainer

Regression inputs and outputs

The input label column data must be Single .

The trainers for this task output the following:

An unsupervised machine learning task that is used to group instances of data into clusters that contain similar characteristics. Clustering can also be used to identify relationships in a dataset that you might not logically derive by browsing or simple observation. The inputs and outputs of a clustering algorithm depends on the methodology chosen. You can take a distribution, centroid, connectivity, or density-based approach. ML.NET currently supports a centroid-based approach using K-Means clustering. Examples of clustering scenarios include:

  • Understanding segments of hotel guests based on habits and characteristics of hotel choices.
  • Identifying customer segments and demographics to help build targeted advertising campaigns.
  • Categorizing inventory based on manufacturing metrics.

Clustering trainer

You can train a clustering model using the following algorithm:

  • KMeansTrainer

Clustering inputs and outputs

The input features data must be Single . No labels are needed.

Anomaly detection

This task creates an anomaly detection model by using Principal Component Analysis (PCA). PCA-Based Anomaly Detection helps you build a model in scenarios where it is easy to obtain training data from one class, such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.

An established technique in machine learning, PCA is frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data. PCA works by analyzing data that contains multiple variables. It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. These combined feature values are used to create a more compact feature space called the principal components.

Anomaly detection encompasses many important tasks in machine learning:

  • Identifying transactions that are potentially fraudulent.
  • Learning patterns that indicate that a network intrusion has occurred.
  • Finding abnormal clusters of patients.
  • Checking values entered into a system.

Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. The algorithms included in this category have been especially designed to address the core challenges of building and training models by using imbalanced data sets.

Anomaly detection trainer

You can train an anomaly detection model using the following algorithm:

  • RandomizedPcaTrainer

Anomaly detection inputs and outputs

The input features must be a fixed-sized vector of Single .

A ranking task constructs a ranker from a set of labeled examples. This example set consists of instance groups that can be scored with a given criteria. The ranking labels are { 0, 1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups with unknown scores for each instance. ML.NET ranking learners are machine learned ranking based.

Ranking training algorithms

You can train a ranking model with the following algorithms:

  • LightGbmRankingTrainer
  • FastTreeRankingTrainer

Ranking input and outputs

The input label data type must be key type or Single . The value of the label determines relevance, where higher values indicate higher relevance. If the label is a key type, then the key index is the relevance value, where the smallest index is the least relevant. If the label is a Single , larger values indicate higher relevance.

The feature data must be a fixed size vector of Single and input row group column must be key type.

Recommendation

A recommendation task enables producing a list of recommended products or services. ML.NET uses Matrix factorization (MF) , a collaborative filtering algorithm for recommendations when you have historical product rating data in your catalog. For example, you have historical movie rating data for your users and want to recommend other movies they are likely to watch next.

Recommendation training algorithms

You can train a recommendation model with the following algorithm:

  • MatrixFactorizationTrainer

Forecasting

The forecasting task use past time-series data to make predictions about future behavior. Scenarios applicable to forecasting include weather forecasting, seasonal sales predictions, and predictive maintenance.

Forecasting trainers

You can train a forecasting model with the following algorithm:

ForecastBySsa

Image Classification

A supervised machine learning task that is used to predict the class (category) of an image. The input is a set of labeled examples. Each label normally starts as text. It is then run through the TermTransform, which converts it to the Key (numeric) type. The output of the image classification algorithm is a classifier, which you can use to predict the class of new images. The image classification task is a type of multiclass classification. Examples of image classification scenarios include:

  • Determining the breed of a dog as a "Siberian Husky", "Golden Retriever", "Poodle", etc.
  • Determining if a manufacturing product is defective or not.
  • Determining what types of flowers as "Rose", "Sunflower", etc.

Image classification trainers

You can train an image classification model using the following training algorithms:

  • ImageClassificationTrainer

Image classification inputs and outputs

The input label column data must be key type. The feature column must be a variable-sized vector of Byte .

This trainer outputs the following columns:

Object Detection

A supervised machine learning task that is used to predict the class (category) of an image but also gives a bounding box to where that category is within the image. Instead of classifying a single object in an image, object detection can detect multiple objects within an image. Examples of object detection include:

  • Detecting cars, signs, or people on images of a road.
  • Detecting defects on images of products.
  • Detecting areas of concern on X-Ray images.

Object detection model training is currently only available in Model Builder using Azure Machine Learning.

Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see: https://aka.ms/ContentUserFeedback .

Submit and view feedback for

Additional resources

Assignments

Jump to: [Homeworks] [Projects] [Quizzes] [Exams]

There will be one homework (HW) for each topical unit of the course. Due about a week after we finish that unit.

These are intended to build your conceptual analysis skills plus your implementation skills in Python.

  • HW0 : Numerical Programming Fundamentals
  • HW1 : Regression, Cross-Validation, and Regularization
  • HW2 : Evaluating Binary Classifiers and Implementing Logistic Regression
  • HW3 : Neural Networks and Stochastic Gradient Descent
  • HW4 : Trees
  • HW5 : Kernel Methods and PCA

After completing each unit, there will be a 20 minute quiz (taken online via gradescope).

Each quiz will be designed to assess your conceptual understanding about each unit.

Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions.

You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

There will be three larger "projects" throughout the semester:

  • Project A: Classifying Images with Feature Transformations
  • Project B: Classifying Sentiment from Text Reviews
  • Project C: Recommendation Systems for Movies

Projects are meant to be open-ended and encourage creativity. They are meant to be case studies of applications of the ML concepts from class to three "real world" use cases: image classification, text classification, and recommendations of movies to users.

Each project will due approximately 4 weeks after being handed out. Start early! Do not wait until the last few days.

Projects will generally be centered around a particular methodology for solving a specific task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report (2-4 pages), describing your approach and providing several figures/tables to explain your results to the reader.

You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.

Accessible AI Logo

The 3 Core Machine Learning Tasks

Understanding Classification, Regression, and Clustering in Machine Learning

The Three Core Tasks

Let’s talk about the 3 core machine learning tasks: Classification , Regression , and Clustering .

These are the three tasks you’ll want to focus on when learning data science.

Not only are these 3 tasks very common things you’ll want to do for your data science projects, but these projects will help you build the skills and knowledge you need to perform the more specialized aspects of machine learning.

Let’s get started.

Classification

Imagine you were a bank and had historical information about people who have taken out loans and whether or not those loans had been repaid. Using this data set, you could train a machine learning model to predict whether a person is likely to pay back a loan they’re requesting.

Loan Approval Flow

This is an example of classification : predicting what categorical label something might belong to given historic data.

In this case the label I’m predicting would be whether the loan would be repaid and the features relevant for that might be things related to a person’s annual income, the value of the home, the term of the loan, and the amount being requested, which might look like the following data:

Example Project: Classifying Die Hard

Another example of classification is a machine learning experiment I did last year around the movie Die Hard .

My wife and I were debating if Die Hard should be considered a Christmas movie. To solve this problem, I built a machine learning model around historical movie information that included both Christmas movies and non-Christmas movies.

Training a Die Hard Model

Once this model was trained, I asked the model if Die Hard should be considered a Christmas movie and it was able to predict the expected value of the Is Christmas Movie label for that movie.

Both Die Hard and the loan approval models are examples of binary classification where something is going to be one of two possibilities.

Other examples might be predicting if a customer or employee will leave your organization or if a mole is cancerous.

Multi-Class Classification

Types of Classification

Sometimes you want to predict if something is one of several different possibilities. When there are 3 or more possibilities, we call this multi-class classification .

Example Project: ESRB Game Rating Prediction

For example, if you have an unreleased video game and wanted to predict the Entertainment Software Rating Board (or ESRB) rating for the game’s content, you could build a classification model and train it on historical games, their content, and the rating they were given.

Sample ESRB Rating

This trained model would then be able to predict ESRB ratings for video games that had yet to be released and generate some degree of probability that a game might be in any given rating.

Using this, I could determine how likely a new video game was to be given a specific rating given historical video game releases.

Next we have regression models. If classification is all about predicting a single categorical label, then regression is about predicting a single numerical label instead. In other words, we’re no longer predicting what something is, but instead we’re predicting how much of something.

For example, you could train a regression model to predict the how much a used car would sell for given historical data on recent used car sales in the area.

Example Project: Car Defrosting Prediction

A regression experiment I did in the past involved predicting the number of minutes I’d need to spend in the morning scraping off my car’s windshield.

I built a data set over some time by automatically tracking overnight weather predictions and then manually recording the number of minutes I spent defrosting my car.

By the end of the winter I had a model that was trained sufficiently to be able to predict how much time I’d need to scrape off my car’s windshield.

Of course, by the next winter we had a garage and my model was worthless, but this was a good example of a regression model in action.

Finally, we reach clustering. Clustering is the process of determining groups of data points based on their similarities.

Clustered Data Plotted in a Scatter Plot

Clustering is sometimes used for things like segmenting different types of users for marketing strategies based on their usage habits.

Clustering is also used for geographical data. If I wanted to host 5 events across the world to meet every person who watched this video in a given year, a clustering algorithm could determine the optimal places to hold each one of those events.

A map showing population clusters

Some of you would still need to travel farther than others, but the average person’s travel distance would be as good as we could make it.

That covers the basics of the three core types of machine learning: classification, regression, and clustering.

As you get started with machine learning, I strongly encourage you to start with classification or regression.

In fact, a standard experiment for new data scientists is to start out with a binary classification experiment that predicts if a passenger on the Titanic would have lived or died based on their ticket information. And no, this is not a joke. Check it out and see!

Until next time, happy coding and keep learning!

Matt Eland avatar

Matt Eland is a software engineering leader and data scientist who has served as a senior engineer, software engineering manager, professional programming instructor, and has helped build enterprise-level software at a variety of organizations before distinguishing himself as a Microsoft MVP in Artificial Intelligence by using technology to accomplish ridiculous things in the name of science and teaching others.

Matt is a Microsoft Certified Azure Data Scientist and AI Engineer associate and is pursuing a master's in data analytics focusing on machine learning and artificial intelligence as he continues to build and learn new things and look for ways to share them with the community. Matt is the author of Refactoring with C# and is currently creating a course on computer vision on Azure and a new book.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

Types of tasks in machine learning

Machine learning, introduction.

Machine learning is a broad field with a variety of approaches to addressing a gamut of tasks. In this article, we will describe some of the commonly addressed tasks using machine learning. We will also comment and point to suitable approaches for handling such tasks.

Prerequisites

To understand the variety of tasks in machine learning, we recommend familiarity with the concepts in

  • Introduction to machine learning

Follow the above link to first get acquainted with the corresponding concepts.

Classification

Classification is the task of assigning categories (or classes) to given instances automatically. The machine learning model that has been trained to achieve such a goal is known as a classifier . Classification falls in the realm of supervised learning — the sub-field of machine learning that enables models to be trained by observing labeled or supervised examples. For example, to learn a classifier to identify spam emails, each supervised example will be a tuple consisting of the email information (text, subject, from, to) and its category ( spam or no spam ).

Depending on the number of categories and their relationships, classification problems fall into several types.

  • Binary classification: An instance must belong to exactly one among two categories. The classifier itself is known as a binary classifier .
  • Multi-class classification: An instance must belong to exactly one among many (more than two) categories. In a multi-class scenario, the categories are mutually exclusive.
  • Multi-labeled classification: An instance may simultaneously belong to more than one category from among several categories. Thus, in a multi-labeled set up, the categories are not mutually exclusive.

Mathematically, we can express the classification problem as follows: If \( \vx \) denotes an \( \ndim\)-dimensional input instance, then the goal of classification is to assign \( \vx \) to the appropriate category (or categories, in the case of multi-labeled) from among \( \nclass \) categories \( \set{C_1, \ldots, C_\nclass} \), where \( \nclass \ge 2 \).

The classifier is trained over a collection of labeled observations provided as tuples \( (\vx_i,y_i) \) containing the instance vector \( \vx_i \) and the true target variable \( y_i \in \set{C_1,\ldots,C_\nclass}\). This collection of \( \nlabeled \) labeled observations is known as the labeled training set , or simply the training set , \( \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} \).

Below, we list some of the most popular approaches to classification. Follow the corresponding links to study the classifiers in further detail.

  • Logistic regression (binary)
  • Perceptron (binary)
  • Support vector machine (binary)
  • Decision trees
  • Linear discriminant analysis
  • Naive Bayes
  • Nearest neighbors classifier
  • Random forest
  • Tree-boosting

The binary classifiers in the above list can be adapted to support multi-class or multi-labeled classification scenarios through the one-vs-one or one-vs-rest strategies.

Regression is the task of assigning a real-valued output to an input instance. For example, we may need to predict the selling price, a real number, of a house given its location, area, lot-size, number of bedrooms, bathrooms, and installed amenities. Just like classification, regression models are also trained using the supervised learning approach to machine learning.

Mathematically, we can express the regression problem as follows: If \( \vx \) denotes an \( \ndim\)-dimensional input instance, then the goal of regression is to predict a real-valued output \( \vy \in \real \) for the input \( \vx \).

The regression model is trained over a collection of supervised observations provided as tuples \( (\vx_i,y_i) \) containing the instance vector \( \vx_i \) and the true target variable \( y_i \in \real \). This collection of \( \nlabeled \) labeled observations is known as the training set , \( \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} \).

In some problem settings, the output variable is also a multi-dimensional vector \( \vy \in \real^\nclass \). Such scenarios are known as multi-output regression problems.

Below, we list some of the most popular approaches to regression. Follow the corresponding links to study the regression models in further detail.

  • Linear least-squares regression
  • Ridge regression
  • Lasso regression
  • Bayesian linear regression
  • Kernel regression
  • Random forest regression
  • Nearest neighbors regression

Clustering involves the assignment of input instances into groups or clusters of similar instances. For example, we may wish to automatically group news items coming from disparate sources into clusters of related news to be summarized by a single headline. Clustering is a form of unsupervised learning scenario, one that does not involve the use of prior supervision or labels about the assignment of individual instances to various groups.

Mathematically, we can express the clustering problem as follows: Consider observations represented as vectors, for example \( \vx \in \real^\ndim \) — vectors consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \). A collection of such observations is provided in the unlabeled set \( \unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled} \). The goal of clustering is to automatically infer the groups of examples \( G_1, \ldots, G_\nclass \) such that the examples belong to a single group, say \( G_i \), are similar in some sense. The groups \( G_1, \ldots, G_\nclass \) are not pre-defined. They are automatically discovered as part of the clustering process.

Below, we list some of the most popular approaches to clustering. Follow the corresponding links to study the specifics of a clustering algorithm in further detail.

  • Gaussian mixture models

Density estimation

Density estimation is the task of modeling a probability density function for some feature space to facilitate the estimation of density of any given input instance. For example, given observations of past confirmed locations of underground reserves, a density estimator may be learnt that can provide the likelihood of an oil reserve at any given input location conditioned on the historical observations. Just like clustering, density estimation is also a form of unsupervised learning approach in machine learning.

Mathematically, we can express the density estimation problem as follows: Consider observations represented as vectors, for example \( \vx \in \real^\ndim \) — vectors consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \). A collection of such observations is provided in the unlabeled set \( \unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled} \). The goal of density estimation is to automatically infer the probability density function that aligns the best with the observations in \( \unlabeledset \), so that the \( p(vx) \) can be estimated most accurately for any instance \( \vx \) in that feature space.

Below, we list some of the most popular approaches to density estimation. Follow the corresponding links to study the specifics of a density estimation algorithm in further detail.

  • Kernel density estimation
  • Bayesian networks
  • Variational autoencoder

Dimensionality reduction

Dimensionality reduction, as the name implies, involves transforming an multivariate input instance to an output instance with fewer dimensions than the input, while retaining task-dependent relevant information in the instance. For example, we may wish to reduce a 10-dimensional dataset to a 2-dimensional dataset for easy visualization as a scatter plot, while maybe retaining the natural groupings among the input instances, even in the two-dimensional space. Dimensionality reduction may involve an unsupervised or a supervised learning strategy, depending on whether the reduced dimensions are arrived at by being informed with some categorical labels.

Mathematically, we can express the dimensionality reduction problem as follows: Consider observations represented as vectors, for example \( \vx \in \real^\ndim \) — vectors consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \). In dimensionality reduction, we wish to transform these into output vectors with \( \nclass \) dimensions, \( \vy \in \real^\nclass \), such that \( \nclass \ll \ndim \).

Below, we list some of the most popular approaches to dimensionality reduction. Follow the corresponding links to study the specifics of a dimensionality reduction algorithm in further detail.

  • Principal component analysis
  • Autoencoder

In addition to these, the approaches to clustering , that we saw earlier, can also be considered as dimensionality reduction approaches where the examples are reduced to a single dimension!

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.

  • Office Hours

machine learning task assignment

One of CS230's main goals is to prepare you to apply machine learning algorithms to real-world tasks, or to leave you well-qualified to start machine learning or AI research. The final project is intended to start you in these directions.

Instructors

machine learning task assignment

Time and Location

Wednesday 9:30AM-11:20AM Zoom

Getting Started

Project starter package.

The teaching team has put together a

  • github repository with project code examples, including a computer vision and a natural language processing example (both in Tensorflow and Pytorch).
  • A series of posts to help you familiarize yourself with the project code examples, get ideas on how to structure your deep learning project code, and to setup AWS. The code examples posted are optional and are only meant to help you with your final project. The code can be reused in your projects, but the examples presented are not complex enough to meet the expectations of a quarterly project.
  • A sheet of resources to get started with project ideas in several topics

Project Topics

This quarter in CS230, you will learn about a wide range of deep learning applications. Part of the learning will be online, during in-class lectures and when completing assignments, but you will really experience hands-on work in your final project. We would like you to choose wisely a project that fits your interests. One that would be both motivating and technically challenging.

Most students do one of three kinds of projects:

  • Application project. This is by far the most common: Pick an application that interests you, and explore how best to apply learning algorithms to solve it.
  • Algorithmic project. Pick a problem or family of problems, and develop a new learning algorithm, or a novel variant of an existing algorithm, to solve it.
  • Theoretical project. Prove some interesting/non-trivial properties of a new or an existing learning algorithm. (This is often quite difficult, and so very few, if any, projects will be purely theoretical.) Some projects will also combine elements of applications and algorithms.

Many fantastic class projects come from students picking either an application area that they’re interested in, or picking some subfield of machine learning that they want to explore more. So, pick something that you can get excited and passionate about! Be brave rather than timid, and do feel free to propose ambitious things that you’re excited about. (Just be sure to ask us for help if you’re uncertain how to best get started.) Alternatively, if you’re already working on a research or industry project that deep learning might apply to, then you may already have a great project idea.

Project Hints

A very good CS230 project will be a publishable or nearly-publishable piece of work. Each year, some number of students continue working on their projects after completing CS230, submitting their work to a conferences or journals. Thus, for inspiration, you might also look at some recent deep learning research papers. Two of the main machine learning conferences are ICML and NeurIPS . You may also want to look at class projects from previous years of CS230 ( Fall 2017 , Winter 2018 , Spring 2018 , Fall 2018 ) and other machine learning/deep learning classes ( CS229 , CS229A , CS221 , CS224N , CS231N ) is a good way to get ideas. Finally, we crowdsourced and curated a list of ideas that you can view here , and an older one here , and a (requires Stanford login).

Once you have identified a topic of interest, it can be useful to look up existing research on relevant topics by searching related keywords on an academic search engine such as: http://scholar.google.com . Another important aspect of designing your project is to identify one or several datasets suitable for your topic of interest. If that data needs considerable pre-processing to suit your task, or that you intend to collect the needed data yourself, keep in mind that this is only one part of the expected project work, but can often take considerable time. We still expect a solid methodology and discussion of results, so pace your project accordingly.

Notes on a few specific types of projects:

  • Computation power. Amazon Web Services is sponsoring the CS230 projects by providing you with GPU credits to run your experiments! We will update regarding how to retrieve your GPU credits. Alternatively Google Cloud and Microsoft Azure offer free academic units which you can apply to.
  • Preprocessed datasets. While we don’t want you to have to spend much time collecting raw data, the process of inspecting and visualizing the data, trying out different types of preprocessing, and doing error analysis is often an important part of machine learning. Hence if you choose to use preprepared datasets (e.g. from Kaggle, the UCI machine learning repository, etc.) we encourage you to do some data exploration and analysis to get familiar with the problem.
  • Replicating results. Replicating the results in a paper can be a good way to learn. However, we ask that instead of just replicating a paper, also try using the technique on another application, or do some analysis of how each component of the model contributes to final performance.

Project Deliverables

This section contains the detailed instructions for the different parts of your project.

Groups: The project is done in groups of 1-3 people; teams are formed by students.

Submission: We will be using Gradescope for submission of all four parts of the final project. We’ll announce when submissions are open for each part. You should submit on Gradescope as a group: that is, for each part, please make one submission for your entire project group and tag your team members.

Evaluation: We will not be disclosing the breakdown of the 40% that the final project is worth amongst the different parts, but the video and final report will combine to be the majority of the grade. Attendance and participation during your TA meetings will also be considered. Projects will be evaluated based on:

  • The technical quality of the work. (I.e., Does the technical material make sense? Are the things tried reasonable? Are the proposed algorithms or applications clever and interesting? Do the authors convey novel insight about the problem and/or algorithms?)
  • Significance. (Did the authors choose an interesting or a “real” problem to work on, or only a small “toy” problem? Is this work likely to be useful and/or have impact?)
  • The novelty of the work. (Is this project applying a common technique to a well-studied problem, or is the problem or method relatively unexplored?)

In order to highlight these components, it is important you present a solid discussion regarding the learnings from the development of your method, and summarizing how your work compares to existing approaches.

Deadline: April 19, Wednesday 11:59 PM

First, make sure to submit the following Google form so that we can match you to a TA mentor. In the form you will have to provide your project title, team members and relevant research area(s).

In the project proposal, you’ll pick a project idea to work on early and receive feedback from the TAs. If your proposed project will be done jointly with a different class’ project, you should obtain approval from the other instructor and approval from us. Please come to the project office hours to discuss with us if you would like to do a joint project. You should submit your proposals on Gradescope. All students should already be added to the course page on Gradescope via your SUNet IDs. If you are not, please create a private post on Ed and we will give you access to Gradescope.

In the proposal, below your project title, include the project category. The category can be one of:

  • Computer Vision
  • Natural Language Processing
  • Generative Modeling
  • Speech Recognition
  • Reinforcement Learning
  • Others (Please specify!)

Your project proposal should include the following information:

  • What is the problem that you will be investigating? Why is it interesting?
  • What are the challenges of this project?
  • What dataset are you using? How do you plan to collect it?
  • What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations?
  • What reading will you examine to provide context and background? If relevant, what papers do you refer to?
  • How will you evaluate your results? Qualitatively, what kind of results do you expect (e.g. plots or figures)? Quantitatively, what kind of analysis will you use to evaluate and/or compare your results (e.g. what performance metrics or statistical tests)?

Presenting pointers to one relevant dataset and one example of prior research on the topic are a valuable (optional) addition. We link one past example of a good project proposal here and a latex template .

Deadline: May 19, Friday 11:59 PM

The milestone will help you make sure you’re on track, and should describe what you’ve accomplished so far, and very briefly say what else you plan to do. You should write it as if it’s an “early draft” of what will turn into your final project. You can write it as if you’re writing the first few pages of your final project report, so that you can re-use most of the milestone text in your final report. Please write the milestone (and final report) keeping in mind that the intended audience is Profs. Ng and Katanforoosh and the TAs. Thus, for example, you should not spend two pages explaining what logistic regression is. Your milestone should include the full names of all your team members and state the full title of your project. Note: We will expect your final writeup to be on the same topic as your milestone. In order to help you the most, we expect you to submit your running code. Your code should contain a baseline model for your application. Along with your baseline model, you are welcome to submit additional parts of your code such as data pre-processing, data augmentation, accuracy matric(s), and/or other models you have tried. Please clean your code before submitting, comment on it, and cite any resources you used. Please do not submit your dataset . However, you may include a few samples of your data in the report if you wish.

Submission Deadline: June 7, Wednesday 11:59 PM (No late days allowed)

Your video is required to be a 3-4 minute summary of your work. There is a hard limit of 4 minutes, and TAs will not watch a video beyond the 4 minute mark. Include diagrams, figures and charts to illustrate the highlights of your work. The video needs to be visually appealing, but also illustrate technical details of your project.

If possible, try to come up with creative visualizations of your project. These could include:

  • System diagrams
  • More detailed examples of data that don’t fit in the space of your report
  • Live demonstrations for end-to-end systems

We recommend searching for conference presentation sessions (AAAI, Neurips, ECCV, ICML, ICLR etc) and following those formats.

You can find a sample video from a previous iteration of the class here

Final Report

Deadline: June 7, Wednesday 11:59 PM (No late days allowed)

The final report should contain a comprehensive account of your project. We expect the report to be thorough, yet concise. Broadly, we will be looking for the following:

  • Good motivation for the project and an explanation of the problem statement
  • A description of the data
  • Any hyperparameter and architecture choices that were explored
  • Presentation of results
  • Analysis of results
  • Any insights and discussions relevant to the project

After the class, we will post all the final writeups online so that you can read about each other’s work. If you do not want your write-up to be posted online, then please create a private Piazza post.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Real-Time Task Assignment Approach Leveraging Reinforcement Learning with Evolution Strategies for Long-Term Latency Minimization in Fog Computing

1 Department of Information Communication, Materials, and Chemistry Convergence, Soongsil University, Seoul 06978, Korea; rk.ca.uss@gsiamgnol

Nhu-Ngoc Dao

2 School of Computer Science and Engineering, Chung-Ang University, Seoul 06974, Korea; rk.er.balcu@cognnd

The emerging fog computing technology is characterized by an ultralow latency response, which benefits a massive number of time-sensitive services and applications in the Internet of things (IoT) era. To this end, the fog computing infrastructure must minimize latencies for both service delivery and execution phases. While the transmission latency significantly depends on external factors (e.g., channel bandwidth, communication resources, and interferences), the computation latency can be considered as an internal issue that the fog computing infrastructure could actively self-handle. From this view point, we propose a reinforcement learning approach that utilizes the evolution strategies for real-time task assignment among fog servers to minimize the total computation latency during a long-term period. Experimental results demonstrate that the proposed approach reduces the latency by approximately 16.1% compared to the existing methods. Additionally, the proposed learning algorithm has low computational complexity and an effectively parallel operation; therefore, it is especially appropriate to be implemented in modern heterogeneous computing platforms.

1. Introduction

Fog computing was developed to act as an intermediate between a remote cloud computing environment and Internet of Things (IoT) devices. It is a novel architecture that extends the cloud to the edge of the network [ 1 , 2 ]. In fog computing, latency-sensitive tasks can be executed at the fog servers, near the devices, while delay-tolerant and computationally intensive applications can be offloaded to the cloud. Fog computing also provides additional advantages such as the ability of processing applications at specific locations. Owing to these advantages, the fog computing infrastructure is increasingly utilized for handling real-time IoT services and applications [ 3 , 4 , 5 ].

While fog computing deployment provides substantial benefit over cloud computing, it exposes a critical challenge in terms of task assignment problem [ 6 , 7 ]. If tasks are not assigned to suitable servers, some servers may suffer from a burden in processing while others with rich resources relax [ 8 ]. Particularly, the imbalance in resource utilization is heightened in scenarios where a large number of IoT devices are present. Consequently, efficient task assignment techniques in real-time are inevitable for fog networks, especially over a long-term period to achieve system stability.

To overcome the aforementioned issues, we proposed a real-time task assignment approach, which leverages reinforcement learning (RL) [ 9 , 10 , 11 ] with evolution strategies (ES) training method [ 12 , 13 ] for long-term latency minimization in fog computing. In this approach, a central scheduler that performs task assignment among fog servers considers the fog computing infrastructure as a trainable neural network (NN) [ 14 , 15 ]. In the NN, fog computing resources, remaining tasks in the buffers of fog servers, and demand of the offloaded task make up various states of the system. The number of system states is extremely huge; therefore, the ES algorithm has been utilized for learning operations for obtaining a fast optimization of long-term latency minimization as a training reward.

We experiment task assignment in the fog computing of a factory, where IoT devices are frequent but have noise interference. The system includes 200 IoT devices and 10 fog servers with different capabilities of task processing. Although the complexity of real-time task assignment problem is high, the proposed reinforcement learning model with evolution strategies algorithm reaches the objective of optimizing long-term latency and has 16.1% higher reward than the greedy method, which is the baseline in real-time task assignment. The contributions of this study are as follows.

  • We propose a reinforcement learning model for the real-time task assignment in fog networks with the objective of minimizing long-term latency. The method for crafting states of the system is novel and is an important contribution to the success of the model.
  • We propose the evolution strategies as a learning method for the reinforcement learning model for optimizing the server selection function, i.e., the trainable neural network. The algorithm has low computational complexity and simplicity in implementation. Additionally, the algorithm is remarkably parallel due to the independence in evaluation of its children. Therefore, it is suitable for modern computers with parallel CPUs.
  • We prove by comprehensive experiments that the proposed model is scalable when the system escalates the number of IoT devices or the number of fog servers. The model attains 15.3% higher reward than the greedy method in a system with 100 IoT devices and five fog servers; and 16.1% with 200 IoT devices and 10 fog servers.

The rest of the paper is organized as follows. Section 2 surveys work related to the fog computing paradigm and real-time task assignment problem. In Section 3 , we propose the reinforcement learning model and how to craft the states of the system. Section 4 introduces the evolution strategies as a learning algorithm for the proposed model. Section 5 shows the experiment results for proving the efficiency of the proposed model and exploring the effectiveness of the parameters of the model. Section 6 concludes the study.

2. Related Work

Literature reviews [ 16 , 17 ] revealed a great contribution of research communities for improving fog computing performances in terms of latency, energy consumption, resource utilization, service availability, their variants, and hybrid solutions [ 2 , 18 , 19 , 20 , 21 , 22 ]. In the latency minimization objective, the state-of-the-art solutions, however, mainly considered the fog computing on the basis of time intervals. Meanwhile, the effective real-time operations that require immediate reaction immediately after the tasks arrive at the fog computing infrastructure have not yet been significantly taken into account.

For instance, Chamola et al. [ 23 ] proposed a load balancing scheme among fog servers for minimizing the total computation latency in the entire system during an observed period. The developed latency optimization function was relaxed to be a convex problem for making it resolvable. Although a total latency reduction is achieved, this work did not consider the offloaded task complexity, which has significant effects on the fog computing performance. Moreover, the algorithmic complexity restricts the scheme to be applied to a large-scale environment. In [ 24 ], a pattern-identified online task scheduling (PIOTS) mechanism has been proposed for industrial IoT services in multitier fog computing systems. The PIOTS mechanism learned task arrival based on historial task data using the self-organizing map (SOM) technique and further applied the Hungarian assignment method for the identified task patterns. Based on that, incoming tasks arriving at the fog computing system are matched into these patterns and applied to the appropriate assignment policies for obtaining the total system latency minimization. However, the dimension of the SOM is constant; therefore, the PIOTS mechanism does not flexibly adapt to time-varying task arrivals. In [ 25 ], Ali et al. considered the latency-aware cloudlet selection in a fog network as a many-to-one matching game, where the IoT devices and cloudlets rank each other for latency minimization. A distributed and self-organizing method was proposed for solving the matching game to obtain the objective function. However, the latency in this work considered the system operation per timeslot, which is not truly a real-time environment. On the contrary, Dao et al. [ 26 ] proposed an adaptive resource balancing (ARB) scheme by migrating user services among fog servers (located at fog radio access nodes). The ARB orchestrates workload in the entire system using the backpressure algorithm so as to minimize the computation latency (i.e., serviceability maximization). Similar to the mentioned studies, the ARB scheme has limitations due to its timeslot-based optimization approach. The optimization for the time intervals requires that tasks generated by IoT devices have to be divided into fragments with a fixed size to be uploaded in time intervals, which is not a trivial process [ 1 ]. Additionally, the approach assumes that the servers have to complete the assignments and their buffers are cleared after each time period, which is not practical.

By contrast, the greedy method is commonly used while considering real-time processing. The function of this method is illustrated in Figure 1 . At t 0 ( Figure 1 a), there are some servers in processing. When a task is uploaded at t 1 ( Figure 1 b), some tasks have been done and there are tasks remaining from the previous task list in the buffers. This method determines the demand of a task and the status of buffers for selecting a server with the objective of minimizing latency at that moment. In Figure 1 , server Fog3 has the lowest latency in processing the remaining tasks and the new task, e.g., the new task is assigned to the server. Although the greedy method is more realistic compared to time interval approach, it is not functional toward long-term latency optimization. Section 3.1 shows an extreme case that the greedy method does not achieve the minimal latency in the long-term, which proves that the method is not sufficient for real-time task assignment.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g001.jpg

Real-time task assignment considering buffer status of fog servers.

In summary, almost all related works lack the truly real-time consideration and a flexible task arrival adaptation. Therefore, an effective solution, which aims at resolving these problems, is crucial for the fog computing system.

3. System Model

3.1. real-time task assignment problem.

In the real-time task assignment problem, we define a system to be a fog network that includes IoT devices that generate tasks, fog servers with various capabilities in processing a task, and a task assignment module that chooses servers where the tasks are executed. The tasks are uploaded in real-time in the system, which means that a time interval between two consecutive tasks is in the range [0,   ∞ ]. Each fog server has a buffer for remaining tasks with unlimited capacity. For the sake of simplicity, we assume that an IoT device always generates tasks with the same size and complexity. In this study, we consider a scenario in a factory where tasks uploaded by IoT devices are frequent but occasionally have an interference of random noise. As there are many devices that continuously upload tasks, traffic reaching to fog servers is extremely noisy and the complexity of the problem is high. Table 1 lists the explanation of terms used in the study.

Explanation of terms in the study.

It is observed that the total latency consists of propagation, execution, and buffering. To adapt to various networking environments, we consider the IoT task arrival as a random and independent process on the communication channels between the IoT devices and fog servers. In other words, the propagation latency is omitted in the scope of this study. Let 〈 s i ,  c i ,  τ i 〉 denote a three-dimensional characteristic vector of the i -th task, where s i , c i , and τ i are the size, complexity, and latency threshold of the task, respectively. In addition, let f j and b j denote the CPU frequency and current buffer size of the j -th fog server, respectively. Assume that the i -th task is assigned to the j -th fog server, and then the latency of i -th task is given by

Based on the system described above, we define a real-time task assignment problem as selecting a fog server for assigning a task to minimize the computation latency of the system during its operational time, which is referred to as long-term latency optimization. To mathematically express this problem, let x i j denote the case when the i -th task is assigned to the j -th fog server. The latency minimization function at timeslot t is defined by

where Ω( t ) and Ψ are the sets of the IoT tasks and fog servers, respectively. Therefore, the long-term latency minimization function (ℱ) is given by

To clearly demonstrate the problem, we consider an extreme case in Figure 2 when there are only two fog servers: Fog1 and Fog2, and three tasks are uploaded in order. In addition, the buffer in fog servers are assumed to operate with first-come-first-serve (FIFO) policy. The heart of the system is a task assignment module, a.k.a task scheduler, for deciding servers where the tasks are to be executed. At t 1 , Task1 is uploaded and assigned to Fog1. The dark bar represents a time span needed by Fog1 to process Task1. Computational latency of the system is considered to be the maximum latency among fog servers to completely process the uploaded tasks. At t 2 , Task2 is uploaded and the module chooses Fog2 for processing the task. The module makes decision by using the greedy method, which minimizes the latency of the system at the moment when a task is uploaded. It is worth noting that at t 2 , a part of Task1 is completed by Fog1 and the remaining task is in the buffer of the server; this is indicated by a bar with a dashed dotted choke. Adopting the greedy policy, Task3 is uploaded and assigned to Fog1 since the current buffer of Fog2 is larger than Fog1’s at t 3 . Consequently, the buffer of Fog1 contains both the remaining tasks of Task1 and the new task of Task3, whereas the buffer of Task2 contains the remaining task of Task2. At that moment, system latency is caused by Fog1.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g002.jpg

Task assignment overview.

In the extreme case mentioned above, we used the greedy method for task assignment. We now present an example in Table 2 a for explaining the details about the method. As shown in the table, expected latency is the time that a fog server needs to solve a task; Fog(n) latency is the time that the server Fog(n) needs to process all remaining tasks in its buffer. The capability of Fog1 is 2 GHz, i.e., the server has the ability to process two gigacycles of tasks in a second. Right before the moment 0 ms, all buffers are empty and the system latency is 0. Task1 is uploaded at 0-ms time point. There are 1 Mbits × 10 cycles/bit = 10 Megacycles needed to process the task. Therefore, Fog1 and Fog2 need 5 ms and 10 ms for the task, respectively. In the real-time task assignment, the greedy method chooses a fog server for the task assignment with an objective of minimizing system latency at the moment the task is uploaded. Thereby, Task1 is assigned to Fog1. At this moment, Fog1 and Fog2 need 5 ms and 0 ms, respectively, to finish all the tasks in their buffers. Here, the system latency is 5 ms.

An example of real-time task assignment.

At 2-ms time point, Fog1 needs 3 ms to finish the remaining tasks in its buffer. Task2, which needs 3.5 ms and 7 ms by Fog1 and Fog2 to process it, respectively is consecutively uploaded. If Task2 is assigned to Fog1, the server needs 3 ms + 3.5 ms, i.e., 6.5 ms to finish the tasks. In case Task2 is assigned to Fog2, the server needs 7 ms to finish the tasks. Following the greedy method, Task2 is assigned to Fog1, and its latency is 6.5 ms. Task3 is uploaded right after that moment; it needs 4 ms by Fog1 and 8 ms by Fog2 to process it. In case Task3 is assigned to Fog1, the latency of Fog1 is 6.5 ms + 4 ms, i.e., 10.5 ms. However, in case Task3 is assigned to Fog3, then the latency of Fog2 is 8 ms. Therefore, Task3 is assigned to Fog2 and the system latency becomes 8 ms, corresponding to the latency of Fog2.

However, the greedy method is not suitable for long-term latency optimization. In the example above, if focusing on long-term latency, we can propose a better solution to the task assignment problem as shown in Table 2 b. In the table, when Task2 is assigned to Fog2, system latency is 7 ms, which is an increase from the latency of 6.5 ms in case of the greedy method. However, Task3 is assigned to Fog1; this causes the latency of the system to be 7 ms. As a result of the change in task assignment, system latency right after the moment 2 ms reduces (7 ms compared to 8 ms when we apply the greedy method). Intuitively, given the state of a system, which includes task demand (size and complexity) and status of server buffers, the action of assigning a task to a server changes the state of the system at that moment, and a reward is returned, e.g., an inverse of latency. Since the tasks that are uploaded are not totally random but frequently with noise, there should also exist a reward pattern when the determined system states are given. Therefore, in this study, we utilize reinforcement learning for exploiting the pattern of the pair state-reward to minimize the latency of the system in the real-time task assignment.

3.2. Reinforcement Learning Model

Reinforcement learning (RL) is a class of machine learning, besides supervised learning and unsupervised learning [ 15 , 27 ]. The objective of an RL problem is automation and control of a system for adapting to an unknown environment [ 9 , 28 , 29 , 30 ]. In our problem, the training environment is the system consisting of fog servers and their buffers. It is worth noting that the environment is consistent, e.g., for each condition, it expresses a unique state. In other words, a state is representative of the environment at the moment we observe it. At the center of the RL model is the action selection function, briefly, the action function (i.e., task assignment module in the central scheduler). The function selects actions based on the states of the system. Each time the system conducts an action, the condition of the environment changes and a new state is expressed. A reward is also assigned to the system for indicating its adaptation. Thereby, the objective of the model is to maximize the rewards received, e.g., maximize the adaptation of the system to the environment. In turn, the action function has to reinforce itself to enhance its ability in choosing actions efficiently by harnessing the rewards. When a new state is expressed, the learning loop continues and the action selection function is continuously reinforced. In the proposed model, the action selection function is a trainable neural network [ 14 ] and the learning rule of the function follows the algorithm mentioned in Section 4 .

In this paper, we aim to construct a novel approach to craft states of the system in the problem as follows. Given a system presented in Section 3.1 , at a moment that a task is uploaded (but still not yet assigned), there are values that can express the state of the system: (1) demand (e.g., size and complexity) of the task, (2) remaining tasks in the buffers of the servers, (3) time span from the last moment of uploading the task to the current moment, (4) a chain of demands from the last tasks. As the third factor is affected by noise and the last factor could cause a burden to the computers used in training the model, only the first and second factors are chosen to define the state. We craft a state of the system from the two factors as follows. At the moment when a task is uploaded, there are some remaining tasks in the buffers of the servers. First, we measure time durations that the servers have to complete the remaining tasks, e.g., computation latency of the servers, and present the values in a vector of size [ n  × 1]. Next, we calculate time durations that the servers will spend if the arrived task is assigned, then the values are stored in another vector of the same size [ n  × 1]. Combining the above-mentioned two vectors, we obtain a vector of size [2 n  × 1], which represents the state of the system at a given moment.

Figure 3 illustrates a state in the experiments discussed in Section 5 . In the figure, there are three tables. The first table presents five fog servers and their frequencies. The second table lists the remaining tasks in the buffers of the servers at the moment a task is uploaded. We measure the computation latency of the servers for completing the remaining tasks in microseconds (μs). The task requires 1 megacycle for its completion, which leads to a variation in the expected latency among the servers. Based on the latency in processing the remaining tasks, and the expected latency in completion of the task, we craft the state of the system at this moment as a vector of size [10 × 1].

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g003.jpg

An example of a state.

3.3. Action Selection Function

Action selection function in the RL model is a trainable machine learning (ML) function that reinforces its ability in action selection through rewards. Among various ML functions that are applied to the RL model, the neural network (NN) is the most popular [ 13 , 14 , 15 ]. Since the NN is a universal approximation function, it can fit well to various types of RL problems [ 31 ]. Additionally, a combination of the RL and NN has also shown the ability to surpass human level in many applications such as the game of Go [ 32 ]. Therefore, NN is chosen to be the action selection function in the proposed RL model.

An example of the NN is demonstrated in Figure 4 , which has three layers. The state of the system is an input to the NN. Since a state has size [2 n  × 1], input layer of the NN also has 2 n nodes, which are denoted as x ( i ) ,  i = {1, …,  n }. All nodes in the input layer connect to all nodes in the hidden layer. Given that the hidden layer has m nodes, we denote them as h ( j ) ,  j = {1, …,  m }. Therefore, there are [ m  ×  n ] connections between the input and the hidden layers. Given that each connection has a weight, there exists a matrix W (1) with the capacity of storing all weights. Weight W i , j ( 1 ) at row i and column j represents a connection between the two nodes x ( i ) and h ( j ) . The value of a node h ( j ) in the hidden layer is a sum of all the products of weights and inputs.

The number of nodes in the hidden layer can affect the training process; therefore, it is carefully explored in the experiments in Section 5 . All nodes in the hidden layer are connected to the softmax layer, which is also the output layer of the NN. Size of the output layer is [ n  × 1]. A matrix W (2) which has size [ m  ×  n ] stores all weights of connections between the two layers. Thereby, the value of a node y ^ ( k ) in the last layer is calculated as follows.

After the values of all nodes are calculated, the probability that a fog server Fog(i) is selected, where i = 1,  ⋯ ,  n is derived by the softmax function as follows.

The server that has the highest probability is chosen for assigning the uploaded task.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g004.jpg

Neural network in reinforcement learning.

The NN is trained by updating its weight matrices, e.g., W (1) and W (2) , to maximize the feedback from the environment. A popular method for updating the weight matrices is through backpropagation, which has proven effective in some RL applications such as the game of Go [ 14 , 33 , 34 ]. However, the method requires the feedback to be scaled to integer levels of {0, 1}, which is not efficient in this problem, where the latency is a floating value. Therefore, in this study, we choose a neural network evolution algorithm [ 12 ], which is a rival of backpropagation algorithm, for training the NN. The algorithm is successfully utilized in RL problems of OpenAI and UberLab [ 13 , 35 ].

4. Evolution Strategies

To train an ML model, we define an objective function for measuring how well the model is performing in a problem and optimize the ML model based on the function. Given the RL to solve the task assignment problem, our objective is choosing an action for minimizing the long-term latency of the system. However, the RL is defined to train for maximizing rewards from the system. Consequently, a reward is an inverse of the system latency. More precisely, a reward from the system after choosing an action a ( t ) is defined as follows.

where L ( t ) =  L ( t  − 1) +  L i j ( t ), L (0) = 0, and L i j ( t ) is the latency generated by the action a ( t ), which assigns the incoming task ( i-th task) to the j-th fog server. It is seen that when t  →  ∞ , L ( t ) →  ∞  and the Reward  → 0. Coordinating with the function ℱ in Equation ( 6 ), the objective of minimizing the long-term latency of the system can be relaxed into minimizing latency over n consecutive tasks, changes into the objective of maximizing the average of rewards over n recent actions. Therefore, the Reward function is given by

To optimize the RL model following the rewards, we update the NN to enhance the ability of the model in choosing actions for task assignments.

Backpropagation is the most popular algorithm for updating the NN. The algorithm calculates the derivatives of the objective function given by weights of the NN and updates the network toward maximizing the objective. However, in our problem, if we update the NN following the current action but not future actions, we cannot attain long-term optimization. Some backpropagation-based paradigms are proposed for optimizing the long-term reward (e.g., Deep Q-Learning [ 36 ]). However, in practice, such paradigms only work well if the reward received from the environment is either 1 or 0 (which means winning or losing the game). The following drawbacks hinder the algorithms’ success in the real-time task assignment problem since the reward from the environment (e.g., the inversed latency of the system) is arbitrarily floating values.

Neuroevolution (NE), i.e., neural network evolution, which is inspired by biological evolution, is another approach for training neural networks [ 12 ]. In nature, evolution begins when parents produce an offspring with random deviation. Among the children, those who fit the environment have better opportunity to survive and reproduce their genomes. As a result of the selection, the next generation enhances the fitness to the environment. The concept of NE is similar to evolution in nature. Given an NN, for each of its iterations, a new generation is produced from the NN, which includes derivations of the NN. The children that have the highest rewards are chosen and the NN is updated based on the rewards. The method to update the NN is conducted in evolution strategies (ES), which is the most well-known algorithm that applies the NE approach [ 13 , 35 ].

Algorithm 1 describes the process for updating the NN by ES. For each iteration, m children of the NN are produced by adding Gaussian noise to each weight in the network. Each child NN plays a role as task assignment module in the RL model with n consecutive tasks and receives an average reward over n actions. Since an average reward is the feedback of the system to the actions chosen by the child, it is also the fitness of the child to the environment. We calculate the mean reward of m children and differences of the rewards of children with the mean reward, which is also the gain of the children over the root network. If a child has a gain, it has better fitness than the root network and should be encouraged to contribute more to the next generation. Following that idea, the root NN is updated by adding weights of children to its weights toward the gain of the children.

where h and η are the number of children and the learning rate (how fast should we update the weights of the NN), respectively. It is worth noting that since the gain of a child is a difference between the mean reward of m children and its reward, the gain can be negative. In that case, Equation ( 12 ) discourages the child to contribute to the reproduction of the next generation.

In summary, by adding random noise to the copies of the NN, the ES algorithm generates a population of networks in the nearby area of the NN. For each iteration, the NN moves toward the area that offers high rewards (positive gain) and avoids the area that offers negative gain. Over several iterations, the algorithm seeks the area that offers the best reward, e.g., the optimization of the RL model for the task assignment problem.

Since the ES algorithm seeks the gain in the nearby area during each iteration, it is important to control the deviation noise added to the children. If the area of a child is very near the root NN, the network may be stuck in the small area. However, if a child is too far from the root NN, we may skip the area that could be the solution to the problem. In the experiments in Section 5 , the deviation of the children is searched by a practical method. It is worth noting that the ES algorithm does not depend on the derivative of the reward function, hence, it is not stuck in the local optima as the back-propagation algorithm which is based on gradients. On the other hand, each child functions independently from the others; therefore, the computation of the ES is parallel. This makes the ES algorithm work efficiently in a modern computer, which has many parallel CPUs, whereas the Deep RL models with backpropagation algorithm can only update in a single CPU environment.

5. Experiments

5.1. experimental setup, 5.1.1. data collection.

We set up a real-time task assignment system for conducting 11 experiments. There are 100–200 IoT devices in each experiment. All devices are active with task uploading frequencies in the range [10, 250] ms. For each device, the probability that a task is uploaded with abnormal frequency, i.e., the task is uploaded at a sooner or later time than expected time, is 5%. This abnormal task uploading represents a noise interference of the data. If a task is uploaded at a random time, it does not violate its predecessor or successor. For instance, if a random task is uploaded at t 1 , then its predecessor and successor are uploaded at t 0 and t 2 , respectively. t 1 guarantees that ( t 1  −  t 0 ) > 0 and ( t 2  −  t 1 ) > 0. With hundreds of IoT devices with 5% random and each device’s initials at different times, the number of possibilities of system states is immense.

For each device, the size and complexity of each uploaded task are similar and do not change during its lifetime. The sizes of tasks are in the range [1, 100] kbits and their complexities are in the range [10, 200] cycles/bit. Consequently, the requirement for a task ranges from 10,000 cycles to 10 Megacycles. Since we run the experiments with Python 3.6, the minimum time scale in the experiments is 1 μs. This means if the uploading times of two tasks are not the same, then their discrepancy is at least 1 μs. It is worth noting that the minimum timescale is practical, particularly in factories.

We collect task uploading in one hour and save the tasks in a dataset along the uploading time. The number of tasks uploaded is over 12 million. The average number of tasks uploaded in one second is 2700, and the average task rate per second (which is a sum of requirements of all tasks in one second) is 7.68 gigacycles.

5.1.2. Fog Server and System Setup

In the experiments, five fog servers are utilized for task assignment. We define the capability of a server to be the number of cycles that the server can solve in one second. It is also the frequency of the server. For instance, if a server has the frequency of 1 GHz, then it has a capability of solving the tasks at a rate of 1 gigacycles per second. Total capability of all servers should be higher than the average task rate so that the servers can solve all the uploaded tasks. In the experiments, five servers are designed to have frequencies in the range {1.2, 1.4, 1.6, 1.8, 2.0} GHz; hence, the average task rate of 7.68 gigacycles is equal to 96% of the total capability of the servers. It is worth noting that the problem of fairly allocating resources between fog servers is out of the scope of the study.

The system is simulated using Python 3.6. In the experiments, we suppose that the buffers of the servers are unlimited. Computation latency, e.g., a time span for a server to execute remaining tasks in its buffer, has the minimum scale in μs, and follows the smallest scale of time in Python 3.6. It is worth noting that in the literature, we only consider computation latency in fog servers. Transmission delay and queue delay are not covers in the study. The simulation runs on a single computer with a CPU core i7-4770 3.4 GHz. The computer has 16 GB memory and no GPU is used in the simulation.

5.2. Experimental Method

We conducted 11 independent experiments, which are, i.e., at the beginning of an experiment, buffers of the servers are empty, and the result of the experiment does not affect other experiments. From the task uploaded dataset in Section 5.1.1 , we divide 1 million tasks for testing and the rest for training the RL model. (Since our objective to the real-time task assignment is to optimize long-term latency, when we mention n tasks, it implies n consecutive tasks). Each experiment includes 10,000 iterations, where each iteration includes training and testing phases. In the training phase ( Figure 5 a), 600 tasks are randomly chosen for each iteration. The first 500 tasks are included in the chosen tasks replayed and assigned to the servers using the greedy method. (When n consecutive tasks are replayed, they exactly follow the uploading time, i.e., the time spans between task uploading times do not change). The reason we apply the greedy method to the first 500 tasks is to make sure that the buffers are in a normal status. If the buffers are empty, the latency of the system is low and it does not exactly express long-term latency of the system. We store the status of the buffers at the moment right before the task 501 is uploaded (which is the first task of the next 100 consecutive tasks).

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g005.jpg

Training and testing process in the experiments.

We generate 10 children of the model following the method in Section 4 . For each child, the initial state is a combination of the stored buffer and the first task in the 100 consecutive tasks, which craft a state as mentioned in Section 3.2 . With each task assignment, the child receives a reward (which is the inverse of the latency at that moment). The average reward after 100 task assignments is the reward of the child. After all the children receive rewards, we update the system following the Algorithm 1. As a consequence of the training, the RL model maximizes the reward of the system over every 100 consecutive tasks.

In the testing phase, we randomly choose 700 consecutive tasks in the testing set. Task assignment method for the first 500 tasks is greedy, which is similar to the training phase. However, for the next 200 tasks, for each task, we observe the state of the system, and the trained RL model chooses a server for task assignment based on the state. The average reward over 200 task assignments is the reward of the testing ( Figure 5 b). We repeat the testing five times and the average reward over five times is the reward of the system after the iteration. Each time the reward of the system after an iteration is higher than the maximum reward of the system in the previous iterations, we store the weight matrices of the RL model and discard the previously stored matrices. After 10,000 iterations, the RL model with the stored weight matrices expresses the model that gives the maximum reward (Algorithm 1).

The rewards given by the testing are compared to the greedy algorithm. The greedy-based result is calculated as follows. An average reward is received after a process that is similar to testing; however, we apply the greedy method to the last 200 tasks ( Figure 5 b). The process is repeated 100 times and the average reward over 100 times is the reward of the system with the greedy method.

5.3. Results

Table 3 is a summary of the parameters utilized in the experiments. The middle column lists all possible values of a parameter, whereas the role of the value in the last column is not only the initial value of the parameter but also an anchor when we explore the effectiveness of other parameters. For instance, when the number of IoT devices changes in a range, the number of Fog servers is set to 5. In the case where the initial value of a parameter is fixed, the parameter does not change its value throughout the experiments. A fixed value of a parameter is the optimal value that makes the best contribution to the results and is discovered by grid search.

Simulation parameters.

Based on the values of parameters in Table 3 , 11 experiments were conducted. A summary of the results of the experiments is listed in Table 4 . In each part of the table, values corresponding to a parameter are highlighted to indicate that we are changing the value of the parameter to discover its effect on the final results. For this reason, other parameters are set to initial values. The only exception is the number of IoT devices and the number of fog servers. Since the parameters are important in the design of the system, we explore their roles with two different experiments.

Experiments.

In experiment 1, the number of fog servers and IoT devices are set to five and 100, respectively. We scrutinize the experiment by plotting the result of 10,000 iterations as shown in Figure 6 a. In this figure, the thin blue plot denotes the rewards in all iterations; the dashed red line is the greedy-based results, i.e., the reward when we apply greedy method. Figure 6 b has the same presentation but the results of the first 1000 iterations only are shown. In the experiment, the proposed model overcomes the greedy method after a few iterations more than the first 200 iterations and peaks near the 60,000-th iteration (a magnified area). More specifically, Table 4 indicates that the proposed model outperforms the greedy method by 15.309%.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g006.jpg

Reward with 100 IoT devices and five fog servers.

In experiment 2, the number of IoT devices increases to 200. Thereby, the average task uploading rate is 19.95 Gigacycles per second. To maintain the ratio of the average task uploading rate and the total capability of the servers at 0.96 (as in experiment 1), we increase the total capability to 20.8 GHz. The number of servers in this experiment is 10, and the frequency of their processors’ servers ranges from 1.56 GHz to 2.47 GHz. The improvement in experiment 2 is 16.101%, which is much higher than the improvement in experiment 1. The reason for the significant increase in the improvement with 10 servers, is that the RL model has more options for assigning the tasks to be uploaded, i.e., the model can find the better solution for the task assignment problem in general. However, an increase in the number of fog servers impacts the training time of the model. The reason is that the model has to calculate a probability that a server is chosen for task assignment among 10 servers, which takes more time than choosing a server among five servers in experiment 1. More specifically, average runtimes per iteration of experiments 2 and 1 are 1.667 s and 1.496 s, respectively, i.e., tasks in experiment 2 require 11% more time for each iteration than those in experiment 1.

Figure 7 illustrates a comparison of results in experiments 1 (exp1) and 2 (exp2). In the second experiment, the RL needs more time to reach the greedy-based results than in experiment 1. In fact, the model only reaches the greedy-based result after 1000-th iteration, which is five times compared to the 200-th iteration in the model in experiment 2. The result expresses that the complexity of the problem of task assignment in fog computing significantly increases in the case when the number of IoT devices increases. Thereby, the number of fog servers and the total capability of a server correspondingly increase. Consequently, the RL model takes more time to find the solution to the problem. On the other hand, the RL in experiment 2 has an optimal solution to the problem after 5000-th iteration, which is not much different than experiment 1 that has the optimal solution after 6000-th iteration. In other words, the RL in experiment 2 incurs a burden with an increase in the number of IoT devices in the system; however, it has the ability to find the optimal solution for the task assignment problem as well. From the experiments, it can be confidently shown that the RL approach works well with various fog computing systems.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g007.jpg

Exploring efficiency with number of fog servers.

We explore the effectiveness of the number of training tasks in Figure 8 . The blue line indicates the improvement of the proposed model in the experiments 3, 1, 4, and 5 compared to the greedy-based result, and the red line is the average run time per iteration of the experiments. The number of training tasks n indicates the number of consecutive tasks that the RL model needs to find the optimal solution (for task assignments). In other words, long-term latency optimization implies optimization in n consecutive tasks. Therefore, in case the number of training tasks is large, the RL model can optimize better. In contrast, the model may take more time to find the optimal solution, and in the worst case, it cannot find the optimal solution with 10,000 iterations. Figure 8 expresses the analysis, in which the result does not improve in case the number of training tasks is over 100, although the training time increases significantly. Conclusively, this parameter should be set at 100 for attaining the best result in the task assignment problem.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g008.jpg

Exploring efficiency with a number of training tasks.

The effectiveness of the number of children generated in each iteration by the ES algorithm is explored in Figure 9 . It is apparent that the runtime linearly grows corresponding to the number of children, since for each child, the RL has to run the same training and testing procedure. On the other hand, the improvements of the RL model peak for the population of 15 and decrease in the case of a larger population. Since the ES algorithm explores a solution in a nearby area, it needs enough samples to make a move toward the optimal area. This is the reason why the improvement grows corresponding to the number of children in the range of five to 15. On the other hand, too many samples do not help to improve the results, since it increases the risk of the NN getting stuck in the local optima, i.e., the model cannot find the best solution to the problem.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g009.jpg

Exploring efficiency with number of children.

In Figure 10 , we explore the effectiveness of the number of hidden nodes in the NN to the final result. Since a neural network is a universal approximation function, an NN with more nodes approximates a function better than an NN with fewer nodes. Moreover, if the number of nodes in the NN is too small, the result is not stable enough. However, if the number of nodes is large, it takes much more time to train the network, and in the worst case, the NN cannot reach the optimal solution after 10,000 iterations. The figure clearly expresses the aforementioned analysis. The result fluctuates if the number of nodes is low. The network peaks at 1024 nodes and performs poorly in case the number of nodes is too large. It is worth noting that the number of hidden nodes only slightly affects the running time. In fact, if the number of hidden nodes in the NN is low, it does not affect the running time at all.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-02830-g010.jpg

Exploring efficiency with the number of hidden nodes in NN.

6. Conclusions

The long-term latency optimization of real-time task assignment is one of the most critical problems in the fog computing. The problem is difficult due to its high complexity; therefore, conventional optimization techniques do not work well. In this study, we resolve the problem with an RL model and apply the ES algorithm to optimize the model. The experiments show that the proposed model overcomes the greedy approach approximately 16.1% in terms of long-term latency for task execution. Moreover, the ES algorithm avoids the incorrect convergence to local optima which appears in most of the existing optimization methods based on the gradient. Additionally, the algorithm is embarrassingly parallel in implementation. Hence, it can speed up the learning process, particularly in practical frameworks and applications such as Omega, Mesos, Kubernetes, and Aneka [ 37 , 38 , 39 , 40 ]. To the best of our knowledge, the algorithm is applied to fog computing for the first time. This study also opens a new direction of the real-time task assignment in fog computing based on the RL and neuroevolution approach.

To extend our study, future works should consider an implementation of the proposed approach to real-world data sets such as telehealth big data and smart city on a testbed system. Another possible extension of this study is to use ES algorithm for simultaneously optimizing multiple utilities of the system in order to provide a balance between latency and energy consumption [ 10 , 11 ]. In addition, transmission latency when uploading task to fog servers or responding from servers to IoT devices should be taken into account.

Author Contributions

Conceptualization and experiment methodology, L.M. and N.-N.D.; Data analysis, L.M. and N.-N.D., Software and validation, L.M.; Writing—original draft preparation, L.M.; Writing—review & editing, N.-N.D. and M.P.; Supervision, project administration and funding acquisition, M.P.

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03933405).

Conflicts of Interest

The authors declare no conflict of interest.

Machine Learning Tutorial

  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
  • Getting Started with Machine Learning
  • An introduction to Machine Learning
  • Getting started with Machine Learning

What is Machine Learning?

  • Types of Machine Learning
  • Best Python libraries for Machine Learning
  • Difference Between Machine Learning and Artificial Intelligence
  • General steps to follow in a Machine Learning Problem

Data Preprocessing

  • ML | Introduction to Data in Machine Learning
  • ML | Understanding Data Processing
  • Python | Create Test DataSets using Sklearn
  • Generate Test Datasets for Machine learning
  • ML | Overview of Data Cleaning
  • One Hot Encoding in Machine Learning
  • ML | Dummy variable trap in Regression Models
  • What is Exploratory Data Analysis ?
  • ML | Feature Scaling - Part 1
  • Feature Engineering: Scaling, Normalization, and Standardization
  • Label Encoding in Python
  • ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Classification & Regression

  • Ordinary Least Squares (OLS) using statsmodels
  • Linear Regression (Python Implementation)
  • ML | Multiple Linear Regression using Python
  • Polynomial Regression ( From Scratch using Python )
  • Implementation of Bayesian Regression
  • How to Perform Quantile Regression in Python
  • Isotonic Regression in Scikit Learn
  • Stepwise Regression in Python
  • Least Angle Regression (LARS)
  • Logistic Regression in Machine Learning
  • Understanding Activation Functions in Depth
  • Regularization in Machine Learning
  • Implementation of Lasso Regression From Scratch using Python
  • Implementation of Ridge Regression from Scratch using Python

K-Nearest Neighbors (KNN)

  • K-Nearest Neighbor(KNN) Algorithm
  • Implementation of Elastic Net Regression From Scratch
  • Brute Force Approach and its pros and cons
  • ML | Implementation of KNN classifier using Sklearn
  • Regression using k-Nearest Neighbors in R Programming

Support Vector Machines

  • Support Vector Machine (SVM) Algorithm
  • Classifying data using Support Vector Machines(SVMs) in Python
  • Support Vector Regression (SVR) using Linear and Non-Linear Kernels in Scikit Learn
  • Major Kernel Functions in Support Vector Machine (SVM)
  • Decision Tree
  • Python | Decision tree implementation
  • CART (Classification And Regression Tree) in Machine Learning
  • Decision Tree Classifiers in R Programming
  • Python | Decision Tree Regression using sklearn

Ensemble Learning

  • Ensemble Methods in Python
  • Random Forest Regression in Python
  • ML | Extra Tree Classifier for Feature Selection
  • Implementing the AdaBoost Algorithm From Scratch
  • Gradient Boosting in ML
  • CatBoost in Machine Learning
  • LightGBM (Light Gradient Boosting Machine)
  • Stacking in Machine Learning

Generative Model

  • ML | Naive Bayes Scratch Implementation using Python
  • Applying Multinomial Naive Bayes to NLP Problems
  • Gaussian Process Classification (GPC) on the XOR Dataset in Scikit Learn
  • Gaussian Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Basic Understanding of Bayesian Belief Networks
  • Hidden Markov Model in Machine learning

Time Series Forecasting

  • Components of Time Series Data
  • AutoCorrelation
  • How to Check if Time Series Data is Stationary with Python?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • Exponential Smoothing in R Programming
  • Python | ARIMA Model for Time Series Forecasting

Clustering Algorithm

  • K means Clustering - Introduction
  • Hierarchical Clustering in Machine Learning
  • Principal Component Analysis(PCA)
  • ML | T-distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
  • DBSCAN Clustering in ML | Density based clustering
  • Spectral Clustering in Machine Learning
  • Gaussian Mixture Model
  • ML | Mean-Shift Clustering

Convolutional Neural Networks

  • Introduction to Convolution Neural Network
  • Image Classifier using CNN
  • What is Transfer Learning?

Recurrent Neural Networks

  • Introduction to Recurrent Neural Network
  • Introduction to Natural Language Processing
  • NLP Sequencing
  • Bias-Variance Trade Off - Machine Learning
  • Reinforcement Learning
  • Reinforcement learning
  • Markov Decision Process
  • Q-Learning in Python
  • Deep Q-Learning
  • Natural Language Processing (NLP) Tutorial

Model Deployment and Productionization

  • Python | Build a REST API using Flask
  • How To Use Docker for Machine Learning?
  • Cloud Deployment Models

Advanced Topics

  • What is AutoML in Machine Learning?
  • Generative Adversarial Network (GAN)
  • Explanation of BERT Model - NLP
  • What is a Large Language Model (LLM)
  • Variational AutoEncoders
  • Transfer Learning with Fine-tuning
  • 100 Days of Machine Learning - A Complete Guide For Beginners
  • 100+ Machine Learning Projects with Source Code [2024]

Machine Learning tutorial

Machine Learning tutorial covers basic and advanced concepts, specially designed to cater to both students and experienced working professionals.

This machine learning tutorial helps you gain a solid introduction to the fundamentals of machine learning and explore a wide range of techniques, including supervised, unsupervised, and reinforcement learning.

Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems that learn—or improve performance—based on the data they ingest. Artificial intelligence is a broad word that refers to systems or machines that resemble human intelligence. Machine learning and AI are frequently discussed together, and the terms are occasionally used interchangeably, although they do not signify the same thing. A crucial distinction is that, while all machine learning is AI, not all AI is machine learning.

Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer that makes it more similar to humans: The ability to learn. Machine learning is actively being used today, perhaps in many more places than one would expect.

Recent Articles on Machine Learning

  • Introduction
  • Data and it’s Processing
  • Supervised Learning
  • Unsupervised Learning
  • Dimensionality Reduction
  • Natural Language Processing
  • Neural Networks
  • ML – Deployment
  • ML – Applications
  • Miscellaneous

Features of Machine learning

  • Machine learning is data driven technology. Large amount of data generated by organizations on daily bases. So, by notable relationships in data, organizations makes better decisions.
  • Machine can learn itself from past data and automatically improve.
  • From the given dataset it detects various patterns on data.
  • For the big organizations branding is important and it will become more easy to target relatable customer base.
  • It is similar to data mining because it is also deals with the huge amount of data.

Introduction :

  • An Introduction to Machine Learning
  • What is Machine Learning ?
  • Introduction to Data in Machine Learning
  • Demystifying Machine Learning
  • Artificial Intelligence | An Introduction
  • Machine Learning and Artificial Intelligence
  • Difference between Machine learning and Artificial Intelligence
  • Agents in Artificial Intelligence
  • 10 Basic Machine Learning Interview Questions

Data and It’s Processing:

  • Understanding Data Processing
  • Python | Generate test datasets for Machine learning
  • Python | Data Preprocessing in Python
  • Data Cleaning
  • Feature Scaling – Part 1
  • Feature Scaling – Part 2
  • Python | Label Encoding of datasets
  • Python | One Hot Encoding of datasets
  • Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python
  • Dummy variable trap in Regression Models

Supervised learning :

  • Getting started with Classification
  • Basic Concept of Classification
  • Types of Regression Techniques
  • Classification vs Regression
  • ML | Types of Learning – Supervised Learning
  • Multiclass classification using scikit-learn
  • Gradient Descent algorithm and its variants
  • Stochastic Gradient Descent (SGD)
  • Mini-Batch Gradient Descent with Python
  • Optimization techniques for Gradient Descent
  • Introduction to Momentum-based Gradient Optimizer
  • Introduction to Linear Regression
  • Gradient Descent in Linear Regression
  • Mathematical explanation for Linear Regression working
  • Normal Equation in Linear Regression
  • Simple Linear-Regression using R
  • Univariate Linear Regression in Python
  • Multiple Linear Regression using Python
  • Multiple Linear Regression using R
  • Locally weighted Linear Regression
  • Generalized Linear Models
  • Python | Linear Regression using sklearn
  • Linear Regression Using Tensorflow
  • A Practical approach to Simple Linear Regression using R
  • Linear Regression using PyTorch
  • Pyspark | Linear regression using Apache MLlib
  • ML | Boston Housing Kaggle Challenge with Linear Regression
  • Python | Implementation of Polynomial Regression
  • Softmax Regression using TensorFlow
  • Understanding Logistic Regression
  • Why Logistic Regression in Classification ?
  • Logistic Regression using Python
  • Cost function in Logistic Regression
  • Logistic Regression using Tensorflow
  • Naive Bayes Classifiers
  • Support Vector Machines(SVMs) in Python
  • SVM Hyperparameter Tuning using GridSearchCV
  • Support Vector Machines(SVMs) in R
  • Using SVM to perform classification on a non-linear dataset
  • Decision Tree Regression using sklearn
  • Decision Tree Introduction with example
  • Decision tree implementation using Python
  • Decision Tree in Software Engineering
  • Ensemble Classifier
  • Voting Classifier using Sklearn
  • Bagging classifier

Unsupervised learning :

  • ML | Types of Learning – Unsupervised Learning
  • Supervised and Unsupervised learning
  • Clustering in Machine Learning
  • Different Types of Clustering Algorithm
  • K means Clustering – Introduction
  • Elbow Method for optimal value of k in KMeans
  • Random Initialization Trap in K-Means
  • ML | K-means++ Algorithm
  • Analysis of test data using K-Means Clustering in Python
  • Mini Batch K-means clustering algorithm
  • Mean-Shift Clustering
  • DBSCAN – Density based clustering
  • Implementing DBSCAN algorithm using Sklearn
  • Fuzzy Clustering
  • Spectral Clustering
  • OPTICS Clustering
  • OPTICS Clustering Implementing using Sklearn
  • Hierarchical clustering (Agglomerative and Divisive clustering)
  • Implementing Agglomerative Clustering using Sklearn

Reinforcement Learning:

  • Reinforcement Learning Algorithm : Python Implementation using Q-learning
  • Introduction to Thompson Sampling
  • Genetic Algorithm for Reinforcement Learning
  • SARSA Reinforcement Learning

Dimensionality Reduction :

  • Introduction to Dimensionality Reduction
  • Introduction to Kernel PCA
  • Principal Component Analysis with Python
  • Low-Rank Approximations
  • Overview of Linear Discriminant Analysis (LDA)
  • Mathematical Explanation of Linear Discriminant Analysis (LDA)
  • Generalized Discriminant Analysis (GDA)
  • Independent Component Analysis
  • Feature Mapping
  • Extra Tree Classifier for Feature Selection
  • Chi-Square Test for Feature Selection – Mathematical Explanation
  • Python | How and where to apply Feature Scaling?
  • Parameters for Feature Selection
  • Underfitting and Overfitting in Machine Learning

Natural Language Processing :

  • Text Preprocessing in Python | Set – 1
  • Text Preprocessing in Python | Set 2
  • Removing stop words with NLTK in Python
  • Tokenize text using NLTK in python
  • How tokenizing text, sentence, words works
  • Introduction to Stemming
  • Stemming words with NLTK
  • Lemmatization with NLTK
  • Lemmatization with TextBlob
  • How to get synonyms/antonyms from NLTK WordNet in Python?

Neural Networks :

  • Introduction to Artificial Neutral Networks | Set 1
  • Introduction to Artificial Neural Network | Set 2
  • Introduction to ANN (Artificial Neural Networks) | Set 3 (Hybrid Systems)
  • Introduction to ANN | Set 4 (Network Architectures)
  • Activation functions
  • Implementing Artificial Neural Network training process in Python
  • A single neuron neural network in Python
  • Introduction to Pooling Layer
  • Introduction to Padding
  • Types of padding in convolution layer
  • Applying Convolutional Neural Network on mnist dataset
  • Recurrent Neural Networks Explanation
  • seq2seq model
  • Introduction to Long Short Term Memory
  • Long Short Term Memory Networks Explanation
  • Gated Recurrent Unit Networks(GAN)
  • Text Generation using Gated Recurrent Unit Networks
  • Introduction to Generative Adversarial Network
  • Generative Adversarial Networks (GANs)
  • Use Cases of Generative Adversarial Networks
  • Building a Generative Adversarial Network using Keras
  • Modal Collapse in GANs
  • Introduction to Deep Q-Learning
  • Implementing Deep Q-Learning using Tensorflow

ML – Deployment :

  • Deploy your Machine Learning web app (Streamlit) on Heroku
  • Deploy a Machine Learning Model using Streamlit Library
  • Deploy Machine Learning Model using Flask
  • Python – Create UIs for prototyping Machine Learning model with Gradio
  • How to Prepare Data Before Deploying a Machine Learning Model?
  • Deploying ML Models as API using FastAPI
  • Deploying Scrapy spider on ScrapingHub

ML – Applications :

  • Rainfall prediction using Linear regression
  • Identifying handwritten digits using Logistic Regression in PyTorch
  • Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression
  • Python | Implementation of Movie Recommender System
  • Support Vector Machine to recognize facial features in C++
  • Decision Trees – Fake (Counterfeit) Coin Puzzle (12 Coin Puzzle)
  • Credit Card Fraud Detection
  • NLP analysis of Restaurant reviews
  • Image compression using K-means clustering
  • Deep learning | Image Caption Generation using the Avengers EndGames Characters
  • How Does Google Use Machine Learning?
  • How Does NASA Use Machine Learning?
  • 5 Mind-Blowing Ways Facebook Uses Machine Learning
  • Targeted Advertising using Machine Learning
  • How Machine Learning Is Used by Famous Companies?
  • Pattern Recognition | Introduction
  • Calculate Efficiency Of Binary Classifier
  • Logistic Regression v/s Decision Tree Classification
  • R vs Python in Datascience
  • Explanation of Fundamental Functions involved in A3C algorithm
  • Differential Privacy and Deep Learning
  • Artificial intelligence vs Machine Learning vs Deep Learning
  • Introduction to Multi-Task Learning(MTL) for Deep Learning
  • Top 10 Algorithms every Machine Learning Engineer should know
  • Azure Virtual Machine for Machine Learning
  • 30 minutes to machine learning
  • Confusion Matrix in Machine Learning

Prerequisites to learn machine learning

  • Knowledge of Linear equations, graphs of functions, statistics, Linear Algebra, Probability, Calculus etc.
  • Any programming language knowledge like Python, C++, R are recommended.

FAQs on Machine Learning Tutorial

Q.1 what is machine learning and how is it different from deep learning .

Machine learning develop programs that can access data and learn from it. Deep learning is the sub domain of the machine learning. Deep learning supports automatic extraction of features from the raw data.

Q.2. What are the different type of machine learning algorithms ?

Supervised algorithms: These are the algorithms which learn from the labelled data, e.g. images labelled with dog face or not. Algorithm depends on supervised or labelled data. e.g. regression, object detection, segmentation. Non-Supervised algorithms: These are the algorithms which learn from the non labelled data, e.g. bunch of images given to make a similar set of images. e.g. clustering, dimensionality reduction etc. Semi-Supervised algorithms: Algorithms that uses both supervised or non-supervised data. Majority portion of data use for these algorithms are not supervised data. e.g. anamoly detection.

Q.3. Why we use machine learning ?

Machine learning is used to make decisions based on data. By modelling the algorithms on the bases of historical data, Algorithms find the patterns and relationships that are difficult for humans to detect. These patterns are now further use for the future references to predict solution of unseen problems.

Q.4. What is the difference between Artificial Intelligence and Machine learning ?

Artificial Intelligence Machine Learning Develop an intelligent system that perform variety of complex jobs. Construct machines that can only accomplish the jobs for which they have trained. It works as a program that does smart work. The tasks systems machine takes data and learns from data. AI has broad variety of applications. ML allows systems to learn new things from data. AI leads wisdom. ML leads to knowledge.

Please Login to comment...

  • How to Delete Whatsapp Business Account?
  • Discord vs Zoom: Select The Efficienct One for Virtual Meetings?
  • Otter AI vs Dragon Speech Recognition: Which is the best AI Transcription Tool?
  • Google Messages To Let You Send Multiple Photos
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Contains Optional Labs and Solutions of Programming Assignment for the Machine Learning Specialization By Stanford University and Deeplearning.ai - Coursera (2023) by Prof. Andrew NG

A-sad-ali/Machine-Learning-Specialization

  • Jupyter Notebook 94.9%
  • Python 5.1%
  • Computer Vision
  • Federated Learning
  • Reinforcement Learning
  • Natural Language Processing
  • New Releases
  • AI Dev Tools
  • Advisory Board Members
  • 🐝 Partnership and Promotion

Logo

In conclusion, the Model Stock technique introduced by the NAVER AI Lab significantly refines the fine-tuning process of pre-trained models, achieving notable accuracies on both ID and OOD benchmarks with just two models. This method reduces computational demands while maintaining performance, showcasing a practical advancement in machine learning. Its success across diverse datasets emphasizes the potential for broader application and efficiency in model optimization, presenting a step forward in addressing current machine learning practices’ computational and environmental challenges.

Check out the  Paper  and  Github .  All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on  Twitter . Join our  Telegram Channel ,   Discord Channel , and  LinkedIn Gr oup .

If you like our work, you will love our  newsletter..

Don’t Forget to join our  39k+ ML SubReddit

machine learning task assignment

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

  • Nikhil https://www.marktechpost.com/author/nikhil0980/ Mini-Gemini: A Simple and Effective Artificial Intelligence Framework Enhancing multi-modality Vision Language Models (VLMs)
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ RakutenAI-7B: A Suite of Japanese-Oriented Large Language Models that Achieve the Great Performance on the Japanese Language Model
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ Researchers from Google DeepMind and Stanford Introduce Search-Augmented Factuality Evaluator (SAFE): Enhancing Factuality Evaluation in Large Language Models
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ This AI Paper Explores the Impact of Model Compression on Subgroup Robustness in BERT Language Models

RELATED ARTICLES MORE FROM AUTHOR

Lumos: an open-source generalizable language agent training framework, top chatgpt books to read in 2024, meet opendevin: an open-source alternative to devin (an autonomous ai software engineer), tencent propose aniportrait: an audio-driven synthesis of photorealistic portrait animation, meet guide labs: an ai research startup building interpretable foundation models that can reliably explain their reasoning, alibaba researchers propose reward learning on policy (rlp): an unsupervised ai framework that refines a reward model using policy samples to keep it on-distribution, meet guide labs: an ai research startup building interpretable foundation models that can reliably..., 10 artificial intelligence (ai) applications/platforms in healthcare.

  • AI Magazine
  • Privacy & TC
  • Cookie Policy

🐝 FREE AI Courses on RAG + Deployment of an Healthcare AI App + LangChain Colab Notebook all included

Thank You 🙌

Privacy Overview

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 15 March 2024

Foundation model for cancer imaging biomarkers

  • Suraj Pai   ORCID: orcid.org/0000-0001-8043-2230 1 , 2 , 3 ,
  • Dennis Bontempi   ORCID: orcid.org/0000-0003-0775-5679 1 , 2 , 3 ,
  • Ibrahim Hadzic   ORCID: orcid.org/0000-0002-8397-5940 1 , 2 , 3 ,
  • Vasco Prudente   ORCID: orcid.org/0000-0003-0623-1259 1 , 2 , 3 ,
  • Mateo Sokač 4 , 5 ,
  • Tafadzwa L. Chaunzwa 1 , 3 ,
  • Simon Bernatz   ORCID: orcid.org/0000-0002-7758-8100 1 , 3 ,
  • Ahmed Hosny 1 , 3 ,
  • Raymond H. Mak   ORCID: orcid.org/0000-0002-8754-0565 1 , 2 ,
  • Nicolai J. Birkbak   ORCID: orcid.org/0000-0003-1613-9587 4 , 5 &
  • Hugo J. W. L. Aerts   ORCID: orcid.org/0000-0002-2122-2003 1 , 2 , 3 , 6  

Nature Machine Intelligence volume  6 ,  pages 354–367 ( 2024 ) Cite this article

10k Accesses

20 Altmetric

Metrics details

  • Cancer imaging
  • Tumour biomarkers

A preprint version of the article is available at medRxiv.

Foundation models in deep learning are characterized by a single large-scale model trained on vast amounts of data serving as the foundation for various downstream tasks. Foundation models are generally trained using self-supervised learning and excel in reducing the demand for training samples in downstream applications. This is especially important in medicine, where large labelled datasets are often scarce. Here, we developed a foundation model for cancer imaging biomarker discovery by training a convolutional encoder through self-supervised learning using a comprehensive dataset of 11,467 radiographic lesions. The foundation model was evaluated in distinct and clinically relevant applications of cancer imaging-based biomarkers. We found that it facilitated better and more efficient learning of imaging biomarkers and yielded task-specific models that significantly outperformed conventional supervised and other state-of-the-art pretrained implementations on downstream tasks, especially when training dataset sizes were very limited. Furthermore, the foundation model was more stable to input variations and showed strong associations with underlying biology. Our results demonstrate the tremendous potential of foundation models in discovering new imaging biomarkers that may extend to other clinical use cases and can accelerate the widespread translation of imaging biomarkers into clinical settings.

Similar content being viewed by others

machine learning task assignment

Towards a general-purpose foundation model for computational pathology

Richard J. Chen, Tong Ding, … Faisal Mahmood

machine learning task assignment

A visual-language foundation model for computational pathology

Ming Y. Lu, Bowen Chen, … Faisal Mahmood

machine learning task assignment

Segment anything in medical images

Jun Ma, Yuting He, … Bo Wang

Foundation models, popularized recently due to their unprecedented performance in language, vision and several other domains 1 , are large deep-learning models trained on extensive amounts of unannotated data serving as the base for a wide range of downstream tasks. In the field of natural language processing, for example, foundation models drive the successes of applications such as ChatGPT 2 , BERT 3 and CLIP 4 . Similarly, foundation models, such as SimCLR 5 and DINO 6 , have reported considerable success in computer vision applications.

Medicine represents a vast potential for foundation models as labelled data are scarce, while multimodal data, such as medical images, biologic and clinical notes, are frequently collected in routine clinical care 7 . Indeed, different applications of foundation models, such as augmented surgical procedures, bedside decision support, interactive radiology reports and note-taking, have been reported 8 .

While many studies investigating imaging-based biomarkers incorporate supervised deep-learning algorithms into their models 9 , 10 , 11 , they are typically applied in scenarios where large datasets are available for training and testing. The quantity and quality of annotated data are strongly linked to the robustness of deep-learning models. However, access to large amounts of annotated data for specialized applications is often challenging and demands expertise, time and labour. In such scenarios, many investigators fall back on traditional handcrafted or engineered approaches based on defined mathematical and statistical algorithms that analyse attributes such as the shape and texture of objects in images, which limit the scope of discovery. This caveat is commonplace in many scenarios where insights from imaging-based biomarkers have great potential in informing clinical care.

Foundation models are generally pretrained using self-supervised learning (SSL), a set of methods that leverage innate information available within data by learning generalized, task-agnostic representations from large amounts of unannotated samples. Existing literature 12 has suggested several strategies, such as image reconstruction, to pretrain networks to learn these representations. Following pretraining, foundation models can be applied to task-specific problems, improving generalization, especially in tasks with small datasets. The expanding literature on SSL in medical imaging 13 focuses primarily on two-dimensional (2D) images (X-ray, whole slide images, dermatology images, fundus images and so on) for diagnostic applications. There is still limited evidence investigating whether SSL can help train foundation models that learn general, robust and transferrable representations that can act as imaging biomarkers, especially prognostic, for tasks of clinical relevance.

In this study, we investigated whether foundation models can improve the development of deep-learning-based imaging biomarkers, especially in limited dataset-size scenarios. The foundation model, a convolutional encoder, was self-supervised pretrained on 11,467 diverse and annotated lesions identified on computed tomography (CT) imaging from 2,312 unique patients 14 (Fig. 1a ). The model was first technically validated by classifying lesion anatomical site (use case 1). Subsequently, it was applied to two clinically relevant applications: developing a diagnostic biomarker that predicts the malignancy of lung nodules (use case 2) and a prognostic biomarker for non-small cell lung cancer (NSCLC) tumours (use case 3; Fig. 1b ). We evaluated two distinct implementation approaches of incorporating a pretrained foundation model into training pipelines for downstream tasks: using the foundation model as a feature extractor followed by a linear classifier and another where the foundation model is fine-tuned through transfer learning. The performance of the foundation model approaches was compared to several existing baselines developed using supervised approaches and publicly available pretrained models. Our analysis examines effective pretraining techniques, performance in limited data scenarios, consistency in test–retest and inter-reader evaluations and the interpretability of findings through deep-learning attribution methods along with their biological relevance to gene expression data. Our results demonstrate the potential of foundation models in discovering new imaging biomarkers and their particular strength in applications with limited dataset sizes. This evidence may extend to other clinical use cases and imaging modalities and can accelerate the widespread development and translation of imaging biomarkers into clinical settings.

figure 1

a , Foundation model pretraining: a foundation model, specifically a deep convolutional encoder, was pretrained by contrasting volumes with and without lesions. b , Clinical application of the foundation model: the foundation model was used to extract biomarkers and subsequently evaluated on three classification tasks using diverse datasets. c , Foundation model implementation approaches: the foundation model was implemented on specific use cases by (1) training a linear classifier on extracted features or (2) through transfer learning by fine-tuning all model parameters. d , Performance evaluations: we compared the performance of the foundation model against supervised models, trained from random initialization and transfer-learned, through fine-tuning, from a different task. Publicly available state-of-the-art models, Med3D and Models Genesis, were also compared against our foundation model using identical implementation approaches. The comparison was made through several criteria for the different use cases, including quantitative performance, stability, biological and efficiency analysis.

We developed a deep-learning foundation model using SSL and tested the model’s performance in three distinct use cases. The study design and the pretraining process are outlined in Fig. 1 . We trained a single foundation model using a dataset with 11,467 annotated CT lesions identified from 2,312 unique patients. Lesion findings were diverse and included multiple lesions, such as lung nodules, cysts and breast lesions, among numerous others. A task-agnostic contrastive learning strategy was used to pretrain the model on these lesion findings (Fig. 1a ). We showed the applicability of our pretrained foundation model to several tasks through the evaluation on three diverse clinical applications over five distinct datasets (Fig. 1b ).

Pretraining strategy selection

We compared simple auto-encoder pretraining and several state-of-the-art self-supervised pretraining approaches—namely SimCLR 5 , SwAV 15 and NNCLR 16 —against the modified version of SimCLR developed in our study ( Methods ). We evaluated pretraining strategies on the technical validation use case of lesion anatomical site classification by comparing linear classifiers trained on top of features extracted from each of the chosen strategies. We observed that our modified SimCLR pretraining surpassed all others ( P  < 0.001) in balanced accuracy (Fig. 2a ) and mean average precision (mAP) (Fig. 2b ), achieving a balanced accuracy of 0.779 (95% confidence interval (CI) 0.750–0.810) and mAP = 0.847 (95% CI 0.750–0.810). As expected, the second best-performing approach was SimCLR (balanced accuracy 0.696 (95% CI 0.663–0.728); mAP = 0.779 (95% CI 0.749–0.811)). The auto-encoder approach, previously popular for pretraining, performed the worst compared to state-of-the-art contrastive SSL approaches.

figure 2

We determined the best pretraining approach for our foundation model on their ability to extract features that can be linearly classified to best predict lesion anatomical site. a , b , Different pretraining approaches were evaluated using balanced accuracy (BA) ( a ) and mAP ( b ). c , d , After pretraining our foundation model using the best strategy, we adapted them to use case 1, lesion anatomical site classification, and compared them against baseline methods using balanced accuracy ( c ) and mAP ( d ). We show performance on these metrics aggregated across eight anatomical sites when trained on the full training set and when the training data percentage decreased to 50, 20 and 10%. e , f , Similar to use case 1, we implemented our foundation model on use case 2 and compared it against baseline methods using the AUC-ROC ( e ) and mAP ( f ). Both metrics were computed when trained on the full and 50, 20 and 10% of the dataset. In e , f , Models Genesis approaches are shaded and/or dotted as they were trained on the same data split of LUNA16 and therefore do not present a fair comparison due to overfitting. For use case 2, we also added a supervised model fine-tuned through transfer learning from use case 1. The error bars for a – f show 95% CIs of the estimates and the bar centre shows the mean estimate of the displayed metric. The estimates were computed by generating a bootstrap distribution with 1,000 resamples for datasets with n  = 1,221 samples ( a – d ) and n  = 170 samples ( e , f ).

When limited data (50, 20 and 10%) was used for downstream task training, our method demonstrated consistently improved performance. More importantly, it remained robust as evidenced by the smallest decline in balanced accuracy and mAP of 9 and 12%, respectively, when reducing training data from 100 to 10%.

Lesion anatomical site classification (use case 1)

As a technical validation of the foundation model, we selected an in-distribution task (that is, sourced from the same cohort as the foundation model pretraining) and developed classification models to predict anatomical sites on a training and tuning dataset totalling 3,830 lesions (use case 1, Fig. 1b ). On a held-out test set of 1,221 lesions, we evaluated the performance of two different implementations of the foundation model (Fig. 1c ).

We found that foundation model implementations showed superiority over compared baseline methods (Fig. 2c,d ). The fine-tuned foundation model, denoted Foundation (fine-tuned), with a mAP of 0.857 (95% CI 0.828–0.886) significantly ( P  < 0.05) outperformed all baseline methods on mAP. With a balanced accuracy of 0.804 (95% CI 0.775–0.835), a significant ( P  < 0.01) improvement in balanced accuracy was also observed in comparison to all baselines except Med3D (fine-tuned), where the improvement was borderline ( P  = 0.059).

Features extracted from the foundation model, Foundation (features), when linearly classified, showed significantly improved performance in balanced accuracy and mAP over features extracted from Med3D (ref. 17 ) and Models Genesis 18 baseline methods. Models fine-tuned using compute-intensive supervised deep-learning methods—Supervised, Med3D (fine-tuned) and Models Genesis (fine-tuned)—did not significantly improve in balanced accuracy and mAP over the simple linear classification of foundation model features. Moreover, when considering only mAP, the simple linear classification significantly ( P  < 0.05) outperformed all other implementations. To provide deeper insight into feature separability that allows for such strong linear classification performance, we attempted to explore visual associations by interpreting projected features (Extended Data Fig. 1 ). We observed that features from the pretrained foundation model provided consistently interpretable and well-separated clusters across different settings. Modelling using features also provided a computational benefit, with both memory and time, over deep-learning training (Extended Data Fig. 2 ).

The performance advantage of the foundation model was even stronger in limited data scenarios (Fig. 2c,d ). When we reduced training data to 50% ( n  = 2,526), 20% ( n  = 1,010) and 10% ( n  = 505), Foundation (features) significantly improved balanced accuracy and mAP over every baseline method. Foundation (fine-tuned) showed a larger drop in performance and failed to improve significantly over baseline implementations as training data were decreased (losing significance from 20% onward). Individual comparisons between each model can be found in Extended Data Fig. 3 . To show the applicability of our approach across the various anatomical sites, we provide a site-wise breakdown of performance in Extended Data Fig. 4 .

Nodule malignancy prediction (use case 2)

To assess the generalizability of the foundation model, we chose an out-of-distribution task (that is, belonging to a cohort different from the pretraining) and trained classification models to predict the malignancy of 507 lung nodules from the LUNA16 dataset (use case 2 in Fig. 1b ). We then evaluated performance on a separate test set of 170 nodules.

The approach of fine-tuning the foundation model, Foundation (fine-tuned), with an area under the curve (AUC) = 0.944 (95% CI 0.907–0.972) and mAP = 0.953 (95% CI 0.915–0.979) resulted in significant ( P  < 0.01) superiority over most of the baseline implementations (Fig. 2e,f ). The implementation Med3D (fine-tuned), with AUC = 0.917 (95% CI 0.871–0.957) and mAP = 0.9307 (95% CI 0.888–0.964), performs slightly worse than our model, but this is not significant ( P  = 0.134). For features extracted from our foundation model, similar to use case 1, our implementation surpasses ( P  < 0.001) baseline feature-based implementations. Notably, none of the deep-learning fine-tuned baselines significantly improve over linear classification. The baseline Models Genesis implementation was excluded in this analysis as this model was pretrained on the same dataset and, therefore, does not indicate a fair comparison.

Again, the Foundation (features) approach shows improved performance in reduced data analyses, dominating all baselines ( P  < 0.05) on 50% ( n  = 254), 20% ( n  = 101) and 10% ( n  = 51) training data. Foundation (fine-tuned) shows superior performance over all baselines at 50% but shows large drops in performance from a 20% reduction onward. Med3D (fine-tuned), which performed well on the full dataset, shows a large drop from 50% data reduction onward. Detailed comparisons can be found in Extended Data Fig. 5a .

NSCLC prognostication (use case 3)

Next, we evaluated the efficacy of our foundation model in another clinically relevant use case to capture prognostic radiographic phenotypes of NSCLC tumours. We trained and tuned prognostication models using data from the HarvardRT ( n  = 291) cohort to predict 2 year overall survival after treatment and then compared the performance of the foundation model and baseline implementations on two independent testing cohorts, LUNG1 (NSCLC-Radiomics) ( n  = 420) and RADIO (NSCLC-Radiogenomics) ( n  = 133) (use case 3 in Fig. 1b ).

In the LUNG1 cohort, features extracted from the foundation model followed by a linear classifier, Foundation (features), exceeded all baseline performances with an AUC of 0.638 (95% CI 0.584–0.692) (Fig. 3a ). All comparisons were significant ( P  < 0.05) except for Med3D (fine-tuned), where borderline significance was observed ( P  = 0.053). Deep-learning-based implementations in the baseline comparisons did not perform strongly on this use case. In addition to AUC, we plotted Kaplan–Meier estimates for the top-performing implementations (Fig. 3b ). Foundation (features) provided the best stratification ( P  < 0.001), indicating its ability to determine appropriate risk groups on the basis of mortality. More detailed analyses can be found in Extended Data Figs. 5b and 6 .

figure 3

We compared the foundation model implementation approaches against baseline methods using the AUC. a , c , Each implementation was adapted for 2 year overall survival classification, trained on the HarvardRT dataset and evaluated on LUNG1 ( a ) and RADIO ( c ) datasets. b , d , Kaplan–Meier curves for groups stratified by model predictions from the best performing among implementation approaches are shown for LUNG1 ( b ) and RADIO ( d ). To ensure a fair comparison, we calculated the threshold to split the risk groups on the HarvardRT tuning set for each implementation. Kaplan–Meier curves for all approaches can be found in Extended Data Fig. 6 . The 95% CI of the estimates is shown by error bars in a , c and error bands in b , d . The measure of centre for the error bars is the mean estimate of AUC and the measure of centre for the error bands is the Kaplan–Meier estimate of the survival function. The estimates for the bar plots in a and c have been computed through a bootstrap distribution with 1,000 resamples using dataset sizes of n  = 420 and n  = 133, respectively.

For the RADIO cohort, Foundation (features) shows the best performance with an AUC of 0.653 (95% CI 0.532–0.771). Similar to the LUNG1 cohort, deep-learning implementations did not demonstrate superior performance (Fig. 3c ). Due to the small sample size, none of the models showed significant differences from the rest ( P  > 0.05) except for the Foundation (features) improving over the Supervised model, which had near-random performance (AUC = 0.520). Kaplan–Meier analysis showed that the sole model that offered significant stratification was the Foundation (features) with P  = 0.009 (Fig. 3d ).

Stability of the foundation model

We evaluated the stability of our foundation model through a test–retest scenario and an inter-reader variability analysis. We used scans from 26 patients from the RIDER dataset 19 , routinely used for test–retest robustness analysis in tumour imaging 19 , 20 , 21 . We found that predictions from the overall best-performing models on LUNG1 and RADIO: Foundation (features) and Supervised (fine-tuned) had high stability with intraclass correlation coefficient (ICC) values of 0.984 and 0.966, respectively. Furthermore, the test–retest features for both networks were strongly correlated (Fig. 4a,b ).

figure 4

We analysed input stability on the LUNG1 dataset and test–retest robustness on the RIDER dataset by comparing between Foundation (features) and Supervised (fine-tuned) (best performing, overall for LUNG1 and RADIO use cases). a , We compared ICC between test–retest model predictions on the RIDER dataset ( n  = 26). b , We further visualize the linearity between flattened features extracted from test and retest scans on the RIDER dataset. c , We show the sampling distribution for input perturbations that are used to simulate inter-reader variability. We perturbed across x , y and z axes, although the distribution is shown only for x and y perturbations for simplicity. d , We compared the stability of the features across models using mean-squared error (MSE) between feature values across all the trials. e , We demonstrated the prognostic stability of models when the input seed point is perturbed, estimated through calculating AUC for 2-year survival from model predictions. The error bars in a represent the 95% CI of the estimates and the bar centre is the mean estimate. For the box plots ( d , e ), the centre line shows the median, the box edges represent first and third quartiles and the whiskers extend to 1.5 times the inter-quartile range. The distribution of the data is shown alongside the box plot. Each AUC and MSE measure in the box plots ( d , e ) have been computed on a dataset with n  = 422 samples and the distribution of the measures are obtained from 50 independent perturbation trials.

To evaluate stability against inter-reader variability, we used the LUNG1 dataset and perturbed the input seed point to extract the three-dimensional (3D) volume, simulating variations among human readers (Fig. 4c ). We found that the Foundation (features) had significantly ( P  < 0.05) higher stability against simulated inter-reader variations in feature differences and prediction performance (Fig. 4d,e ).

Saliency maps for fine-tuned foundation models

To gain insight into regions of the input volumes that contribute to a given prediction, we used gradient-based saliency maps for Foundation (fine-tuned) on three selected use cases (as depicted in Fig. 5 ).

figure 5

a – c , We generated gradient-based saliency maps for each of the fine-tuned foundation models from use cases 1 ( a ), 2 ( b ) and 3 ( c ) using smooth guided back-propagation and visualized salient regions on two samples from corresponding test datasets. The first and fourth columns show the central axial slice (50 × 50 mm) of the volume provided as input to the model. The second and fifth columns show isolines for saliency contours overlayed on the image. Finally, the third and sixth columns show saliency maps highlighting areas of the input volume that contribute the most to a change in the output prediction.

Our analysis revealed that for each use case, the focus was primarily around tissues within or in proximity to the tumour, which is consistent with research demonstrating the tumour microenvironment’s influence on cancer development 22 and prognosis. Specifically, in use case 1 (Fig. 5a ), the focus was mainly on areas surrounding the lesions, such as the parenchyma and bone regions in the lung and the trachea in mediastinal lesions. For use case 2 (Fig. 5b ), tissues of the nodule were highlighted, avoiding high-density bone regions. Use case 3 (Fig. 5c ) primarily attributed areas surrounding the centre of mass of the tumour, with some contribution from high-density bone regions. Overall, these findings indicated that the areas that contribute to the networks’ predictions varied in accordance with the specific use case, with the tumour and surrounding tissues playing a pivotal role.

Underlying biological basis of the foundation model

Finally, we investigated the biological basis of our foundation model by analysing gene expression data associated with model predictions for 130 participants from the RADIO dataset. To identify relevant genes, we selected the top 500 genes and performed a correlation analysis, comparing Foundation (features) and Supervised (fine-tuned) predictions with gene expression profiles. We found that absolute correlation coefficients between gene expression profiles and model predictions were significantly higher ( P  = 0.008) for the foundation model, indicating a stronger association with underlying tumour biology (Fig. 6a ).

figure 6

We compared the Foundation (features) and Supervised (fine-tuned) (best-performing models on the RADIO dataset) model predictions with gene expression profiles. a , Box plot of absolute correlation coefficients ( y axis) of selected genes against model predictions ( x axis) across n  = 130 samples. Statistical significance between the two groups is determined through a two-sided Wilcoxon signed rank test. b , Gene-set enrichment analysis of genes with correlation coefficient greater than 0.1 revealed for the foundation (left) and supervised model predictions (right). Genetic pathways are shown on the y axis, and the gene ratio is shown on the x axis. Gene count and adjusted P values are also shown in the legend. False discovery rates are used to adjust the P values for multiple comparisons. The box plots in a are defined by the median as the centre line, first and third quartiles as the box edges and 1.5 times the inter-quartile range as the whiskers. MHC, major histocompatibility complex.

Additionally, we examined the genes associated with these models through a gene-set enrichment analysis (genes with a correlation coefficient >0.1). Our analysis revealed that the foundation model showed an enrichment pattern of immune-associated pathways, including interferon signalling, interferon gamma signalling, major histocompatibility complex class II antigen presentation and PD-1 signalling. Conversely, while the supervised model did show enrichment of individual pathways, no identifiable pattern was observed (Fig. 6b ).

In this study, we demonstrated that our foundation model, trained using self-supervised contrastive learning, provided robust performance in predicting anatomical site, malignancy and prognosis across three different use cases in four cohorts. Several studies 23 , 24 , 25 have demonstrated the efficacy of SSL in medicine where only limited data might be available for training deep-learning networks. Our findings complement and extend this for identifying reliable imaging biomarkers for cancer-associated use cases. We showed that our foundation model provided superior performance for anatomical lesion site classification on average and across individual anatomical sites, even when very few training samples were available for that site. Similarly, for malignancy prediction, our model outperformed all other baseline approaches. In both these use cases, the benefit of our model was especially evident in limited data scenarios. Modelling using features extracted from the foundation model was the most robust across these use cases when subjected to drops in training data, offering stable performance even when data sizes were considerably reduced, for example, using only 51 samples in use case 2. Using these features provided the best performance on small cohorts in predicting prognosis and also demonstrated significant stratification of patients by their associated risk for each of the LUNG1 and RADIO cohorts ( P  < 0.01). Feature-based implementations were also computationally efficient when considering both time and memory. Additionally, features and predictions from the foundation model features were found to be highly stable against inter-reader and test–retest variations. Regarding interpretability, we observed that models focused on varying regions of the tumour and surrounding tissue relevant to the associated use case. To gain insight into the underlying biological associations of these features, RNA sequencing analysis combined with imaging data showed that these features correlated with immune-associated pathways.

Image-biomarker studies for predicting endpoints, such as overall survival on small cohorts, largely rely on statistical feature extraction (engineered radiomics) and classical machine learning-based modelling. These require precise 3D segmentations for feature extraction, increasing the annotation burden of these studies. Moreover, these statistical features are affected by several confounders, such as inter-reader variability in segmentations 26 and acquisition settings of the scanners 27 , limiting their applicability in diverse settings. Deep-learning methods, in comparison, are robust to differences in acquisition and segmentation variability and provide improved performance 10 . Surveying diagnostic biomarker studies, Shen et al. 28 trained a simple deep convolutional network to extract features from lung nodules followed by malignancy classification using a support vector machine, possibly one of the first convolutional approaches for this use case. In a subsequent study, Shen et al. 29 proposed a new multi-crop convolutional neural networks (CNN) architecture and demonstrated improved performance over auto-encoder-based pretraining and radiomic feature-based training. Kumar et al. 30 identified radiomic sequences through deep convolutional encoders to determine lung nodule malignancy. These developed approaches were specific to nodule malignancy classification, and it is difficult to determine their transferability to other use cases. By contrast, our approach is generalizable to multiple use cases, and for nodule malignancy, we obtain high performance using significantly lesser training data, only 338 nodules (due to our more stringent exclusion criteria). Considering prognostic biomarkers, Hosny et al. 10 trained a deep-learning model for lung cancer prognostication using several multi-institutional cohorts and demonstrated strong performance over traditional radiomics. Haarburger et al. 31 presented a deep convolutional network-based approach to predict survival endpoints on the LUNG1 dataset. Mukherjee et al. 32 developed a shallow CNN for predicting overall survival by round-robin training on four different cohorts and additionally observed that their model transferred well to predicting nodule malignancy. A general trend observed across these studies was that the performance of deep-learning models was more robust when larger and multi-institutional cohorts were available for training, and validation was generally performed on smaller cohorts. A demonstrated strength of our approach is that training on smaller cohorts performs well in larger validation cohorts.

Advances in deep learning, such as SSL, have translated well to medical imaging use cases, with several studies incorporating pretraining for improved performance 23 , 25 , 33 , 34 . More recently, foundation models have become popular for their ability to learn general concepts adaptable to various tasks. Zhou et al. 35 proposed a foundation model where a visual transformer was trained on 1.6 million retinal images and validated on ocular disease use cases. Azizi et al. 36 presented distinct foundation models for five domains trained in a multi-step approach with different amounts of pretraining data for each (ranging from 8,000 to 2.2 million images). Azad et al. 37 conducted an extensive review, highlighting the development of diverse foundation models, both generalist and more specific, across several medical imaging domains.

Developing a reliable and reproducible foundation model for a specific domain involves the consideration of several design choices. Cole et al. 38 present empirical observations on the quantity of pretraining data, the impact of the pretraining domain, the quality of data and task difficulty when using contrastive pretraining methods. They show a saturation point associated with pretraining dataset size and diminishing returns beyond this point. This point largely depends on the nature and sizes of training data in the downstream task. In our study, we pretrained on 11,467 lesion volumes and randomly sampled volumes, from 5,513 unique CT scans, leveraging not only one of the largest lesion-specific datasets but also one of the largest pretraining 3D CT datasets. The only other study we know that uses more data is by Ghesu et al. 25 where 24,000 CT scans are used for pretraining. Cole et al. 38 also showed that pretraining using in-domain data, semantically connected to the downstream task, has a huge impact besides scale of the pretraining data. Azizi et al. 36 also observed improvements when incorporating in-domain data, even when the number of samples used was smaller. In the context of our study, our pretraining process is the closest to the domain of oncological image biomarkers; as a result, improvements over more out-of-domain pretraining methods are seen.

Despite the strengths outlined in our study, we recognize several limitations that need to be addressed before the clinical applicability of our foundation model. First, the retrospective nature of this study constrains our ability to assess the real-world practicality of model-based biomarkers. Second, evaluating the model’s reliability and reproducibility across diverse demographic groups and various biomarker discovery tasks is crucial to ensure broad applicability. This includes examining how well the model handles distribution shifts between the pretraining and application phases. Another key consideration is investigating whether a larger volume of pretraining data could enhance model performance, particularly for complex tasks. Additionally, since imaging features alone may not suffice for comprehensive clinical decision making, integrating clinical data as covariates could notably improve the model’s effectiveness. Third, a significant challenge with deep-learning models, including ours, is their ‘black box’ nature, which limits interpretability and explainability. Although we used established saliency attribution methods to interpret our model’s predictions, the technical limitations 39 , 40 of these methods may restrict the applicability of the insights gained. Furthermore, our initial biological association analysis, aimed at explaining the model’s decisions, is preliminary and requires more rigorous investigation for a concrete understanding.

In conclusion, our foundation model offers a powerful and reliable framework for discovering cancer imaging biomarkers, especially in small datasets. Furthermore, it surpasses current deep-learning techniques in various tasks while fitting conveniently into existing radiomic research methods. This approach can potentially uncover new biomarkers contributing to research and medical practice. We share our foundation model and reproducible workflows so that more studies can investigate our methods, determine their generalizability and incorporate them into their research studies.

Study population

We use a total of five distinct datasets: four of which are publicly accessible and one is an internal dataset. These were acquired from various institutions as components of separate investigations (Extended Data Fig. 9 ).

DeepLesion 14 is a dataset comprising 32,735 lesions from 10,594 studies of 4,427 unique patients collected over two decades from the National Institute of Health Clinical Center PACS server. Various lesions, including kidney, bone and liver lesions, as well as enlarged lymph nodes and lung nodules, are annotated. The lesions are identified through radiologist bookmarked RECIST (Response Evaluation Criteria in Solid Tumors, National Cancer Institute, USA) diameters across 32,120 CT slices. In our study, we excluded CT scans with a slice thickness exceeding 3 mm, resulting in 16,518 remaining lesions. Subsequently, we divided this into 11,467 unlabelled lesions for contrastive training and 5,051 labelled lesions for anatomical site classification. The unlabelled lesions were sourced from 5,513 unique CT scans across 2,312 patients. Labelled lesions chosen for the anatomical site classification use cases were excluded from the pretraining data to avoid potential data leakage between pretraining and evaluation tasks. Despite not using class labels during pretraining, we consciously decided to prevent overlapping lesions from being seen at this stage to ensure unbiased evaluation. The labelled lesion data were further separated randomly into training, tuning and testing sets, containing 2,610, 1,220 and 1,221 lesions, respectively.

LUNA16 (ref. 41 ) is a curated version of the LIDC-IDRI dataset of 888 diagnostic and lung cancer screening thoracic CT scans obtained from seven academic centres and eight medical imaging companies comprising 1,186 nodules. The nodules are accompanied by annotations agreed on by at least three out of four radiologists. Alongside nodule location annotations, radiologists also noted various observed attributes such as internal composition, calcification, malignancy, suspiciousness and more. For our evaluation, we chose nodules with at least one indication of malignancy suspicion, totalling 677. We randomly picked 338 nodules for training and 169 for tuning the malignancy prediction networks. The final 170 nodules were used to assess the networks’ performance.

HarvardRT 10 is a cohort of 317 patients with stage I–IIIB NSCLC treated with radiation therapy at the Dana-Farber Cancer Institute and Brigham and Women’s Hospital, Boston, MA, USA, between 2001 and 2015. All CT scans for this cohort were acquired with and without intravenous contrast on the GE Lightspeed CT scanner. The primary tumour site was contoured by radiation oncologists using soft tissue and lung windows. A subset of 291 patients with a follow-up of 2 years was selected for this study. We used 203 tumour volumes for training the prognostication networks and the remaining 88 tumour volumes for tuning.

LUNG1 (ref. 42 ) is a cohort of 422 patients with stage I–IIIB NSCLC treated with radiation therapy at MAASTRO Clinic, Maastricht, the Netherlands. Fluorodeoxyglucose positron emission tomography (PET)-CT scans were acquired with or without contrast on the Siemens Biograph Scanner. Radiation oncologists used PET and CT images to delineate the gross tumour volume. For our study, we selected CT scans of 420 patients (right-censored for 2-year survival) with annotated primary gross tumour volumes and used these as an independent test set for prognostication networks.

The RADIO 43 dataset is a collection of 211 patients with NSCLC stage I–IV recruited between 2008 and 2012 who were referred for surgical treatment and underwent preoperative CT and PET-CT scans. These patients were recruited from the Stanford University School of Medicine and the Palo Alto Veterans Affairs Healthcare System. Scans were obtained using various scanners and protocols depending on the institution and physician. A subset of 144 patients in the cohort have available tumour segmentations independently reviewed by two thoracic radiologists. In addition to imaging data, the dataset includes molecular data from EGFR, KRAS, ALK mutational testing, gene expression microarrays and RNA sequencing. For the current study, we used 133 patients with annotated gross tumour volumes as an independent test set for prognostication after right-censoring for 2 year survival and subsequently investigated the biological basis of our networks using this dataset.

Data preprocessing

CT scans were resampled using linear interpolation to achieve isotropic voxels with a 1 mm 3 resolution to address variations in slice thickness and in-plane resolutions across study populations. We extracted patches of 50 × 50 × 50 voxels from the scans centred around a seed point (Extended Data Fig. 7 ). For the DeepLesion dataset, which provided annotations in the form of RECIST diameters, the seed point was determined by calculating the midpoint of the RECIST diameter. For the other datasets (that is, LUNA16, HarvardRT, LUNG1 and RADIO), which supplied annotations as 3D contours, the seed point was obtained by computing the centre of mass. This approach allows for significantly higher throughput than manual segmentation, which can be more tedious. We then normalized the voxel values in the patches by subtracting −1,024 (lower-bound Hounsfield unit) and dividing by 3,072 (upper-bound Hounsfield unit of 2,048), ensuring the intensity values in the input data ranged between 0 and 1.

Task-agnostic pretraining of the foundation model

We implemented contrastive pretraining using a modified version of the SimCLR framework 5 . The SimCLR framework’s general principle involves transforming a single data sample (for example, a patch taken from a CT scan) into two correlated and augmented samples (for example, the same patch rotated 15° clockwise and flipped horizontally). A convolutional encoder is then used to extract latent representations from these samples. Through a contrastive loss function 44 , the model learns to identify similar representations from the same data sample and dissimilar representations from different data samples (Extended Data Fig. 8 ). The framework emphasizes effective transformation choices, convolutional encoder architectures and contrastive loss functions for optimal SSL performance. To effectively represent the nature of medical images, we made modifications to each of these components.

Transformations proposed in the original SimCLR framework for natural world images, such as cutout augmentation, Sobel filtering and colour distortion, are unsuited for 3D medical images due to dynamic range and colour depth differences. Therefore, our study applies different augmentations to replace these transformations. For instance, we substituted the random colour jitter transform with a random histogram intensity shift transform, as they both induce variation in intensity distribution.

To extract representations from the transformed 3D volumes, we selected the 3D ResNet50 (ref. 45 ) architecture as our deep convolutional encoder. While the SimCLR authors used a 2D ResNet50 architecture, we opted for its 3D counterpart, which has proven effective in handling 3D medical imaging data 46 .

Regarding loss functions, we extended normalized temperature-scaled cross-entropy loss (NT-Xent) 47 to support contrastive training for lesion volumes. The modifications include: (1) selecting positive pairs as 3D patches surrounding the lesion’s seed point, (2) choosing negative pairs by randomly sampling 3D patches from the rest of the scan and (3) computing the contrastive loss on these positive and negative pairs, with each iteration comprising n positive pairs and n  × 2( n  − 1) negative pairs. We also explored different temperature parameters for the NT-Xent loss. However, the original value of 0.1 proposed by the original paper was the most effective.

Our model was pretrained for 100 epochs using an effective batch size of 64 (32 × 2 training nodes) on two NVIDIA Quadro RTX 8,000 graphical processing units (GPUs) taking approximately 5 days. We used stochastic gradient descent as the optimizer, with layer-wise adaptive rate control, momentum and weight-decay enabled. To improve the optimization process, we used learning rate schedulers that combined linear and cosine decay strategies and a warmup phase to modify the learning rate at the beginning of training gradually. While most specifications were consistent with the original SimCLR experiments, we experimented with different batch sizes, patch sizes (50 and 64 mm 3 ), learning rates, transforms and model architectures.

We conducted a comparison of our modified SimCLR version with its original form along with various well-known and recent pretraining methods. Before the rise of contrastive approaches, auto-encoder methods were commonly used for pretraining and, therefore, we added this to the comparison. This was implemented using MONAI’s auto-encoder framework, ensuring a parameter count similar to that of ResNet50 (230 million compared to ResNet50’s 200 million). Despite SimCLR’s ongoing popularity 13 , recent methodologies have shown superior results in particular scenarios and tasks. We adapted SwAV 15 and NNCLR 16 approaches, combining settings from their original designs with modifications suitable for medical imaging contexts. In our comparative analysis, we maintained uniformity in batch sizes and dataset parameters across all methods, while optimizer and loss-specific settings were aligned with each method’s original configuration.

Task-specific training of the foundation model

Our foundation model was adapted for a specific task through two approaches: (1) extracting features from the frozen encoder and fitting a linear classifier and (2) transfer learning the pretrained ResNet50 for the given classification task.

We extracted 4,096 features from the foundation model for each data point and used them to train a logistic regression model using the scikit-learn framework 48 . A comprehensive parameter search for the logistic regression model was performed using the optuna hyperparameter optimization framework 49 . No performance improvements were observed through feature selection strategies; therefore, all 4,096 features were used in accordance with linear evaluation strategies prevalent in SSL literature.

Transfer learning through fine-tuning was carried out with all layers updated during training, using cross-entropy loss. A series of randomly chosen augmentations—random flips, random 90° rotations and random translations of ±10 voxels across all axes—were applied throughout the training. Stochastic gradient descent was used for network training, with momentum enabled and step-wise learning rate decay. Following the original SimCLR experiments, configurations and similar parameters (including learning rate, transforms and model architectures) were explored during hyperparameter tuning. Each network was trained for 100 epochs using a single NVIDIA Quadro RTX 8,000 GPU, and the best-performing model checkpoints were chosen on the basis of the tuning set.

For supervised models, we selected four different baselines. First, we randomly initialized the weights of a ResNet50 and trained it using task-specific configurations consistent with fine-tuning the foundation model. Second, the randomly initialized model trained on use case 1 was fine-tuned through transfer learning for use cases 2 and 3. For the third and fourth baselines, publicly available pretrained models were investigated to add comparisons against the state of the art. Specifically, Med3D and Models Genesis were selected on the basis of their relevance to similar domains and tasks, and their established popularity within the community. These models were tailored to each task using configurations that mirrored those of our foundational model, taking into account both their inherent feature representations and transfer learning capabilities.

Task-specific training was conducted on reduced dataset sizes in addition to usic models using these samples with the same configuration as the entire dataset. As the training dataset sizes decreased, we considered training the models for a higher number of epochs; however, models frequently overfitted during extended training. The entire test dataset was used to allow benchmarking across these splits. However, we do not conduct reduced dataset training for use case 3, as it is typical to have inherently small sample sizes in such use cases when compared to task complexity due to study-specific inclusion criteria. Therefore, experiments involving further data reduction in this case do not provide any valuable insights.

Performance analysis

Validation of the foundation model was performed using several use case-relevant metrics. Lesion anatomical site classification performance was assessed using balanced accuracy as a multi-label counting metric and mAP as a multi-threshold metric. The multi-label metric, balanced accuracy, adjusts class-wise accuracy on the basis of the class distribution at a chosen threshold (0.5). The multi-threshold metric, mAP, enables the examination of a given class’s performance across a range of prediction thresholds. All classes other than the class of interest are considered negatives, and performance is averaged across all possible classes. We avoided using the AUC-receiver operating curve (AUC-ROC) for this use case due to the high proportion of negatives relative to positives, which results in consistently low false-positive rates and might overestimate the AUC. However, due to a more balanced class distribution, nodule malignancy prediction was evaluated using AUC-ROC. NSCLC prognostication networks also used AUC-ROC for evaluation, as it estimates the ranking of participants on the basis of their survival times.

Models underwent pair-wise comparison using permutation tests. n permutations ( n  = 1,000) were conducted for each pair, and new models were computed after permuting class labels. Metrics were recalculated after resampling, and a two-sided P value was calculated to test the null hypothesis of observations from each pair originating from the same underlying distribution. Additionally, 95% CIs were established for each model using a bootstrap sampling with n  = 1,000 resamples.

Kaplan–Meier curves were also used to determine the stratification of participants on the basis of their prediction scores for the prognostication models. Groups were selected on the basis of prediction scores on the tuning set, and curves were plotted on the test set for these groups. Multivariate log-rank tests were used to examine the significance of the stratification. Univariate Cox regression models were built using the model predictions as the categorical variables of interest, grouped similarly to the Kaplan–Meier curve.

Feature visualization and saliency maps

We used the foundation model, top-performing supervised model, Med3D and Models Genesis as feature extractors to obtain 4,096 distinct features (except for Med3D’s 2,048 features) per data point. To enable visual interpretation of these high-dimensional features, we used t -stochastic neighbourhood embeddings 50 at different perplexity values and principal component analysis to reduce their dimensionality to 2D. Points in the 2D visualization were colour-coded according to their respective target classes despite dimensionality reduction being agnostic to these distinctions. Density contours were superimposed over the visualizations to enhance the understanding of group patterns, offering a more comprehensive representation of trends across data points.

To generate saliency maps for each task, the fine-tuned foundation model was used to generate predictions on randomly selected volumes from respective datasets. The fine-tuned foundation model with a single output prediction (corresponding to the predicted target class) was chosen in contrast to the feature extractor as expressing saliency maps over 4,096-dimensional outputs remains challenging in practice. We used a combination of (1) smooth gradient back-propagation, which averages gradients of the output with respect to several noisy inputs, and (2) guided back-propagation, which combines deconvolution with back-propagation, mainly stopping the flow of negative gradients or neurons that decrease the activation signal. The method is termed smooth guided back-propagation 51 , 52 and is implemented in the MONAI framework 53 .

Stability testing

To test the stability of our models, we performed a test–retest and inter-reader variation evaluation. For the test–retest evaluation, we compared model predictions (of outcome) from the best foundation and supervised models generated on chest CT scans taken in a 15-minute interval for 26 patients. ICC was computed using the interrater reliability and agreement package (irr) in R 54 . We also tested the stability of the flattened features computed by the models by calculating Spearman correlation and R 2 .

For the inter-reader variation evaluation, we used the LUNG1 dataset and generated 50 random perturbations sampled from a 3D multivariate normal distribution with zero mean and diagonal covariance matrix for each seed point. Across each dimension, a variance of 16 voxels was used for generating samples. We generated predictions on volumes extracted from perturbed seed points using the best foundation and supervised model, resulting in 50 different prediction sets for each. The mean and variance of the 50 sets were computed for each and compared.

Biological associations

The GSE103584 dataset contains 130 NSCLC samples that consist of paired CT scans and gene expression profiles generated by RNA sequencing. To analyse gene expression profiles, we filtered them on the basis of cohort mean expression and standard deviation. First, we took only the genes with a higher expression than the overall dataset mean and then picked the top 500 genes on the basis of standard deviation. Next, we performed a correlation analysis comparing the best-supervised and foundation models. To further evaluate foundation model features’ association with tumour biology, we computed the absolute value of the correlation coefficients and performed a gene-set enrichment analysis with all genes with a correlation coefficient above 0.1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Most of the datasets used in this study are openly accessible for both training and validation purposes and can be obtained from the following sources: (1) DeepLesion 14 , used both for our pretraining and use case 1, (2) LUNA16 (ref. 55 ) used for developing our diagnostic image biomarker, (3) LUNG1 (ref. 56 ) and (4) RADIO 57 used for the validation of our prognostic image-biomarker model. Imaging and clinical data for the LUNG1 and RADIO datasets were obtained from Imaging Data Commons 58 collections. The training dataset for our prognostic biomarker model, HarvardRT, is internal to Mass General Brigham institutions and contains sensitive protected health information. Due to privacy concerns and legal restrictions associated with patient data, the complete dataset cannot be made publicly available. However, we have shared the model predictions obtained on this dataset so to ensure that our statistical analyses can be reproduced. Researchers interested in accessing the dataset can submit a formal request detailing the intended use of the data to R.H.M. ([email protected]). Each request will be evaluated on a case-by-case basis in compliance with the ethical guidelines and agreements under which the data were collected.

Code availability

The complete pipeline used in this study can be accessed either from the AIM webpage at https://aim.hms.harvard.edu/foundation-cancer-image-biomarker or directly on https://github.com/AIM-Harvard/foundation-cancer-image-biomarker (ref. 59 ). This includes the code for (1) data download and preprocessing: starting from downloading the data to generating train-validation-test splits used in our study; (2) replicating the training and inference of foundation and baseline models across all tasks through easily readable and customizable YAML files (leveraging project-lighter 60 ) and (3) code for reproducing our comprehensive performance validation. In addition to sharing reproducible code, we also provide trained model weights, extracted features and outcome predictions for all the models used in our study. Most importantly, we provide our foundation model accessible through a simple pip package install and two lines of code to extract features for your dataset. We also provide a detailed documentation website that can be accessed at https://aim-harvard.github.io/foundation-cancer-image-biomarker/ . The final model weights 61 are made available through the Zenodo platform. The full model implementation is also available through https://mhub.ai/ in a reproducible, containerized, off-the-shelf executable format, allowing fast application in several academic and clinical environments.

Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).

Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (Curran Associates Inc., 2022).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics (eds Burstein, J. et al.) 4171–4186 (ACL, 2019).

Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 8748–8763 (PMLR, 2021).

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning (eds III, H.D. & Singh, A.) 1597–1607 (PMLR, 2020).

Oquab, M. et al. DINOv2: learning robust visual features without supervision. Transact. Mach. Learn. Res. 1–32 (2024).

Thieme, A. et al. Foundation models in healthcare: opportunities, risks & strategies forward. In Extended Abstracts 2023 CHI Conference on Human Factors in Computing Systems 1–4 (ACM, 2023).

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616 , 259–265 (2023).

Article   ADS   CAS   PubMed   Google Scholar  

Mahajan, A. et al. Deep learning-based predictive imaging biomarker model for EGFR mutation status in non-small cell lung cancer from CT imaging. J. Clin. Orthod. 38 , 3106 (2020).

Google Scholar  

Hosny, A. et al. Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med. 15 , e1002711 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Braghetto, A., Marturano, F., Paiusco, M., Baiesi, M. & Bettinelli, A. Radiomics and deep learning methods for the prediction of 2-year overall survival in LUNG1 dataset. Sci. Rep. 12 , 14132 (2022).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Balestriero, R. et al. A cookbook of self-supervised learning. Preprint at https://arxiv.org/abs/2304.12210 (2023).

Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit. Med. 6 , 74 (2023).

Yan, K., Wang, X., Lu, L. & Summers, R. M. DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 5 , 036501 (2018).

Article   Google Scholar  

Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33 , 9912–9924 (2020).

Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P. & Zisserman, A. With a little help from my friends: nearest-neighbor contrastive learning of visual representations. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9568–9577 (IEEE, 2021).

Chen, S., Ma, K. & Zheng, Y. Med3D: transfer learning for 3D medical image analysis. Preprint at https://arxiv.org/abs/1904.00625 (2019).

Zhou, Z. et al. Models Genesis: generic autodidactic models for 3D medical image analysis. Med. Image Comput. Comput. Assist. Interv. 11767 , 384–393 (2019).

PubMed   PubMed Central   Google Scholar  

Zhao, B. et al. Evaluating variability in tumor measurements from same-day repeat CT scans of patients with non-small cell lung cancer. Radiology 252 , 263–272 (2009).

Aerts, H. J. W. L. et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 5 , 4006 (2014).

Hosny, A. et al. Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study. Lancet Digit. Health 4 , e657–e666 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hinshaw, D. C. & Shevde, L. A. The tumor microenvironment innately modulates cancer progression. Cancer Res. 79 , 4557–4566 (2019).

Azizi, S. et al. Big self-supervised models advance medical image classification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 3458–3468 (IEEE, 2021).

Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6 , 1346–1352 (2022).

Article   PubMed   Google Scholar  

Ghesu, F. C. et al. Contrastive self-supervised learning from 100 million medical images with optional supervision. J. Med. Imaging 9 , 064503 (2022).

Haarburger, C. et al. Radiomics feature reproducibility under inter-rater variability in segmentations of CT images. Sci. Rep. https://doi.org/10.1038/s41598-020-69534-6 (2020).

Campello, V. M. et al. Minimising multi-centre radiomics variability through image normalisation: a pilot study. Sci. Rep. 12 , 12532 (2022).

Shen, W., Zhou, M., Yang, F., Yang, C. & Tian, J. Multi-scale convolutional neural networks for lung nodule classification. Inf. Process. Med. Imaging 24 , 588–599 (2015).

PubMed   Google Scholar  

Shen, W. et al. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognit. 61 , 663–673 (2017).

Article   ADS   Google Scholar  

Kumar, D. et al. in Image Analysis and Recognition (eds Karray, F. et al.) 54–62 (Springer, 2017).

Haarburger, C., Weitz, P., Rippel, O. & Merhof, D. Image-based survival prediction for lung cancer patients using CNNS. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) 1197–1201 (IEEE, 2019).

Mukherjee, P. et al. A shallow convolutional neural network predicts prognosis of lung cancer patients in multi-institutional computed tomography image datasets. Nat.Mach. Intell. 2 , 274–282 (2020).

Taleb, A. et al. 3D self-supervised methods for medical imaging. Adv. Neural Inf. Process. Syst. 33 , 18158–18172 (2020).

Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6 , 1399–1406 (2022).

Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature https://doi.org/10.1038/s41586-023-06555-x (2023).

Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7 , 756–779 (2023).

Azad, B. et al. Foundational models in medical imaging: a comprehensive survey and future vision. Preprint at https://arxiv.org/abs/2310.18689 (2023).

Cole, E., Yang, X., Wilber, K., Aodha, O. M. & Belongie, S. When does contrastive visual representation learning work? In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 14755–14764 (IEEE, 2022).

Adebayo, J. et al. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 9505–9515 (Curran Associates, 2018).

Arun, N. et al. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiol. Artif. Intell. 3 , e200267 (2021).

Setio, A. A. A. et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med. Image Anal. 42 , 1–13 (2017).

Aerts, H. J. W. L. et al. Data from NSCLC-Radiomics (The Cancer Imaging Archive, 2019); https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI

Napel, S. & Plevritis, S. K. NSCLC Radiogenomics: Initial Stanford Study of 26 cases (The Cancer Imaging Archive, 2014); https://doi.org/10.7937/K9/TCIA.2014.X7ONY6B1

Wang, F. & Liu, H. Understanding the behaviour of contrastive loss. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2495–2504 (IEEE, 2021).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).

Uemura, T., Näppi, J. J., Hironaka, T., Kim, H. & Yoshida, H. Comparative performance of 3D-DenseNet, 3D-ResNet, and 3D-VGG models in polyp detection for CT colonography. In Proc. Medical Imaging 2020: Computer-Aided Diagnosis Vol. 11314, 736–741 (SPIE, 2020).

Sohn, K. Improved deep metric learning with multi-class N-pair loss objective. In Advances in Neural Information Processing Systems (eds Lee, D. et al.) 1857–1865 (Curran Associates, 2016).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: a next-generation hyperparameter optimization framework. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2623–2631 (Association for Computing Machinery, 2019).

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. A. Striving for simplicity: the all convolutional net. In 3rd International Conference on Learning Representations Workshop (ICLR, 2015).

Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. SmoothGrad: removing noise by adding noise. Preprint at https://arxiv.org/abs/1706.03825 (2017).

Jorge Cardoso, M. et al. MONAI: an open-source framework for deep learning in healthcare. Preprint at https://arxiv.org/abs/2211.02701 (2022).

Gamer, M. irr: Various Coefficients of Interrater Reliability and Agreement (R Foundation for Statistical Computing, 2010); cran.r-project.org/web/packages/irr/irr.pdf

The Cancer Imaging Archive. LIDC-IDRI (TCIA, 2023); www.cancerimagingarchive.net/collection/lidc-idri/

The Cancer Imaging Archive. NSCLC-RADIOMICS (TCIA, 2023); www.cancerimagingarchive.net/collection/nsclc-radiomics/

The Cancer Imaging Archive. NSCLC-RADIOGENOMICS-STANFORD (TCIA, 2023); www.cancerimagingarchive.net/analysis-result/nsclc-radiogenomics-stanford/

Fedorov, A. et al. NCI imaging data commons. Cancer Res. 81 , 4188–4193 (2021).

Pai, S. AIM-Harvard/foundation-cancer-image-biomarker: v0.0.1. Zenodo https://doi.org/10.5281/zenodo.10535536 (2024).

Hadzic, I., Pai, S., Bressem, K. & Aerts, H. Lighter. Zenodo https://doi.org/10.5281/zenodo.8007711 (2023).

Pai, S. Foundation model for cancer imaging biomarkers. Zenodo https://doi.org/10.5281/zenodo.10528450 (2024).

Download references

Acknowledgements

We acknowledge financial support from the National Institute of Health (NIH) (H.J.W.L.A. grant nos. NIH-USA U24CA194354, NIH-USA U01CA190234, NIH-USA U01CA209414, NIH-USA R35CA22052 and NIH-USA U54CA274516-01A1), the European Union, European Research Council (H.J.W.L.A. grant no. 866504) and Deutsche Forschungsgemeinschaft, the German Research Foundation (S.B. grant no. 502050303).

Author information

Authors and affiliations.

Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Harvard Institutes of Medicine, Boston, MA, USA

Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Prudente, Tafadzwa L. Chaunzwa, Simon Bernatz, Ahmed Hosny, Raymond H. Mak & Hugo J. W. L. Aerts

Radiology and Nuclear Medicine, CARIM and GROW, Maastricht University, Maastricht, the Netherlands

Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Prudente, Raymond H. Mak & Hugo J. W. L. Aerts

Department of Radiation Oncology, Brigham and Women’s Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA

Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Prudente, Tafadzwa L. Chaunzwa, Simon Bernatz, Ahmed Hosny & Hugo J. W. L. Aerts

Department of Molecular Medicine, Aarhus University Hospital, Aarhus, Denmark

Mateo Sokač & Nicolai J. Birkbak

Department of Clinical Medicine, Aarhus University, Aarhus, Denmark

Department of Radiology, Brigham and Women’s Hospital, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA

Hugo J. W. L. Aerts

You can also search for this author in PubMed   Google Scholar

Contributions

The concept for the study was developed by S.P. and H.J.W.L.A. Data acquisition, analysis and interpretation were done by S.P., D.B., A.H., T.L.C., R.H.M. and H.J.W.L.A. Methodological design and implementation were done by S.P. and D.B. Conceptualization of assessment strategies was developed by S.P., D.B., N.J.B. and H.J.W.L.A. Statistical analyses was carried out by S.P., M.S., N.J.B. and H.J.W.L.A. Code and reproducibility were the responsibility of S.P., I.H. and V.P. The paper was written by S.P., D.B., M.S., S.B., R.H.M. and H.J.W.L.A. Critical revision of the paper was carried out by S.P., D.B., I.H., V.P., M.S., T.L.C., S.B., A.H., R.H.M., N.J.B. and H.J.W.L.A. The study was supervised by H.J.W.L.A.

Corresponding author

Correspondence to Hugo J. W. L. Aerts .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Machine Intelligence thanks Paula Jacobs, Pritam Mukherjee and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 visual exploration of the features generated from the foundation and baseline models..

Features from the foundation model and each of the baseline models are extracted on the independent test-set for identifying lesion anatomical sites and visualized using several different dimensionality reduction approaches. Approaches chosen aim to avoid biases from parameter selection, therefore, tSNE with different perplexity settings and PCA are used. The x-axis corresponds to dimension 1, and the y-axis to dimension 2 of the dimensionality reduction. The density contours of each class are underlaid to highlight separability between classes in the feature space. It is to be noted that the supervised model was trained with lesion anatomical site labels while all the other models (Foundation, Med3D, ModelsGenesis) were used merely as feature extractors without being trained for the particular label.

Extended Data Fig. 2 Time and memory efficiency of implementation approaches.

We compare the two implementation approaches of our foundation model, 1) linear modelling on extracted features, which comprises a feature extraction step followed by the linear modelling step, and 2) transfer learning through a fine-tuning step. a , Training times (in minutes) for each of the three use cases and the three steps are shown. b , Memory usage (GPU VRAM and System RAM) are shown for the feature extraction, linear modelling and fine-tuning steps. Memory usage for each step across use-cases remains mostly constant due to batch processing and high feature dimensionality. All analyses were run with six cores on AMD EPYCTM 7402 P Processor 24-core @ 2.80 GHz. The GPU, which was only used for fine-tuning, was the Quadro RTX 8000. For both CPU and GPU runs, where batch processing was used, a batch size of 32 was chosen.

Extended Data Fig. 3 Detailed comparison of the foundation model implementations against baseline methods for lesion anatomical site classification.

Comparison of the balanced accuracy and mean average precision of the Foundation (Features) and Foundation (Finetuned) against all other methods when using 100%, 50%, 20%, and 10% percent of the training data. For each metric-percentage pair, a p-value heatmap (darker colours show non-significant values) is shown with the foundation models on the y-axis and all other models to compare on the x-axis. In each cell, the increase or decrease in metric value is shown along with the corresponding p-value. p-values between models were compared using the permutation test with N = 1000 permutations conducted for each pair-wise comparison.

Extended Data Fig. 4 Anatomical site-wise breakdown of foundation model and baseline method performance.

We compare the foundation model against baseline methods across different training data percentages using average precision scores for each anatomical site in the DeepLesion held-out test dataset. This allows us to show the generalizability of approaches across anatomical sites.

Extended Data Fig. 5 Detailed comparison of the foundation model implementations against baseline methods for nodule malignancy classification and NSCLC prognostication.

a , Comparison of the area-under-receiver operating curve (AUC) and mean average precision(mAP) of the Foundation (Features) and Foundation (Finetuned) against all other methods when using 100%, 50%, 20%, and 10% percent of the training data on use case 2. b , Comparison of the AUC of the Foundation (Features) and Foundation (Finetuned) against all other models for the LUNG1 (left) and RADIO (right) dataset for use-case 3. For each metric-percentage pair, a p-value heatmap (darker colours show non-significant values) is shown with the foundation models on the y-axis and all other models to compare on the x-axis. In each cell, the increase or decrease in metric value is shown along with the corresponding p-value. p-values between models were compared using the permutation test with N = 1000 permutations conducted for each pair-wise comparison.

Extended Data Fig. 6 Survival analysis for all models implemented on NSCLC prognostication.

a , b , Kaplan Meier curves on the LUNG1 ( a ) and RADIO ( b ) datasets for both the foundation model implementation approaches as well as the baseline comparisons are shown. c , d , In c and d , Hazard ratios (HR), computed through univariate Cox regression, for each of the implementation approaches on the LUNG1 and RADIO datasets are shown using forest plots. For both these analyses, groups are determined based on respective model predictions split on the median of the corresponding HarvardRT tuning set predictions. The error bands in (a,b) represent the 95% confidence interval of the Kaplan-Meier estimates of the survival function. The log-rank test is used to determine significant differences between the groups in the KM analysis. For ( c , d ), the error bars represent the 95% confidence interval of the hazard ratio, and the p-values are calculated using the Wald test. For the LUNG1 dataset, n = 420 and RADIO, n = 133 samples are used to compute each of the analyses in the plots above.

Extended Data Fig. 7 Diameter distribution of DeepLesion.

Distribution of diameters in the x and y axes for the DeepLesion training dataset based on RECIST bookmarks identified on key slices. Input dimensions of 50x50x50 mm3 were chosen as they covered 93% and 97% of the distribution in the x and y axes, respectively.

Extended Data Fig. 8 Stages of the implementation pipeline.

a, We first pre-train using a modified version of the SimCLR on 11.467 lesions. The pre-training process consists of a positive contrastive and a negative contrastive loss component. In the positive contrastive loss, augmentations of the same lesion are made to learn similar features. At the same time, the negative contrastive loss learns different features for volumes with and without lesions. b , In the second stage, for each task, different implementation approaches are followed by adapting the pretrained model by either extracting features from a frozen model followed by linearly predicting a target or by fine-tuning all model weights for predicting a target.

Extended Data Fig. 9 Dataset breakdown.

The first table shows the 6 different cohorts used in this study along with eligible scans and patients. A secondary table shows the outcome, sex, and age distribution of each of the cohorts.

Supplementary information

Reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Pai, S., Bontempi, D., Hadzic, I. et al. Foundation model for cancer imaging biomarkers. Nat Mach Intell 6 , 354–367 (2024). https://doi.org/10.1038/s42256-024-00807-9

Download citation

Received : 09 June 2023

Accepted : 08 February 2024

Published : 15 March 2024

Issue Date : March 2024

DOI : https://doi.org/10.1038/s42256-024-00807-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

machine learning task assignment

Help | Advanced Search

Computer Science > Machine Learning

Title: unleashing the potential of large language models for predictive tabular tasks in data science.

Abstract: In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured tabular data. This limitation stems from their lacking exposure to the intricacies of tabular data during their foundational training. Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2 on this enriched dataset. Furthermore, we investigate the practical application of applying the trained model to zero-shot prediction, few-shot prediction, and in-context learning scenarios. Through extensive experiments, our methodology has shown significant improvements over existing benchmarks. These advancements highlight the efficacy of tailoring LLM training to solve table-related problems in data science, thereby establishing a new benchmark in the utilization of LLMs for enhancing tabular intelligence.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Machine learning tasks

    machine learning task assignment

  2. The Machine Learning Workflow Explained (and How You Can Practice It

    machine learning task assignment

  3. Machine Learning Tasks

    machine learning task assignment

  4. Task Assignment algorithm following the machine learning approach

    machine learning task assignment

  5. Coursera: Machine Learning (Week 5) [Assignment Solution]

    machine learning task assignment

  6. Using make to Orchestrate Machine Learning Tasks

    machine learning task assignment

VIDEO

  1. NPTEL: Introduction to Machine Learning Assignment-8 Answers #nptelquiz

  2. Introduction To Machine Learning Week 1 Assignment 1 Solution

  3. Week 6| Assignment 6

  4. Assignment 7

  5. What are Machine Learning Tasks?

  6. Introduction To Machine Learning week 6 Assignment 6 #nptel #assignment #trending

COMMENTS

  1. Machine learning tasks

    A machine learning task is the type of prediction or inference being made, based on the problem or question that is being asked, and the available data. For example, the classification task assigns data to categories, and the clustering task groups data according to similarity. Machine learning tasks rely on patterns in the data rather than ...

  2. Machine Learning Fundamentals Handbook

    As a Machine Learning Researcher or Machine Learning Engineer, there are many technical tools and programming languages you might use in your day-to-day job. But for today and for this handbook, we'll use the programming language and tools: Python Basics: Variables, data types, structures, and control mechanisms.

  3. Assignments

    After completing each unit, there will be a 20 minute quiz (taken online via gradescope). Each quiz will be designed to assess your conceptual understanding about each unit. Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions. You can view the conceptual questions in each unit's ...

  4. Your First Machine Learning Project in Python Step-By-Step

    In this step-by-step tutorial you will: Download and install Python SciPy and get the most useful package for machine learning in Python. Load a dataset and understand it's structure using statistical summaries and data visualization. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

  5. 101 NLP Exercises (using modern libraries)

    This post aims to serve as a reference for basic and advanced NLP tasks. 101 NLP Exercises using modern libraries. Photo by Ana Justin Luebke. 1. Import nltk and download the 'stopwords' and 'punkt' packages. Difficulty Level : L1. Q. Import nltk and necessary packages. Show Solution. 2.

  6. 4 Types of Classification Tasks in Machine Learning

    Examples include: Email spam detection (spam or not). Churn prediction (churn or not). Conversion prediction (buy or not). Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state. For example " not spam " is the normal state and " spam " is the abnormal state.

  7. The 3 Core Machine Learning Tasks

    Conclusion. That covers the basics of the three core types of machine learning: classification, regression, and clustering. As you get started with machine learning, I strongly encourage you to start with classification or regression. In fact, a standard experiment for new data scientists is to start out with a binary classification experiment ...

  8. Types of tasks in machine learning

    Classification. Classification is the task of assigning categories (or classes) to given instances automatically. The machine learning model that has been trained to achieve such a goal is known as a classifier.Classification falls in the realm of supervised learning — the sub-field of machine learning that enables models to be trained by observing labeled or supervised examples.

  9. Project

    Project. One of CS230's main goals is to prepare you to apply machine learning algorithms to real-world tasks, or to leave you well-qualified to start machine learning or AI research. The final project is intended to start you in these directions. Past Projects.

  10. PDF Machine endowment cost model: task assignment between humans ...

    The model can be applied to human-machine task assignment decisions in industry and services. ... while, machine learning in artificial intelligence may better accomplish cognitive tasks. It is ...

  11. Task assignment in microtask crowdsourcing platforms using learning

    Conclusion. In this paper we introduced a new task assignment algorithm, called LEarning Automata based Task assignment (LEATask). LEATask has two stages of exploration and exploitation. During exploration the process starts with clustering each vector of workers using hierarchical clustering.

  12. PDF Machine Learning Laboratory Manual

    Machine learning tasks Machine learning tasks are typically classified into two broad categories, depending on whether there is a learning "signal" or "feedback" available to a learning system: ... Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that

  13. Real-Time Task Assignment Approach Leveraging Reinforcement Learning

    In the real-time task assignment problem, we define a system to be a fog network that includes IoT devices that generate tasks, fog servers with various capabilities in processing a task, and a task assignment module that chooses servers where the tasks are executed. ... Reinforcement learning (RL) is a class of machine learning, besides ...

  14. Machine endowment cost model: task assignment between humans and

    Although research on human-machine task assignment has presently received academic attention, the theoretical foundation of task assignment requires further development. ... Machine learning is ...

  15. Machine Learning Tutorial

    This machine learning tutorial helps you gain a solid introduction to the fundamentals of machine learning and explore a wide range of techniques, including supervised, unsupervised, and reinforcement learning. Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems that learn—or improve ...

  16. A dynamic task assignment model for aviation emergency ...

    Consequently, with the development of machine learning, numerous algorithms have been applied to task assignments. Ding et al. proposed a dynamic task scheduling based on Q-Learning for energy-efficient cloud computing, which minimizes the task response time and maximizes CPU utilization per server assignment [8] .

  17. How to Ace Home Assignments for Machine Learning Job Interviews

    1. Create an ML model that works and is reasonable for the task. In general, this part is really task-dependent, so you need to know your stuff. Some general insights: Make sure you understand the assignment. If you are not certain what you are required to do, it is usually okay to ask for clarification.

  18. PDF Machine Learning based Timeliness-Guaranteed and Energy-Efficient Task

    domain, several task assignment schemes have been proposed that decide which tier (remote cloud, fog, and edge devices) to assign a task to reduce task latency and (or) energy consump-tion. These methods can be categorized into two groups based on their goals: (1) energy-aware task assignment [5-11] and (2) bandwidth-aware task assignment [12 ...

  19. PDF ML for RT: Priority Assignment Using Machine Learning

    priority assignment for a task set, whose system hazard is smaller than all known heuristic priority assignments for the task set. This notion makes it tractable to determine whether a given priority assignment of a task set with large nis a pseudo-premier or not, as it only requires investigation of the heuristic priority assignments.

  20. How to Delegate Tasks as a Machine Learning Leader

    1. Assess team members' skills and weaknesses. 2. Align tasks with individual strengths and interests. 3. Challenge and motivate through task assignment. 4. Balance workload to prevent overloading. 5.

  21. A-sad-ali/Machine-Learning-Specialization

    Contains Optional Labs and Solutions of Programming Assignment for the Machine Learning Specialization By Stanford University and Deeplearning.ai - Coursera (2023) by Prof. Andrew NG - A-sad-ali/Machine-Learning-Specialization

  22. [2106.02856] Reinforcement Learning for Assignment Problem with Time

    Download PDF Abstract: We present an end-to-end framework for the Assignment Problem with multiple tasks mapped to a group of workers, using reinforcement learning while preserving many constraints. Tasks and workers have time constraints and there is a cost associated with assigning a worker to a task. Each worker can perform multiple tasks until it exhausts its allowed time units (capacity).

  23. Task Assignment algorithm following the machine learning approach

    Download scientific diagram | Task Assignment algorithm following the machine learning approach from publication: Job Aware Scheduling Algorithm for MapReduce Framework | MapReduce framework has ...

  24. Machine learning based intrusion detection as a service: task

    Machine learning (ML), on the other hand, has better capabilities for detecting variants. In this paper, we adopt ML-based IDS which consists of three in-sequence tasks: pre-processing, binary detection, and multi-class detection. We proposed ten different task assignments, which map these three tasks into a three-tier network for distributed IDS.

  25. Distributed Swarm Learning for Edge Internet of Things

    The rapid growth of Internet of Things (IoT) has led to the widespread deployment of smart IoT devices at wireless edge for collaborative machine learning tasks, ushering in a new era of edge learning. With a huge number of hardware-constrained IoT devices operating in resource-limited wireless networks, edge learning encounters substantial challenges, including communication and computation ...

  26. NAVER AI Lab Introduces Model Stock: A Groundbreaking Fine-Tuning

    Fine-tuning pre-trained models has become the basis for achieving state-of-the-art results across various tasks in machine learning. This practice involves adjusting a model, initially trained on a large dataset, to perform well on a more specific task. One of the challenges in this field is the inefficiency associated with the need for numerous fine-tuned models to achieve optimal performance.

  27. Foundation model for cancer imaging biomarkers

    Here, we developed a foundation model for cancer imaging biomarker discovery by training a convolutional encoder through self-supervised learning using a comprehensive dataset of 11,467 ...

  28. [2403.20208] Unleashing the Potential of Large Language Models for

    In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured ...

  29. Backpropagation through space, time, and the brain

    Effective learning in neuronal networks requires the adaptation of individual synapses given their relative contribution to solving a task. However, physical neuronal systems -- whether biological or artificial -- are constrained by spatio-temporal locality. How such networks can perform efficient credit assignment, remains, to a large extent, an open question. In Machine Learning, the answer ...