Distributed Systems and Parallel Computing

No matter how powerful individual computers become, there are still reasons to harness the power of multiple computational units, often spread across large geographic areas. Sometimes this is motivated by the need to collect data from widely dispersed locations (e.g., web pages from servers, or sensors for weather or traffic). Other times it is motivated by the need to perform enormous computations that simply cannot be done by a single CPU.

From our company’s beginning, Google has had to deal with both issues in our pursuit of organizing the world’s information and making it universally accessible and useful. We continue to face many exciting distributed systems and parallel computing challenges in areas such as concurrency control, fault tolerance, algorithmic efficiency, and communication. Some of our research involves answering fundamental theoretical questions, while other researchers and engineers are engaged in the construction of systems to operate at the largest possible scale, thanks to our hybrid research model .

Recent Publications

Some of our teams.

Algorithms & optimization

Graph mining

Network infrastructure

System performance

We're always looking for more talented, passionate people.

Careers

Advances, Systems and Applications

  • Open access
  • Published: 10 October 2023

MapReduce scheduling algorithms in Hadoop: a systematic study

  • Soudabeh Hedayati 1 ,
  • Neda Maleki 2 ,
  • Tobias Olsson 2 ,
  • Fredrik Ahlgren 2 ,
  • Mahdi Seyednezhad 3 &
  • Kamal Berahmand 4  

Journal of Cloud Computing volume  12 , Article number:  143 ( 2023 ) Cite this article

2143 Accesses

3 Citations

Metrics details

Hadoop is a framework for storing and processing huge volumes of data on clusters. It uses Hadoop Distributed File System (HDFS) for storing data and uses MapReduce to process that data. MapReduce is a parallel computing framework for processing large amounts of data on clusters. Scheduling is one of the most critical aspects of MapReduce. Scheduling in MapReduce is critical because it can have a significant impact on the performance and efficiency of the overall system. The goal of scheduling is to improve performance, minimize response times, and utilize resources efficiently. A systematic study of the existing scheduling algorithms is provided in this paper. Also, we provide a new classification of such schedulers and a review of each category. In addition, scheduling algorithms have been examined in terms of their main ideas, main objectives, advantages, and disadvantages.

Introduction

Big data is a term used to describe a collection of extremely large amounts of data that cannot be handled by conventional database management tools [ 1 ]. Big data has become very popular in information technology, where data is becoming increasingly complicated and large amounts of it are being created every day. Data comes from social networking sites, business transactions, sensors, mobile devices, etc. [ 2 ]. Due to the increase in volume, velocity, and variety of data, processing leads to complexities and challenges. Thus, big data becomes a complex process in terms of correctness, transformation, matching, relating [ 3 ]. A new platform is required for data transmission, storage, and processing because the data is so big and unstructured. The platform can process and analyze huge volumes of data at a reasonable speed. This necessity led to the development of parallel and distributed processing in clusters and grades [ 4 ]. In order to hide the complexity of the parallel processing system from the users, numerous frameworks have been released [ 5 ].

MapReduce is a programming pattern that is popular among all frameworks. Using the Map and Reduce functions of MapReduce, users do not have to worry about the details of parallelism when defining parallel processes, such as data distribution, load balancing, and fault tolerance [ 6 ]. MapReduce is used to process high volumes of data concurrently. Map and Reduce are the two functions of MapReduce. In this framework, the first step in parallel computing is to assign map tasks to various nodes and perform them on input data. Then, the final results are generated by combining the map outputs and applying the reduce function [ 7 , 8 ].

MapReduce is in competition with other program paradigms like Spark and DataMPI. The choice of MapReduce for investigation is based on the fact that it is high-performance open source, utilized by many large companies to process batch jobs [ 9 , 10 ], and is our future research area. Scheduling is the process of assigning tasks to the nodes, which is a critical factor for improving system performance in MapReduce. There are many algorithms to solve scheduling issues with different techniques and approaches. The main goal of scheduling is to increase throughput while reducing response time and improving performance and resource utilization [ 11 , 12 ].

Using recent articles and inclusion and exclusion criteria, we provide a systematic and comprehensive survey of MapReduce scheduling papers. MapReduce scheduling was the subject of a systematic literature review in 2017 [ 13 ], but there hasn't been another systematic study. As far as we know, our review is the first systematic study from 2017 to June 2023. In this paper, we have reviewed the existing scheduling algorithms and identified trends and open challenges in this area. To answer the research questions (RQs), we gathered and analyzed relevant data from papers on MapReduce scheduling and provided the answers.

In this study, our purpose is to provide the results of a systematic survey on MapReduce scheduling algorithms. Thus, we consider current algorithms, compare the differences between the schedulers, and describe several scheduling algorithms. The remainder of the paper is structured as follows. “ Background and research method ” section has two parts: In part one, an architectural overview of MapReduce and Hadoop is introduced, and in part two, our research method is provided. “ Regular surveys ” section discusses schedulers in Hadoop MapReduce and categorizes them. It also presents a comparison of selected algorithms. In “ Schedulers in Hadoop MapReduce ” section, we discuss the mentioned schedulers and analyze the results. Finally, our conclusions are provided in “ Discussion ” section.

Background and research method

Doug Cutting and Mike Cafarela designed Hadoop specifically for distributed applications. It is an open-source Apache project that used Google's MapReduce approach. Hadoop has two major parts: Hadoop Distributed File System (HDFS) and MapReduce [ 14 ], which is shown in Fig.  1 .

figure 1

Hadoop Architecture [ 15 ]

HDFS: the data storage part of Hadoop is HDFS. The HDFS architecture is based on master/slave. It consists of one NameNode and several DataNodes. The NameNode acts as a master node, while the DataNodes act as a slave to the master node [ 16 ]. Also, NameNode maps input data splits to DataNodes and maintains the metadata, and DataNodes store application data. The data stored in HDFS is fetched by MapReduce for computation [ 15 , 17 ].

MapReduce is the processing unit of Hadoop with two primary parts: one JobTracker and numerous TaskTrackers. The JobTracker breaks the job into several map and reduce tasks and assigns them to TaskTrackers, and the TaskTrackers run the tasks [ 18 ]. The MapReduce job execution flow is depicted in Fig.  2 . Map tasks are performed in parallel on input splits to create a set of intermediate key-value pairs. Using key-value pairs, these pairs are shuffled across multiple reduce tasks. One key is accepted by each Reduce task at a time, and that key’s data is processed, generating the output files which are stored on HDFS [ 19 ].

figure 2

MapReduce job execution flow [ 19 ]

Research method

According to [ 20 , 21 , 22 ], the papers are selected using the following method:

Several research questions are established based on the research area;

Keywords are discovered based on the research questions;

Search strings are defined according to the keywords;

The final papers are vetted according to several inclusion and exclusion criteria.

Research question

In this paper, we review and analyze the selected studies to present a summary of current research on MapReduce scheduling. We plan to respond to the following research questions:

RQ1: What was the trend of publications in the field of Hadoop MapReduce scheduling in the last five years?

RQ2: What are the proposed schedulers, and what are the techniques used?

RQ3: What are the key ideas of each paper?

RQ4: What evaluation techniques have been used to assess the results of each paper?

RQ5: What scheduling metrics are considered?

Search strategy

We used search strings to find systematic mapping and literature studies. The search strings were defined after the study questions were created. The databases were also described. Research keywords: Using the research questions, we chose the keywords. The identified keywords were: “Hadoop”, “MapReduce”, and “Scheduling”. Search strings: we defined the search strings using the selected keywords according to the research questions. The survey's initial results were found using the search string, i.e., “Hadoop OR MapReduce” AND “Scheduling.” This expressive style was subsequently adapted to suit the specific requirements of each database. Sources: After defining the search string, we selected the following databases as sources:

IEEE Xplore

Science Direct

The search period was from 2009 to 2023, and it was carried out in July 2023. A typical attendance graph is shown in Fig.  3 , with its characteristic funnel shape.

figure 3

References Word Cloud—This figure shows the word cloud generated by processing the references. Top 200 most used words are selected. The larger the words are, the more frequently they are used

Search selection

After getting the database results, each paper must be carefully examined to ensure it pertains to our survey context. To find the studies, the following inclusion and exclusion criteria were applied. The results removed from each stage can be seen in Fig.  4 .

Stage 1. Apply the search query to all the sources, gathering the results.

Stage 2. Apply inclusion/exclusion criteria to the paper title.

Stage 3. Apply inclusion/exclusion criteria to the paper abstract.

Stage 4. Apply inclusion/exclusion criteria to the introduction.

figure 4

Selection of the papers

Inclusion criteria

The relevance of a work to the research issues is one of the inclusion criteria. First, we examined the study titles and abstracts from the initial database searches. The introduction then conducted a second analysis of the chosen studies from this stage. The study was discarded whenever one of the inclusion criteria was not met. The inclusion criteria are as follows:

Studies are addressing scheduling concepts for Hadoop MapReduce.

Studies use predetermined performance measures to assess their performance approach.

Studies are presented at conferences or published in journals (peer review).

Exclusion criteria

Studies were disqualified based on an analysis of their title, abstract, and, if necessary, introduction. The exclusion criteria are as follows:

Studies that do not address scheduling in Hadoop MapReduce.

Studies whose entire texts are not accessible at the source.

Studies unrelated to the research questions.

Repeated studies that were printed in multiple sources;

Presentations, demonstrations, or tutorials.

Data extraction

During the phase of data extraction for analysis of the selected studies, the data gathered from them are summarized. We synthesized the data to answer the research question RQ1 in this section.

Answer to Question RQ1 What was the trend of publications in the field of Hadoop MapReduce scheduling in the last five years?

Figure  5 shows the percentage of selected studies per year. As can be seen, 17 (34%) studies have been published in the last five years. To be more specific, 1 (2%) articles were published in 2023, 7 (14%) articles were published in 2022, 4 (8%) studies were published in 2021, 3 (6%) studies were published in 2020, and 2 (4%) articles were published in 2019.

figure 5

Percentage of selected studies per year

Figure  6 shows the percentage of selected studies from each database. It is shown that 7% of studies are related to ACM Digital Library, 65% are related to IEEE Xplore, 5% belong to Science Direct, and 23% belong to Springer Link.

figure 6

Percentage of selected studies per database

Figure  7 presents the percentage of studies in each type. As can be seen, 61% of the studies were presented at a conference, and 39% were published in journals.

figure 7

Percentage of selected studies per type database

Regular surveys

In this section, we aim to provide an extensive review of regular surveys conducted in the field. Our goal is to comprehensively analyze and summarize the existing surveys to offer a comprehensive understanding of the subject matter.

Ghazali et al. [ 23 ] presented a novel classification system for job schedulers, categorizing them into three distinct groups: job schedulers for mitigating stragglers, job schedulers for enhancing data locality, and job schedulers for optimizing resource utilization. For each job scheduler within these groups, they provided a detailed explanation of their performance-enhancing approach and conducted evaluations to identify their strengths and weaknesses. Additionally, the impact of each scheduler was assessed, and recommendations were offered for selecting the best option in each category. Finally, the authors provided valuable guidelines for choosing the most suitable job scheduler based on specific environmental factors. However, the survey is not systematic and the process of selecting articles is not clear. Moreover, there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Abdallat et al. [ 24 ] focused on the topic of job scheduling algorithms in Big Data Hadoop environment. The authors emphasize the importance of efficient job scheduling in processing large amounts of data in real-time, considering the limitations of traditional ecosystems. The paper provides background information on the Hadoop MapReduce framework and conducts a comparative analysis of different job scheduling algorithms based on various criteria such as cluster environment, job allocation strategy, optimization strategy, and quality metrics. The authors present use cases to illustrate the characteristics of selected algorithms and offer a comparative display of their details. The paper discusses popular scheduling considerations, including locality, synchronization, and fairness, and evaluates Hadoop schedulers based on metrics such as locality, response time, and fairness. However, the survey is not systematic and the process of selecting articles is not clear. Also, there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Hashem et al. [ 25 ] conducted a comprehensive comparison of resource scheduling mechanisms in three widely-used frameworks: Hadoop, Mesos, and Corona. The scheduling algorithms within MapReduce were systematically categorized based on strategies, resources, workload, optimization approaches, requirements, and speculative execution. The analysis encompassed two aspects: taxonomy and performance evaluation, where they thoroughly reviewed the advancements made in MapReduce scheduling algorithms. Additionally, the authors highlighted existing limitations and identified potential areas for future research in this domain. However, the survey is not systematic and the process of selecting articles is not clear. Also, there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Soualhia et al. [ 26 ] analyzed 586 papers to identify the most significant factors impacting scheduler performance. They broadly discussed these factors, including challenges related to resource utilization, and total execution time. Additionally, they categorized existing scheduling approaches into adaptive, constrained, dynamic, and multi-objective groups, summarizing their advantages and disadvantages. Furthermore, they classified scheduling issues into categories like resource management, data management (including data locality, placement, and replication), fairness, workload balancing, and fault-tolerance, and analyzed approaches to address these issues, grouping them into four main categories: dynamic scheduling, constrained scheduling, adaptive scheduling, and others. However, there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Khezr et al. [ 27 ] proposed a comprehensive review of the applications, challenges, and architecture of MapReduce. This review aims to highlight the advantages and disadvantages of various MapReduce implementations in order to discuss the differences between them. They examined the utilization of MapReduce in multi-core systems, cloud computing, and parallel computing environments. This study provides a comprehensive review of MapReduce applications. Additionally, open issues and future work are discussed. However, the survey is not systematic and did not outline a clear process for selecting articles. Moreover, it did not specifically focus on Hadoop.

Senthilkumar et al. [ 28 ] explored different scheduling issues, such as locality, fairness, performance, throughput, and load balancing. They proposed several job scheduling algorithms, including FIFO scheduler, Fair scheduler, Delay scheduler, and Capacity scheduler, and evaluated the pros and cons of each algorithm. The study also discussed various tools for node allocation, load balancing, and job optimization, providing a comparative analysis of their strengths and weaknesses. Additionally, optimization techniques to maximize resource utilization within time and memory constraints were reviewed and compared for their effectiveness. However, the survey is not systematic and did not outline a clear process for selecting articles. Furthermore, there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Hashem et al. [ 29 ] presented a comprehensive review of MapReduce challenges, including data access, data transfer, iteration, data processing, and declarative interfaces. Bibliometric analysis and review are performed on MapReduce research assessment publications indexed in Scopus. Also, the MapReduce application is included from 2006 to 2015. Furthermore, it discusses the most significant studies on MapReduce improvements and future research directions related to big data processing with MapReduce. It is suggested that power management in Hadoop clusters on MapReduce is one of the main issues to be addressed. In this study, the paper selection process is clear, but there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Li et al. [ 30 ] addressed the basic concept of the MapReduce framework, its limitations, and the proposed optimization methods. These optimization methods are categorized into several subjects, including job scheduling optimization, improvement in the MapReduce programming model, support for stream data in real-time, performance tuning such as configuration parameters, energy savings as a major cost, and enhanced authentication and authorization. However, there is no comparison between these studies’ pitfalls and advantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles. Moreover, the survey is not systematic, and the process of selecting articles is not clear.

Tiwari et al. [ 31 ] introduced a multidimensional classification framework for comparing and contrasting various MapReduce scheduling algorithms. The framework was based on three dimensions: quality attribute, entity scheduled, and adaptability to runtime environment. The study provided an extensive survey of scheduling algorithms tailored for different quality attributes, analyzing commonalities, gaps, and potential improvements. Additionally, the authors explored the trade-offs that scheduling algorithms must make to meet specific quality requirements. They also proposed an empirical evaluation framework for MapReduce scheduling algorithms and summarized the extent of empirical evaluations conducted against this framework to assess their thoroughness. However, the survey is not systematic and did not outline a clear process for selecting articles. Furthermore, there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

Polato et al. [ 32 ] performed a systematic literature review to establish a taxonomy for classifying research related to the Hadoop framework architecture. They categorized the studies into four main categories: MapReduce, Data Storage & Manipulation, Hadoop Ecosystem, and Miscellaneous. The MapReduce category encompassed studies on solutions utilizing the paradigm and associated concepts, while the Data Storage & Manipulation category covered research on HDFS, storage, replication, indexing, random access, queries, DBMS infrastructure, Cloud Computing, and Cloud Storage. The Hadoop Ecosystem category focused on studies exploring new approaches within the Hadoop Ecosystem, and the Miscellaneous category included research on topics like GPGPU, cluster cost management, data security, and cryptography. The taxonomy developed in this study provides a comprehensive overview of the diverse research landscape surrounding the Hadoop framework architecture. In this study, the paper selection process is clear, but there is no comparison between these studies’ advantages and disadvantages, and there is no information about the environment and the platform for implementation or simulation of the surveyed articles.

We compared and highlighted shortcomings of these surveys. Table 1 provides a summary of them and their main properties. The analysis table includes references, key ideas, systematic survey, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Schedulers in Hadoop MapReduce

To respond to RQ2, RQ3, and RQ4, a thorough review of the selected studies was conducted and the most frequently addressed scheduling issues in Hadoop MapReduce were analyzed. We classified the schedulers into six categories:

Deadline-aware schedulers;

Data Locality-aware schedulers;

Cost-aware schedulers;

Resource-aware schedulers;

Makespan-aware schedulers;

Learning-aware Schedulers.

The idea of each paper has been validated by comparing the performance against existing solutions and benchmarks, which is shown in Tables 2 , 3 , 4 , 5 , 6 , 7 and 8 in “ Deadline-aware Schedulers ” to “ Learning-aware schedulers ” sections, respectively. Each category will be discussed in the following subsections.

Deadline-aware schedulers

Some MapReduce jobs on big data platforms have deadlines and need to be completed within those deadlines. When a job has a deadline, the proper resources must be allocated to the job; otherwise, the deadline cannot be satisfied [ 87 ]. Therefore, meeting the job deadline is crucial in MapReduce clusters. We classify deadline-aware schedulers into two categories: deadline-aware schedulers in heterogeneous Hadoop Clusters and deadline-aware schedulers in homogeneous Hadoop clusters. In this section, we first survey and review the most popular deadline-aware schedulers, which minimize job deadline misses. Finally, the reviewed schedulers are compared and summarized.

Deadline-aware schedulers in homogeneous clusters

Gao et al. [ 33 ] proposed a deadline-aware preemptive job scheduling strategy, called DAPS, for minimizing job deadline misses in Hadoop Yarn clusters. The proposed method considers the deadline constraints of MapReduce applications and maximizes the number of jobs that meet their deadlines while improving cluster resource utilization. DAPS is formulated as an online optimization problem, and a preemptive resource allocation algorithm is developed to search for a good job scheduling policy. DAPS is a distributed scheduling framework that includes a central resource scheduler, a job analyzer, and node resource schedulers.

Cheng et al. [ 34 ] provided a MapReduce job scheduler for deadline constraints called RDS. This scheduler assigns resources to jobs based on resource prediction and job completion time estimation. The problem of job scheduling was modeled as an optimization problem. To find the optimal solution, a receding horizon control algorithm was used. They estimated job completion times using a self-learning model.

Kao et al. [ 35 ] proposed a scheduling framework, called DamRT, to provide deadline guarantees while considering data locality for MapReduce jobs in homogeneous systems. In the proposed method, tasks are non-preemptive. DamRT schedules the jobs in four steps: Firstly, DamRT determines the dispatching order of jobs. The urgency value instead of the deadline determines the dispatching order of jobs if the map tasks of jobs can be distributed across nodes concurrently. In contrast, if the map tasks of jobs can be allocated across nodes concurrently, the urgency value determines the priority of the jobs. An "urgency value" is calculated by dividing the estimated response time by the slack time slots across nodes. DamRT first dispatches the job with the highest urgency value. Secondly, the partition order is adjusted for all map tasks of the job. The scheduler assigns the map tasks of a job across nodes, based on the access probability of the required data of the nodes. Thirdly, DamRT assigns the map tasks to nodes according to data locality and blocking time. If two map tasks are assigned to one node, one task will wait for another task's data transfer and execution. For other tasks, the blocking time is defined by the partition value. Also, the scheduler calculates the partition value of all the nodes for all the map tasks of the job and assigns these map tasks to the node that can schedule the job and has the smallest partition value. Finally, after completing all the map tasks of the job, the reduce tasks of the job are ready for execution. The urgency value instead of the deadline determines the priority of the reduce tasks, if the reduce tasks can be distributed across nodes concurrently. Then, all reduced tasks are allocated to the node with the lowest load, where this job can be scheduled.

Verma et al. [ 36 ] extended ARIA and proposed a deadline-based Hadoop scheduler called MinEDF-WC (minimum Earliest Deadline First-Work conserving). They integrated three mechanisms: 1) a policy for job ordering in the queue: when the job profile is unavailable, job ordering is calculated by the EDF policy (Earliest Deadline First). The EDF policy is used with Hadoop’s default scheduler, which allocates the maximum number of slots to job. When a new job that has an earlier deadline arrives, the active tasks are killed in this scheme. 2) a mechanism for allocating an appropriate number of map and reduce slots required for the job to satisfy the deadline is proposed to improve the EDF job ordering. This mechanism assigns the minimum resource required to satisfy a job deadline when the job profile is available. As a result, it is called MinEDF. 3) a mechanism for allocating and deallocating extra resources among the jobs: the authors presented a mechanism that improves the MinEDF. It is called minEDF-WC and is designed to use spare slots efficiently. After allocating the minimum number of slots required for the job, the unallocated slots are referred to as spare slots. These spare slots are allocated to active jobs. When the new job with an earlier deadline arrives, the mechanism determines whether it can meet its deadline after completing the running tasks and their slots are released. If released slots cannot meet the new job’s deadline, the mechanism stops the active tasks and reallocates these slots to the new job.

Phan et al. [ 37 ] showed that the default schedulers in Hadoop are inadequate for deadline-constrained MapReduce job scheduling. They presented two deadline-based schedulers that are based on the Earliest Deadline First (EDF) but customized for Hadoop environment. The first is EDF/TD, which minimizes total or maximum tardiness. A job's tardiness is the elapsed time between its deadline and completion time. The second is EDF/MR, which minimizes the miss rate. The "miss rate" is the number of soft real-time jobs that miss their deadlines. The EDF/TD scheduling policy sorts the job queue based on job deadlines. For each job, if it has local tasks on a node, one of those is selected to be scheduled according to the lowest value of laxity (difference between the deadline and the estimated execution time). If all the tasks of the job are remote tasks, only the tasks whose data is close to the node are executed. If there are no such tasks, these steps are repeated for the following job in the queue by the scheduler. Finally, In the absence of any tasks, the first task of the first job in the queue is chosen. The EDF/MR scheduling policy classifies the jobs into two sets: schedulable jobs are those that are expected to meet their deadlines, and unschedulable jobs are those that are not predicted to meet their deadlines. A job is predicted to meet its deadline if the present time plus the estimated remaining execution time is less than the job's deadline. The scheduler first considers the schedulable set for scheduling. When it is empty, the scheduler considers it an unschedulable set. In each set, the priority of the tasks is assigned based on the EDF/TD policy, and tasks with the highest priority are allocated first.

Kc et al. [ 38 ] proposed a deadline-based scheduler in Hadoop clusters. To determine if the job can be completed by the deadline, the scheduler first performs a schedulability test. Therefore, it estimates the minimum number of map and reduce slots required for the job to be completed before the deadline. In order to schedule a job, there must be more free slots than or equal to the minimum number for map and reduce [ 88 , 89 ]. Hence, jobs are only scheduled if they can be finished before the deadline.

Teng et al. [ 39 ] addressed the problem of deadline-based scheduling from two perspectives. 1. The real-time scheduling problem is formulated by determining the features of a task, cluster, and algorithm. The task is modeled as a periodic sequence of instances, and the MapReduce cluster is modeled as an exclusive cluster. Also, tasks can be run only sequentially rather than concurrently. Moreover, tasks are run preemptively since the cluster supports preemption. 2. To schedule tasks deadline, they proposed the Paused Rate Monotonic (PRM) method. The highest priority is assigned to the task with the shortest deadline. As mentioned, a task is a periodic sequence of successive instances. As a result, the current instance needs to be finished before the new instance arrives. In this algorithm, the period is the deadline for any instance, thus, the highest priority is given to the task with the lowest deadline. Since a MapReduce job comprises a map task and a reduce task, the authors segmented the period T into a mapping period TM and a reducing period TR. TM is the deadline for map tasks, and TR is the deadline for reduce tasks. PRM pauses the reduce tasks until the map tasks are completed. Thus, the reduce tasks are scheduled only after time TM. By pausing between the map and reduce stages, the resources are utilized efficiently.

Wang et al. [ 40 ] presented a scheduling algorithm using the most effective sequence (SAMES) for scheduling jobs with deadlines. First, they introduced the concept of a sequence: an ordered list of jobs. Sequence restricts the order in which the map phase of jobs is finished. Next, they defined the concept of effective sequence (ES): The sequence seq of a job set is an ES if the completion time of each job is shorter than its deadline. They presented two efficient techniques for finding ESes. If there is more than one ES, they utilized a ranking method to select the most effective sequence (MES). An incremental method is proposed for determining whether a new arrival job is acceptable and updating the MES.

Dong et al. [ 41 ] addressed the problem of scheduling mixed real-time and non-real-time MapReduce jobs. They proposed a two-stage scheduler that is implemented using multiple techniques. First, by using a sampling-based method called TFS, the scheduler predicts the map and reduce task execution time. Next, each job is adaptively controlled by a resource allocation model (AUMD) to run with a minimum slot. Finally, a two-stage scheduler for scheduling real-time and non-real-time jobs is proposed, which supports resource preemption.

Verma et al. [ 42 ] extended ARIA and proposed a novel framework and technique to automate the process of estimating required resources to meet application performance goals and complete data processing by a certain time deadline. The approach involves building a job profile from the job's past executions or by executing the application on a smaller data set using an automated profiling tool. To explain more, they benefited from linear regression to predict the job completion time depending on the size of the input dataset and assigned resources. Scaling rules combined with a fast and efficient capacity planning model are applied to generate a set of resource provisioning options.

We investigated and analyzed deadline-aware schedulers. Table 2 provides a summary of popular deadline-aware schedulers in homogeneous Hadoop clusters and their main properties. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Deadline-aware schedulers in heterogeneous clusters

Jabbari et al. [ 43 ] addressed the challenge of selecting appropriate virtual machines (VMs) and distributing workload efficiently across them to meet both deadline and cost minimization goals in cloud environments. The paper proposes a cost minimization approach to calculate the total hiring cost before and during the computations, based on the application's input size and the required type and number of VMs. The proposed approach uses a daily price fluctuation timetable to schedule MapReduce computations and minimize the total cost while meeting the deadline.

Shao et al. [ 44 ] investigated the service level agreement violation (SLAV) of the YARN cluster using a Fair Scheduling framework. To assign resources to MapReduce jobs, the authors used dynamic capacity management and a deadline-driven policy. A Multi-dimensional Knapsack Problem (MKP) and a greedy algorithm were employed to model and solve the problem, respectively.

Lin et al. [ 45 ] provided a deadline-aware scheduler for MapReduce jobs, DGIA, in a heterogeneous environment. Using the data locality, DGIA meets the deadlines of new tasks. When the deadline of some new tasks is not met, DGIA re-allocates these tasks. The task re-allocation problem is transformed into a well-known network graph problem: minimum cost maximum-flow (MCMF) to optimize the computing resource utilization.

Chen et al. [ 46 ] addressed the problem of Deadline-Constrained MapReduce Scheduling called DCMRS in heterogeneous environments. Using Bipartite Graph modeling, they presented a new scheduling method named BGMRS. It has three major modules, i.e., deadline partition, bipartite graph modeling, and scheduling problem transformation. First, the BGMRS adaptively determines deadlines for map and reduce task of a job. Secondly, to demonstrate the relationship between tasks and slots, they formed a weighted bipartite graph. Finally, the DCMRS problem is transformed into the minimum weighted bipartite matching (MWBM) problem to achieve the best allocation between tasks and resources. Also, to solve the MWBM problem, they presented a heuristic method with the node group technique.

Tang et al. [ 47 ] presented a deadline-based MapReduce job scheduler called MTSD. In the presented scheduling, user can specify a job’s deadline. MTSD presents a node classification algorithm that measures the node’s computing capacity. The nodes are classified according to their computing capacity in heterogeneous clusters using this algorithm. One purpose of node classification is to demonstrate a new data distribution model to increase the data locality. Another purpose is to increase the accuracy of the evaluations of the task remaining times. To determine its priority, MTSD computes the minimum number of map and reduce slot required for the job.

Verma et al. [ 48 ] proposed ARIA, a framework for deadline-based scheduling in Hadoop clusters. It has three major parts. First, a job profile is created for a production job that runs periodically. A job profile shows the characteristics of the job execution during the map, shuffle, sort, and reduce phases. Second, using the job profile, i) the job completion time is estimated according to the assigned map and reduce slots, and ii) the minimum number of map and reduce slots for meeting the job’s deadline is estimated based on Lagrange’s method. Finally, they benefited from the earliest deadline first policy (EDF) to determine job ordering.

Polo et al. [ 49 ] presented the adaptive scheduler for MapReduce multi-job workloads with deadline constraints. The scheduler divides a job into tasks already completed, not yet started, and currently running. It adaptively determines the number of slots required for the job to satisfy the deadline. Therefore, for each job, the amount of pending work is estimated. To this end, this technique investigates both the tasks that have not yet started and the currently running tasks. Based on these two parameters, it calculates the number of slots. Then, the scheduler calculates the priority of each job based on the number of slots to be assigned simultaneously to each job. Jobs are sorted into a queue based on their priority. In order to account for the hardware heterogeneity, nodes can be classified into two groups: those with general-purpose cores and those with specialized accelerators such as the GPU. When a job requires a GPU to run its tasks, the scheduler assigns slots from the nodes with the GPUs to run the tasks of the job [ 90 ].

We investigated and analyzed deadline-aware schedulers. Table 3 provides a summary of popular deadline-aware schedulers in heterogeneous Hadoop clusters and their main properties. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Data locality-aware schedulers

In data locality-aware schedulers, tasks are allocated to the node where the task's input data is stored; otherwise, they are assigned to the node closest to the data node [ 56 ]. Researchers proposed several scheduling algorithms to improve data locality because it minimizes data transfer over the network and mitigates the total execution time of tasks, thus improving the Hadoop performance [ 4 ]. Therefore, improving data locality is a crucial problem in MapReduce clusters. In this section, we review several important data locality-aware schedulers.

Kalia et al. [ 50 ] tackled the issue of heterogeneous computing nodes in a Hadoop cluster, which can lead to slower job execution times due to varying processing capabilities. To address this challenge, the authors introduced a K-Nearest Neighbor (KNN) based scheduler that employs speculative prefetching and clustering of intermediate map outputs before sending them to the reducer for final processing. The proposed algorithm prefetches input data and schedules intermediate key-value pairs to reduce tasks using the KNN clustering algorithm with Euclidean distance measure. The study concludes that their scheduler, based on clustering, enhances data locality rate and improves execution time.

Li et al. [ 51 ] concentrated on optimizing computing task scheduling performance in the Hadoop big data platform. They introduced an enhanced algorithm for task scheduling in Hadoop, which evaluates the data localization saturation of each node and prioritizes nodes with low saturation to prevent preemption by nodes with high saturation. The authors concluded that their proposed scheduler enhances data locality, overall performance, and reduces job execution time in the Hadoop environment.

Fu et al. [ 52 ] addressed the problem of cross-node/rack data transfer in the distributed computing framework of Spark, which can lead to performance degradation, such as prolonging of entire execution time and network congestion. The authors propose an optimal locality-aware task scheduling algorithm that utilizes bipartite graph modelling and considers global optimality to generate the optimal scheduling solution for both map tasks and reduce tasks for data locality. The algorithm calculates the communication cost matrix of tasks and formulates an optimal task scheduling scheme to minimize overall communication cost, which is transformed into the minimum weighted bipartite matching problem. The problem is resolved by the Kuhn-Munkres algorithm. The paper also proposes a locality-aware executor allocation strategy to improve data locality further.

Gandomi et al. [ 53 ] discussed the importance of data locality-aware scheduling in the Hadoop MapReduce framework, which is designed to process big data on commodity hardware using the divide and conquer approach. The authors propose a new hybrid scheduling algorithm called HybSMRP, which focuses on increasing data locality rate and decreasing completion time by using dynamic priority and localization ID techniques. The algorithm is compared to Hadoop default scheduling algorithms, and experimental results show that HybSMRP can often achieve high data locality rates and low average completion times for map tasks.

He et al. [ 54 ] presented a map task scheduler called MatchMaking to increase data locality. First of all, when the first job does not have a local map task to the requesting node, the scheduler searches for the succeeding jobs. Then, each node is given an equal chance of getting its local tasks, to do this, if the node is unable to find a local task for the first time in a row, it will not be allocated any non-local tasks. Therefore, the node does not get a map task during this heartbeat interval. Nodes that fail to find a local task a second time are assigned a non-local task in order to prevent the waste of computing resources. Each slave node is allocated a status marker based on its locality. Based on the marked value of a slave node, the scheduler determines if a non-local task is allocated to the slave node if there is no map task local to it. Third, in this algorithm, a slave node can be assigned a maximum of one non-local task per heartbeat. Lastly, with the addition of a new job to the queue, the locality markers of all slave nodes will be removed. Upon arriving at a new job, the algorithm resets all slave nodes' node statuses since they may have local tasks for a new job.

Ibrahim et al. [ 55 ] presented a map task scheduler, Maestro, to increase the data locality. To accomplish this, the chunks’ locations, replicas’ locations, and how many other chunks each node hosts are tracked by Maestro. There are two waves of scheduling that the Maestro employs: the first wave scheduler and the run-time scheduler. Taking into account the number of map tasks hosted on each node and the replication scheme for the input data, the first wave scheduler fills up empty slots on each node. The run-time scheduler determines a task's probability of being scheduled on a certain machine based on replicas of its input data.

Zhang et al. [ 56 ] proposed a MapReduce job scheduling technique to increase the data locality in heterogeneous systems. Upon a node sends a request, the scheduler assigns the task to the input data stored on that node. In the absence of such tasks, the task that has the closest input data to that node is selected and its transmission and waiting times are calculated. The task is reserved for the node storing the input data when the waiting time is shorter than the transmission time.

Zhang et al. [ 57 ] provided a map task scheduler, named next-k-node scheduling (NKS), to achieve the data locality. When a requesting node sends a request, the technique schedules tasks that their input data is stored on that node. In the absence of such tasks, a probability is calculated for each map task, and the ones with higher probability are scheduled. A task has a low probability that its input data is stored on the next k nodes, so the technique reserves these tasks for these nodes.

Zaharia et al. [ 58 ] proposed a delay scheduling algorithm to increase the data locality. Upon receiving from a requesting node, if the head-of-line job is unable to launch a local task, the scheduler skips and finds a local task from subsequent jobs. The maximum delay time is set to D. When the scheduler skips a job for an extended period of time (D), the job is allocated to the node with non- local data to avoid starvation [ 91 ]. We investigated and analyzed the data locality-aware schedulers. Table 4 provides a summary of data locality-aware schedulers and their main properties. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Cost-aware schedulers

In big data platforms, data centers store a huge amount of data. Processing this data requires thousands of nodes in Hadoop clusters. Such large clusters consume enormous amounts of power and increase the cost of data centers. Therefore, we face a big challenge in minimizing cost in MapReduce clusters [ 31 ]. In this section, we survey and review several important cost-aware schedulers that reduce the cost of MapReduce systems. Finally, the reviewed schedulers are compared and summarized.

Seethalakshmi et al. [ 59 ] proposed a new scheduling method based on Real Coded Genetic Algorithm (RCGA) to effectively allocate nodes in heterogeneous Hadoop settings. The paper evaluates metrics such as load, makespan of each Virtual Machine (VM), execution time, and memory constraints of each job to identify the challenges in allocating jobs to nodes. The authors propose a solution based on work classification, where jobs are categorized into 'n' classes based on execution rate, priority, and arrival rate. The best set of work classes for each VM is then proposed to solve the issue of resource and work mismatch. The final scheduling is done by the Real coded GA optimization model, which considers fairness and minimum share satisfaction. The authors conducted experiments, and the results show that it outperforms existing systems in terms of metrics such as execution time, cost, resource utilization, throughput.

Tang et al. [ 60 ] addressed the problem of scheduling cloud applications with precedence-constrained tasks that are deadline-constrained and must be executed with minimum financial cost. They proposed a heuristic cost-efficient task scheduling strategy called CETSS, which includes a workflow DDAG model, task sub deadline initialization, greedy workflow scheduling algorithm, and task adjusting method. The proposed greedy workflow scheduling algorithm consists of a dynamical task renting billing period sharing method and an unscheduled task sub deadline relax technique.

Vinutha et al. [ 61 ] proposed a scheduling algorithm to optimize the MapReduce jobs for performance improvement in processing big data using Hadoop. The goal of the algorithm is to reduce the budget and execution time of cloud models by establishing a relationship between the scheduling of jobs and the allocation of resources. The earliest finish time is considered for cloud resource optimization to assign the map tasks. The algorithm schedules tasks based on the availability of slots and available resources in the cluster. The authors evaluate their proposed method on word count with different input data sizes.

Javanmardi et al. [ 62 ] proposed a high-level architecture model for scheduling in heterogeneous Hadoop clusters. The proposed model reduces the scheduling load by performing part of the scheduling in the user system. They also present a scheduler based on the base unit that can estimate the execution time in heterogeneous Hadoop clusters with low overhead and high accuracy, while being resistant to node failure. The scheduler considers the cost of transfer and processing of jobs in the clusters, which leads to a reduction in the cost of executing the jobs. The paper also designs four algorithms for the scheduler, including the estimation of execution time in the user system, distributing the input data of jobs between data nodes based on performance, job scheduling, and task scheduling.

Rashmi et al. [ 63 ] proposed a cost-effective workflow scheduler for Hadoop in cloud data centers. The motivation behind the scheduling issue is the need to efficiently allocate resources to complete MapReduce jobs within the deadline and at a lower cost. The proposed scheduler takes into account the workflow as a whole rather than treating each job separately, as in many existing schedulers. The scheduler creates and maintains virtual machines (VMs) for jobs in a workflow, even after their completion to avoid time wastage and overheads.

Zacheilas et al. [ 64 ] proposed a novel framework called ChEsS for cost-effective scheduling of MapReduce workloads in multiple cluster environments. The scheduling problem is challenging due to the tradeoff between performance and cost, the presence of locality constraints, and the use of different intra-job scheduling policies across clusters. The goal of the framework is to automatically suggest jobs-to-clusters assignments that minimize the required budget and optimize the end-to-end execution time of the submitted jobs. The framework estimates the impact of various parameters, such as job locality constraints, on the user's makespan/budget and detects near-optimal job-to-cluster assignments by efficiently searching the solution space.

Palanisamy et al. [ 65 ] proposed a new MapReduce cloud service model called Cura for cost-effective provisioning of MapReduce services in a cloud. Unlike existing services, Cura is designed to handle production workloads that have a significant amount of interactive jobs. It leverages MapReduce profiling to automatically create the best cluster configuration for the jobs, implementing a globally efficient resource allocation scheme that significantly reduces the resource usage cost in the cloud. Cura achieves this by effectively multiplexing the available cloud resources among the jobs based on the job requirements and by using core resource management schemes such as cost-aware resource provisioning, VM-aware scheduling, and online virtual machine reconfiguration.

Chen et al. [ 66 ] addressed the problem of optimizing resource provisioning for MapReduce programs in the public cloud to minimize the monetary or time cost for a specific job. The authors study the components in MapReduce processing and build a cost function that models the relationship among the amount of data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce program. The model parameters can be learned from test runs, and based on this cost model, the authors propose an approach called Cloud RESource Provisioning (CRESP) to solve a number of decision problems, such as the optimal amount of resources that can minimize the monetary cost with the constraint on monetary budget or job finish time.

We investigated and analyzed deadline-aware schedulers. Table 5 provides a summary of popular cost-aware schedulers and their main properties. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Resource-aware schedulers

In big data applications, data centers are deployed on large Hadoop clusters. The nodes in these clusters receive a large number of jobs and require more resources to execute them. As a result, race condition arises among the jobs that demand resources like CPU, memory, and I/O [ 92 ]. Therefore, improving cluster resource utilization has become a major concern in MapReduce clusters. This section presents some of the most popular resource-aware schedulers which increase resource utilization.

Aarthee et al. [ 67 ] proposed a new scheduler, called Dynamic Performance Heuristic-based Bin Packing (DP-HBP) MapReduce scheduler, to improve resource utilization in a heterogeneous virtualized environment. By analyzing the exact combination of vCPU and memory capacities, the scheduler can effectively allocate resources and improve the entire virtual cluster's performance. In other words, the scheduler allocates the proper number of virtual machine cores and memory at the datacentres for cloud users. Also, the scheduler is a generalized model that can handle data-intensive jobs on MapReduce, regardless of their nature.

Jeyaraj et al. [ 68 ] addressed the challenge of resource utilization in virtual clusters running Hadoop MapReduce workloads, which can suffer from heterogeneities at the hardware, virtual machine, performance, and workload levels. The authors propose an efficient MapReduce scheduler called ACO-BP that places the right combination of map and reduce tasks in each virtual machine to improve resource utilization. They transform the MapReduce task scheduling problem into a 2-Dimensional bin packing model and obtain an optimal schedule using the ant colony optimization (ACO) algorithm. The ACO-BP scheduler minimizes the makespan for a batch of jobs and outperforms three existing schedulers that work well in a heterogeneous environment. The authors conclude that their proposed scheduler is an effective solution to improve resource utilization in virtual clusters running Hadoop MapReduce workloads.

Zhang et al. [ 69 ] presented a phase-level MapReduce scheduler called PRISM. This scheduler divides tasks into phases and schedules tasks at the phase level. PRISM assigns the resources based on the phase that each task is running. The authors proposed a heuristic algorithm to determine the order of the phases that can be scheduled on a machine: Each phase is assigned a utility value, which demonstrates the benefit of scheduling the phase. The utility value is calculated based on the fairness and job performance of the phase. Then the phase that has the highest utility is chosen. Also, the utility value is dependent on the phase. If a phase is a map or shuffle, a new map or reduce task is selected for scheduling. If a phase is a map or shuffle, a new map or reduce task is selected for scheduling. In this case, the scheduler determines the phase's utility by achieving a higher degree of parallelism by performing an extra task. For other phases, PRISM determines the utility of the phase by the urgency of finishing the phase. Urgency is calculated by how long it has been paused in seconds. A task whose execution has been paused for a long time needs to be scheduled as soon as possible.

Rasooli et al. [ 70 ] proposed a Hadoop scheduling algorithm called COSHH in heterogeneous environments. Utilizing a k-means clustering algorithm, COSHH categorizes the jobs into classes according to their requirements. Then the scheduler determines which jobs and resources are best suited to each other. To do this, it first builds a linear program (LP) and defines it using the characteristics of the job classes and the resources. After solving the LP, the scheduler finds a class set for each resource. Afterward, COSHH allocates jobs to resources based on fairness and minimum share requirements. Moreover, COSHH is composed of two major steps: first, upon arriving a new job, the algorithm stores it in the proper queue. Second, when a heartbeat message is received, the algorithm allocates a job to a free resource.

Polo et al. [ 71 ] proposed the Adaptive Scheduler called RAS for MapReduce multi-job workloads. The main purpose of the method is to utilize the resources efficiently. This scheduler proposes the "job slot" concept instead of the "task slot." Each job slot is a specific slot for a certain job. RAS computes the number of concurrent tasks that need to be assigned to complete a job before its deadline. This calculation is performed using the deadline, the number of pending tasks, and the average task length. Then, each job is assigned a utility value by RAS. The placement algorithm uses a utility value to choose the best placement of tasks on TaskTrackers.

Sharma et al. [ 72 ] proposed MROrchestrator, a fine-grained, dynamic, and coordinated resource management framework that effectively manages the resources. Resource bottleneck detection and resource bottleneck mitigation are two functions of the MROrchestrator. First, it collects the run-time resource allocation information of each task and identifies resource bottlenecks. The latter resolves bottlenecks through coordinated resource allocations.

Pastorelli et al. [ 73 ] proposed a scheduler for Hadoop called HFSP that achieves fairness and short response times. HFSP utilizes size-based scheduling to assign the cluster resources to the jobs. Job size is required for size-based scheduling, but there is no a priori knowledge of the job size in Hadoop. HFSP estimates job sizes during job execution to construct job size information. Also, using an aging function, the priority of jobs is computed. Afterward, the scheduler allocates resources to jobs based on priority. For both small and large jobs, aging is used to prevent starvation.

Tian et al. [ 74 ] discussed the optimal resource provisioning to execute the MapReduce programs in public clouds. Using a cost method, for the target MapReduce job, they modeled the relationship between the amount of input data, Map and Reduce slots, and the complexity of the Reduce function. Using a cost model can figure out how many resources are needed to reduce costs by a specified deadline or to reduce the time under a specified budget [ 93 ].

Ghoneem et al. [ 75 ] provided a MapReduce scheduling method to improve efficiency and performance in the heterogeneous cluster. The scheduler uses a classification algorithm based on job processing requirements and resources available in order to categorize jobs as executable and nonexecutable. To obtain the best performance, the scheduler allocates the executable jobs to the proper nodes. We described the most popular resource-aware schedulers. Table 6 provides the summary of the main properties of resource-aware schedulers. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Makespan-aware schedulers

The makespan (total completion time) of a set of jobs is the total amount of time it takes to complete jobs. In order to increase the cluster’s performance, makespan needs to be minimized by distributing the data across the nodes. Also, low makespan is a major factor for any scheduler [ 94 ]. Therefore, minimizing the makespan has become an important issue in MapReduce clusters. In this section, we first review several important makespan-aware schedulers. Then, the reviewed schedulers are compared and summarized.

Varalakshmi et al. [ 76 ] proposed a new job scheduler, called the virtual job scheduler (VJS), which is designed to schedule MapReduce jobs in a heterogeneous cluster. VJS creates a virtual job set by considering the CPU and IO resource utilization levels of each job waiting in the execution queue. The partitioning algorithm is the core of VJS, and the authors proposed two novel partitioning algorithms: two-level successive partitioning (TLSP) and predictive partitioning (PRED). The goal of TLSP is to optimize the busy time of reducers in environments where the higher-level scheduler is aware of the idle time of individual reducers. On the other hand, the goal of PRED is to optimize the overall time taken by the Reducer phase, including the idle time of reducers, and is found to produce better makespan and near-zero wait time in all scenarios, despite the prediction error associated with it. Therefore, PRED is used to partition the input of individual jobs in the virtual job set.

Maleki et al. [ 77 ] presented a secure and performance-aware optimization framework called SPO, to minimize the makespan of tasks using a two-stage scheduler. SPO applies the HEFT algorithm in Map and Reduce stage, respectively, and considers network traffic in the shuffling phase. Moreover, the authors proposed a mathematical optimization model of the scheduler to estimate the system performance while considering the security constraints of tasks.

Maleki et al. [ 78 ] proposed a two-stage MapReduce task scheduler for heterogeneous environments called TMaR, which aims to minimize the makespan of a batch of tasks while considering network traffic. The authors highlight the importance of scheduling Map tasks in cloud deployments of MapReduce, where input data is located on remote storage. TMaR schedules Map and Reduce tasks on servers that minimize the task finish time in each stage, respectively. The proposed dynamic partition binder for Reduce tasks in the Reduce stage reduces shuffling traffic, and TMaR + extends TMaR to improve total power consumption of the cluster, reducing it up to 12%.

Jiang et al. [ 79 ] studied MapReduce scheduling on n parallel machines with different speeds, where each job contains map tasks and reduce tasks, and the reduce tasks can only be processed after finishing all map tasks. The authors consider both offline and online scheduling problems and propose approximation algorithms with worst-case ratios for non-preemptive and preemptive reduce tasks. In the offline version, the authors propose an algorithm with a worst-case ratio of max{1 + Δ^2—1/n, Δ} for non-preemptive reduce tasks, where n is the number of servers, and Δ is the ratio between the maximum server speed and the minimum speed. They also design a 2-ratio algorithm for preemptive reduce tasks. In the online version, the authors devise two heuristics for non-preemptive and preemptive reduce tasks, respectively, based on the offline algorithms.

Verma et al. [ 80 ] designed a two-stage scheduler called the BalancedPools to minimize the makespan of multi-wave batch jobs. The method divides the jobs into two pools with the same makespan and assigns resources equally among the pools. Then, to minimize the makespan of each pool, the Johnson algorithm is applied within each pool to determine an order of jobs. Finally, the MapReduce simulator SimMR estimates the overall makespan for two pools.

Yao et al. [ 81 ] proposed TuMM, a slot management scheme in order to allow dynamic slot configuration in Hadoop. The scheduling goal is to utilize the resources efficiently and minimize the makespan of a batch of jobs. In order to achieve this goal, two major modules are proposed: Workload Monitor and Slot Assigner. To predict the present workloads of map and reduce tasks, Workload Monitor periodically collects prior workload information, including execution times of completed tasks. Slot Assigner for each node utilizes the estimated information from Workload Monitor to modify the slot ratio between map and reduce. Slot ratio is utilized as a tunable knob between Map and Reduce tasks.

Zheng et al. [ 82 ] studied a joint scheduling optimization for MapReduce, which overlap the map and shuffle phases and execute simultaneously. Mitigating the average job makespan is the goal of scheduling. The main issue is that since the map and shuffle phases have a dependency relationship, they cannot be fully parallelized. Therefore, after the map phase emits data, the shuffle phase must wait for it to be transferred. To solve the above problem, the authors introduced a new concept called "strong pair." As defined by them, two jobs are considered "strong pairs" if their map and shuffle workloads are equal. They proved that when all the jobs can be broken down into strong pairs, the best schedule is to run jobs that can form a strong pair pairwise. To perform jobs in a pairwise manner, a number of offline and online scheduling policies are presented. First, jobs are classified based on their workloads. Then, using a pairwise manner, jobs are executed within each group.

Tang et al. [ 83 ] proposed an optimized scheduling algorithm for MapReduce workflow, named MRWS, in heterogeneous clusters. Workflows are modeled by DAG graphs containing MapReduce jobs. The scheduler includes a phase for prioritizing jobs and a phase for allocating tasks. First, the jobs are divided into I/O-intensive and computing-intensive categories, and each job’s priorities are determined by considering its category. After that, each block is assigned a slot, and tasks are scheduled based on the data locality [ 95 ]. We described the most popular makespan-aware schedulers. Table 7 provides a summary of the main properties of makespan-aware schedulers. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

Learning-aware schedulers

In this section, we first review several important learning-aware schedulers. Then, the reviewed schedulers are compared and summarized.

Ghazali et al. [ 84 ] focused on the scheduling of MapReduce jobs in Hadoop and specifically addresses the importance of data and cache locality in improving performance. The authors propose a job scheduler called CLQLMRS (Cache Locality with Q-Learning in MapReduce Scheduler) that utilizes reinforcement learning techniques to optimize both data and cache locality. The proposed scheduler is evaluated through experiments in a heterogeneous environment. The results demonstrate a significant reduction in execution time compared to other scheduling algorithms such as FIFO, Delay, and Adaptive Cache Local. The CLQLMRS algorithm improves Hadoop performance compared to the aforementioned schedulers.

Naik et al. [ 85 ] focused on the challenges of MapReduce job scheduling in heterogeneous environments and the importance of data locality in improving the performance of the MapReduce framework. The paper highlights that data locality, which involves moving computation closer to the input data for faster access, is a critical factor in enhancing the performance of MapReduce in heterogeneous Hadoop clusters. However, the existing MapReduce framework does not fully consider heterogeneity from a data locality perspective. To address these issues, the paper proposes a novel hybrid scheduler that utilizes a reinforcement learning approach. The proposed scheduler aims to identify true straggler tasks and schedule them on fast processing nodes in the heterogeneous cluster, taking data locality into account.

Naik et al. [ 86 ] proposed a novel MapReduce scheduler called MapReduce Reinforcement Learning (MRRL) scheduler, which leverages reinforcement learning techniques to adaptively schedule tasks in heterogeneous environments. The MRRL scheduler observes the system state of task execution and identifies slower tasks. It suggests speculative re-execution of these slower tasks on other available nodes in the cluster to achieve faster execution. The proposed approach does not require prior knowledge of the environmental characteristics and can adapt to the heterogeneous environment over time. The authors employ the SARSA learning algorithm, which is a model-free approach that solves the problem of searching optimal states with state transitions depending on the scheduler. The state determination criterion and reward function in the proposed MRRL algorithm are based on the objective of minimizing job completion time.

We described several learning-aware schedulers. Table 8 provides a summary of the main properties of learning-aware schedulers. The analysis table includes references, key ideas, advantages and disadvantages, comparison algorithms, and evaluation techniques.

To respond to RQ5, a comparative analysis of different scheduling metrics in Hadoop MapReduce is presented in this section. In Table 9 and Fig.  8 , we showed how the selected studies addressed the scheduling metrics. Figure  8  demonstrates that 19% of the algorithms studied used the deadline metric, 17% used the data locality metric, 15% addressed the execution time metric, 12% addressed the makespan metric, 11% used the completion time metric, 9% used the cost metric, 8% used the response time metric, 7% used the resource utilization metric, and 2% addressed the throughput. It is shown that the majority of algorithms have focused on the deadline and data locality metrics.

figure 8

Percentage of scheduling metrics in reviewed algorithms

Figure  9 shows the evaluation techniques used in the selected studies. As can be seen, 69% of the studies used implementation, which is the highest; 6% of them used simulation, and 25% of them used both implementation and simulation.

figure 9

Evaluation techniques used by studies addressed by the reviewed algorithms

Scheduling in Hadoop MapReduce is an important challenge that Hadoop systems are facing. In this paper, we provided a comprehensive systematic study in Hadoop MapReduce. First, an overview of Hadoop major components is presented. According to our research questions, from more than 500 papers, 53 primary studies were selected. Then we thoroughly reviewed and analysed individually the selected MapReduce scheduling algorithms. Based on our research method, we classify these schedulers into six categories: deadline-aware schedulers, data locality-aware schedulers, cost-aware schedulers, resource-aware schedulers, makespan-aware schedulers, and learning-aware schedulers. We compared the studies in terms of key ideas, main objectives, advantages, disadvantages, comparison algorithms, and evaluation techniques. The results are summarized in a table in each category.

Availability of data and materials

Not applicable.

Assunção MD et al (2015) Big Data computing and clouds: Trends and future directions. J Parallel Distributed Comput 79:3–15

MathSciNet   Google Scholar  

Thusoo A et al (2010) "Hive-a petabyte scale data warehouse using hadoop." 2010 IEEE 26th international conference on data engineering (ICDE 2010). IEEE

Deshai N et al (2019) "Big data Hadoop MapReduce job scheduling: A short survey." Information Systems Design and Intelligent Applications: Proceedings of Fifth International Conference INDIA 2018 Volume 1. Springer, Singapore

Hu H et al (2014) Toward scalable systems for big data analytics: A technology tutorial. IEEE Access 2:652–687

Google Scholar  

Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci 275:314–347

Chen M, Mao S, Liu Y (2014) Big data: A survey. Mobile Netw Appl 19:171–209

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

Bakni N-E and I Assayad (2021) Survey on improving the performance of MapReduce in Hadoop. In: Proceedings of the 4th International Conference on Networking, Information Systems & Security

Zhang B, Wang X, Zheng Z (2018) The optimization for recurring queries in big data analysis system with MapReduce. Futur Gener Comput Syst 87:549–556

Kashgarani H, Kotthoff L (2021) "Is algorithm selection worth it? Comparing selecting single algorithms and parallel execution." AAAI Workshop on Meta-Learning and MetaDL Challenge. PMLR

Pakize SR (2014) A comprehensive view of Hadoop MapReduce scheduling algorithms. Int J Comput Netw Commun Secur 2(9):308–317

Kang Y, Pan L, Liu S (2022) Job scheduling for big data analytical applications in clouds: A taxonomy study. Futur Gener Comput Syst 135:129–145

Bhosale HS, Gadekar DP (2014) Big data processing using hadoop: survey on scheduling. Int J Sci Res 3(10):272–277

Shvachko K et al (2010) "The hadoop distributed file system." 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). Ieee

Khushboo K, Gupta N (2021) "Analysis of hadoop MapReduce scheduling in heterogeneous environment." Ain Shams Engineering Journal 12(1):1101–1110

White T (2012) Hadoop: The definitive guide. " O'Reilly Media, Inc."

Lu Z et al (2018) IoTDeM: An IoT big data-oriented MapReduce performance prediction extended model in multiple edge clouds. J Parallel Distributed Comput 118:316–327

Singh R, Kaur PJ (2016) Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud. J Big Data 3(1):1–10

Wang H et al (2015) BeTL: MapReduce checkpoint tactics beneath the task level. IEEE Trans Serv Comput 9(1):84–95

Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Inf Softw Technol 64:1–18

Cruz-Benito J (2016) Systematic literature review & mapping

Lu Q et al (2015) "MapReduce job optimization: a mapping study." 2015 International Conference on Cloud Computing and Big Data (CCBD). IEEE

Ghazali R et al (2021) A classification of Hadoop job schedulers based on performance optimization approaches. Clust Comput 24(4):3381–3403

Abdallat AA, Alahmad AI, AlWidian JA (2019) Hadoop mapreduce job scheduling algorithms survey and use cases. Mod Appl Sci 13(7):1–38

Hashem IAT et al (2020) MapReduce scheduling algorithms: a review. J Supercomput 76:4915–4945

Soualhia M, Khomh F, Tahar S (2017) Task scheduling in big data platforms: a systematic literature review. J Syst Softw 134:170–189

Khezr SN, Navimipour NJ (2017) MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J Grid Comput 15:295–321

Senthilkumar M, Ilango P (2016) A survey on job scheduling in big data. Cybern Inf Technol 16(3):35–51

Hashem IAT et al (2016) MapReduce: Review and open challenges. Scientometrics 109:389–422

Li R et al (2016) MapReduce parallel programming model: a state-of-the-art survey. Int J Parallel Prog 44:832–866

Tiwari N et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surveys (CSUR) 47(3):1–38

Polato I et al (2014) A comprehensive view of Hadoop research—A systematic literature review. J Netw Comput Appl 46:1–25

Gao Y, Zhang K (2022) "Deadline-aware preemptive job scheduling in hadoop yarn clusters." 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE

Cheng D et al (2018) Deadline-aware MapReduce job scheduling with dynamic resource availability. IEEE Trans Parallel Distrib Syst 30(4):814–826

Kao Y-C, Chen Y-S (2016) Data-locality-aware mapreduce real-time scheduling framework. J Syst Softw 112:65–77

Verma A et al (2012) "Deadline-based workload management for MapReduce environments: Pieces of the performance puzzle." 2012 IEEE Network Operations and Management Symposium. IEEE

Phan LT et al (2011) "An empirical analysis of scheduling techniques for real-time cloud-based data processing." 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA). IEEE

Kc K, Anyanwu K (2010) "Scheduling hadoop jobs to meet deadlines." 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE

Teng F et al (2014) A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. J Supercomput 69(2):739–765

Wang X et al (2015) SAMES: deadline-constraint scheduling in MapReduce. Front Comp Sci 9:128–141

Dong X, Wang Y, Liao H (2011) "Scheduling mixed real-time and non-real-time applications in mapreduce environment." 2011 IEEE 17th International Conference on Parallel and Distributed Systems. IEEE

Verma A, Cherkasova L, Campbell RH (2011) "Resource provisioning framework for mapreduce jobs with performance goals." Middleware 2011: ACM/IFIP/USENIX 12th International Middleware Conference, Lisbon, Portugal, December 12-16, 2011. Proceedings 12. Springer Berlin Heidelberg

Jabbari A et al (2021) "A Cost-Efficient Resource Provisioning and Scheduling Approach for Deadline-Sensitive MapReduce Computations in Cloud Environment." 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). IEEE

Shao Y et al (2018) Efficient jobs scheduling approach for big data applications. Comput Ind Eng 117:249–261

Lin J-W, Arul JM, Lin C-Y (2019) Joint deadline-constrained and influence-aware design for allocating MapReduce jobs in cloud computing systems. Clust Comput 22:6963–6976

Chen C-H, Lin J-W, Kuo S-Y (2015) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140

Tang Z et al (2013) A MapReduce task scheduling algorithm for deadline constraints. Clust Comput 16:651–662

Verma AL, Cherkasova, and RH Campbell (2011) Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM international conference on Autonomic computing

Polo J et al (2013) Deadline-based MapReduce workload management. IEEE Trans Netw Serv Manage 10(2):231–244

Kalia K et al (2022) Improving MapReduce heterogeneous performance using KNN fair share scheduling. Robot Auton Syst 157:104228

Li Y, Hei X  (2022) "Performance optimization of computing task scheduling based on the Hadoop big data platform." Neural Computing and Applications pp. 1-12

Fu Z et al (2020) An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans Parallel Distrib Syst 31(10):2406–2420

Gandomi A et al (2019) HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework. J Big Data 6:1–16

He C, Lu Y, Swanson D (2011) "Matchmaking: A new mapreduce scheduling technique." 2011 IEEE Third International Conference on Cloud Computing Technology and Science. IEEE

Ibrahim S et al (2012) "Maestro: Replica-aware map scheduling for mapreduce." 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE

Zhang X et al (2011) "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments." 2011 International Conference on Cloud and Service Computing. IEEE

Zhang X et al (2011) "Improving data locality of mapreduce by scheduling in homogeneous computing environments." 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications. IEEE

Zaharia M et al (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European conference on Computer systems

Tang X et al (2021) Cost-efficient workflow scheduling algorithm for applications with deadline constraint on heterogeneous clouds. IEEE Trans Parallel Distrib Syst 33(9):2079–2092

Seethalakshmi V, Govindasamy V, Akila V (2022) Real-coded multi-objective genetic algorithm with effective queuing model for efficient job scheduling in heterogeneous Hadoop environment. J King Saud Univ-Computer Inf Sci 34(6):3178–3190

Vinutha D, Raju G (2021) Budget constraint scheduler for big data using Hadoop MapReduce. SN Comput Sci 2(4):250

Javanmardi AK et al (2021) A unit-based, cost-efficient scheduler for heterogeneous Hadoop systems. J Supercomput 77:1–22

Rashmi S, Basu A (2016) "Deadline constrained Cost Effective Workflow scheduler for Hadoop clusters in cloud datacenter." 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS). IEEE

Zacheilas N, Kalogeraki V (2016) "Chess: Cost-effective scheduling across multiple heterogeneous mapreduce clusters." 2016 IEEE international conference on autonomic computing (ICAC). IEEE, Berahmand, [10/4/2023 8:36 PM]

Palanisamy B, Singh A, Liu L (2014) Cost-effective resource provisioning for mapreduce in a cloud. IEEE Trans Parallel Distrib Syst 26(5):1265–1279

Chen K et al (2013) CRESP: Towards optimal resource provisioning for MapReduce computing in public clouds. IEEE Trans Parallel Distrib Syst 25(6):1403–1412

Aarthee S, Prabakaran R (2023) Energy-aware heuristic scheduling using bin packing mapreduce scheduler for heterogeneous workloads performance in big data. Arab J Sci Eng 48(2):1891–1905

Jeyaraj R, Paul A (2022) Optimizing MapReduce task scheduling on virtualized heterogeneous environments using ant colony optimization. IEEE Access 10:55842–55855

Zhang Q et al (2015) PRISM: Fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194

Rasooli A, Down DG (2014) COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Futur Gener Comput Syst 36:1–15

Polo J et al (2011) "Resource-aware adaptive scheduling for mapreduce clusters." Middleware 2011: ACM/IFIP/USENIX 12th International Middleware Conference, Lisbon, Portugal, December 12-16, 2011. Proceedings 12. Springer, Berlin Heidelberg

Sharma B et al (2012) "Mrorchestrator: A fine-grained resource orchestration framework for mapreduce clusters." 2012 IEEE Fifth International Conference on Cloud Computing. IEEE

Pastorelli M et al (2015) HFSP: bringing size-based scheduling to hadoop. IEEE Trans Cloud Comput 5(1):43–56

Tian F, Chen K (2011) "Towards optimal resource provisioning for running mapreduce programs in public clouds." 2011 IEEE 4th International Conference on Cloud Computing. IEEE

Ghoneem M, Kulkarni L (2017) "An adaptive MapReduce scheduler for scalable heterogeneous systems." Proceedings of the International Conference on Data Engineering and Communication Technology: ICDECT 2016, Volume 2. Springer Singapore, Berahmand, [10/4/2023 8:40 PM]

Varalakshmi P, Subbiah S (2022) Optimized scheduling of multi-user Map-Reduce jobs in heterogeneous environment. Concurr Comput: Pract Exp 34(27):e7316

Maleki N, Rahmani AM, Conti M (2021) SPO: a secure and performance-aware optimization for MapReduce scheduling. J Netw Comput Appl 176:102944

Maleki N et al (2020) TMaR: a two-stage MapReduce scheduler for heterogeneous environments. HCIS 10:1–26

Jiang Y et al (2017) Makespan minimization for MapReduce systems with different servers. Futur Gener Comput Syst 67:13–21

Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327

Yao Y et al (2015) Self-adjusting slot configurations for homogeneous and heterogeneous hadoop clusters. IEEE Trans Cloud Comput 5(2):344–357

Zheng H, Wan Z, Wu J (2016) "Optimizing MapReduce framework through joint scheduling of overlapping phases." 2016 25th International Conference on Computer Communication and Networks (ICCCN). IEEE

Tang Z et al (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72:2059–2079

Ghazali R et al (2022) CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learning. J Cloud Comput 11(1):1–17

Naik NS, Negi A (2017) "A learning-based mapreduce scheduler in heterogeneous environments." 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE

Naik NS, Negi A, Sastry V (2015) Performance improvement of MapReduce framework in heterogeneous context using reinforcement learning. Proc Comput Sci 50:169–175

Varga M, Petrescu-Nita A, Pop F (2018) Deadline scheduling algorithm for sustainable computing in Hadoop environment. Comput Secur 76:354–366

He C, Lu Y, Swanson D (2013) Real-time scheduling in mapreduce clusters. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. IEEE

Gautam JV et al (2015) "A survey on job scheduling algorithms in big data processing." 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT). IEEE

Chen CH, Lin JW, Kuo SY (2014) "Deadline-constrained MapReduce scheduling based on graph modelling." 2014 IEEE 7th International Conference on Cloud Computing. IEEE

Nimbalkar PP, Gadekar DP (2015) Survey on scheduling algorithm in mapreduce framework. IJSETR 4(4):1226–1230

Singh N, Agrawal S (2015) A review of research on MapReduce scheduling algorithms in Hadoop." International Conference on Computing, Communication & Automation. IEEE

Khan M et al (2015) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454

Mohamed E, Hong Z (2016) "Hadoop-MapReduce job scheduling algorithms survey." 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE

Mittal R and H Kaur A Survey on Data Placement and Workload Scheduling Algorithms in Heterogeneous Network for Hadoop. Int J Comput Appl 975:8887

Download references

Author information

Authors and affiliations.

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

Soudabeh Hedayati

Applied IoT Lab, Department of Computer Science and Media Technology, Linnaeus University, Kalmar, Sweden

Neda Maleki, Tobias Olsson & Fredrik Ahlgren

School of Computing, Florida Institute of Technology, Melbourne, USA

Mahdi Seyednezhad

School of Computer Sciences, Science and Engineering Faculty, Queensland University of Technology (QUT), Brisbane, Australia

Kamal Berahmand

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: S. Hedayati; methodology: S. Hedayati; validation: S. Hedayati, and N. Maleki; formal analysis: S. Hedayati, N. Maleki, T. Olsson, F. Ahlgren, M. Seyednezhad, and K. Berahmand; investigation: S. Hedayati, N. Maleki, T. Olsson, F. Ahlgren, M. Seyednezhad, and K. Berahmand; resources: S, Hedayati; data curation, S. Hedayati, and N. Maleki; writing—original draft preparation: S. Hedayati, N. Maleki, T. Olsson, F. Ahlgren, M. Seyednezhad, and K. Berahmand; writing—review and editing: S. Hedayati, N. Maleki, T. Olsson, F. Ahlgren, M. Seyednezhad, and K. Berahmand; visualization: S. Hedayati, N. Maleki, T. Olsson, F. Ahlgren, M. Seyednezhad, and K. Berahmand; supervision: S. Hedayati; It is noted that all authors cooperated with each other to achieve suitable information flow across the entire paper. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Soudabeh Hedayati .

Ethics declarations

Ethics approval and consent to participate, competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Hedayati, S., Maleki, N., Olsson, T. et al. MapReduce scheduling algorithms in Hadoop: a systematic study. J Cloud Comp 12 , 143 (2023). https://doi.org/10.1186/s13677-023-00520-9

Download citation

Received : 31 March 2023

Accepted : 20 September 2023

Published : 10 October 2023

DOI : https://doi.org/10.1186/s13677-023-00520-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Distributed systems
  • Resource allocation
  • Scheduling algorithms
  • Fair scheduling

recent research paper on distributed computing

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 10 January 2024

Research on a new management model of distribution Internet of Things

  • Chao Chen 1 ,
  • Liwu Gong 1 ,
  • Xin Luo 2 &
  • Fuwang Wang 2  

Scientific Reports volume  14 , Article number:  995 ( 2024 ) Cite this article

886 Accesses

3 Altmetric

Metrics details

  • Computer science
  • Information technology

Based on the characteristics of controllable intelligence of the Internet of Things (IoT) and the requirements of the new distribution Network for function and transmission delay, this study proposes a method of combining edge collaborative computing and distribution Network station area, and builds a distribution Network management structure model by combining the Packet Transport Network (PTN) Network structure. The multi-terminal node distribution model of distributed IoT is established. Finally, a distribution IoT management model is constructed based on the edge multi-node cooperative reasoning algorithm and collaborative computing architecture model. The purpose of this paper is to solve the problem of large reasoning delay caused by heavy computing tasks in distribution cloud servers. The final results show that the model reduces the inference delay of cloud computing when a large number of smart device terminals of distribution IoT are connected to the network.

Similar content being viewed by others

recent research paper on distributed computing

An intelligent model for supporting edge migration for virtual function chains in next generation internet of things

Vassilis Tsakanikas, Tasos Dagiuklas, … Shahid Mumtaz

recent research paper on distributed computing

Power system low delay resource scheduling model based on edge computing node

Ying Zhao & Hua Ye

recent research paper on distributed computing

Enhanced layered fog architecture for IoT sensing and actuation as a service

Abdulsalam Alammari, Salman Abdul Moiz & Atul Negi

Introduction

The distribution system itself has the characteristics of wide geographical distribution, various types of equipment, diverse network connections and changeable operation modes. With the application of distribution automation monitoring system, the heterogeneous and multivariate data generated by it increases exponentially, and the amount of data has reached the level of big data. The cloud server of Internet of Things (IoT) has a heavy computing task, so there are some problems, such as excessive reasoning delay and high redundancy of data acquisition, which affect the working efficiency of the system. Therefore, how to effectively collect the data of distribution Internet of Things, optimize the management structure of the distribution network and improve the efficiency of data processing has become an urgent problem to be solved in the construction of the distribution Internet of Things.

The information and communication technology of the IoT maintains the characteristics of ubiquitous sensing and Interworking Protocol (IP) communication, but also has decentralized sensing ability. It distributes intelligent processing units at different levels of the distribution network, and realizes comprehensive sensing and supervision of the distribution network and asset information through the combination of cloud computing and edge computing. The edge computing is used to reduce the problems of edge IoT device management, data acquisition and processing, and the large and complex data communication volume in cloud processing 1 . The application of edge computing in smart grid has attracted extensive attention and research from scholars at home and abroad, mainly focusing on energy monitoring and management, load forecasting and scheduling, security and privacy protection, multi-energy collaborative management, fault detection and maintenance and so on. Luo et al. proposed a short-term energy consumption prediction system based on edge computing architecture. The experimental results show that the system can provide high precision real-time energy consumption prediction support 2 . Liu et al. designed an IoT energy management system based on a deep reinforcement learning edge computing infrastructure. Simulation results show that compared with traditional systems, the proposed system can achieve lower energy consumption while reducing delay 3 . Xu et al. used the hash SHA256 algorithm to generate a set of code word groups as the index of records. When searching, users input multiple keywords and generate corresponding trap doors. The trap doors are accurately matched with the index, and all matching results are fed back to users. It can effectively protect the data stored at the edge of the power grid 4 . Bai et al. proposed a distributed distribution fault detection model based on edge intelligence, which introduced edge computing into the traditional cloud-based system, effectively reducing the fault analysis and processing time, and ensuring the safe and stable operation of the power communication network 5 . Cai et al. proposed an intelligent decision-making method for quick power supply restoration of distribution network based on cloud-edge collaboration, and proposed a new implementation scheme for quick power supply restoration of distribution networks 6 . Li et al. adopted edge computing technology to transfer computing and storage tasks from cloud computing centers to edge devices closer to data sources for processing, reducing data transmission delay and improving data processing speed and real-time. A special data processing framework for distribution network load multi-source, strong coupling data acquisition, cleaning and aggregation is constructed, and the energy efficiency management system of the distribution network is further constructed by studying the quality optimization of power grid and distributed energy scheduling 7 . Literature 8 proposes a unified energy management architecture applied to distributed edge computing of renewable energy, which improves the response speed of distributed energy control. Literature 9 constructed the edge computing architecture of smart grid model. Taking intelligent measurement data acquisition as an example, the role of edge computing in data analysis, its efficiency and security were analyzed in detail. Literature 10 designed and realized the service collaboration of distribution IoT based on the pipeline mode computing model around the edge computing requirements, which could reduce the computational load, but the computational speed was not studied. Literature 11 takes intelligent terminals of power distribution as the core, provides collaborative data assistance based on edge computing, and realizes quantitative calculation and intelligent identification of local topology data in low-voltage station areas. In the literature 12 , the whole distribution station area is taken as the edge computer node, the characteristic indicators of the station area are established through monitoring data, and the weight is selected to judge the operation state of the station area, so as to further reduce the computational redundancy. In literature 13 , the collaborative management architecture of the power company's jurisdiction area is designed based on edge computing technology for the IoT in the distribution area. In literature 14 , large-scale distributed co-simulation technology is applied to smart grid system and a new co-simulator is designed. Literature 15 and literature 16 used edge computing technology to design diversified load management, optimized the data processing flow, and proposed solutions to the heterogeneous problem of intelligent terminal devices in the distribution IoT. The above researches show that edge computing can be effectively applied to smart distribution networks, but most of the research focuses on the design of distribution network management structure, and the optimization operation and efficient work of distribution Internet of Things are not studied in depth.

Edge computing extends from cloud computing to intelligent terminals and edge nodes of the Internet of Things, which can improve computing speed and processing efficiency. A large number of sensing and computing terminals are arranged in the station area. By using the data acquisition and processing capabilities of these edge devices and the interactive capabilities with users, the information is preprocessed, which saves computing resources for the cloud computing center. The two can be unified into a whole, and the precise control and real-time response of the entire distribution IoT can be realized 17 , 18 . According to the dimensions of collaborative computing between cloud computing and edge computing, cloud edge collaboration can be divided into the following four categories: (1) resource collaboration; (2) data collaboration; (3) intelligent collaboration; (4) service collaboration. Based on the above four categories 19 . This paper studies the model of reducing inference delay in the process of distributed IoT cloud computing, designs a multi-terminal distribution model of distribution IoT, and constructs a distribution IoT management model based on edge multi-node collaborative computing, thus completing the inference optimization task of distributed cloud computing.

Edge computing and distributed Internet of Things

Edge computing.

Through the integration of computing, communication and control technologies, the intelligent terminal of IoT has formed an organic whole of information system and physical systems 20 . There are some shortcomings in the current calculation methods of distribution networks, such as fast calculation, data management, data terminal integration, analysis and decision-making, etc. The calculation rate and processing efficiency of the existing calculation methods are relatively slow, but they can be solved by the technologies of cloud computing and big data analysis 21 . However, at present, the data environment is complex and redundant, and the structure is diverse. The steps of analyzing and discriminating all kinds of information are tedious, and the real-time response characteristics of information are difficult to meet, which has a certain impact on the stability of energy transmission and the reliability of intelligent control in distribution network 22 . Compared with the feature that cloud computing can be processed remotely and uniformly in the master station, edge computing extends from cloud computing to intelligent terminals and edge nodes of the Internet of Things, and its main purpose is to realize the functions of real-time collection and calculation of localized data information, online diagnosis, ms-level rapid response and precise control of controlled nodes. A single distribution terminal adopts edge calculation, so that data can be calculated and solved on the edge side, and the solution results can be uploaded to the cloud center, which is more conducive to real-time monitoring of the running state and quick and intelligent processing of data. Traditional control is to control the electrical signal, while edge calculation is to control the information. It integrates the key capabilities of communication networks, computing preprocessing, data storage and application programs with edge nodes as the core 23 . At present, all industries are striving to promote the integration of operational, information, communication (OICT) proposed by the Edge Computing Alliance 19 , but in fact, with the wide application of big data and other technologies, the transformation from OICT to data technology (DT) will also be realized, and finally digital automation technology (AT) and intelligent control services will be provided. Figure  1 shows the schematic diagram of data positioning for edge calculation.

figure 1

Schematic diagram of edge computing positioning.

Connecting edge computing with Internet of Things and distribution network

The traditional distribution network can usually be divided into the main station, sub-stations and terminal layers. With the deepening reform of the power network and the massive access of distributed energy, the traditional distribution network has also changed from the passive distribution network with unidirectional transmission to the active distribution network with bidirectional transmission capacity 24 , 25 , which also makes the distribution system look forward to the change. At present, with the large-scale power electronic equipment and intelligent controllable devices put into use, there are some shortcomings in the traditional control technology and power communication grid. Only by introducing the IoT technology into the power grid can we meet the demand of new distribution system 26 . At present, all layers of distribution equipment in the new distribution network are combined with embedded systems, which makes the traditional distribution network form information interaction with other equipment in the station area based on information communication network, thus constructing a large-scale distribution Internet of Things with physical network, information equipment and computing units coupled. Finally, it can further promote the integration of edge computing, IoT and distribution network, and make the distribution network realize intelligent control at different levels.

Multi-terminal node distribution model of Internet of Things

In the distribution network, distribution intelligent terminals can be used as edge computing nodes, with computing, storage and communication functions. The terminal of the smart platform can manage the connected communication devices, sensing devices, measurement devices, and execution devices, and can seamlessly connect to the data center and cloud platform. Through various appropriate communication means, the measurement data is collected, and the local data processing and analysis decisions are carried out. The adjacent intelligent station terminals can form distributed computing nodes, and the measurement data and processing results of the edge nodes of each distributed terminal intelligent device can be shared. Through various suitable communication means, measurement data is collected, and local data processing and analysis are carried out. The terminals in the adjacent intelligent station area can form distributed computing nodes, and the measurement data and processing results of the edge nodes of the intelligent devices of each distributed terminal can be shared. The schematic diagram of device data and information sharing is shown in Fig.  2 :

figure 2

Data and information sharing schematic diagram.

In this diagram, the central server acts as a centralized management and coordination role. Each smart station terminal is connected to a central server and communicates with other terminals.

Data sharing can be achieved in the following ways:

Data upload: The intelligent station terminal can upload the collected measurement data and processing results to the central server. These data can include sensor readings, algorithm outputs, etc. A central server is responsible for storing and managing this data, making it accessible to other terminals.

Data download: The terminal can download data uploaded by other terminals from the central server. In this way, each terminal can obtain data from the entire distributed network for further processing and analysis.

Data sharing: Data can also be shared directly between terminals without going through the central server. This method can improve the data transmission speed and reduce the network load. Terminals can send data to other terminals through point-to-point communication or multicast.

Collaborative computing: Distributed terminals can use collaborative computing technology to divide tasks into sub-tasks and execute them in parallel on each terminal. Each terminal is responsible for processing part of the data and passing the results to a central server or other terminal for further integration and analysis.

Through these ways, the adjacent intelligent station terminals can realize the sharing of data and information, thus forming a distributed computing node, providing more powerful computing and analysis capabilities.

Power distribution IoT is an extension of the IoT technology in the field of power distribution networks. The characteristics of power distribution IoT cannot be reflected through the traditional 3-layer physical architecture (master station, substation, terminal). In order to reflect the characteristics of distribution Internet of Things, this paper designs a distribution Internet of Things architecture based on the PTN network model according to the PE-P-CE structure proposed by the PTN network, combined with the characteristics of edge computing technology to reduce the computation amount and the information interaction of the IoT. Distribution IoT is mainly composed of three types of devices, namely CE devices, PE devices and P devices. The structure diagram of edge technology is shown in Fig.  3 .

figure 3

Edge technology structure diagram.

The CE devices are terminal edge unit level IoT devices located in the active distribution network in the PTN access layer, which can realize self-sensing, self-computing, interchangeable, extensible and self-determining functions. The CE devices in the distribution IoT mainly include distribution transformers, terminal feedback units, intelligent monitoring units, data transmission units, centralized processing units, switch devices, and configured communication modules. The CE devices carry out real-time monitoring, data collection, and intelligent control of terminal devices. Distribution transformer monitoring terminal units monitor the status of distributed power supplies, energy storage devices, and electricity meters. The sub-station device is the edge access device installed in the PTN convergence layer, which plays the role of convergence and access of all edge data. It is called the electronic distribution station in the IoT at the active distribution network system level. It can organize and configure the control instructions of the distribution network by itself, and make self-decisions and optimizations. Multiple CE devices are pooled and connected to PE devices. Through status collection, data interaction, edge analysis and cloud integration, global management in the distribution area is finally realized. As the core layer of the PTN structure, the master station equipment can realize comprehensive information perception in the distribution IoT, integrate and process the data of distributed terminal intelligent devices, conduct in-depth analysis, and then make scientific decisions and issue commands. The master station equipment monitors the status of the distribution network in real time, and the sub-station equipment uploads the information required to be uploaded by distributed terminal intelligent devices to the master station. Moreover, the master station can read real-time data from each CE device, and use the information obtained above to determine the operation status of the smart distribution network.

Distributed Internet of Things management model based on node collaborative computing

Construction of collaborative management and control model for internet of things.

At present, it is difficult to solve the demand of multi-node distribution Internet of Things scenarios by adopting centralized single architecture scheme. Edge computing mode is introduced, and the number of intelligent devices in distribution network terminals is set as edge nodes. Edge computing is the first way to preprocess data. For the edge node management function undertaken by the cloud server, the cloud edge collaboration mechanism can ensure the low delay of data interaction between the distribution network devices. In practical applications, the communication delay of data transmission between devices is still quite high and may even exceed the calculation delay. The following strategies can be used to improve the communication delay of data transmission between devices. First, we can consider optimizing the network topology to ensure more efficient data transmission paths. By carefully designing the network topology, we can reduce the transmission distance of data packets, thus reducing communication delays 27 . Secondly, increasing network bandwidth is another effective measure. Increasing bandwidth can increase data transfer rates and reduce communication delays. This can be achieved by upgrading network hardware or using higher-performance communication equipment 28 . To further optimize communication performance, caching and prefetch technologies can be introduced. By implementing a caching mechanism on the device side, frequent access to the cloud can be reduced, thereby reducing communication latency 29 . Additionally, adopting faster communication protocols is also a way to solve high communication latency issues. Selecting appropriate communication protocols improves data transmission efficiency and accelerates inter-device communications 30 .

At present, the terminal equipment of the distribution network is complex, including distributed energy, energy storage equipment, flexible load and some controllable equipment. The information interaction of each equipment is mixed, and there is interference in the transmission process. According to the requirements of terminal intelligent devices in the distribution network for request delay and functional integrity, the modeler considers the distribution network and its information system structure, and designs a multi-dimensional collaborative algorithm to meet the requirements. The proposed topology model includes master device node, edge node, user terminal layer and edge device of distribution IoT . The master node C and the edge node set directly managed by the master node C have direct access, and the edge node is provided with its own management center, and the user terminal can access the edge node. A brief topology is shown in Fig.  4 .

figure 4

Brief topology of distribution network management structure.

The distribution network is managed hierarchically, and the multi-level centralized and distributed combination of global, intermediate and local control structures is adopted to construct the distribution network management and control structure, as shown in Fig.  5 . Based on this structure, first of all, tasks are generated by terminal devices and sent to nodes in the cloud to perform processing computation.

figure 5

Distribution network management and control structure.

The interface definitions in the Fig.  5 are shown in Table 1 .

According to the physical structure of the distribution network system, the management and control model is used to represent it hierarchically. The various types of equipment and control modes in the model have the following characteristics:

Multiple terminal devices can realize information interaction and real-time analysis.

In the same structure layer, the effects of different control modes are not coupled and separated from each other, so that the control objectives can be achieved independently, and the influence among their control effects can be ignored.

When the superior control fails, each subordinate control will not respond, and it can continue to work according to the operation rules.

Multi-node collaborative computing method for minimizing effect time target

The traditional load balancing strategy is mainly used in parallel systems. For external data requests, it distributes tasks to multiple processing units, but there are also some shortcomings, such as poor dynamic performance of the algorithm, insufficient consideration of the real-time service capability required by the processing units, and generation of task response and scheduling decisions 31 . Literature 32 adopts the dynamic equilibrium strategy of linear regression, which can predict the real-time data of terminal edge nodes and then feed back, but it can't meet the coordination needs of cloud and edge because it doesn't consider the propagation delay. Therefore, taking into account the computing capacity of nodes and the propagation delay between edge and cloud centers, a collaborative computing model involving multiple nodes is adopted to minimize response time and achieve collaboration between cloud computing and edge computing. By implementing reasonable task division, node collaborative computing, task scheduling and optimization, as well as error processing and fault tolerance mechanisms, this multi-node collaborative computing model can effectively enhance task execution efficiency and system performance 33 , 34 , 35 .

According to the collaborative architecture model of master device and edge computing described in Section " Multi-terminal node distribution model of Internet of Things ", the algorithm is described with the minimum delay as the goal. The data interaction of terminal intelligent devices reaches the edge node first, and the edge node makes the action decision through collaborative control, which involves the round-trip delay between the edge and the master device, the terminal computing processing delay of the master device, and finally the data processing can be completed only after the edge node calculates the delay. Specific parameters to be considered are shown in Table 2 .

For the data exchange task of terminal equipment, the proposed algorithm is represented by a quaternary data set:

where st i is the number of subtasks included in the j i task, c i is the clock cycle required to complete the target task, λ i is the weight of j i task in computing resources, and μ i is the weight of j i task to computing resources.

The main equipment and edge nodes are characterized by ternary data sets:

where cn k is the corresponding node calculated, cpu k is the vacancy rate of computing resources, mem k is the idle rate of computing memory during node calculation, and t k is the time for a single terminal intelligent device to run independently on the corresponding edge computing node.

Based on the above definition, the cloud computing time can be quantitatively analyzed with the minimum delay as the goal. When the edge node k performs the data interaction task i , the calculated time delay is:

When the main device executes the data interaction task i , the calculation time delay is:

Then, it can be found that the request response time RT edge ( i , k ) of the edge node is:

When the master node executes the data interaction task, the request response time RT cloud ( i ,0) is:

In order to minimize the response time of the request, the method adopted is to choose the local edge and the main device node with short response to perform this task. After the task is submitted, the scheduling decision is made by the edge node first. In order to judge which node is the task executor, T edge is removed from the calculation response time for comparison during the decision-making. The judgment formula is as follows:

There are two data transmission flows from edge node to master device node and from master device node to edge node in data collaboration, all of which are acted by each edge node's own collaboration controller. The specific process of data collaboration is shown in Fig.  6 .

figure 6

The specific process of data collaboration.

Actually, edge nodes in the system only undertake some application functions. When a user terminal accesses the system, the edge node and the master node need to jointly respond. The master device node is responsible for the global management of the distribution IoT in the whole region, and provides storage capacity for edge nodes. Edge nodes realize device management and control, collaborative decision-making and task execution, and jointly provide application layer services for user terminals.For the tasks submitted by the terminal at the edge node, the collaborative controller of the edge node first adopts scheduling decision or task forwarding according to the task type. The specific workflow is shown in Fig.  7 . Based on the above scheme, the distribution network is partitioned, the edge nodes are determined after numbering each distributed terminal device, and the deployment scheme is constructed with the minimum delay as the goal. In the process of reasoning task, the distance between information generating unit and edge nodes, communication environment and other factors will affect the final output result. In order to further reduce the delay, deep neural network (DNN) and equipment edge synergy reasoning algorithm can be introduced. Divide DNN first, reasonably use computing resources to reduce computing complexity and redundancy of edge servers. By training DNN models with different capacities and multiple nodes, select DNN models suitable for application requirements, thus reducing computing burden and total delay. During normal operation, with the different running time, the output data of different DNN layers are also different, which shows great heterogeneity. In practice, the layer module running for a long time may not output data efficiently. Therefore, the DNN is divided into two parts, and the redundant and complicated part can be calculated in the server first in a low transmission efficiency mode, thus reducing the end-to-end waiting time.

figure 7

The specific process of service collaboration.

Simulation experiment

Firstly, the proposed collaborative reasoning algorithm is simulated and verified on the cloud simulation platform. In the simulation experiment, a cloud computing center and three edge nodes are set up, and 30 groups of computing tasks are set up. Each group of computing tasks has three task units of the same type, and they are submitted to the three edge nodes respectively, and each group of computing tasks is separated by 1 ms to simulate the change of transmission delay in the actual working process. In the process of simulation, the time delay of service agent module is changed to simulate, and the effect of task coordination processing submitted to edge nodes is simulated and compared.

Figure  8 depicts the execution time of different computing tasks submitted to edge nodes under different scheduling methods. The solid line represents the collaborative reasoning algorithm of minimizing response time proposed in this paper, the short dotted line represents the polling scheduling algorithm, and the long dotted line represents the random scheduling algorithm. The results show that the method based on minimizing response time has stable task completion time and relatively short execution time.

figure 8

Comparison of results of different scheduing algorithms.

Average the data in Fig.  8 , and the average execution time of different methods is shown in Fig.  9 . The results show that this method has the shortest average execution time.

figure 9

Comparison of the average execution time of different scheduing.

As can be seen from Figs.  8 and 9 , the system efficiency can be improved by reducing the time for users to submit tasks to edge nodes. This is mainly because the algorithm fully considers the computing capacity gap between cloud computing center and edge nodes and realizes it through optimized scheduling decisions.

In order to build the heterogeneous edge node network environment, four different edge devices are used: JetsonTX1, Jetson TX2, Jetson TK1 and Jetson NaNo. Each time the network topology is generated, it is randomly assigned to the edge nodes. Firstly, the characteristic state indicators of four edge devices are preprocessed by weight matrix. And the smaller the number, the better the running state. The processing results of state indicators are shown in Table 3 .

For different computing performance parameters of four types of equipments, the Paleo framework is used to test and run the delay parameters on the edge nodes of each branch, and the running progress is recorded as shown in Table 4 .

Compared with the delay time of existing cloud servers, it can be found that the efficiency has been improved in the later stage of optimization. In the method based on edge computing, the average delay of different networks is simulated, and the bandwidth between nodes is set to 1Mbps. The results are shown in Table 5 .

The test results show that compared with other experiments, when the instructions uploaded by the information generation unit are cancelled from the same partition in the network, the simulation experiment data analysis takes the lowest time with the same inference accuracy.

Aiming at the problems of heavy computing tasks, excessive reasoning delay, and high redundancy of complex data acquisition, this paper adopts cloud-edge coordination based on edge computing technology and the combination of IoT and intelligent distribution network, and develops coordination strategies through three aspects: computing, data and service coordination. Based on the PTN network structure, a multi-terminal node distribution model of distribution IoT is proposed, and a distributed management model of distribution network IoT based on edge multi-node collaborative computing is established. Finally, the distribution autonomy and collaborative management and control of the distribution network are realized, effectively avoiding the problem of traditional centralized management.

Data availability

The research project involved in this study has not yet been completed, therefore, the data is not suitable for publication at this time. If readers need information, please consult with the corresponding author to obtain it.

Yi, W. et al. The integration of 5G communication and ubiquitous power Internet of Things: application analysis and research prospects. Power System Technology 43 (5), 1575–1585 (2019).

Google Scholar  

Luo, H. et al. A short-term energy prediction system based on edge computing for smart city. Futur. Gener. Comput. Syst. 101 , 444–457 (2019).

Article   Google Scholar  

Liu, Y. et al. Intelligent edge computing for IoT-based energy management in smart cities. IEEE Netw. 33 (2), 111–117 (2019).

Xu, A. et al. Multi-keyword ciphertext retrieval method for edge computing of smart grid. Computer applications and software 39 (07), 310–314 (2022).

Bai, M. et al. Research on distributed distribution fault detection based on edge intelligence. Electr. Autom. 45 (04), 79–81 (2023).

Tiantian, C. et al. An intelligent decision-making method for quick power supply restoration of distribution network based on cloud-edge collaboration. Power Syst. Prot. Control 51 (19), 94–103. https://doi.org/10.19783/j.cnki.pspc.221918 (2023).

Jian, Li. et al. Edge computing solutions for distribution networks. Autom. Exp. 40 (02), 120–122 (2023).

Haoyang, S. et al. Edge computing technology for distribution Internet of Things. Power Syst. Technol. 43 (12), 4314–4321 (2019).

Shi, W., Sun, H. & Cao, J. Edge computing-an emerging computing model for the internet of everything era. J. Comput. Res. Dev. 54 (5), 907–924 (2017).

Liang, L. & Hui, Li. Function design and system implementation of edge computing gateway. Electr. Meas. Instrum. 58 (8), 42–48 (2021).

Gu, H. & Chen, S. Application design of low-voltage intelligent stations based on edge computing. Electr. Meas. Instrum. 58 (8), 36–41 (2021).

Wang, X., Song, N., Shi, L., et al. Comprehensive evaluation method of transformer area status based on edge computing. Electr. Meas. Instrum . https://kns.cnki.net/kcms/detail/23.1202.TH.20210525.1231.003.html .

Guan, L. Research on cache replacement technology based on fog computing edge data storage. Shanxi Electron. Technol. 4 , 90–96 (2017).

Zheng, Y., Liu, Y., Hansen, H. H. L. Navigation-orientated natural spoken language understanding for intelligent vehicle dialogue, Proceedings of the 2017 IEEE Intelligent Vehicles Symposium , Piscataway : IEEE, pp. 559–564, 2017.

Zheng Guilin, Yu. & Xingye.,. Design and implementation of IoT system for intelligent electricity management and control based on edge computing. Electr. Meas. Instrum. 58 (8), 28–35 (2021).

Xiaojiang, C. et al. Design of pluralistic load management for intelligent concentrator based on edge computing technology. Electr. Meas. Instrum. 58 (8), 17–27 (2021).

Xu, E. & Dong, E. Explore nine application scenarios of cloud-side collaboration. Commun. World 21 , 42–43 (2019).

Xun, E. & Dong, E. Exploration and practice of the collaborative development of cloud computing and edge computing. Commun. World 801 (9), 48–49 (2019).

Congmin, L. V. & Zhou, G. Research on the overall capability connotation of edge cloud collaboration and collaborative solutions at all levels. 2019 Guangdong Tongxin Youth Forum Excellent papers special issue, 90–195 (2019).

Li, E., Zhou, Z. & Chen, X. Edge intelligence: on-demand deep learning model co-inference with device-edge synergy, ACM Workshop on Mobile Edge Communications , August 20 , 2018 , Budapest , Hungary. New York : ACM Press , pp. 31–36 (2018).

Gong, G., Luo, A. & Chen, Z. Cyber physical system of active distribution network based on container. Power Syst. Technol. 42 (10), 3128–3135 (2018).

Xin, L., Huang, Q. & Wu, D. Distributed large-scale co-simulation for IoT-aided smart grid control. IEEE Access 99 (5), 19951–19960 (2017).

Xing, W. Research on power internet of things architecture for smart grid construction. Electr. Power Big Data 21 (10), 28–31 (2018).

Li, B., Jia, B. & Cao, W. Application prospect of edge computing in power demand response business. Power Syst. Technol. 45 (1), 79–87 (2018).

ADS   Google Scholar  

Okay, F. Y. & Ozdemir, S. A fog computing based smart grid model, 2016 International Symposium on Networks", Computers and Communications (ISNCC) , Hammamet : ISNCC , pp. 1–6 (2016).

Xu, X., Chen, Z. & Ding, H. Discussion on the design of edge computing architecture for regional electricity retailer. Electr. Power Constr. 40 (7), 41–47 (2019).

CAS   Google Scholar  

Tomovic, S. et al. Software-defined fog network architecture for IoT. Wirel. Pers. Commun. 92 , 181–196 (2017).

Ma, Z. et al. High-reliability and low-latency wireless communication for internet of things: Challenges, fundamentals, and enabling technologies. IEEE Internet Things J. 6 (5), 7946–7970 (2019).

Wang, S. et al. A survey on mobile edge networks: Convergence of computing, caching and communications. IEEE Access 5 , 6757–6779 (2017).

Yang, H. et al. Deep reinforcement learning based massive access management for ultra-reliable low-latency communications. IEEE Trans. Wirel. Commun. 20 (5), 2977–2990 (2020).

Optimizing Network Performance With Content Switching Server. Optimizing Network Performance With Content Switching: Server, Firewall And Cache Load Balancing (The Radia Perlman Series In Computer Networking And Security). Prentice Hall.

Yang, J. Design and Implementation of Service Collaboration and Load Balancing Strategies for Cloud Services. Beijing University of Posts and Telecommunications (2019).

Tang, Y. et al. Artificial intelligence methods for power grid simulation analysis and decision-making. Chin. J. Electr. Eng. 42 (15), 5384–5406 (2022).

Wang, L., Wu, Z. & Fan, W. A review on resource allocation and task scheduling optimization in edge computing. J. Syst. Simul. 03 , 509–520 (2021).

Ren, J. et al. Collaborative cloud and edge computing for latency minimization. IEEE Trans. Veh. Technol. 68 (5), 5031–5044 (2019).

Download references

Acknowledgements

This research was funded by Research on distribution network optimization layout and collaborative control technology for high proportion photovoltaic access and distributed energy storage, grant number B311JX230001.

Author information

Authors and affiliations.

State Grid Zhejiang Electric Power Co., Ltd., Jiaxing Power Supply Company, Jiaxin, 314000, Zhejiang Province, China

Chao Chen & Liwu Gong

School of Electrical Engineering, Northeast Electric Power University, Jilin, 132000, Jilin Province, China

Xin Luo & Fuwang Wang

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, C.C. and L.W.; methodology, X.L. and F.W.; software, C.C.; validation, F.W., L.W. and C.C.; formal analysis, C.C.; investigation, X.L. and F.W.; resources, L.W.; data curation, F.W. and L.W.; writing—original draft preparation, C.C.; writing—review and editing, L.W.; supervision, C.C. and X.L.; funding acquisition, F.W. and C.C. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Fuwang Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chen, C., Gong, L., Luo, X. et al. Research on a new management model of distribution Internet of Things. Sci Rep 14 , 995 (2024). https://doi.org/10.1038/s41598-024-51570-1

Download citation

Received : 11 September 2023

Accepted : 06 January 2024

Published : 10 January 2024

DOI : https://doi.org/10.1038/s41598-024-51570-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

recent research paper on distributed computing

recent research paper on distributed computing

Collections and calls for papers

Computer science sdg 7: affordable and clean energy, disc 2022 (by invitation only), disc 2023 (by invitation only), podc 2022 (by invitation only), podc 2023 (by invitation only), special issue on podc 2021 and disc 2021.

  • Find a journal
  • Publish with us
  • Track your research

Open Access is an initiative that aims to make scientific research freely available to all. To date our community has made over 100 million downloads. It’s based on principles of collaboration, unobstructed discovery, and, most importantly, scientific progression. As PhD students, we found it difficult to access the research we needed, so we decided to create a new Open Access publisher that levels the playing field for scientists across the world. How? By making research easy to access, and puts the academic needs of the researchers before the business interests of publishers.

We are a community of more than 103,000 authors and editors from 3,291 institutions spanning 160 countries, including Nobel Prize winners and some of the world’s most-cited researchers. Publishing on IntechOpen allows authors to earn citations and find new collaborators, meaning more people see your work not only from your own field of study, but from other related fields too.

Brief introduction to this section that descibes Open Access especially from an IntechOpen perspective

Want to get in touch? Contact our London head office or media team here

Our team is growing all the time, so we’re always on the lookout for smart people who want to help us reshape the world of scientific publishing.

Home > Books > Recent Progress in Parallel and Distributed Computing

Introductory Chapter: The Newest Research in Parallel and Distributed Computing

Submitted: 14 September 2016 Published: 19 July 2017

DOI: 10.5772/intechopen.69201

Cite this chapter

There are two ways to cite this chapter:

From the Edited Volume

Recent Progress in Parallel and Distributed Computing

Edited by Wen-Jyi Hwang

To purchase hard copies of this book, please contact the representative in India: CBS Publishers & Distributors Pvt. Ltd. www.cbspd.com | [email protected]

Chapter metrics overview

1,344 Chapter Downloads

Impact of this chapter

Total Chapter Downloads on intechopen.com

IntechOpen

Total Chapter Views on intechopen.com

Author Information

Wen-jyi hwang *.

  • Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, Taiwan

*Address all correspondence to: [email protected]

The parallel and distributed computing is concerned with concurrent use of multiple compute resources to enhance the performance of a distributed and/or computationally intensive application. The compute resources may be a single computer or a number of computers connected by a network. A computer in the system may contain single-core, multi-core and/or many-core processors. The design and implementation of a parallel and distributed system may involve the development, utilization and integration of techniques in computer network, software and hardware. With the advent of networking and computer technology, parallel and distributed computing systems have been widely employed for solving problems in engineering, management, natural sciences and social sciences.

There are six chapters in this book. From Chapters 2 to 6, a wide range of studies in new applications, algorithms, architectures, networks, software implementations and evaluations of this growing field are covered. These studies may be useful to scientists and engineers from various fields of specialization who need the techniques of distributed and parallel computing in their work.

The second chapter of this book considers the applications of distributed computing for social networks. The chapter entitled “A Study on the Node Influential Capability in Social Networks by Incorporating Trust Metrics” by Tong-Ming Lim and Hock Yeow Yap provides useful distributed computing models for the evaluation of node influential capacity in social networks. Two algorithms are presented in this study: Trust-enabled Generic Algorithm Diffusion Model (T-GADM) and Domain-Specified Trust-enabled Generic Algorithm Diffusion Model (DST-GADM). Experimental results confirm the hypothesis that social trust plays an important role in influential propagation. Moreover, it is able to increase the rate of success in influencing other social nodes in a social network.

Another application presented in this book is the smart grid for power engineering. The chapter entitled “A Distributed Computing Architecture for the Large-Scale Integration of Renewable Energy and Distributed Resources in Smart Grids” by Ignacio Aravena, Anthony Papavasiliou and Alex Papalexopoulos analyzes the distributed system for the management of the short-term operations of power systems. They propose optimization algorithms for both the levels of the distribution grid and high voltage grids. Numerical results are also included for illustrating the effectiveness of the algorithms.

This book also contains a chapter covering the programming aspect of parallel and distributed computing. For the study of parallel programming, the general processing units (GPUs) are considered. GPUs have received attention for parallel computing because their many-core capability offers a significant speedup over traditional general purpose processors. In the chapter entitled “GPU Computing Taxonomy” by Abdelrahman Ahmed Mohamed Osman, a new classification mechanism is proposed to facilitate the employment of GPU for solving the single program multiple data problems. Based on the number of hosts and the number of devices, the GPU computing can be separated into four classes. Examples are included to illustrate the features of each class. Efficient coding techniques are also provided.

The final two chapters focus on the software aspects of the distributed and parallel computing. Software tools for the wikinomics-oriented development of scientific applications are presented in the chapter entitled “Distributed Software Development Tools for Distributed Scientific Applications” by Vaidas Giedrimas, Anatoly Petrenko and Leonidas Sakalauskas. The applications are based on service-oriented architectures. Flexibilities are provided so that codes and components deployed can be reused and transformed into a service. Some prototypes are given to demonstrate the effectiveness of the proposed tools.

The chapter entitled “DANP-Evaluation of AHP-DSS” by Wolfgang Ossadnik, Benjamin Föcke and Ralf H. Kaspar evaluates the Analytic Hierarchy Process (AHP)-supporting software for the use of adequate Decision Support Systems (DSS) for the management science. The corresponding software selection, evaluation criteria, evaluation framework, assessments and evaluation results are provided in detail. Issues concerning the evaluation assisted by parallel and distributed computing are also addressed.

These chapters offer comprehensive coverage of parallel and distributed computing from engineering and science perspectives. They may be helpful to further stimulate and promote the research and development in this rapid growing area. It is also hoped that newcomers or researchers from other areas of disciplines desiring to learn more about the parallel and distributed computing will find this book useful.

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Continue reading from the same book

Published: 19 July 2017

By Yap Hock Yeow and Lim Tong‐Ming

1390 downloads

By Ignacio Aravena, Anthony Papavasiliou and Alex Pap...

1412 downloads

By Abdelrahman Ahmed Mohamed Osman

1810 downloads

Help | Advanced Search

Quantum Physics

Title: using quantum computing to infer dynamic behaviors of biological and artificial neural networks.

Abstract: The exploration of new problem classes for quantum computation is an active area of research. An essentially completely unexplored topic is the use of quantum algorithms and computing to explore and ask questions \textit{about} the functional dynamics of neural networks. This is a component of the still-nascent topic of applying quantum computing to the modeling and simulations of biological and artificial neural networks. In this work, we show how a carefully constructed set of conditions can use two foundational quantum algorithms, Grover and Deutsch-Josza, in such a way that the output measurements admit an interpretation that guarantees we can infer if a simple representation of a neural network (which applies to both biological and artificial networks) after some period of time has the potential to continue sustaining dynamic activity. Or whether the dynamics are guaranteed to stop either through 'epileptic' dynamics or quiescence.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Distributed Computing Research Free Essay Example

    recent research paper on distributed computing

  2. Research PhD Projects in Distributed Computing (#1 PhD Support)

    recent research paper on distributed computing

  3. (PDF) A Review Paper on Cloud Computing

    recent research paper on distributed computing

  4. (PDF) Security Issues in Distributed Computing System Models

    recent research paper on distributed computing

  5. (PDF) Efficient Techniques for Distributed Computing

    recent research paper on distributed computing

  6. Distributed Computing: An Introduction

    recent research paper on distributed computing

VIDEO

  1. CS 436: Distributed Computer Systems

  2. Distributed Computing Lecture 8: Architectures P2

  3. Can Parallel Computing Finally Impact Mainstream Computing?

  4. What is Distributed Computing?

  5. Distributed Computing Lecture 9: Processes and Threads

  6. 11th standard question paper distributed to 12th class students

COMMENTS

  1. distributed computing Latest Research Papers

    Cloud computing is an innovation that conveys administrations like programming, stage, and framework over the web. This computing structure is wide spread and dynamic, which chips away at the compensation per-utilize model and supports virtualization. Distributed computing is expanding quickly among purchasers and has many organizations that ...

  2. The evolution of distributed computing systems: from fundamental to new

    Distributed systems have been an active field of research for over 60 years, and has played a crucial role in computer science, enabling the invention of the Internet that underpins all facets of modern life. Through technological advancements and their changing role in society, distributed systems have undergone a perpetual evolution, with each change resulting in the formation of a new ...

  3. 130073 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DISTRIBUTED COMPUTING. Find methods information, sources, references or conduct a literature review ...

  4. PDF The evolution of distributed computing systems: from fundamental to new

    The objective of this work is to study and evaluate the key factors that have influenced and driven the evolution of distributed system para-digms, from early mainframes, inception of the global inter-network, and to present contemporary systems such as edge computing, Fog computing and IoT. Our analy-sis highlights assumptions that have driven ...

  5. Journal of Parallel and Distributed Computing

    The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.

  6. Advances in parallel and distributed computing and its applications

    Furthermore, parallel and distributed computing has emerged in recent advances of many hotspot research directions including artificial intelligence, machine learning, Internet of Things, bioinformatics, digital medicine, cybersecurity, and social computing, resulting in numerous ground-breading discoveries that are changing our society and life.

  7. Home

    Overview. Distributed Computing is a peer-reviewed journal that serves as a forum for significant contributions to the theory and practical aspects of distributed systems. Covers topics from design and analysis of distributed algorithms to architectures and protocols for communication networks. Includes discussions on synchronization protocols ...

  8. Distributed Computing

    MLitB: Machine Learning in the Browser. software-engineering-amsterdam/MLitB • 8 Dec 2014. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. 1 ...

  9. A survey on the Distributed Computing stack

    In this paper, we review the background and the state of the art of the Distributed Computing software stack. We aim to provide the readers with a comprehensive overview of this area by supplying a detailed big-picture of the latest technologies. First, we introduce the general background of Distributed Computing and propose a layered top ...

  10. Fundamental Research Challenges for Distributed Computing Continuum Systems

    This article discusses four fundamental topics for future Distributed Computing Continuum Systems: their representation, model, lifelong learning, and business model. Further, it presents techniques and concepts that can be useful to define these four topics specifically for Distributed Computing Continuum Systems. Finally, this article presents a broad view of the synergies among the ...

  11. New Advances in Distributed Computing and Its Applications

    It aims to gather a comprehensive range of both quantitative and qualitative research contributions from a diverse array of individual, academic, organizational, and industry practitioners in the evolving field of distributed computing solutions. By exploring the latest advances in distributed computing, this Special Issue seeks to provide ...

  12. Parallel and Distributed Computing: Algorithms and Applications

    Feature papers represent the most advanced research with significant potential for high impact in the field. ... for unleashing the enormous computational power of these systems and attaining the promising performance of parallel/distributed computing. Moreover, the new possibilities offered by the high-performance systems pave the way to a new ...

  13. PDF Distributed Computing With the Cloud

    single vector from any processing node (and hence computing the sum) requires Ω(n/ √ n) = Ω(√ n) rounds - due to the limited bandwidth to the cloud. But using both cloud links and local links (Fig. 1, right), the sum can be computed in Θ(˜ 4 √ n) rounds, as we show in this paper.

  14. Distributed Systems and Parallel Computing

    From our company's beginning, Google has had to deal with both issues in our pursuit of organizing the world's information and making it universally accessible and useful. We continue to face many exciting distributed systems and parallel computing challenges in areas such as concurrency control, fault tolerance, algorithmic efficiency, and ...

  15. PDF Critical Analysis of Middleware Architectures for Large Scale

    Abstract: Distributed computing is increasingly being viewed as the next phase of Large Scale Distributed Systems (LSDSs). However, the vision of large scale resource sharing is not yet a reality in many areas - Grid computing is an evolving area of computing, where standards and technology are still being developed to enable this new paradigm.

  16. MapReduce scheduling algorithms in Hadoop: a systematic study

    Hadoop is a framework for storing and processing huge volumes of data on clusters. It uses Hadoop Distributed File System (HDFS) for storing data and uses MapReduce to process that data. MapReduce is a parallel computing framework for processing large amounts of data on clusters. Scheduling is one of the most critical aspects of MapReduce. Scheduling in MapReduce is critical because it can ...

  17. (PDF) Distributed Computing: An Overview

    Distributed computing systems offer the potential for improved performance and resource sharing. In this paper we have made an overview on distributed computing. In this paper we studied the ...

  18. Research on a new management model of distribution Internet of Things

    This paper studies the model of reducing inference delay in the process of distributed IoT cloud computing, designs a multi-terminal distribution model of distribution IoT, and constructs a ...

  19. Algorithms

    Dear Colleagues, We invite you to submit your latest research on the broad area of distributed computing to this Special Issue, "Distributed Computing Theory, Systems, Algorithms, and Data Structures". The goal of this Special Issue is to improve understanding of the principles and practices underlying distributed computing.

  20. Collections and calls for papers

    Computer Science SDG 7: Affordable and Clean Energy. This collection compiles papers from across the Springer computer science portfolio which relate to SDG 7, Affordable and Clean Energy. Read more about the Springer Sustainable Development Goals Program here. Submission status. Closed.

  21. Introductory Chapter: The Newest Research in Parallel and Distributed

    With the advent of networking and computer technology, parallel and distributed computing systems have been widely employed for solving problems in engineering, management, natural sciences and social sciences. There are six chapters in this book. From Chapters 2 to 6, a wide range of studies in new applications, algorithms, architectures ...

  22. The Distributed Computing Paradigms: P2P, Grid, Cluster, Cloud, and Jungle

    International Journal of Latest Research in Science and Technology ISSN (Online):2278-5299 Vol.1,Issue 2 :Page No.183-187 ,July-August(2012) ... Figure 1.1: Classification of Distributed Computing This review paper covers the distributing technologies. In the section 3rd peer-to-peer computing is elaborated; in section 4th, ...

  23. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing

    Powering a new era of computing, NVIDIA today announced that the NVIDIA Blackwell platform has arrived — enabling organizations everywhere to build and run real-time generative AI on trillion-parameter large language models at up to 25x less cost and energy consumption than its predecessor.

  24. PDF Distributed AI in Zero-touch Provisioning for Edge Networks: Challenges

    potential research directions to foster novel studies in this field and overcome the current limitations. Index Terms—Distributed Artificial Intelligence; Edge Net-works; Edge Resource Federation; Internet of Things; Zero-touch Provisioning I. INTRODUCTION The Internet of Things (IoT) has witnessed a rapid rise in recent years.

  25. Using Quantum Computing to Infer Dynamic Behaviors of Biological and

    The exploration of new problem classes for quantum computation is an active area of research. An essentially completely unexplored topic is the use of quantum algorithms and computing to explore and ask questions \\textit{about} the functional dynamics of neural networks. This is a component of the still-nascent topic of applying quantum computing to the modeling and simulations of biological ...