Help | Advanced Search

Computer Science > Computers and Society

Title: embedded systems education in the 2020s: challenges, reflections, and future directions.

Abstract: Embedded computing systems are pervasive in our everyday lives, imparting digital intelligence to a variety of electronic platforms used in our vehicles, smart appliances, wearables, mobile devices, and computers. The need to train the next generation of embedded systems designers and engineers with relevant skills across hardware, software, and their co-design remains pressing today. This paper describes the evolution of embedded systems education over the past two decades and challenges facing the designers and instructors of embedded systems curricula in the 2020s. Reflections from over a decade of teaching the design of embedded computing systems are presented, with insights on strategies that show promise to address these challenges. Lastly, some important future directions in embedded systems education are highlighted.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Micromachines (Basel)
  • PMC10220566

Logo of micromac

A Review of Artificial Intelligence in Embedded Systems

Zhaoyun zhang.

1 School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan 523000, China

Jingpeng Li

2 School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523000, China; moc.621@503eelk

Associated Data

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Advancements in artificial intelligence algorithms and models, along with embedded device support, have resulted in the issue of high energy consumption and poor compatibility when deploying artificial intelligence models and networks on embedded devices becoming solvable. In response to these problems, this paper introduces three aspects of methods and applications for deploying artificial intelligence technologies on embedded devices, including artificial intelligence algorithms and models on resource-constrained hardware, acceleration methods for embedded devices, neural network compression, and current application models of embedded AI. This paper compares relevant literature, highlights the strengths and weaknesses, and concludes with future directions for embedded AI and a summary of the article.

1. Introduction

Over the years, the development of artificial intelligence and its applications has greatly reduced the complexity of many machine learning models, making it easier to deploy them on resource-constrained devices. Furthermore, corresponding support for models and algorithms on these devices has emerged. These developments have facilitated a new research direction: embedded artificial intelligence [ 1 , 2 , 3 ]. The concept of embedded AI was first introduced in reference [ 3 ], which proposed that the IoT could evolve into the Wisdom Web of Things (W2T) and emphasized that embedded intelligence about individuals, the environment, and society could increase the number of users of existing IoT systems, promote environmental sustainability, and enhance social awareness. Recent developments in embedded AI are described in references [ 1 , 2 ], both of which combine embedded AI with IoT technology. The current mainstream research direction of embedded AI is to integrate it with IoT, which includes edge computing with convolutional accelerator [ 4 ] and load distribution [ 5 ]. Reference [ 6 ] mentions that the combination of embedded intelligence and IoT is the future direction of development. In addition, edge computing can be combined with artificial intelligence, resulting in what is known as edge intelligence [ 7 ].

The current development of embedded AI is two-way: the optimization of AI models and algorithms reduces the difficulty of deploying them on embedded devices, while hardware accelerators in embedded devices increase support for AI models and algorithms. Furthermore, hardware resources are being developed, and AI is rapidly advancing in mobile devices. For example, reference [ 8 ] describes the deployment of neural networks on cell phones, and there are also neural networks specifically designed for mobile devices, such as MobileNet [ 9 ]. MobileNet will be discussed in detail in Section 3 on lightweight networks.

To deploy artificial intelligence in embedded devices, several aspects need to be considered: (1) choosing a suitable deployment platform, (2) using hardware accelerators, as mentioned in the second section, (3) applying model compression techniques, as described in the third section [ 10 , 11 ], and (4) enhancing hardware support for AI algorithms and models. All these points are discussed in detail in this paper.

In this paper, we will introduce embedded AI from the following aspects: the first section introduces the development history of embedded AI, the second section introduces the hardware accelerators of embedded AI, the third section introduces the key technologies of embedded AI, the fourth section introduces the application directions of embedded AI, the fifth section discusses the challenges encountered in the current development of embedded AI and provides an outlook for the future, and finally the sixth section concludes with a summary.

2. Hardware Acceleration Methods for Embedded AI

The current acceleration of embedded AI can be achieved through three types of hardware: FPGA, ASIC, and GPU. FPGA (Field-Programmable Gate Array) is a type of programmable logic chip; ASIC (Application-Specific Integrated Circuit) is an integrated circuit specifically designed to meet the needs of a particular application; and a GPU (Graphics Processing Unit) is a processor used for processing graphics data.

FPGA (Field Programmable Gate Array), also known as Field Programmable Gate Arrays, can generate high-performance, low-power hardware mappings in neural networks due to its flexibility. FPGA can be customized to meet system-level requirements in different environments, such as throughput, latency, and power.

The authors of paper [ 12 ] propose an approach for accelerating convolutional neural networks on FPGAs to generate high-performance hardware designs by assembling pre-implemented components of convolutional neural networks into a graph topology-based puzzle. The use of pre-implemented components allows minimal resource utilization for predicting performance, which reduces engineering time and increases yields. The pre-implemented components in the paper are implemented using an out-of-context design approach. Experiments show a 69% yield improvement compared to traditional FPGA implementations, achieving more than 1.5 times the maximum frequency with fewer resources and higher energy efficiency. Reference [ 13 ] proposes the ASimOV framework, which consists of two parts: (1) an AI simulator for finding the optimal parameters in the parameter search space of the AI accelerator and algorithm to achieve maximum performance (e.g., precision); and (2) an HDL code generator for using the optimal parameters of the AI accelerator and algorithm to generate a hardware model, which is implemented in the FPGA for functional testing purposes. This end-to-end framework allows for the optimization of the AI accelerator on specific datasets, while also reducing actual cost. The functionality of ASimOV has been verified in small AI algorithms such as character recognition, clothing recognition, and speech recognition. For the k-NN AI accelerator, the total performance reached 90.7% with 64kb of memory. Reference [ 14 ] compares artificial neural networks and spiking neural networks on FPGAs in terms of method, learning process, and efficiency using the MNIST, GTSRB, and CIFAR-10 datasets. Spiking neural networks were found to be slightly faster than artificial neural networks for the learning process when using the MNIST dataset. In terms of power consumption, spiking neural networks consume less FPGA logic and have 19% more power than artificial neural networks. The authors compared the spiking neural network with the convolutional neural network on the GTSRB dataset and found that the former saved 42% of power, but the high latency of the inference process ultimately did not result in energy savings. On the CIFAR-10 dataset, the test results showed that overall FPGA logic decreased, with a 75% reduction in logic conversion efficiency, resulting in energy savings compared to the GTSRB dataset mentioned above because the inference time was greatly reduced. The increasing complexity of convolutional neural network models has made it challenging to map them to FPGA devices. Therefore, fpgaConvNet research was conducted to reduce the difficulty of deploying network models. For instance, in reference [ 15 ], the authors propose that the inference process of ConvNet could be interpreted as a streaming application. fpgaConvNet uses a computational synchronous data flow model as its modeling core, where synchronous data flow is mainly used in the design and analysis of parallel systems. By transforming the design space exploration task into a multi-objective optimization problem, fpgaConvNet can effectively target applications with various performance requirements, either in terms of high throughput or low latency. The authors present experiments comparing fpgaConvNet with a cluster of embedded GPUs. fpgaConvNet generates a more energy-efficient accelerator and achieves higher raw performance within the same power constraints compared to highly optimized embedded GPU designs. The authors argue that fpgaConvNet deploys the ConvNet model on embedded FPGAs, which provides the necessary infrastructure. Reference [ 16 ] proposes an end-to-end optimized framework called fpgaConvNet, which is an automated mapping framework for convolutional networks on FPGAs, and an automated design approach for the synchronous dataflow paradigm while defining a set of SDF transformations. A system multi-objective optimization formulation is also proposed for fpgaConvNet, enabling it to generate co-optimized hardware designs for convolutional neural network workloads, target devices, and application performance metrics. In the experiments, the authors compare the performance of fpgaConvNet with the existing FPGA-related work. The comprehensive experimental structure demonstrates that fpgaConvNet achieves outstanding results in terms of performance, performance density, and other relevant metrics.

There are two approaches to deploying algorithms on FPGA: enhancing the algorithm to leverage the hardware capabilities of FPGA, or optimizing the execution efficiency of the algorithm in hardware. Most current work focuses on memory compression to maximize operational efficiency, but does not consider energy consumption on the FPGA platform. Additionally, it is important to consider the operational efficiency on the FPGA when selecting the programming language.

ASIC (Application Specific Integrated Circuit) is a type of special-purpose integrated circuit that is highly customized and less programmable, and therefore relies heavily on algorithm design. To deploy neural networks on ASICs, appropriate hardware accelerators need to be designed for the specific network structure. The optimization of hardware accelerators involves utilizing parallel resources and reducing computational complexity.

Reference [ 17 ], on the other hand, proposes an optimization approach that utilizes parallel resources. The authors introduce the YodaNN convolutional neural network accelerator, which simplifies the design complexity by using simpler complementary operations and multiplexers instead of fixed-point MAC units. Experimental results demonstrate that the YodaNN accelerator achieved 2.7 times higher peak performance and 32 times higher energy efficiency than the state-of-the-art CNN accelerator at that time. Reference [ 18 ] proposes the UCNN (Unique Weight CNN Accelerator), which is a convolutional neural network accelerator optimized for reducing the computational effort. As the convolutional neural network design evolves, the size and number of filters in the network structure will keep increasing, and the values of the weights are bound to be repeated. Therefore, each common weight can be extracted from the multiplication it involves; for example, the w i in the following equation can be extracted.

In addition, the author proposes maintaining two hardware structures: the input indirect table and the weight indirect table. F 1 and F 2 in the figure below are two filters, x and z are weights, and IF1 is the feature map, which represents the convolution between the computational filter and the feature map. The operation in F 1 can be approximately replaced by the operation in F 2 , thus reducing the computational effort.

Reference [ 19 ] proposes MSICs (ML-specific integrated circuits), ASICs dedicated to deep neural networks. These circuits use larger on-chip buffers instead of off-chip ones and exploit reusability and sparsity to achieve higher energy efficiency by reducing numerical precision and optimizing data flow. Experimental results show that this deep learning-specific ASIC is over 10 times more energy efficient than a CPU or GPU.

The main approach for deploying networks on ASIC is to create neural network accelerators. Networks can be deployed on ASIC through quantization, which can be further optimized by combining both quantization and neural network accelerator techniques. Additionally, a hardware accelerator can be improved by utilizing sparse architectures.

GPU (graphics processing unit) is a specialized processor used for accelerating the speed of image processing in hardware. Currently, GPUs are also widely used to accelerate neural networks in machine learning. In this paper, we mainly introduce the role of GPUs in enhancing the security of machine learning.

Reference [ 20 ] proposes CRYPTGPU, a privacy-preserving machine learning system that uses a cryptographic interface and a “GPU-friendly” cryptographic protocol to maintain privacy when performing both linear and nonlinear operations on the GPU. Experimental results show that the GPU-private convolution protocol is more than 150 times faster than similar CPU-based protocols, and about 10 times faster than CPU in simulations for nonlinear operations such as ReLU activation. It is concluded that GPUs can accelerate private training and inference for deep learning, and enable privacy-preserving deep learning on complex datasets and networks. Another paper, [ 21 ], proposes ParSecureML, a GPU-based framework that improves the performance of secure machine learning algorithms based on two-party computation in a secure context. It addresses problems of complex computation patterns, frequent intra-node data transfers between CPUs and GPUs, and complex inter-node data dependencies by implementing profiling-guided adaptive GPU utilization, fine-grained double pipeline intra-node CPU-GPU collaboration, and compressed transmission intra-node communication. Experimental results show that ParSecureML improves acceleration by an average of 33.8× over state-of-the-art secure machine learning frameworks, with an average acceleration of 31.7× when applied to inference.

The above two papers address security in machine learning, which is an undeveloped field of research. However, it may be a promising direction for embedded AI in the future.

2.4. Other Acceleration Hardware

In addition to the above three platforms, reference [ 22 ] deploys a pedestrian image detection system on a neuromorphic chip and compares it to a CPU system and a GPU system, demonstrating improved efficiency and reduced energy consumption compared to traditional GPU-accelerated and CPU multi-core systems. This approach is particularly beneficial for resource-constrained platforms. The authors also suggest areas for improvement, noting that the neuromorphic chip NM500 used in the paper supports only one hidden layer, which may limit the functionality of deep neural networks, and lacks dynamic switching capability, making it difficult to manage power consumption. Finally, they suggest that integrating the neuromorphic chip into System on Chips (SOC) could improve the performance and energy efficiency of machine learning on embedded systems. In another paper [ 23 ], the authors optimize and deploy a deep neural network on the NVIDIA Jetson platform, which supports more parallel computing capabilities and multiple output interfaces compared to other platforms. This facilitates the deployment of artificial intelligence algorithms and models on the Jetson platform.

2.5. Summary

The various acceleration methods for the three hardware platforms mentioned in 2.1, 2.2, and 2.3 have their own advantages in embedded AI applications and are suitable for different scenarios. They should be chosen according to different needs. If the scenario requires the frequent switching of computational tasks, FPGA can provide higher flexibility and programmability. If the scenario requires deep learning tasks with parallel computation on large amounts of data, GPUs can provide higher computational performance. If the applications have fixed computational tasks, ASICs can provide higher performance and energy efficiency. Of course, these three platforms can also work together. For example, in [ 24 ], the authors combined these two hardware platforms by exploiting the high-density computational performance of GPUs in machine learning and the low latency demonstrated by FPGAs in model inference. Table 1 provides the characteristics of the various approaches for each deployment platform in this chapter.

Literature on hardware accelerators.

3. The Key Technologies of Embedded AI

To deploy neural networks on resource-constrained devices such as embedded systems, several problems need to be addressed. Firstly, how can embedded devices carry huge models and algorithms of artificial intelligence with their limited resources? Secondly, how can hardware devices support algorithms and models after meeting the above conditions? This chapter proposes three key techniques to solve these challenges. Section 3.1 and Section 3.2 address model compression and algorithm optimization to meet the hardware constraints of resource-constrained devices, while Section 3.3 focuses on algorithm optimization on hardware to enhance support for AI algorithms. Model compression aims to maintain model accuracy while reducing its size as much as possible, while binarization networks aim to compress the model to the smallest possible size while improving model accuracy. Additionally, the chapter explores CPU/GPU algorithms by investigating the support provided by CPUs and GPUs for artificial intelligence algorithms.

3.1. Model Compression of Neural Network

3.1.1. network structure redesign.

The network structure design is a method of improving existing neural networks by designing new network structures. Many researchers have undertaken significant work in this area.

In 2017, Landola et al. [ 25 ] proposed SqueezeNet, a lightweight network that maintains accuracy using fewer parameters. SqueezeNet consists of two parts: a convolutional neural network architecture designed by the authors, and the Fire module. To maintain accuracy, three strategies were used in designing the convolutional neural network architecture: 1. using 1 × 1 filters instead of partial 3 × 3 filters, 2. using a squeeze layer to reduce the input channels of 3 × 3 filters, and 3. delaying downsampling (postponing the downsampling process to the end of the network). The network structure of SqueezeNet is shown in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is micromachines-14-00897-g001.jpg

SqueezeNet network structure.

In 2017, Howard et al. [ 9 ] proposed MobileNet, a lightweight network for mobile and embedded vision applications. This network introduced two global hyperparameters, α (Width Multiplier) and ρ (Resolution Multiplier), that can be balanced in terms of latency and accuracy. The core components of MobileNet include depthwise separable convolution, width multiplier, and resolution multiplier. The standard convolution is decomposed into a pointwise convolution and a depthwise convolution, where the depth convolution is connected after the input channel, and the pointwise convolution is connected between the depth convolution and the output. This structure is illustrated in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is micromachines-14-00897-g002.jpg

Depthwise separable convolution.

Depthwise separable convolution separates the filtering and merging functions in standard convolution, using one layer for filtering and another layer for merging. This decomposition method can significantly reduce computational effort and model size. In MobileNet, when targeting specific applications that require smaller models and lower computational costs, the depth and point convolutions are computed as separate network layers, resulting in a total of 28 layers. In 2018, Zhang et al. [ 26 ] proposed ShuffleNet, a lightweight network for devices with computational constraints. The network architecture employs pointwise group convolution and channel shuffle techniques to significantly reduce computational costs while maintaining accuracy. Based on the efficient depth-separable convolution or group convolution of Xception [ 27 ] and ResNeXt [ 28 ], ShuffleNet considers 1 × 1 group convolution. The feature map generated for each group in the previous layer is divided into separate channels for several subgroups in each group to provide different subgroups for each group in the next layer. The Shuffle operation is shown in Figure 3 .

An external file that holds a picture, illustration, etc.
Object name is micromachines-14-00897-g003.jpg

Shuffle group convolution operation.

Reference [ 29 ] proposed the once-for-all network (OFA), which can reduce training costs by selecting dedicated subnetworks without additional training. Additionally, the authors proposed a new progressive shrinkage algorithm for training OFA, which is a generalized pruning method. Experimental results showed that this method outperformed state-of-the-art NAS methods at the time and effectively reduced energy consumption. In another work [ 30 ], the network structure of Yolov5 was redesigned by embedding three network modules: CBAM, C3Ghost, and Ghost. The CBAM module was used to enhance the feature extraction capability, while C3Ghost and Ghost modules were used to reduce the number of parameters and floating-point operations. The MS COCO and PASCAL VOC datasets were used for experiments, and the results show that the new network structure had a slightly decreased average detection time, as well as a 50.61% reduction in floating-point operations and a 47.88% reduction in model parameters compared to the original network structure.

MobileNet has two more hyperparameters than SqueezeNet, which makes the application more flexible and able to be adjusted according to the actual application for computational cost and latency. ShuffleNet can generate more feature maps than SqueezeNet, but its deep convolution operation can only be performed on the bottleneck feature map, which leads to some difficulties in deployment on low-power devices. Additionally, most network structures are designed to adjust convolutional operations by replacing complex kernels with simpler ones to reduce computational complexity and the number of model parameters, while maintaining accuracy and reducing energy consumption. However, some network structures may have poor generalization ability and are only applicable in specific scenarios.

3.1.2. Quantization

Quantization is the compression of floating-point data bits in neural network parameters to reduce model complexity and size by reducing the number of bits used by floating-point numbers, while maintaining model accuracy as much as possible.

In reference [ 31 ], BRECQ, a framework for Post-training Quantization (PTQ), was proposed for the first time to limit the bit-width range of the post-training quantization task to INT2. The authors conducted a comprehensive theoretical study of second-order errors and found that the framework was able to balance cross-layer dependence and generalization errors. They also used approximate inter-layer and intra-layer sensitivity, incorporating hybrid precision techniques. Experimental results show that post-training quantization can obtain a model with similar precision to ResNet and MobileNetV2 with only four bits using the Quantization-Aware Training (QAT) method without additional conditions, and can obtain 240 times the production speed of the quantized model. In another work [ 32 ], the authors propose data-free quantization methods that do not require data, fine-tuning, or hyperparameter optimization. They suggest a method that uses the scale-equivalence property of the activation function to adjust the range of weights in a network and corrects the errors introduced in the quantization process. Experiments show that the data-free quantization method approaches the original model’s accuracy and is even comparable to more sophisticated training-based methods. The authors of reference [ 33 ] propose a mechanism for weight-rounding for post-training quantization, AdaRound, which does not require fine-tuning of the network and can cope with data and task loss by using only a small amount of unlabeled data. Experiments show that this mechanism maintains accuracy loss within 1% by quantizing the weights of ResNet-18 and ResNet-50 to 4 bits. Reference [ 34 ] describes the “deep compression” method, a comprehensive approach that is a three-stage pipeline of pruning, training quantization, and Huffman (Huffman) coding running together. This approach can reduce the storage requirements of neural networks by a factor of 35 to 49 without compromising accuracy. The principle of the method is to first learn the significant connections as the basis for pruning the network, then quantize the weights to achieve weight sharing, and finally apply Huffman coding. Before Huffman coding, the authors retrained the network to fine-tune the remaining connections and quantized centroids, and reduced by nine to thirteen times the connections through pruning. Quantization reduced the connection bitwidth from 32 bits to 5 bits. Experimental results show that the deep compression approach reduces the storage space required for AlexNet by a factor of 35 from 240 mb to 6.9 mb on the AlexNet dataset with no loss of accuracy, and achieved 3 to 4 times the layered acceleration and 3 to 7 times the energy efficiency on CPU, GPU, and mobile GPU benchmarks. Finally, the authors of reference [ 35 ] propose the Efficient Inference Engine (EIE), which can be deployed to SRAM (Static Random-Access Memory) platforms. This engine utilizes the sparsity of activation functions and weights, and the technique of weight sharing and quantization. The EIE can save 120 times the energy, respectively, 10 times the energy by using sparsity, 8 times the energy by weight assignment, and 3 times the energy by skipping zero activation functions using ReLU. Unlike the large deep neural networks trained by the “deep compression” method, the EIE is suitable for on-chip DRAM (Dynamic Random-Access Memory).

Despite the various quantization methods mentioned above, there is still a need for better quantization methods that can achieve higher compression rates with lower accuracy degradation. Additionally, it is crucial to consider the energy consumption of the compressed network model to make it suitable for use in resource-constrained devices such as embedded devices.

3.1.3. Pruning

Pruning is a method used to reduce redundant data in the neural network by determining the importance of each unit and removing unimportant parts.

One pruning method proposed in the literature [ 36 ] consists of three steps: first, train the network to learn the important connections; second, prune the unimportant connections; and finally retrain the network to adjust the weights of the remaining connections. Experimental results have shown that reducing the number of parameters by a factor of nine in the AlexNet network structure does not have a significant impact on performance. Another weight pruning method, ProbMask, was proposed in reference [ 37 ]. It measures the importance of weights by the probability of global criteria in all network layers and features automatic learning by setting constraints. Experimental results show that the ProbMask method can improve top-1 accuracy by about 10% compared with existing methods. Reference [ 38 ] proposes the ManiDp method, which maximizes the dynamic mining and pruning of redundant filters by embedding the manifold information of all instances into the pruned network space. The method achieves the dynamic removal of redundant filters, and experimental results show that it can reduce the number of floating-point operations by 55.3% while decreasing the top-1 accuracy by only 0.57% when applied on ResNet-34. A new channel exploration method, CHEX, is proposed in the literature [ 39 ] to solve the problem that traditional pruning methods require fully pre-trained large models and are limited by the pruning process. CHEX repeatedly prunes and regrows the channels during the training process, reducing the risk of prematurely pruning important channels. Experimental results show that a top-1 accuracy of 76% can be obtained using the CHEX compressed ResNet-50 model on the ImageNet dataset, reducing the Flops (floating point operations per second) to only 25% of the original ResNet-50 model. Finally, reference [ 40 ] proposes a channel pruning method that uses a random search method to determine the channel model of the pruned network structure. Experimental results show that the performance of the models obtained with different network structures and datasets is close under the random pruning scheme. However, the number of parameters has a great impact on the accuracy of the constructed networks, and the more parameters the lower the error rate of the pruned network after a certain amount of computation.

In conclusion, current pruning methods include weight pruning, channel pruning, and neuron pruning, each with its advantages and disadvantages. The criteria for determining the importance of the unit can impact the accuracy. The use of constraint learning methods can improve it, although the implementation is complex. The random search method of pruning is simple to implement, but the network model’s performance is limited compared to other methods that improve it. Therefore, each method has its advantages and needs to be selected in conjunction with the actual application scenario, and there is no single method that can synthesize the complexity and model compression efficiency.

3.2. Binary Neural Networks and Optimization Techniques

Convolutional neural networks consist of multiple network layers and millions of parameters. Due to their large size, it is challenging to deploy them directly on embedded devices that have high hardware requirements.

To address this issue, a binarization method was proposed to simplify the network parameters [ 41 ]. This method quantifies the weights and activation values into one fixed-point parameter, leading to memory savings and reduced inference time. However, the binarization network results in severe information loss. The direction of binarization network research is towards reducing information loss, reducing errors, and preserving model accuracy. The authors of reference [ 42 ], Courbariaux et al., proposed the concept of binarized neural networks and used a randomized binarization method during the training forward propagation. During the backward propagation, a clip function was introduced to intercept the full precision weight to update the range, compressing the number of network model parameters to a great extent and preventing the real weights from growing too fast without affecting the binary weights. One year later, after the Binary network was proposed, reference [ 43 ] proposed Binary-Weight-Networks and XNOR-Networks. The filters of binary-weighted networks approximate binary values, and XNOR-Networks filters and convolutional layer inputs are binary, which accelerates the speed of convolutional operation and saves 32 times the memory. The experiments using the ImageNet dataset show that the method improved the top-1 accuracy by 16% compared to other network binarization methods used at that time. Reference [ 44 ] presents the first hash method training binary method and experiments on CIFAR-10, CIFAR-100 with ImageNet dataset. The main work of the authors was to convert the training binary network into a hash problem, multiply the binary code by a scaling factor to reduce the loss caused by using the hash method, and propose alternate optimization methods to iteratively update the binary code and the scaling factor. Experimental results show a 3.0% improvement in accuracy for the ImageNet classification task compared to the best binarization network at that time. Reference [ 45 ] proposes Center-Symmetric Local Binary Convolutional Neural Networks (CS-LBCNN) for handwritten character recognition to address the problem that local binary networks are affected by randomly assigned local binary convolutional weights. The authors also propose an improvement—Threshold Center-Symmetric Local Binary Convolutional Neural Networks (TCS-LBCNN). The experiments were compared in CS-LBCNN and TCS-LBCNN using bilingual, MNIST, and MADBase datasets, and the average accuracies obtained in CS-LBCNN were 99.465%, 99.596%, and 99.502%, respectively. The final average accuracies achieved in TCS-LBCNN were 99.491%, 99.656%, and 99.534%. Experimental result showing in Figure 4 . The authors also compared with today’s advanced techniques and were able to achieve a small number of improvements in performance and accuracy.

An external file that holds a picture, illustration, etc.
Object name is micromachines-14-00897-g004.jpg

Reference [ 45 ] ( a ) CS-LBCNN experiment result; ( b ) TCS-LBCNN experiment result.

In reference [ 46 ], the AdaBin method was proposed to adaptively obtain the optimal binary set {b1, b2} for each layer of weights and activations. The method defined a new binary quantization function using the center position and distance of 1-bit values. An equalization method was proposed for the weights to minimize their Kullback–Leibler scatter, and a gradient-based optimization method was introduced to obtain the activation parameters. Experimental results showed that a top-1 precision of 66.4% could be obtained using the ResNet-18 architecture on the ImageNet dataset, and 69.4mAp (mean Average Precision, which is a synthesis of both Precision and Recall metrics) was obtained using SSD300 on the PASCAL VOC.

Binarization techniques are currently available to significantly reduce the size and complexity of models, enabling complex neural networks to be deployed on resource-constrained devices. However, accuracy loss remains a serious issue. To address this, researchers have explored using other values besides the traditional {−1, 1} for weights or activation values, increasing the complexity of the network to improve accuracy. Alternatively, adaptive binarization methods may be the future direction. These methods can dynamically adjust the range of weights and activation values based on the situation, providing a more flexible and accurate approach.

3.3. CPU/GPU Acceleration Algorithm

In addition to optimizing network performance, hardware-based CPU and GPU algorithms for neural network acceleration can also be optimized in the field of computer science. The current CPU/GPU acceleration algorithms for neural networks are mainly divided into three categories: adjusting the task scheduling strategy, enhancing CPU-GPU parallel computing efficiency, and strengthening GPU utilization.

Reference [ 47 ] proposed the RedSync method based on the RGC (Residual Gradient Compression) compression algorithm, which can reduce end-to-end training time in multi-GPU systems. This method significantly accelerated the training speed of large-scale deep neural networks. Reference [ 48 ] proposed the RedSync method based on the residual gradient compression (RGC) compression algorithm, which can reduce the end-to-end training time in multi-GPU systems. This method significantly accelerates the training speed of large-scale deep neural networks. In reference [ 9 ], the authors propose Troodon, a load-balanced scheduling heuristic suitable for CPUs and GPUs. The main idea of the algorithm is to organize all jobs into job pools according to device suitability, rank the jobs in the job pools according to the predicted speedup rate, and achieve load-balanced scheduling by considering the processing requirements of the jobs and the computational capacity of the devices. This is calculated by computing the compute shares of each device that are related to the available workload to be scheduled. The authors’ experiments were conducted on two CPU-GPU heterogeneous systems, and Troodon’s final processing time was reduced by 78%, 65%, and 41%, respectively, compared to the other three algorithms in the literature (DS, ISG, and AA). In reference [ 49 ], the authors propose methods to execute a local respective-field-based Extreme Learning Machine (LU) on a GPU platform. The first method is a new chunked logical unit decomposition algorithm that overcomes the global memory size limitation. The second method is used to accelerate the chunked LU decomposition efficient chunking algorithm Cholesky decomposition algorithm. Finally, the authors propose a heterogeneous CPU-GPU parallel algorithm that can make full use of GPU node resources. The experimental results showed that the chunking Cholesky decomposition algorithm was two times faster compared to the LU decomposition algorithm, while the heterogeneous blocking CPU-GPU acceleration algorithm improved the performance by 5% to 10% compared to the Cholesky algorithm. The authors of reference [ 50 ] propose Hybrid Parallel DESPOT (HyP-DESPOT), a massively parallel online planning algorithm based on the DESPOT algorithm that integrates CPU and GPU parallelism in a multilevel scheme for robotic environments. The DESPOT algorithm has three key steps—1. Forward Search, 2. Leaf node initialization, and 3. Backup—with the two steps of forward search and backup being irregular. The experimental results show that the execution speed of the HyP-DESPOT algorithm is significantly faster than that of DESPOT. In addition to CPU and GPU optimization for neural networks, specialized components can also be designed for neural network acceleration, such as the hardware-efficient vector convolutional neural network accelerator proposed in reference [ 51 ], which uses a 3 × 3 filter to optimize the shrink array. Here, a one-dimensional broadcast data stream is used to generate partial sums, thus compressing the computational overhead. The experiments in this paper mainly measure the improved neural network energy consumption and hardware utilization of the convolutional neural network accelerator applied to VGG-16, ResNet-34, GoogleNet, and MobileNet. The results show that the power consumption of the network loaded with the acceleration was reduced, and its hardware utilization reached 99%, 97%, 93.7%, and 94%. Reference [ 52 ] proposes kernel merging methods to improve GPU occupancy and also proposes a machine learning-based GPU sharing mechanism to select kernel pairs in the OpenCL framework. The approach in this work first selects a suitable architecture for the task and then merges GPU cores to improve resource utilization. Experimental results show that the model built using this method achieves 0.91 and 0.98 in F-measure.

The current CPU/GPU algorithms for neural network acceleration have improved support for embedded devices, and research is focused on optimizing their operational efficiency. However, the issue of energy consumption has been largely overlooked. It is important to consider the impact of these algorithms on energy consumption in future research.

3.4. Summary

This section introduces three key technologies: Model Compression, Binary networks, and CPU/GPU acceleration algorithms, respectively. These three technologies will determine whether embedded AI can deploy more complex and efficient models and algorithms in the future. Furthermore, it is important to consider how the network structure and hardware algorithm support can be better implemented in the future to facilitate the integration with IoT edge computing. Table 2 provides the characteristics of each work covered in this section.

Literature covering key technologies.

4. Application Modes of Embedded Artificial Intelligence

The application of embedded AI can be simply divided into three approaches: deploying trained models and weights and other data to embedded devices, training on embedded devices, and training partially on embedded devices and partially on other devices. The first two approaches have their characteristics. Post-training deployment is simpler, but it requires more computational resources in inference. Embedded device training can avoid the bottleneck of data transmission and computational resources, but training on embedded devices requires more computational resources and time compared to post-training deployment. The third approach is currently less studied, but task offloading in edge computing is more compatible with this direction. Table 3 shows the literature reviewed in this chapter.

Literature on application modes.

4.1. Post-Training Deployment

The authors of reference [ 53 ] utilized a wearable smart log patch to monitor and compare health data, such as blood pressure, temperature, and electrocardiogram readings. The wearable IoT smart log patch had a hardware specific chip size of 5 mm, operated at 0.9 V with a 120 Mhz clock speed, and had built-in Wi-Fi and Bluetooth, 14 I/O pins, and 12 sensors. The formula for accuracy in this work was as follows.

The article proposes an EC-BDLN algorithm, which reduces data faults during the transportation of information from the input layer to the output layer, resulting in high throughput. Additionally, when the network detects a data fault, it can transport data from the source to the destination address using minimal energy consumption. The algorithm also introduces a Gaussian factor that helps to maintain unwanted switching activities in the network and improve the prediction accuracy. In the experiments, the proposed algorithm outperforms other methods in terms of efficiency and has lower error rates. Reference [ 54 ] describes a wearable fall monitoring system that uses a wearable motion sensor with three axial devices, including a compass/magnetometer, gyroscope, and accelerometer. The proposed artificial intelligence IOT-CPS (Cyber-Physical Systems) diagnoses patient disease accuracy, recall, F-Measure, and execution time, and shows significant improvement over existing algorithms. Reference [ 55 ] is about deploying machine-learning capabilities on a smartphone mobile platform for target recognition. The proposed target detection system is based on two platforms: OpenCV and Tensorflow. The UAV configuration can transmit video at 720p/1080p with low latency and high quality. The controller is wirelessly connected to the UAV and the Android device, and these three components are synchronized. To deploy Yolov3 on an Android mobile device, the authors used Keras to process the network configuration file and the weight file. This generated an intermediate file with the extension h5, as well as a help file that defines the neural network in JSON format. In their experiments, the authors used SSD, YOLOv3, and Tiny-YOLOv3 deployed on two platforms to evaluate the accuracy, speed, RAM, energy consumption, and temperature. The authors found that OpenCV outperforms TensorFlow when performing image preprocessing on mobile devices. However, TensorFlow performs better when the tracking process is running, and it also has the highest neural network execution speed. TensorFlow significantly outperforms other platforms, achieving a higher number of frames per second. In reference [ 56 ], the authors explored artificial intelligence for basic hand position recognition in Chinese classical dance, using TensorFlow on NVIDIA Jeston series embedded AI development boards. They extracted the coordinates of human joint point positions in the video and calculated the similarity between imitators’ action poses and standard action samples using cosine similarity algorithms to achieve recognition and evaluation of academic poses in Chinese classical dance. Finally, reference [ 57 ] investigates image segmentation, feature extraction, and motion recognition methods based on embedded systems and artificial intelligence platforms for taekwondo technical action processes and motion effects. The authors optimized their algorithm, involving improved background subtraction for image segmentation and the use of principal component analysis to construct feature vectors to reduce the number of dimensions of the quantitative description of features. They chose a microcontroller containing max1300 chip and max9632 chip for deployment and achieved a desirable recognition rate for 10 actions in the video database. However, this work did not deploy the algorithm on a real-time system, only mentioning the possibility.

4.2. Training on Embedded Devices

In [ 4 ], the authors propose a zero-activation-skipping convolutional accelerator (ZASCA), which avoids the non-contributing multiplication of zero-valued activations. The main submodules are convolutional units (CEs), pipelines, and encoding. In the experiments, the authors denote ZASCA, which performs convolution on dense activation, as ZASCAD, and ZASCAS, which performs convolution on sparse activation, as ZASCAS. They compare ZASCAD and ZASCAS with Eyeriss, EEPS, and DSIP using AlexNet, VGG-16, ResNet-18, and ResNet-50 datasets. Eyeriss is tested on AlexNet and VGG-16, and EEPS and DSIP are tested on AlexNet only. The results show that when running AlexNet and VGG-16, ZASCAD runs 1.4 times and 3.1 times faster than Eyeriss, respectively, and ZASCAS performs AlexNet convolution 1.5 times faster than EEPS. Finally, ZASCAD and ZASCAS perform 2.7 times and 4 times better than DSIP at runtime, respectively. The authors also provide an energy consumption analysis in the paper, which yields the result that ZASCAS performs ResNet-50 convolution to obtain the highest energy efficiency. In [ 58 ], the authors used computational intelligence techniques for image analysis to study predictive computational intelligence techniques for the nitrogen status evaluation of wheat crops using Hue–Saturation–Intensity color normalization, genetic algorithm (GA), and artificial neural network (ANN) to predict the crop accuracy status and classify it. They propose an accurate crop yield prediction model that combines ANN and GA to predict crop yield by first extracting low-level color, morphology, and texture features, then further processing using GA, and finally optimizing the features for transmission to the ANN for crop prediction. The experiments in this work were conducted on an Intel Core i5 CPU using a dataset of 18,200 images of wheat crops, divided into four categories based on the age of the wheat, with 5460 images for testing and 12,740 images for the training and implementation of the neural network. The experimental results verified an accuracy of 97.75%, with a minimum error of 0.22 and a loss value reduction of 0.28 compared to other contemporary methods. After comparative experimental analysis, the model provided an effective and reliable approach to the crop yield prediction problem due to the optimization of the training parameters and the low hardware requirements of the method using ANN with GA. In [ 59 ], the authors proposed an energy-efficient thin and deep convolutional neural network architecture dedicated to traffic sign recognition, which contained four convolutional layers, two overlapping maximum pooling layers, and one fully connected hidden layer. Using overlapping maximum pooling layers instead of non-overlapping maximum pooling layers can solve the problem of overfitting and compress the image before processing. The authors used an Intel Core i7 CPU for training, and the datasets were German traffic sign recognition benchmark and Belgian traffic sign classification datasets. The final experimental results showed that the proposed architecture outperforms state-of-the-art traffic sign methods and reduces the system energy consumption due to the lower computational effort of the method used in the literature. Since the computational cost is linearly equivalent to the energy consumption, the method achieves reduced energy consumption and complexity while maintaining the accuracy of other contemporary methods. However, the network structure was designed exclusively for traffic sign recognition and lacked generalization.

4.3. Partial Training

In reference [ 5 ], the authors proposed incorporating deep learning into the edge computing environment to enhance learning performance while reducing network traffic. By utilizing deep learning, edge computing can extract important information and reduce the communication overhead transmitted from edge devices to the cloud server. Additionally, some of the learning layers can be implemented at the edge during model training, and only the reduced intermediate data are transferred to the centralized cloud server. Experimental results demonstrate that this approach outperforms other methods in optimizing IoT deep learning.

4.4. Summary

Currently, most embedded AI applications are deployed on embedded devices after being trained elsewhere. However, research on on-device training and the execution of tasks is relatively scarce. These two deployment modes have different requirements and are more difficult to match. On-device training can lead to better generalization, but also requires higher hardware capabilities. In the future, the development of hardware accelerators will likely promote the development of both deployment modes.

5. The Outlook of Embedded Artificial Intelligence

In the Industry 4.0 environment, the digitalization process of the manufacturing industry relies on embedded intelligence technology. To achieve this, more complex and intelligent artificial intelligence algorithms and models need to be deployed to resource-constrained embedded devices. Embedded intelligence will play an important role in the digital transformation of the manufacturing industry. The following are some considerations for embedded artificial intelligence technology:

  • Efficient algorithms and lightweight models: In the current society, most workers need to frequently switch between different work scenarios. This results in higher requirements for device portability, including weight, volume, energy consumption, and other factors. To ensure the portability of the devices, the development of intelligent devices requires the study of more efficient algorithms and lightweight network models while maintaining model accuracy and reducing network model complexity.
  • Hardware acceleration methods: In addition to optimization in algorithms and models, optimization can also be achieved at the hardware level. The current research on hardware acceleration methods is limited to a single architecture of the neural network accelerator. Applying a hardware neural network accelerator to multiple platforms or using multiple hardware devices in combination may become a solution to the problem in the future.
  • Deployment optimization: Embedded AI deployment can be divided into post-training deployment, training on embedded devices, and part of the training task on embedded devices. Current post-training deployment has a high demand for training speed on other platforms, which can be met by improving the model training speed. The need for training on embedded devices is consistent with the first point of this subsection, requiring more efficient algorithms and lighter network models to reduce the difficulty of model training on embedded devices. For tasks completed on embedded devices, consideration of post-training models for integration is required to ensure model integrity.
  • Compatibility: According to reference [ 60 ], the current embedded intelligence in the industry still faces problems. For example, in legacy automation systems, some dedicated functions lack interoperability with the current automation system due to various reasons. At the same time, there is no standard method to manage the edge computing nodes and data collection. Additionally, utilizing the large amount of data generated by the edge computing and industrial cloud working together in machine learning remains an issue.

6. Conclusions

This paper presents the current state of development in embedded AI from three perspectives: hardware acceleration methods, key technologies, and application models. Currently, embedded intelligence has established a foundation that includes deployment platforms for AI support, such as designing hardware accelerators for neural networks, optimizing network models, which includes network structure design, quantization, pruning, and binarization methods, and improving underlying hardware algorithms. All of the above technologies contribute to the deployment of AI on resource-constrained devices, but further development is needed in the following areas: efficient algorithms and lightweight models, the optimization of hardware acceleration methods, and the optimization of deployment methods and compatibility.

Funding Statement

The research was supported by the Department of Education of Guangdong Province (2022ZDZX1029).

Author Contributions

Conceptualization, Z.Z. and J.L.; writing—original draft preparation, J.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Conflicts of interest.

The authors declare no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Modern automotive embedded systems with special mention to radars

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Book cover

  • Conference proceedings
  • © 2020

Embedded Systems and Artificial Intelligence

Proceedings of ESAI 2019, Fez, Morocco

  • Vikrant Bhateja 0 ,
  • Suresh Chandra Satapathy 1 ,
  • Hassan Satori 2

Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges (SRMGPC), Lucknow, India

You can also search for this editor in PubMed   Google Scholar

School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Bhubaneswar, India

Department of computer sciences, faculty of sciences dhar mahraz, sidi mohammed ben abbdallah university, fez, morocco.

  • Gathers outstanding research papers presented at the ESAI 2019
  • Discusses new findings in embedded systems and artificial intelligence
  • Serves as a reference guide for researchers and practitioners in academia and industry

Part of the book series: Advances in Intelligent Systems and Computing (AISC, volume 1076)

125k Accesses

270 Citations

2 Altmetric

  • Table of contents

About this book

Editors and affiliations, about the editors, bibliographic information.

  • Publish with us

Buying options

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

This is a preview of subscription content, log in via an institution to check for access.

Table of contents (85 papers)

Front matter, embedded computing and applications, real-time implementation of artificial neural network in fpga platform.

  • Mohamed Atibi, Mohamed Boussaa, Abdellatif Bennis, Issam Atouf

Design of a 0.18 μm CMOS Inductor-Based Bandpass Microwave Active Filter

  • Mariem Jarjar, Nabih Pr. El Ouazzani

Improvement of the Authentication on In-Vehicle Controller Area Networks

  • Asmae Zniti, Nabih El Ouazzani

Comparative Evaluation of Speech Recognition Systems Based on Different Toolkits

  • Fatima Barkani, Hassan Satori, Mohamed Hamidi, Ouissam Zealouk, Naouar Laaidi

Stability Analysis and \(H_\infty \) Performance for Nonlinear Fuzzy Networked Control Systems with Time-Varying Delay

  • Ahmed Ech-charqy, Said Idrissi, Mohamed Ouahi, El Houssaine Tissir, Ghali Naami

Driver Behavior Assistance in Road Intersections

  • Safaa Dafrallah, Aouatif Amine, Stéphane Mousset, Abdelaziz Bensrhair

BER Performance of CE-OFDM System: Over AWGN Channel and Frequency-Selective Channel Using MMSE Equalization

  • J. Mestoui, M. El Ghzaoui, K. El Yassini

Towards a Dynamical Adaptive Core for Urban Flows Simulations

  • Hind Talbi, El Miloud Chaabelasri, Najim Salhi, Imad Elmahi

An Efficiency Study of Adaptive Median Filtering for Image Denoising, Based on a Hardware Implementation

  • Hasnae El Khoukhi, Faiz Mohammed Idriss, Ali Yahyaouy, My Abdelouahed Sabri

Processing Time and Computing Resources Optimization in a Mobile Edge Computing Node

  • Mohamed El Ghmary, Tarik Chanyour, Youssef Hmimz, Mohammed Ouçamah Cherkaoui Malki

Viscous Dissipation and Heat Source/Sink Effects on Convection Heat Transfer in a Saturated Porous Medium

  • Elyazid Flilihi, Mohammed Sriti, Driss Achemlal

Fast and Efficient Decoding Algorithm Developed from Concatenation Between a Symbol-by-Symbol Decoder and a Decoder Based on Syndrome Computing and Hash Techniques

  • M. Seddiq El Kasmi Alaoui, Saïd Nouh, Abdelaziz Marzak

SWIMAC: An Ultra-Low Power Consumption MAC Protocol for IR-UWB-Based WSN

  • Anouar Darif

An Optimal Design of a Short-Channel RF Low Noise Amplifier Using a Swarm Intelligence Technique

  • Soufiane Abi, Hamid Bouyghf, Bachir Benhala, Abdelhadi Raihani

Study and Design of a See Through Wall Imaging Radar System

  • M. Abdellaoui, M. Fattah, S. Mazer, M. El bekkali, M. El ghazi, Y. Balboul et al.

Analog Active Filter Component Selection Using Genetic Algorithm

  • Asmae El Beqal, Bachir Benhala, Izeddine Zorkani

Preliminary Study of Roots by Georadar System

  • Ahmed Faize, Gamil Alsharahi

Predictive Agents for the Forecast of CO2 Emissions Issued from Electrical Energy Production and Gas Consumption

  • Seif Eddine Bouziane, Mohamed Tarek Khadir
  • Conference Proceedings
  • Artificial Intelligence
  • Embedded Systems
  • Fuzzy and Expert Systems
  • Networked Embedded Systems
  • Mobile Embedded Systems

Vikrant Bhateja

Suresh Chandra Satapathy

Hassan Satori

Book Title : Embedded Systems and Artificial Intelligence

Book Subtitle : Proceedings of ESAI 2019, Fez, Morocco

Editors : Vikrant Bhateja, Suresh Chandra Satapathy, Hassan Satori

Series Title : Advances in Intelligent Systems and Computing


Publisher : Springer Singapore

eBook Packages : Intelligent Technologies and Robotics , Intelligent Technologies and Robotics (R0)

Copyright Information : Springer Nature Singapore Pte Ltd. 2020

Softcover ISBN : 978-981-15-0946-9 Published: 08 April 2020

eBook ISBN : 978-981-15-0947-6 Published: 07 April 2020

Series ISSN : 2194-5357

Series E-ISSN : 2194-5365

Edition Number : 1

Number of Pages : XXII, 910

Number of Illustrations : 174 b/w illustrations, 277 illustrations in colour

Topics : Computational Intelligence , Artificial Intelligence , Information Systems Applications (incl. Internet) , Systems and Data Security

Policies and ethics

  • Find a journal
  • Track your research


Engineer's Planet

Mtech, Btech Projects, PhD Thesis and Research Paper Writing Services in Delhi

Embedded Systems Project Topics With Abstracts and Base Papers 2024

Embark on a transformative journey into the world of embedded systems with our meticulously curated selection of M.Tech project topics for 2024, thoughtfully complemented by trending IEEE base papers. These projects encapsulate the forefront of innovation and challenges in embedded systems, offering an indispensable resource for  M.Tech students seeking to delve into the dynamic landscape of intelligent and interconnected devices .Our comprehensive compilation covers a diverse range of Embedded Systems project topics, each intricately paired with an associated base paper and a succinct abstract. From IoT applications and real-time systems to edge computing and hardware-software co-design, these projects mirror the current trends in embedded systems development. Stay ahead of the curve by exploring projects that align with the latest IEEE standards and technological breakthroughs. Whether you’re a student, researcher, or industry professional, our collection serves as a gateway to the cutting edge of embedded systems advancements. The project titles are strategically chosen to incorporate keywords that resonate with the latest trends in embedded systems , ensuring relevance and alignment with the evolving technological landscape. Delve into the abstracts to quickly grasp the scope, methodologies, and potential impacts of each project.

M.Tech Projects Topics List In Embedded Systems

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

  • Quantum Research

Landmark IBM error correction paper published on the cover of Nature

Ibm has created a quantum error-correcting code about 10 times more efficient than prior methods — a milestone in quantum computing research..

Landmark IBM error correction paper published on the cover of Nature

27 Mar 2024

Rafi Letzter

Share this blog

Today, the paper detailing those results was published as the cover story of the scientific journal Nature. 1

Last year, we demonstrated that quantum computers had entered the era of utility , where they are now capable of running quantum circuits better than classical computers can. Over the next few years, we expect to find speedups over classical computing and extract business value from these systems. But there are also algorithms with mathematically proven speedups over leading classical methods that require tuning quantum circuits with hundreds of millions, to billions, of gates. Expanding our quantum computing toolkit to include those algorithms requires us to find a way to compute that corrects the errors inherent to quantum systems — what we call quantum error correction.

Read how a paper from IBM and UC Berkeley shows a path toward useful quantum computing

Quantum error correction requires that we encode quantum information into more qubits than we would otherwise need. However, achieving quantum error correction in a scalable and fault-tolerant way has, to this point, been out of reach without considering scales of one million or more physical qubits. Our new result published today greatly reduces that overhead, and shows that error correction is within reach.

While quantum error correction theory dates back three decades, theoretical error correction techniques capable of running valuable quantum circuits on real hardware have been too impractical to deploy on quantum system. In our new paper, we introduce a new code, which we call the gross code , that overcomes that limitation.

This code is part of our broader strategy to bring useful quantum computing to the world.

While error correction is not a solved problem, this new code makes clear the path toward running quantum circuits with a billion gates or more on our superconducting transmon qubit hardware.

What is error correction?

Quantum information is fragile and susceptible to noise — environmental noise, noise from the control electronics, hardware imperfections, state preparation and measurement errors, and more. In order to run quantum circuits with millions to billions of gates, quantum error correction will be required.

Error correction works by building redundancy into quantum circuits. Many qubits work together to protect a piece of quantum information that a single qubit might lose to errors and noise.

On classical computers, the concept of redundancy is pretty straightforward. Classical error correction involves storing the same piece of information across multiple bits. Instead of storing a 1 as a 1 or a 0 as a 0, the computer might record 11111 or 00000. That way, if an error flips a minority of bits, the computer can treat 11001 as 1, or 10001 as 0. It’s fairly easy to build in more redundancy as needed to introduce finer error correction.

Things are more complicated on quantum computers. Quantum information cannot be copied and pasted like classical information, and the information stored in quantum bits is more complicated than classical data. And of course, qubits can decohere quickly, forgetting their stored information.

Research has shown that quantum fault tolerance is possible, and there are many error correcting schemes on the books. The most popular one is called the “surface code,” where qubits are arranged on a two-dimensional lattice and units of information are encoded into sub-units of the lattice.

But these schemes have problems.

First, they only work if the hardware’s error rates are better than some threshold determined by the specific scheme and the properties of the noise itself — and beating those thresholds can be a challenge.

Second, many of those schemes scale inefficiently — as you build larger quantum computers, the number of extra qubits needed for error correction far outpaces the number of qubits the code can store.

At practical code sizes where many errors can be corrected, the surface code uses hundreds of physical qubits per encoded qubit worth of quantum information, or more. So, while the surface code is useful for benchmarking and learning about error correction, it’s probably not the end of the story for fault-tolerant quantum computers.

Exploring “good” codes

The field of error correction buzzed with excitement in 2022 when Pavel Panteleev and Gleb Kalachev at Moscow State University published a landmark paper proving that there exist asymptotically good codes — codes where the number of extra qubits needed levels off as the quality of the code increases.

This has spurred a lot of new work in error correction, especially in the same family of codes that the surface code hails from, called quantum low-density parity check, or qLDPC codes. These qLDPC codes are quantum error correcting codes where the operations responsible for checking whether or not an error has occurred only have to act on a few qubits, and each qubit only has to participate in a few checks.

But this work was highly theoretical, focused on proving the possibility of this kind of error correction. It didn’t take into account the real constraints of building quantum computers. Most importantly, some qLDPC codes would require many qubits in a system to be physically linked to high numbers of other qubits. In practice, that would require quantum processors folded in on themselves in psychedelic hyper-dimensional origami, or entombed in wildly complex rats’ nests of wires.

In our paper, we looked for fault-tolerant quantum memory with a low qubit overhead, high error threshold, and a large code distance.

High-threshold and low-overhead fault-tolerant quantum memory

Bravyi, S., Cross, A., Gambetta, J., et al. High-threshold and low-overhead fault-tolerant quantum memory. Nature (2024).

In our Nature paper, we specifically looked for fault-tolerant quantum memory with a low qubit overhead, high error threshold, and a large code distance.

Let’s break that down:

Fault-tolerant: The circuits used to detect errors won't spread those errors around too badly in the process, and they can be corrected faster than they occur

Quantum memory: In this paper, we are only encoding and storing quantum information. We are not yet doing calculations on the encoded quantum information.

High error threshold: The higher the threshold, the higher amount of hardware errors the code will allow while still being fault tolerant. We were looking for a code that allowed us to operate the memory reliably at physical error rates as high as 0.001, so we wanted a threshold close to 1 percent.

Large code distance: Distance is the measure of how robust the code is — how many errors it takes to completely flip the value from 0 to 1 and vice versa. In the case of 00000 and 11111, the distance is 5. We wanted one with a large code distance that corrects more than just a couple errors. Large-distance codes can suppress noise by orders of magnitude even if the hardware quality is only marginally better than the code threshold. In contrast, codes with a small distance become useful only if the hardware quality is significantly better than the code threshold.

Low qubit overhead: Overhead is the number of extra qubits required for correcting errors. We want the number of qubits required to do error correction to be far less than we need for a surface code of the same quality, or distance.

We’re excited to report that our team’s mathematical analysis found concrete examples of qLDPC codes that met all of these required conditions. These fall into a family of codes called “Bivariate Bicycle (BB)” codes. And they are going to shape not only our research going forward, but how we architect physical quantum systems.

The gross code

While many qLDPC code families show great promise for advancing error correction theory, most aren’t necessarily pragmatic for real-world application. Our new codes lend themselves better to practical implementation because each qubit needs only to connect to six others, and the connections can be routed on just two layers.

To get an idea of how the qubits are connected, imagine they are put onto a square grid, like a piece of graph paper. Curl up this piece of graph paper so that it forms a tube, and connect the ends of the tube to make a donut. On this donut, each qubit is connected to its four neighbors and two qubits that are farther away on the surface of the donut. No more connections needed.

The good news is we don’t actually have to embed our qubits onto a donut to make these codes work — we can accomplish this by folding the surface differently and adding a few other long-range connectors to satisfy mathematical requirements of the code. It’s an engineering challenge, but much more feasible than a hyper-dimensional shape.

We explored some codes that have this architecture and focused on a particular [[144,12,12]] code. We call this code the gross code because 144 is a gross (or a dozen dozen). It requires 144 qubits to store data — but in our specific implementation, it also uses another 144 qubits to check for errors, so this instance of the code uses 288 qubits. It stores 12 logical qubits well enough that fewer than 12 errors can be detected. Thus: [[144,12,12]].

Using the gross code, you can protect 12 logical qubits for roughly a million cycles of error checks using 288 qubits. Doing roughly the same task with the surface code would require nearly 3,000 qubits.

This is a milestone. We are still looking for qLDPC codes with even more efficient architectures, and our research on performing error-corrected calculations using these codes is ongoing. But with this publication, the future of error correction looks bright.

fig1-Tanner Graphs of Surface and Bivariate Bicycle Codes.png

Fig. 1 | Tanner graphs of surface and BB codes.

Fig. 1 | Tanner graphs of surface and BB codes. a, Tanner graph of a surface code, for comparison. b, Tanner graph of a BB code with parameters [[144, 12, 12]] embedded into a torus. Any edge of the Tanner graph connects a data and a check vertex. Data qubits associated with the registers q(L) and q(R) are shown by blue and orange circles. Each vertex has six incident edges including four short-range edges (pointing north, south, east and west) and two long-range edges. We only show a few long-range edges to avoid clutter. Dashed and solid edges indicate two planar subgraphs spanning the Tanner graph, see the Methods. c, Sketch of a Tanner graph extension for measuring Z ˉ \={Z} and X ˉ \={X} following ref. 50, attaching to a surface code. The ancilla corresponding to the X ˉ \={X} measurement can be connected to a surface code, enabling load-store operations for all logical qubits by means of quantum teleportation and some logical unitaries. This extended Tanner graph also has an implementation in a thickness-2 architecture through the A and B edges (Methods).

Syndrome measurement circuit

Fig. 2 | Syndrome measurement circuit.

Fig. 2 | Syndrome measurement circuit. Full cycle of syndrome measurements relying on seven layers of CNOTs. We provide a local view of the circuit that only includes one data qubit from each register q(L) and q(R) . The circuit is symmetric under horizontal and vertical shifts of the Tanner graph. Each data qubit is coupled by CNOTs with three X-check and three Z-check qubits: see the Methods for more details.

Why error correction matters

Today, our users benefit from novel error mitigation techniques — methods for reducing or eliminating the effect of noise when calculating observables, alongside our work suppressing errors at the hardware level. This work brought us into the era of quantum utility. IBM researchers and partners all over the world are exploring practical applications of quantum computing today with existing quantum systems. Error mitigation lets users begin looking for quantum advantage on real quantum hardware.

But error mitigation comes with its own overhead, requiring running the same executions repeatedly so that classical computers can use statistical methods to extract an accurate result. This limits the scale of the programs you can run, and increasing that scale requires tools beyond error mitigation — like error correction.

Last year, we debuted a new roadmap laying out our plan to continuously improve quantum computers over the next decade. This new paper is an important example of how we plan to continuously increasing the complexity (number of gates) of the quantum circuits that can be run on our hardware. It will allow us to transition from running circuits with 15,000 gates to 100 million, or even 1 billion gates.

Bravyi, S., Cross, A.W., Gambetta, J.M. et al. High-threshold and low-overhead fault-tolerant quantum memory. Nature 627, 778–782 (2024).

Start using our 100+ qubit systems

Keep exploring, computing with error-corrected quantum computers.

Logical gates with magic state distillation

Logical gates with magic state distillation

Error correcting codes for near-term quantum computers

Error correcting codes for near-term quantum computers

recent research papers in embedded systems

A new paper from IBM and UC Berkeley shows a path toward useful quantum computing


  1. (PDF) The Embedded Systems Design Challenge

    recent research papers in embedded systems

  2. VTU Mtech Advanced Embedded systems question papers Dec 2013-Jan 2014-12

    recent research papers in embedded systems

  3. Buy International Journal of Advances in Embedded System Research

    recent research papers in embedded systems

  4. (PDF) Embedded System Design Using UML and Platforms

    recent research papers in embedded systems

  5. Design methodology for embedded systems.

    recent research papers in embedded systems


    recent research papers in embedded systems


  1. Embedded Systems Course

  2. || How to read a Research Papers || AI tool for summarize Research Paper in just 5sec ||#aitools

  3. Benefits of Software Defined Radio RF Globalnet

  4. Embedded Systems || Arrays & Linked Lists || Lecture 9

  5. Differentiating System Architectures: Applying Architecture Measures

  6. UNSW Research


  1. 261298 PDFs

    Electronics, Embedded Systems, and Microcontrollers, Hardware in the Loop | Explore the latest full-text research PDFs, articles, conference papers, preprints and more on EMBEDDED SYSTEMS. Find ...

  2. IEEE Embedded Systems Letters

    Profession and Education. Technical Interests. Need Help? US & Canada:+1 800 678 4333. Worldwide: +1 732 981 0060. Contact & Support. About IEEE Xplore. Contact Us. Help.

  3. embedded systems Latest Research Papers

    Embedded systems are increasingly used in our daily life due to their importance. They are computer platforms consisting of hardware and software. They run specific tasks to realize functional and non functional requirements. Several specific quality attributes were identified as relevant to the embedded system domain.

  4. Embedded Systems Education in the 2020s: Challenges, Reflections, and

    In [12], guiding principles were presented for the embedded systems teaching and research agenda at UC Berkeley. The authors discuss how embedded systems concepts are used to bring together the physical and computational world and how students are educated in critical reasoning about modeling and abstraction.

  5. Real-Time Embedded Systems: Present and Future

    1. Introduction. Real-time and networked embedded systems are important bidirectional bridges between the physical and the information worlds. Embedded intelligence increasingly pervades industry, infrastructure, and public and private spaces, being identified as a society and economy emerging "neural system" that supports both societal ...

  6. A Literature Review on Embedded Systems

    Abstract: Embedded system design is crucial for the development of industry, technology, and science, and it is an area that has significantly grown in recent years throughout Latin America, both in academia and in industry. Embedded System (ES) refers to electronic equipment with a computing core which, unlike a personal computer, is designed to meet a specific function and is usually ...

  7. Emerging and New Technologies in Embedded Systems

    Dear Colleagues, This Special Issue (SI) invites papers on research achievements in emerging and new technologies in embedded systems. Embedded systems have matured as a discriminating technology for a wide variety of applications in our daily life. However, this immense impact is unfortunately confined by the power and cost constraints of ...

  8. [2206.03263] Embedded Systems Education in the 2020s: Challenges

    This paper describes the evolution of embedded systems education over the past two decades and challenges facing the designers and instructors of embedded systems curricula in the 2020s. Reflections from over a decade of teaching the design of embedded computing systems are presented, with insights on strategies that show promise to address ...

  9. Recent Advances in Embedded Computing, Intelligence and Applications

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... embedded-system design ...

  10. Research on Modeling and Design of Real-Time Embedded Systems

    This paper proposed an Actor-oriented modeling and designing methodology for complex real time embedded system. Embedded system can be divided into multiple single computing model subsystems according to different computation model and different abstract hierarchy by this method. Then, specialist can construct this subsystem according to their previous experiences and knowledge, and the ...

  11. International Journal of Embedded Systems (IJES) Inderscience

    Embedded systems are increasingly becoming a key technological component of all kinds of complex technical systems, ranging from vehicles, telephones, audio-video equipment, aircraft, toys, security systems, medical diagnostics, to weapons, pacemakers, climate control systems, manufacturing systems, intelligent power systems etc. IJES addresses the state of the art of all aspects of embedded ...

  12. PDF [Electronics] Real-Time Embedded Systems

    electronics Editorial Real-Time Embedded Systems: Present and Future Christos Koulamas 1,* and Mihai T. Lazarescu 2 ID 1 Industrial Systems Institute/"Athena" Research Center, PSP Bldg, Stadiou Strt, 26504 Patras, Greece 2 Dipartimento di Elettronica e Telecomunicazioni Politecnico di Torino Turin, I-10129 Turin, Italy; [email protected] * Correspondence: [email protected]; Tel.: +30 ...

  13. A Review of Artificial Intelligence in Embedded Systems

    These developments have facilitated a new research direction: embedded artificial intelligence [1,2,3]. The concept of embedded AI was first introduced in reference ... (SOC) could improve the performance and energy efficiency of machine learning on embedded systems. In another paper ...

  14. Real-Time Embedded Systems: Architectures, Software and Applications

    Special Issue Information. Dear Colleagues, The objective of this Special Issue is to invite original and high-quality papers that describe research, technical aspects, or developments in real-time embedded systems architecture, software, and applications. The research on real-time embedded systems has gained great interest in the last years as ...

  15. Modern automotive embedded systems with special mention to radars

    Electronics and embedded systems have become an increasingly integral part of modern automobiles. Automotive embedded systems are a class of embedded systems which are based on the concept of generic embedded systems but vary widely in terms of design constrains, operating environments, performance goals and many other factors. This paper serves as a review of the recent trends of automotive ...

  16. embedded system engineering research papers IEEE PAPER

    robotics-research-papers-2012 embedded system-recent papers. Design and implementation of an embedded home surveillance system by use of multiple ultrasonic sensors; Development of Variable Rate Fertilizer Applicator Module Based on 8-bit Embedded System; An operating system architecture for organic computing in embedded real-time systems

  17. Embedded Systems and Artificial Intelligence

    This book gathers selected research papers presented at the First International Conference on Embedded Systems and Artificial Intelligence (ESAI 2019), held at Sidi Mohamed Ben Abdellah University, Fez, Morocco, on 2-3 May 2019. Highlighting the latest innovations in Computer Science, Artificial Intelligence, Information Technologies, and ...

  18. Embedded Systems Project Topics With Abstracts and Base Papers 2024

    Embark on a transformative journey into the world of embedded systems with our meticulously curated selection of M.Tech project topics for 2024, thoughtfully complemented by trending IEEE base papers. These projects encapsulate the forefront of innovation and challenges in embedded systems, offering an indispensable resource for M.Tech students seeking to delve into the dynamic landscape of ...

  19. Sensors

    Special Issue Information. Dear Colleagues, The recent momentum of the Internet of Things (IoT) is driving the need for embedded systems comprising one or more low-power and resource-constrained computing elements and sensors that are organised into part of a more extensive network. At the same time, this also creates new challenges related to ...

  20. PDF National Bureau of Economic Research

    National Bureau of Economic Research

  21. IBM Quantum Computing Blog

    This new paper is an important example of how we plan to continuously increasing the complexity (number of gates) of the quantum circuits that can be run on our hardware. It will allow us to transition from running circuits with 15,000 gates to 100 million, or even 1 billion gates.

  22. Real-Time Systems in Emerging IoT-Embedded Applications

    Dear Colleagues, The objective of this Special Issue is the design, development, and implementation of real-time systems for emerging IoT and industrial IoT-embedded systems and their applications. Research on real-time control of time-sensitive systems has attracted great interest in recent years due to the increasing utilization of embedded ...