ACM Digital Library home

  • Advanced Search

Malware Detection with Artificial Intelligence: A Systematic Literature Review

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, 1 introduction.

research paper on malware detection

2 Related Work

PaperMalware SophisticationStatic & Dynamic AnalysisMalware DatasetsFeature SelectionML & DLResultsChallenges
[ ] \(\approx\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
[ ] \(\approx\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
[ ] \(\checkmark\) \(\checkmark\) \(\approx\)
[ ] \(\checkmark\) \(\approx\) \(\checkmark\) \(\checkmark\) \(\approx\) \(\approx\) \(\approx\)
[ ] \(\checkmark\) \(\approx\) \(\approx\) \(\approx\) \(\approx\)
[ ] \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\approx\) \(\approx\) \(\checkmark\)
This article \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)

3 Review Methodology

3.1 research questions, 3.2 search strategy, 3.2.1 acm digital library., 3.2.2 ieeexplore., 3.2.3 scopus., 3.2.4 google scholar., 3.2.5 google scholar., 3.3 inclusion and exclusion criteria, 4.1 malware sophistication, 4.1.1 evasive malware..

Operation [ ]Details
Instruction Pointer (IP)The IP is monitored for every instruction that is processed by the Central Processing Unit (CPU).
Application Programming Interface (API)Allows programs to communicate and malware can interact with the OS and use its libraries.
System Calls (syscalls)Allows programs to request a service from the kernel, malware can request a service and, based on the return, determine if it is running on bare metal.
Memory AccessMalware can attempt to read certain parts of the memory to detect if it is running on bare metal or a virtualized or instrumented environment.
EvasiveTechnique [ ]Pin Detected [ ]Malware (%) [ ]Benign (%) [ ]
IsDebuggerPresent 1,516 (3.34)17 (1.79)
CheckRemoteDebuggerPresent 432 (0.95)1 (0.11)
OutputDebugString 794 (1.75)26 (2.74)
FindWindow 2,245 (4.95)51 (5.37)
QueryInformationProcess 1,028 (2.27)4 (0.42)
SetInformationThread 350 (0.77)1 (0.11)
OutputDebugString()FormatString N.A.N.A.
SeDebugPrivilege OpenProcess N.A.N.A.
QIP(ProcessDebugFlags) \(\checkmark\) 1,028 (2.27)4 (0.42)
QIP(DebugHandleObject) 10,651 (23.47)157 (16.53)
QueryPerformanceCounter \(\checkmark\) 6,038 (13.31)140 (14.74)
GetTickCount 11,029 (24.32)198 (20.84)
timeGetTime 805 (1.77)32 (3.37)
CloseHandle 3,104 (6.84)19 (2.0)
Hardware Breakpoints 46 (0.1)0 (0)
Control-C Vectored Exception 0 (0)0 (0)
RDTSC \(\checkmark\) 9,518 (20.98)168 (17.7)
INT 3 Exception (0XCC) 747 (1.65)0 (0)
INT 2D (Kernel Debugger Interrupt) \(\checkmark\) 39 (0.09)0 (0)
ICE Breakpoint 61 (0.13)0 (0)
Single Step Detection \(\checkmark\) 141 (0.31)0 (0)
Unhandled Exception Filter 15,651 (34.49)475 (50)

4.1.2 Novel Malware.

4.1.3 ai-powered malware., 4.2 analysis techniques, 4.2.1 static analysis..

research paper on malware detection

FeaturesTools
ASM Opcode, header, functions, stringsIDA Pro [ ]
PE file header & data, strings, importsPefile (Python) [ , ]
API calls and accessed DLL’sPeframe (Python) [ ]
Byte sequences & n gramsBinary file bytes and N grams [ ]
File, header and text section sizesPefile [ , ]
Label benign or maliciousVirusTotal [ ]

4.2.2 Sandbox Analysis.

FeaturesTools
API callsCuckoo Sandbox [ ], DBI Intel Pin [ , ]
Registry AccessCuckoo Sandbox [ ]
System CallsCuckoo Sandbox [ ], DBI Intel Pin [ , ]
Hardware Performance Countersperf [ ]
I/O accessBitvisor [ ]
File operations, memory dumpsCuckoo Sandbox [ ]
NetworkCuckoo Sandbox [ ], Wireshark [ ]
CPU RegistersDBI Intel Pin [ , ], Debugger [ ]

4.2.3 Dynamic Binary Instrumentation.

CategoryDetails
Indirect Evasion TechniquesDBI tools replicate the behavior of the underlying and the architecture. The DBI tool may not implement every possible behavior, and this provides a mechanism for two anti-instrumentation techniques, Unsupported Assembly Instructions, and Unsupported Behaviors [ ]. If the DBI tool does not support every available CPU assembly instruction and behavior, then this can be exploited to crash the DBI if the malware implements an unsupported instruction or behavior [ ]. Further, analysis environments are typically allocated limited resources, and anti-instrumentation techniques can exploit this by attempting to exhaust available resources [ , ].
Code Cache ArtifactsAn instrumented binary file is written to memory differently than a binary file run on bare metal. This changes the that normally holds the memory address of the next assembly instruction to be executed, because it is executed in the code cache memory [ ]. The changes to the IP can be detected by probes that make comparisons of addresses that are normally expected [ , ].
Environment ArtifactsProbes that seek to fingerprint aspects of memory layout can detect DBI tools inside the process memory of the instrumented binary, whereby the memory layout for an instrumented binary is different and the DBI tool is normally the parent process of the instrumented binary [ , , , ].
JIT Compiler DetectionJIT compilers attempt to conceal the presence of DBI tools by hiding their API and system calls. These hidden calls can be detected by comparing memory addresses and offsets [ ].
Overhead DetectionDBI tools add considerable overhead that impacts execution time, which can be detected by probing the timing of various instructions and resource usage [ , , , ].

4.3 Malware Repositories

DatasetEMBER [ ]BODMAS [ ]SOREL-20M [ ]Virus Share [ ]Malware Bazaar [ ]
Total Samples1.1M134,43520M55M700k
Dates20182019–20202017–2019N.A.Yes
TimestampPartialYesYesYesYes
TaxonomyPartialCompleteNoNoYes
Malware Samples400,00057,29310M48M700k
Benign Samples400,00077,14210MN.A.N.A.
Malware BinariesNoYesDisarmedYesYes
Benign BinariesNoNoNoNoNo
Feature VectorsYesYesYesNoNo
FeaturesStaticStaticStaticBinariesBinaries
Feature ExtractorYesYesNoNoNo

4.3.1 EMBER .

4.3.2 bodmas ., 4.3.3 sorel-20m ., 4.3.4 virusshare..

research paper on malware detection

4.3.5 Malware Bazaar.

4.4 feature selection, 4.5 machine learning vs. deep learning.

research paper on malware detection

PaperDatasetFeaturesModelAcc. %
[ ]Edge-IIoTset [ ]DDoS-UDP, DDoS-ICMP, SQL-injection, Password, Vulnerability-scanner, DDoS-TCP, DDoS-HTTP, Uploading, Backdoor, Port-Scanning, XSS, Ransomware, MITM, FingerprintingSecurityLLM98
CNN95
Transformer Model95
RNN94
DNN93
RF81
KNN7993
SVM78
DT67
[ ][ ]: 23,080 malware and 21,116 benign samplesClusters of dynamically extracted API call sequencesAPI sequence cluster transition matrices99.9
[ ]: 151 malware & 69 benign samples99
CSDMC2010: 320 malware & 68 benign samples98.5
[ ]: 7,107 malware & 169 benign samples98.7
[ ]Benign applications from Software-informer50 FastICA features from initial 15,972 Cuckoo featuresTensorFlow CNN with 512 nodes in Layer 1 & Layer 2 500 epochs94.84
Originally 1,232 ransomware samples across 14 families from VirusShare & VirusTotal Functional samples: 483 ransomware & 754 benignOriginal 15,972 features from Cuckoo JSON reportTensorFlow CNN with 1,024 nodes in Layer 1 & Layer 2 500 epochs95.96
RF90.95
SVM89.96
MC88.12
[ ]150 ransomware & 150 benignDNA sequence for 26 most significant using MOGWO & BCSProposed LR & AL87.91
AB83.22
NB78.52
DS75.83
[ ]13,075 malware & 19,747 benign samples5 static feature vectors, including strings, opcode, API, library, permission, component & environmental featuresMulti Modal Keras DNN, 5 initial networks with 2 hidden layers & final network with 2 hidden layers98.0
[ ]70,140 benign & 69,860 malware samplesEMBER static featuresDNN98.9
RF97.0
DT96.9
SVM96.1
KNN95.1
AB83.0
LR54.0
NB53.8
121,701 benign & 118,717 malware samplesCuckoo JSON reportSVM96.1
CNN93.6
DNN91.0
RF89.5
DT86.0
KNN81.5
AB73.3
LR67.4
NB54.6
52,245 benign & 50,792 malware samplesCuckoo JSON report & Python PsutilCNN96.6
DNN90.4
RF89.9
SVM89.0
KNN85.9
DT82.5
AB82.1
LR57.34
NB50.5
[ ]Android 4,354 malware & 5,065 benign samples, only 429 malware & 1,700 benign had network trafficStatic permission, intent & component from manifest.xml Network traffic 18.5 G benign & 19.0 G malicious, converted to imagesFirst NN for static feature vector, CACNN for network traffic images if classified as benign by first NN99.19
[ ]Android 18,000 malware & 18,000 benign2,290 API calls & 625 manifest properties TensorFlow modelsRNN Bidirection GRU96.78
RNN GRU96.75
RNN Stacked GRU96.67
RNN Stacked LSTM96.64
RNN Bidirection LSTM96.61
RNN LSTM96.56
CNN95.11
18,000 malware & 18,000 benign samples Android2,290 API and 625 manifest properties TensorFlow LiteRNN GRU96.75
70,130 malware 21 families2,290 API & 625 manifest propertiesRNN GRU94.45
[ ]Real crypto mining traffic with web surfing, video and audio streaming, file transfer, email & others51 network traffic features from Tstat tool & 8 features from NetFlow metricsNN100
RF100
DT100
LR100
CART99.99
[ ]43,530 malware samples from VirusShare & 3,591 benign samples from Windows 7 OSFeature vector from debugger & count of the 610 opcodes in the Intelx86/x64 architectureRC99.06
RSS99.05
RF99.05
AB99.02
Bagging98.96
PART98.51
IBk98.34
LWL98.33
J4898.12
KStar98.09
J RIP97.83
REPTree97.63
RT97.06
DT94.70
HT92.11
OneR90.42
DS81.91
ZeroR49.95
[ ]VirusShare: 31,609, VXHeaven: 20,713 & MALICIA: 11,368 malware & 13,752 benign samplesImages generated from dynamic CFG *Recall as accuracy not presentedYOLO-based CNN90.26*
AIS85.88*
Simple-CNN84.85*
SVM74.36*
[ ]56 benign and 50 malware samples for 21,800 volatile memory dumps from Ubuntu VMs: DNS server & HTTP server171 from the memory dumps. All samples used for training, but trained with different dumps & behaviors for same sampleDNS RF98.7
DNS ANN98.2
DNS DNN97.9
DNS SVM97.8
DNS KNN97.7
DNS LR95.6
DNS NB77.6
HTTP ANN99.9
HTTP KNN99.9
HTTP RF99.8
HTTP SVM99.8
HTTP DNN99.5
HTTP LR99.5
HTTP NB94.0
171 extracted from memory dumps, 8 benign and 8 malware samples used for testingDNS DNN95.9
DNS RF93.8
DNS LR93.5
DNS ANN87.9
DNS SVM84.5
DNS KNN80.4
DNS NB67.3
HTTP KNN98.9
HTTP RF98.5
HTTP DNN97.3
HTTP SVM96.7
HTTP NB96.0
HTTP ANN95.0
HTTP LR95.0
[ ]VirusChaser: 139,384 malware & 10,475 benign samples, labelled with VirusTotal79 Pefile utility static features, 513 Cuckoo dynamicAI-Hydra: RF and MLP85.1
Sophos AV74.9
Clam AV74.5
Bitdefender AV52
[ ]582 ransomware in 11 families & 942 benign samplesDynamic, binary strings Windows API, Windows registry, file, system file & directory operations, CAE features reduced from 16,382 to 500 & 100 featuresCSPE-R ensemble: CFH, SVM, RF, LR, DNN93
CFH-RF10092
CFH-SVM50092
CFH-SVM10090
CFH-LR10090
LR90
CFH-RF50089
CFH-LR50089
SVM88
RF80
[ ]BODMAS: 400,000 training, 200,000 testing & 19,000 GAN samplesStatic BODMAS featuresNLP with BERT & TensorFlow DNN85.82
[ ]Malware DB: 3,653 malware samples & 554 benign samples, 5 categories: backdoor, password stealer, rogue, trojan & wormIntel Pin DBI: opcode frequency, memory addresses, memory reads, memory writes, and unaligned memory accessSpecialized DNN Average over the five specialized detectors93.0
Specialized LR Average over the five specialized detectors91.0
General DNN Average over the five types of malware89.0
General LR Average over the five types of malware.87.0

5 Discussion

6 conclusion, acknowledgments, index terms.

Computing methodologies

Artificial intelligence

Machine learning

Machine learning approaches

General and reference

Document types

Surveys and overviews

Security and privacy

Intrusion/anomaly detection and malware mitigation

Malware and its mitigation

Recommendations

Malgene: automatic extraction of malware analysis evasion signature.

Automated dynamic malware analysis is a common approach for detecting malicious software. However, many malware samples identify the presence of the analysis environment and evade detection by not performing any malicious activity. Recently, an approach ...

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...

Ontology for Malware Behavior: A Core Model Proposal

The ubiquity of Internet-connected devices motivates attackers to create malicious programs (malware) to exploit users and their systems. Malware detection requires a deep understanding of their possible behaviors, one that is detailed enough to tell ...

Information

Published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Check for updates, author tags.

  • artificial intelligence
  • machine learning
  • deep learning
  • computer security
  • malware repository
  • malware analysis techniques
  • feature selection
  • evasive malware
  • sophisticated malware
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 7,147 Total Downloads
  • Downloads (Last 12 months) 7,147
  • Downloads (Last 6 weeks) 1,496

View options

View or Download as a PDF file.

View online with eReader .

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • Open access
  • Published: 12 January 2018

A state-of-the-art survey of malware detection approaches using data mining techniques

  • Alireza Souri   ORCID: orcid.org/0000-0001-8314-9051 1 &
  • Rahil Hosseini 2  

Human-centric Computing and Information Sciences volume  8 , Article number:  3 ( 2018 ) Cite this article

51k Accesses

243 Citations

Metrics details

Data mining techniques have been concentrated for malware detection in the recent decade. The battle between security analyzers and malware scholars is everlasting as innovation grows. The proposed methodologies are not adequate while evolutionary and complex nature of malware is changing quickly and therefore turn out to be harder to recognize. This paper presents a systematic and detailed survey of the malware detection mechanisms using data mining techniques. In addition, it classifies the malware detection approaches in two main categories including signature-based methods and behavior-based detection. The main contributions of this paper are: (1) providing a summary of the current challenges related to the malware detection approaches in data mining, (2) presenting a systematic and categorized overview of the current approaches to machine learning mechanisms, (3) exploring the structure of the significant methods in the malware detection approach and (4) discussing the important factors of classification malware approaches in the data mining. The detection approaches have been compared with each other according to their importance factors. The advantages and disadvantages of them were discussed in terms of data mining models, their evaluation method and their proficiency. This survey helps researchers to have a general comprehension of the malware detection field and for specialists to do consequent examinations.

Introduction

In the recent years, the application of malware detection mechanisms utilize through data mining techniques through have increased using machine learning to recognize malicious files [ 1 , 2 ]. Machine learning methods can take in hidden examples from a given preparing set which includes both malware and benign examples. These basic examples can separate malware from benevolent code [ 3 , 4 ]. Malware is a standout most thoughtful intimidations for distributed systems and the Internet [ 5 ]. The battle between security analyzers and malware scholars is everlasting as innovation grows. Malware is a program that makes your framework accomplish something that an assailant needs it to do [ 6 ]. The most generally utilized malware detection develops a straightforward example coordinating way to deal with identify vindictive code. Typically malware designers don’t compose new code without any preparation, yet redesign the old code with new components or muddling strategies [ 7 ]. With a large number of malware cases seeming each day, proficiently preparing countless specimens which display comparable conduct, has turned out to be progressively essential [ 8 ].

Up to now, malware analysis [ 9 , 10 ] have the high growing impact in the procedure of deciding the reason and the usefulness the conduct of a given suspicious application. Such a procedure is an important essential with a specific end goal to create effective and powerful identification furthermore characterization techniques; malware analysis is partitioned into two primary classifications that include dynamic and static methods [ 11 , 12 ]. To the best of our knowledge, the most data mining methods have some benefits and weaknesses in malware detection subject [ 13 ]. In addition, having a new literature review can be influenced on the research studies and explore some technical details in malware detection using data mining techniques. Of course, some research [ 13 , 14 , 15 , 16 , 17 ] had discussed the malware detection approaches. There are some defects in the surveyed research. Some papers are published in out of date and did not considered new articles in comparison and analysis. In addition, some surveys have not any systematic classification and article selection for their researches. For example, Muazzam Siddiqui et al. [ 18 ] presented a survey of malware detection using data mining techniques. Some defects of the survey are as follow: this survey used old research in literature analysis. In addition, they did not any systematic review for article selection in their research. This research did not specified an appropriate categorization for malware detection techniques. Just, they analyzed the scanning and data analysis methods in the proposed research.

To overcome some defects, this paper presents a systematic literature review on the new recent malware detection techniques using data mining approaches. This review classifies the malware detection approaches in two main fields: signature-based and behavior-based. The contributions of this paper are as follows:

Providing a summary of the current challenges related to malware detection approaches in data mining.

Presenting a systematic and categorized overview of the current approaches to machine learning mechanisms in the data mining topics.

Exploring a structure of the important methods that are significant in malware detection approach.

Discussing the important factors of classification malware approaches in the data mining to improve their problems in the futures.

The rest of this paper organized as follows. “ Malware detection approaches ”, overviews the malware detection mechanisms in data mining methods and classifies them with a technical taxonomy. “ Review of the malware detection approaches ” presents an analytical comparison of the proposed approaches for selected studies. In “ Discussion ”, a discussion about the malware detection issues is shown that have not been analyzed comprehensively up to now as an exploration of new challenges. Finally, “ Conclusion ” displays the conclusion.

Malware detection approaches

As a result of the developing malware in the innovation, the information of obscure malware protection is a fundamental subject in the malware recognition as per the machine learning strategies [ 19 ]. The machine learning strategies are divided into supervised and unsupervised classes. Malware detection approaches are divided into two main categories that include behavior-based and signature-based methods [ 20 ]. Also, there are two static and dynamic [ 21 ] malware analysis that generally performed in finding malicious applications [ 22 ].

In Fig.  1 , we illustrate a malware detection taxonomy based on machine learning approaches. According to this figure, the API calls features, assembly features, and binary features are existing approaches for malware detection method. These features use machine learning methods for predicting and detecting malicious files.

Taxonomy of malware detection approaches

Signature-based malware detection

Recently, signature-based detection is the most generally utilized procedure in antivirus programming highlighting exact correlation. Malware recognition has essentially centered on performing static investigations to review the code-structure mark of infections, instead of element behavioral methods [ 23 ]. The signature-based system finds interruptions utilizing a predefined list of known assaults. Despite the fact that this arrangement has the ability to identify malware in the versatile application, it requires steady overhauling of the predefined signature database. Moreover, it is less effective in identifying noxious exercises utilizing the signature-based technique because of the quickly changing nature of portable malware [ 24 , 25 ]. Signature-based strategies depend in light of exceptional crude byte examples or standard articulations, known as marks, made to coordinate the noxious document. For example, static highlights of a record are utilized to decide if it is a malware. The main advantage of signature-based techniques is their thoroughness since they follow all conceivable execution ways of a given document.

In inside of the malware structure, existing malicious objects have characteristics that can be used to generate a unique digital signature. The anti-malware provider utilizes the meta-heuristic algorithms that can scan efficiently the malicious object to control its signature [ 26 ]. After identifying the malicious object, the detected signature is added to the existing database as the recognized malware. The database sources include huge number of the various signatures that classify malicious objects. In the signature-based malware detection, there are some various qualities including fast identification, easy to run, and broadly accessible [ 27 ].

Since the digital signature plans are gotten from known malware, these plans are likewise generally known. Subsequently they can be effectively evaded by programmers utilizing straightforward confusion procedures. Hence malware code can be modified and signature-based identification can be sidestepped. Since anti-malware providers are built on the premise of known malware, they can’t to distinguish obscure malware, or even variations of known malware. In this way, without exact digital signature, they can’t adequately distinguish polymorphic malware. Along these lines, signature-based recognition does not give zero-day insurance. Besides, since a signature-based indicator utilizes an isolate signature for each malware variation, the database of signatures develops at an exponential rate [ 28 ]. The signature-based malware detection has two main methods for applying malware detection approach in machine learning methods including assembly features and binary features. Figure  2 illustrates a standard signature-based malware detection framework using data mining approaches.

The signature-based malware detection framework

Also, Table  1 shows the advantages and weaknesses of the signature-based malware detection approach.

Behavior-based malware detection

This subsection illustrates the behavior-based approaches in malware detection. In addition, it reviews the selected behavior-based approaches in the data mining. Finally, the discussed behavior-based approaches compared and summarized in the last subsection. Behavior-based methodologies require execution of a given example in a sandboxed situation and run-time exercises are checked and logged. Dynamic investigation systems utilize both virtualization and imitating conditions to execute a malware and to remove its practices. The primary advantage of the behavior-based approach is that gives a superior comprehension of how malware is produced and implemented [ 8 , 14 ].

In the behavior-based malware approach, the suspicious objects are assessed based on their activities that they cannot execute in system. Efforts to achieve activities that are clearly irregular or unofficial would specify the suspicious object is malicious, or at least apprehensive. A malicious behavior is known using a dynamic analysis that evaluates malicious intent by the object’s code and structure. In the behavior-based detection, the API calls and assembly features are two main methods for applying machine learning algorithms. Figure  3 depicts a standard behavior-based malware detection approach using data mining algorithms.

The behavior-based malware detection framework

Table  2 shows the advantages and weaknesses of the behavior-based malware detection approach.

After describing the existing malware detection approaches, next section presents the technical analysis of the current research studies in the malware detection with data mining algorithms.

Review of the malware detection approaches

In this section, the existing malware detection approaches are analyzed according to some evaluation factors such as the main idea, advantages and disadvantages, algorithm type and assessment type in data mining techniques. We analyze the selected studies according to existing approaches and discuss on them.

Review of the signature-based approaches

Wu et al. [ 29 ] have utilized an artificial immune-based smartphone malware detection model (SP-MDM) both static malware examination and element malware investigation as indicated by the component of the biologic resistant framework that can shield us from disease by creatures. In this model, the static marks and dynamic marks of malware are separated, and in view of the genuine esteemed vector encoding, the antigens are produced. The youthful identifier develops into a develop one on the off chance that it experiences self-resistance. Finder posterity with higher fondness is made after the streamlining of developing identifiers utilizing clonal determination calculation. Also, they collected twenty malware and twenty benign files as testing samples set.

Bat-Erdene et al. [ 30 ] presented a strategy for characterizing the packing algorithms of given unknown packed executable. To begin with, they measured the entropy estimations of a given executable and change over the entropy estimations of a specific area of memory into typical representations. Their presented strategy utilized symbolic aggregate approximation (SAX), which is known to be viable for huge information changes. Second, we order the conveyance of images utilizing managed learning order strategies, i.e., credulous Bayes and bolster vector machines for recognizing pressing calculations. The aftereffects of our examinations including a gathering of 324 pressed kindhearted projects and 326 stuffed malware programs with 19 pressing calculations illustrate that our strategy can distinguish pressing calculations of given executable with a high precision of 95.35%, a review of 95.83%, and an accuracy of 94.13%. We propose four likeness estimations for distinguishing pressing calculations based on SAX representations of the entropy values and an incremental total examination. Among these four measurements, the loyalty closeness estimation shows the best-matching result, i.e., a rate of precision running from 95.0 to 99.9%, which is from 2 to 13 higher than that of the other three measurements. Our review affirms that pressing calculations can be recognized through an entropy examination in view of a measure of the instability of the running procedures and without earlier information of the executable.

Cui et al. [ 31 ] illustrated a novel recognition framework in light of cloud environment and packet examination. The framework identifies the malicious mobile malware behavior through their bundles with the utilization of information mining strategies. This approach totally keeps away from the deformities of customary techniques. The framework is administration arranged and can be sent by portable administrators to send cautions to clients who have malware on their gadgets. To enhance framework execution, another bunching technique called withdrawal grouping was made. This technique utilizes earlier learning to lessen dataset measure. In addition, a multi-module location plan was acquainted with improve framework precision. The aftereffects of this plan are created by incorporating the location consequences of a few calculations, including Naive Bayes and Decision Tree.

Fan et al. [ 32 ] proposed a compelling arrangement mining calculation to find vindictive quintal examples, and afterward, All-Nearest-Neighbor (ANN) classifier is constructed for malicious position in the established samples. The created information mining structure made out of the proposed consecutive example mining technique and ANN classifier can well describe the malevolent examples from the gathered record test set to adequately distinguish recently concealed malware tests. A thorough exploratory review on a genuine information accumulation is performed to assess our recognition structure. The promising test comes about demonstrate that their structure beats other to exchange information mining based discovery techniques in distinguishing new vindictive executable.

Hellal and Ben Romdhane [ 33 ] displayed another diagram mining technique to recognize variations of malware utilizing static examination while covering the current defects. Also, they proposed a novel calculation, called minimal contrast frequent sub-graph miner method (MCFSM), for separating negligible discriminative and generally utilized malevolent behavioral designs which can distinguish definitely a whole group of vindictive projects, conversely to another arrangement of benevolent projects. The proposed technique demonstrates high recognition rates and low false positive rates and creates a predetermined number of behavioral malware marks.

Martín et al. [ 34 ] illustrated outsider calls to sidestep the impacts of these disguise methodologies since they can’t be obfuscated. We join bunching and multi-target advancement to produce a classifier in view of particular practices characterized by outsider call bunches. The analyzer guarantees that these gatherings are identified with noxious or favorable practices cleaning any non-discriminative example. This device, named MOCDroid, Footnote 1 accomplishes a precision of 95.15% in test with 1.69% of false positives with genuine applications extricated from the wild, overcoming all business antivirus motors from VirusTotal.

Santos et al. [ 35 ] proposed another strategy to identify obscure malware families. This model depends on the recurrence of the presence of opcode groupings. Moreover, they depicted a system to mine the importance of each opcode and evaluate the recurrence of each opcode grouping. Furthermore, they provided experimental approval that this new strategy is fit for recognizing obscure malware.

Wang and Wang [ 24 ] presented a malware recognition framework to ensure a little order mistake by machine learning using the speculation capacity of support vector models (SVMs). This review built up a programmed malware location framework via preparing a SVM classifier in light of behavioral marks. Over approval, plan was utilized for taking care of grouping exactness issues by utilizing SVMs connected with 60 groups of genuine malware. The trial comes about uncover that the characterization blunder diminishes as the measuring of testing information is expanded. For various estimating (N) of malware tests, the expectation precision of malware discovery runs up to 98.7% with N = 100. The general recognition precision of the SVC is more than 85% for unspecific versatile malware.

Summary of the reviewed signature-based approaches

According to the discussed and reviewed signature-based detection approaches, the comparison of the proposed articles is demonstrated in Table  3 which shows the used case study in research, the main advantages, disadvantages and target environment for the existing studies. The main advantage of signature-based detection approaches is using pattern detection that decreases the system overhead and execution time for malware prediction. The main disadvantage of the signature-based detection approaches is omitting feature selection. The target environment is categorized into three main platforms including embedded systems, Windows-based and smartphones. The most research studies in the signature-based detection have used the Windows-based environment for representing the proposed malware detection approach.

In addition, Table  4 depicts a side-by-side comparison of the signature-based detection factors in each article. These factors include case-study method, classification or clustering approach, data analysis method, and data set type and accuracy factor.

Review of the selected behavior-based approaches

Altaher [ 38 ] proposed an evolving hybrid neuro-fuzzy classifier (EHNFC) for Android-based malware grouping utilizing consent based components. The proposed EHNFC not just has the capacity of distinguishing obscured malware utilizing fluffy tenets, yet can likewise advance its structure by adopting new malware recognition fluffy tenets to enhance its discovery exactness when utilized as a part of the location of more malware applications. To this end, a developing bunching technique for adjusting and advancing malware location fluffy tenets was changed to consolidate a versatile methodology for overhauling the radii and focuses of grouped authorization based components. This adjustment to the advancing bunching strategy improves group merging also, produces decides that are better custom-made to the input information, henceforth enhancing the characterization precision of the proposed EHNFC. The exploratory outcomes for the proposed EHNFC demonstrate that the proposition outflanks a few cutting-edge jumbled malware order approaches as far as a false negative rate (0.05) and false positive rate (0.05). The outcomes likewise show that the proposition identifies the Android malware superior to other neuro-fuzzy frameworks as far as precision (90%).

Mohaisen et al. [ 39 ] proposed, a computerized and conduct based malware examination and marking framework called AMAL that addresses shortcomings of the current frameworks. AMAL comprises of two sub-frameworks, AutoMal and MaLabel. AutoMal gives instruments to gather low granularity behavioral curios that portray malware utilization of the document framework, memory, organize, what’s more, registry, and does that by running malware tests in virtualized situations. On the other hand, MaLabel utilizes those ancient rarities to make delegate highlights, utilize them for building classifiers prepared by physically screened preparing tests, and utilize those classifiers to characterize malware tests into families comparable in conduct. AutoMal additionally empowers unsupervised learning, by executing various bunching calculations for tests gathering. An assessment of both AutoMal and MaLabel in view of medium-scale (4000 specimens) and expansive scale datasets (more than 115,000 samples) collected and broke down via AutoMal shows AMAL’s adequacy in precisely describing, ordering, and gathering malware tests. MaLabel accomplishes an exactness of 99.5% and review of 99.6% to confident relations demand, and more than 98% of accuracy and evaluation for unsupervised classification.

Yuan et al. [ 40 ] presented a deep learning method to connect the components from the static investigation with elements from the dynamic investigation of Android applications. In addition, they actualized an Android malware detection engine based on the deep-learning method (DroidDetector) that can consequently distinguish whether a file has a malicious behavior or not. With a large number of Android applications, they tested DroidDetector and play out an in-depth examination of the elements that deep learning basically adventures to portray malware completely. The outcomes appear that deep learning is appropriate for characterizing Android malware and particularly compelling with the accessibility of additional preparation information. DroidDetector can accomplish 96.76% detection accuracy, which traditional machine learning methods.

Boukhtouta et al. [ 41 ] presented the issue of fingerprinting perniciousness of activity with the end goal of recognition and arrangement. This research pointed first at fingerprinting perniciousness by utilizing two approaches: Deep Packet Inspection (DPI) and IP bundle headers arrangement. To this end, we consider malignant activity created from element malware examination as movement perniciousness ground truth. In light of this supposition, they exhibited how these two methodologies are utilized to recognize what’s more, attribute maliciousness to the various threat. In this work, we concentrate the positive and negative angles for Deep Packet Review and IP bundle headers order. They assessed every approach in view of its recognition and attribution precision and additionally their level of multifaceted nature. The results of both methodologies have demonstrated promising outcomes as far as discovery; they are great possibility to constitute a collaboration to expand or prove recognition frameworks as far as runtime speed and grouping exactness.

Ding et al. [ 42 ] proposed an affiliation mining strategy based on API calls to recognize malware. To expand the identification speed of the Objective-Oriented association (OOA) mining, distinctive methodologies are exhibited: to enhance the govern quality, criteria for API determination are proposed to expel APIs that can’t get to distinctly visit things; to discover affiliation decides that have solid segregation control, we characterize the manage utility to assess the affiliation runs; and to enhance the location exactness, a characterization strategy in view of numerous affiliation guidelines is embraced. The trials demonstrate that the proposed systems can essentially enhance the running velocity of OOA. In our investigations, the time cost for information mining is decreased by 32%, and the time cost for arrangement is decreased by 50%.

Eskandari et al. [ 43 ] presented a novel hybrid approach, HDM-Analyzer, is displayed which takes points of interest of dynamic and static investigation techniques for rising pace while protecting the precision at a sensible level. HDM-Analyzer can foresee the dominant part of basic leadership focuses on using the factual data which is assembled by element investigation; along these lines, they have no any performance overhead. The fundamental commitment of this paper is taking exactness preferred standpoint of the element investigation and consolidating it into static examination keeping in mind the end goal to enlarge the precision of static investigation. Truth be told, the execution overhead has been endured in learning stage; hence, it does not force on highlight extraction stage which is performed in examining operation. The exploratory outcomes illustrate that HDM-Analyzer accomplishes better general exactness and time many-sided quality than static and element investigation strategies.

Miao et al. [ 44 ] presented a bilayer conduct reflection strategy in light of the semantic examination of dynamic API sequences. Operations on touchy framework assets and complex practices are disconnected in an interpretable way at various semantic layers. At the lower layer, crude API calls are joined to extract low-layer practices by means of information reliance investigation. At the higher layer, low-layer practices are further joined to build more intricate high-layer practices with great interpretability. The separated low-layer furthermore, high-layer practices are at last inserted into a high dimensional vector space. Henceforth, the disconnected practices can be specifically utilized by numerous prominent machine learning calculations. In addition, to handle the issue that considerate projects are not satisfactorily examined or malware and amiable projects are seriously imbalanced, an enhanced one-class bolster vector machine (OC-SVM) named OC-SVM-Neg is proposed which makes utilization of the accessible negative examples. The trial comes about demonstrate that the proposed include extraction technique with OC-SVM-Neg beats double classifiers on the false caution rate and the speculation capacity.

Ming et al. [ 45 ] have presented a substitution attacks to cover comparable practices by harming behavior-based specifications. The key strategy for the attacks is to supplant a system call dependence graph to its semantically identical variations so that the comparable malware tests confidential unique family end up being characteristic. Accordingly, malware investigators need to put more endeavors into reconsidering the similar samples which may have been examined sometime recently. They distill general attacking strategies by mining more than 5200 malware tests’ behavior specifications and execute a compiler-level model to automate replacement attacks. By evaluating on the real malicious examples, the effectiveness of the proposed method to obstruct several behavior-based malware analysis tasks, such as clustering and malware comparison. Finally, they discussed likely countermeasures to support current malware protection.

Nikolopoulos and Polenakis [ 46 ] have proposed a graph-based model which using relations between gatherings of system-calls, distinguishes whether an unknown software sample is malicious or benign, and classifies a malevolent software to one of a set of an arrangement of known malware families. All the more correctly, clients used the System-call Dependency Graphs (or, for short, ScD-graphs), acquired by traces captured through dynamic taint investigation. The authors planed their model to be safe against strong changes applying our recognition and arrangement systems on a weighted coordinated graph, to be specific Group Relation Graph, or Gr-graph for short, coming about because of ScD-graph subsequent to gathering disjoint subsets of its vertices. For the discovery procedure, the authors proposed the Delta-comparability metric, and for the procedure of classification, they proposed the SaMe-similitude and NP-similarity measurements comprising the SaMe-NP closeness. At last, they evaluated their model for malware recognition and classification demonstrating its possibilities against malicious software measuring its identification rates and classification accuracy.

Sheen et al. [ 47 ] have considered Android-based malware for examination and an adaptable recognition component is planned to utilize multi-feature collaborative decision fusion (MCDF). The distinctive features of a malicious record like the consent-based features and the API call based features are considered keeping in mind the end goal to give a superior discovery via preparing a gathering of classifiers and combining their choices utilizing collective approach in view of likelihood hypothesis. The execution of the proposed model is evaluated on a gathering of Android-based malware including diverse malware families and the outcomes demonstrate that the presented approach give a superior execution than best in class troupe plans accessible.

Norouzi et al. [ 48 ] have proposed distinctive classification techniques with a specific end goal to recognize malware in light of the element and conduct of each malware. A dynamic investigation technique has been exhibited for recognizing the malware features. A recommended program has been introduced for changing over a malware behavior executive history XML document to an appropriate WEKA instrument input. To represent the execution proficiency and preparing information and test, the authors apply the proposed ways to deal with a genuine contextual investigation information set utilizing WEKA instrument. The evaluation results described that the availability of the proposed data mining approach. In addition, their proposed data mining methodology is more proficient for identifying malware and behavioral classification of malware can be helpful to recognize malware in a behavioral antivirus.

Galal et al. [ 49 ] proposed a behavior-based features model that defines malicious action exhibited by malware example. To remove the proposed model, the authors first perform dynamic examination on a generally late malware dataset inside a controlled virtual environment and capture traces of API calls conjured by malware examples. The traces are then generalized into high-level features refer to as actions. The proposed method is evaluated using some famous classification methods such as random forests, decision tree and SVM. The experimental results show that the classifiers attain high precision and satisfactory results in the detection of malware variants.

Summary of the reviewed behavior-based approaches

According to the discussed and reviewed behavior-based detection approaches, the comparison of the proposed articles has illustrated in Table  5 . Table  5 presents the main idea, advantages, disadvantages and target environment of each technical study in behavior-based approaches. The main advantage of behavior-based detection approaches is detecting all of the suspicious files according to their calls’ behavior that increases the accuracy of malware prediction. The main disadvantage of the signature-based detection approaches is the runtime overhead. The target environment is categorized into three main platform including embedded systems, windows-based and smartphones. The most research studies in the behavior-based detection have used the smartphone environment for representing the proposed malware detection approach.

Also, Table  6 shows a technical comparison of the behavior-based detection factors in each article. These factors include case-study method, classification or clustering approach, data analysis method, used data set, total number of dataset and accuracy factor.

In this section, a statistical analysis of reviewed approaches of malware detection using data mining is presented. Figure  4 shows the statistical diagram for all of the classification methods in the selected malware detection approaches. In this report, the SVM method has most percentage for malware detection approach with 29%, j48 has 17%, NB has 10%, RF has 5%, ANN has 3% and the other methods have less than 2% usage in data mining results. We discover that the SMV method just has the best accuracy in the signature-based malware detection approaches using data mining.

Classification methods in malware detection mechanism

Also, Fig.  5 shows the accuracy factor for each research. As shown, all of the accuracy factors higher than 80%. The maximum accuracy percentage is 99.2% for the DPIM approach [ 41 ] and the minimum accuracy percentage is 86% for the DMDAM approach [ 22 ].

Accuracy factor for selected approaches in malware detection

Also, Fig.  6 shows the main case study diagram of each research in malware detection. As shown, the recent researches have considered android smartphones to analyze malware detection approaches with 40%. The symbolic code aggregation case studies in windows-based platform has 23%, the pattern mining has 11%, the system calls has 8% usage in malware detection.

Case study analysis for each research in malware detection

In addition, Fig.  7 illustrates the total number of data set used for malware detection analysis in each research. In this figure, there are five research that use higher than 5000 real samples during the evaluation process. The BBA approach [ 44 ] has the maximum dataset with 17,000 samples and the AMD approach [ 38 ] has the minimum dataset with 500 samples.

The total number of dataset used in each research

Also, Fig.  8 shows the data analysis methods percentage in terms of static, dynamic and hybrid analysis in selected research. The most data analysis methods have used dynamic analysis with 51%, the hybrid analysis has 29% and the static analysis has 20% usage. The 30% of the signature-based approaches have used the dynamic data analysis. The 65% of the behavior-based malware detection approaches have used the dynamic data analysis method.

The data analysis methods in the selected articles

Open issues

Due to applying the survey on the malware detection approaches, the following research challenges as the open issues are presented that are not addressed by the research populations up to now.

Decryption/encryption detection: One of the important open issues in malware detection is information hiding malware techniques. Information hiding techniques are utilized to make information hard to take note. This practice ought not to be mistaken for encryption, in which the substance is disjointed, as it is rather clear. Such components are regularly utilized mutually to guarantee that a discussion stays indiscernible. Steganography is a standout amongst the most surely understood subfields of data stowing away and means to shroud mystery information in an appropriate transporter.

Meta-heuristic detection: The malware detection analyses using meta-heuristic algorithms can influence the speed up of the execution time and the total accuracy factor of the data mining process.

Real-time malware detection: Is based on hybrid analysis, secure multi-objective evolutionary malware detection, secure e-banking environments and secure healthcare systems are very challenging to recognize the malicious files and hidden attacks using data mining approaches.

Further studies are suggested to improve the accuracy of the related malware detection methods using evolutionary mechanisms.

In this survey, we performed a full description research to find more than 35 authors and different works. However, by considering the increasing development of studies on this topic, it is not possible to guarantee that all of the articles were recovered, particularly for 2010, because the research finished in July 2017.

Suggestion criteria

According to the existing discussion analysis, some technical suggestions are introduced to expand the malware detection approaches in the new platforms and architectures such as Internet of Things (IoT) applications, e-banking and social networks.

Some evolutionary methods can be improve the malware detection for predicting the polymorphism attacks in the electronic wallet applications. For example, a meta-heuristic algorithm finds the optimal signature detection for a polymorphism malware attacks in the electronic mobile payments.

Context-aware detection is a new idea for dynamic malware detection approaches in the IoT applications based on semantic signature that categorize API calls with respect to the most interactions between end user and application layer of the IoT. When the smart devices cannot interact between user devices and datacenters, the reliability and availability of the smart services have been decreased.

Providing a safe condition for the huge data collection such as big data against the malware attacks is the key challenge for the malware detection for navigating big data security. Therefore, to select the minimal sample space of the malware damage, the data collection and storing big data can be navigate using data mining and synthesis methods.

This paper presented a systematic literature survey of the malware detection approaches using data mining. The reviewed and papers were investigated and classified into two main categories; (1) signature-based and (2) behavior-based approaches. The malware detection approaches were compared and analyzed according to various essential factors such as classification approaches, data analysis methods, the number of the used dataset, accuracy factor and case study analysis. The advantage and disadvantage of each method were deliberated in the malware detection methods. Most of the selected articles in data mining are behavior-based techniques. In the malware analysis stage, the most case studies are proposed for the android smartphones. In addition, using meta-heuristic algorithms in malware detection analysis can speed up and improve the execution time and the overall accuracy of the data mining process. As the experimental results, we observed that the SVM method has most percentage for malware detection approach with 29%, j48 has 17%, Decision tree has 14%, NB has 10%, BF has 5% and the other methods have less than 3% usage in data mining results. We discover that the SVM method just has the best accuracy in the signature-based malware detection approaches using data mining. In addition, the maximum accuracy percentage is 99.2% for the DPIM approach and the minimum accuracy percentage is 86% for the DMDAM approach. Also, we observed that the recent researches have considered android smartphones to analyze malware detection approaches with 40%. The symbolic code aggregation case studies in windows-based platform has 23%, the pattern mining has 11%, the system calls has 8% usage in malware detection. Finally, we have seen that The 30% of the signature-based approaches have used the dynamic data analysis. The 65% of the behavior-based malware detection approaches have used the dynamic data analysis method. As an important open issue, some important topics such as secure multi-objective malware, e-banking environments, and healthcare systems malware attacks are challenging areas to recognize the malicious files and hidden attacks.

Multi-objective classifier detection.

Souri A, Norouzi M, Asghari P (2017) An analytical automated refinement approach for structural modeling large-scale codes using reverse engineering. Int J Inf Technol 9:329–333. https://doi.org/10.1007/s41870-017-0050-7

Google Scholar  

Souri A, Navimipour NJ, Rahmani AM (2017) Formal verification approaches and standards in the cloud computing: a comprehensive and systematic review. Comput Stand Interfaces. https://doi.org/10.1016/j.csi.2017.11.007

Hashemi H, Azmoodeh A, Hamzeh A, Hashemi S (2017) Graph embedding as a new approach for unknown malware detection. J Comput Virol Hacking Tech 13:153–166. https://doi.org/10.1007/s11416-016-0278-y

Article   Google Scholar  

Park JH (2017) Novel approaches for applying linguistic processing techniques based on pattern recognition and machine learning. JIPS (J Inf Process Syst) 13:643–652

Souri A, Asghari P, Rezaei R (2017) Software as a service based CRM providers in the cloud computing: challenges and technical issues. J Serv Sci Res 9:219–237. https://doi.org/10.1007/s12927-017-0011-5

Bhattacharya A, Goswami RT (2017) DMDAM: data mining based detection of android malware. In: Mandal JK, Satapathy SC, Sanyal MK, Bhateja V (eds) Proceedings of the first international conference on intelligent computing and communication springer Singapore, Singapore, pp 187–194

Nikolopoulos SD, Polenakis I (2017) A graph-based model for malware detection and classification using system-call groups. J Comput Virol Hacking Tech 13:29–46. https://doi.org/10.1007/s11416-016-0267-1

Pektaş A, Acarman T (2017) Classification of malware families based on runtime behaviors. J Inf Secur Appl 37:91–100. https://doi.org/10.1016/j.jisa.2017.10.005

Ye Y, Chen L, Hou S, Hardy W, Li X (2017) DeepAM: a heterogeneous deep learning framework for intelligent malware detection. Knowl Inf Syst. https://doi.org/10.1007/s10115-017-1058-9

Safarkhanlou A, Souri A, Norouzi M, Sardroud SEH (2015) Formalizing and verification of an antivirus protection service using model checking. Procedia Comput Sci 57:1324–1331. https://doi.org/10.1016/j.procs.2015.07.443

Li Z, Sun L, Yan Q, Srisa-an W, Chen Z (2017) DroidClassifier: efficient adaptive mining of application-layer header for classifying android malware. In: Deng R, Weng J, Ren K, Yegneswaran V (eds) Security and privacy in communication networks: 12th international conference, securecomm 2016, Guangzhou, China, October 10–12, 2016, Proceedings. Springer International Publishing, Cham, pp 597–616

Malhotra R, Jangra R (2017) Prediction & assessment of change prone classes using statistical & machine learning techniques. J Inf Process Syst 13(4):778–804. https://doi.org/10.3745/JIPS.04.0013

Chowdhury M, Rahman A, Islam R (2018) Malware analysis and detection using data mining and machine learning classification. In: Abawajy J, Choo K-KR, Islam R (eds) International conference on applications and techniques in cyber security and intelligence: applications and techniques in cyber security and intelligence. Springer International Publishing, Cham, pp 266–274

Palumbo P, Sayfullina L, Komashinskiy D, Eirola E, Karhunen J (2017) A pragmatic android malware detection procedure. Comput Secur 70:689–701. https://doi.org/10.1016/j.cose.2017.07.013

Narayanan A, Chandramohan M, Chen L, Liu Y (2017) A multi-view context-aware approach to Android malware detection and malicious code localization. Empir Softw Eng. https://doi.org/10.1007/s10664-017-9539-8

Mohamed GAN, Ithnin NB (2018) SBRT: API signature behaviour based representation technique for improving metamorphic malware detection. In: Saeed F, Gazem N, Patnaik S, Saed Balaid AS, Mohammed F (eds) Recent trends in information and communication technology. Proceedings of the 2nd international conference of reliable information and communication technology (IRICT 2017). Springer International Publishing, Cham, pp 767–777

Malhotra A, Bajaj K (2016) A hybrid pattern based text mining approach for malware detection using DBScan. CSI Trans ICT 4:141–149. https://doi.org/10.1007/s40012-016-0095-y

Siddiqui M, Wang MC, Lee J (2008) A survey of data mining techniques for malware detection using file features. In: Proceedings of the 46th annual southeast regional conference on xx. 2008. ACM

Sun L, Li Z, Yan Q, Srisa-an W, Pan Y (2016) SigPID: significant permission identification for android malware detection. In: 2016 11th international conference on malicious and unwanted software (MALWARE), pp 1–8

Boujnouni ME, Jedra M, Zahid N (2015) New malware detection framework based on N-grams and support vector domain description. In: 2015 11th international conference on information assurance and security (IAS), pp 123–128

Wuechner T, Cislak A, Ochoa M, Pretschner A (2017) Leveraging compression-based graph mining for behavior-based malware detection. IEEE Trans Dependable Secur Comput. https://doi.org/10.1109/tdsc.2017.2675881

Bhattacharya A, Goswami RT (2017) Comparative analysis of different feature ranking techniques in data mining-based android malware detection. In: Satapathy SC, Bhateja V, Udgata SK, Pattnaik PK (eds) Proceedings of the 5th international conference on frontiers in intelligent computing: theory and applications: FICTA 2016, Volume 1. Springer Singapore, Singapore, pp 39–49

Fan CI, Hsiao HW, Chou CH, Tseng YF (2015) Malware detection systems based on API log data mining. In: 2015 IEEE 39th annual computer software and applications conference, pp 255–260

Wang P, Wang Y-S (2015) Malware behavioural detection and vaccine development by using a support vector model classifier. J Comput Syst Sci 81:1012–1026. https://doi.org/10.1016/j.jcss.2014.12.014

Fraley JB, Figueroa M (2016) Polymorphic malware detection using topological feature extraction with data mining. In: SoutheastCon 2016, pp 1–7

Sun M, Li X, Lui JC, Ma RT, Liang Z (2017) Monet: a user-oriented behavior-based malware variants detection system for android. IEEE Trans Inf Forensics Secur 12:1103–1112

Sun H, Wang X, Buyya R, Su J (2017) CloudEyes: cloud-based malware detection with reversible sketch for resource-constrained internet of things (IoT) devices. Softw Pract Exp 47:421–441. https://doi.org/10.1002/spe.2420

Tang Y, Xiao B, Lu X (2011) Signature tree generation for polymorphic worms. IEEE Trans Comput 60:565–579. https://doi.org/10.1109/TC.2010.130

Article   MathSciNet   MATH   Google Scholar  

Wu B, Lu T, Zheng K, Zhang D, Lin X (2014) Smartphone malware detection model based on artificial immune system. China Commun 11:86–92. https://doi.org/10.1109/CC.2014.7022530

Bat-Erdene M, Park H, Li H, Lee H, Choi MS (2017) Entropy analysis to classify unknown packing algorithms for malware detection. Int J Inf Secur 16(3):227–248. https://doi.org/10.1007/s10207-016-0330-4

Cui B, Jin H, Carullo G, Liu Z (2015) Service-oriented mobile malware detection system based on mining strategies. Pervasive Mob Comput 24:101–116. https://doi.org/10.1016/j.pmcj.2015.06.006

Fan Y, Ye Y, Chen L (2016) Malicious sequential pattern mining for automatic malware detection. Expert Syst Appl 52:16–25. https://doi.org/10.1016/j.eswa.2016.01.002

Hellal A, Romdhane LB (2016) Minimal contrast frequent pattern mining for malware detection. Comput Secur 62:19–32. https://doi.org/10.1016/j.cose.2016.06.004

Martín A, Menéndez HD, Camacho D (2016) MOCDroid: multi-objective evolutionary classifier for Android malware detection. Soft Comput 21:7405–7415. https://doi.org/10.1007/s00500-016-2283-y

Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG (2013) Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf Sci 231:64–82. https://doi.org/10.1016/j.ins.2011.08.020

Article   MathSciNet   Google Scholar  

Rehman Z-U, Khan SN, Muhammad K, Lee JW, Lv Z, Baik SW, Shah PA, Awan K, Mehmood I (2017) Machine learning-assisted signature and heuristic-based detection of malwares in Android devices. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.11.028

Alam S, Qu Z, Riley R, Chen Y, Rastogi V (2017) DroidNative: automating and optimizing detection of Android native code malware variants. Comput Secur 65:230–246. https://doi.org/10.1016/j.cose.2016.11.011

Altaher A (2016) An improved Android malware detection scheme based on an evolving hybrid neuro-fuzzy classifier (EHNFC) and permission-based features. Neural Comput Appl 28:4147–4157. https://doi.org/10.1007/s00521-016-2708-7

Mohaisen A, Alrawi O, Mohaisen M (2015) AMAL: high-fidelity, behavior-based automated malware analysis and classification. Comput Secur 52:251–266. https://doi.org/10.1016/j.cose.2015.04.001

Yuan Z, Lu Y, Xue Y (2016) Droiddetector: android malware characterization and detection using deep learning. Tsinghua Sci Technol 21:114–123. https://doi.org/10.1109/TST.2016.7399288

Boukhtouta A, Mokhov SA, Lakhdari N-E, Debbabi M, Paquet J (2016) Network malware classification comparison using DPI and flow packet headers. J Comput Virol Hacking Tech 12:69–100. https://doi.org/10.1007/s11416-015-0247-x

Ding Y, Yuan X, Tang K, Xiao X, Zhang Y (2013) A fast malware detection algorithm based on objective-oriented association mining. Comput Secur 39(Part B):315–324. https://doi.org/10.1016/j.cose.2013.08.008

Eskandari M, Khorshidpour Z, Hashemi S (2013) HDM-Analyser: a hybrid analysis approach based on data mining techniques for malware detection. J Comput Virol Hacking Tech 9:77–93. https://doi.org/10.1007/s11416-013-0181-8

Miao Q, Liu J, Cao Y, Song J (2016) Malware detection using bilayer behavior abstraction and improved one-class support vector machines. Int J Inf Secur 15:361–379. https://doi.org/10.1007/s10207-015-0297-6

Ming J, Xin Z, Lan P, Wu D, Liu P, Mao B (2016) Impeding behavior-based malware analysis via replacement attacks to malware specifications. J Comput Virol Hacking Tech 13:193–207. https://doi.org/10.1007/s11416-016-0281-3

Article   MATH   Google Scholar  

Nikolopoulos SD, Polenakis I (2016) A graph-based model for malware detection and classification using system-call groups. J Comput Virol Hacking Tech 13:29–46. https://doi.org/10.1007/s11416-016-0267-1

Sheen S, Anitha R, Natarajan V (2015) Android based malware detection using a multifeature collaborative decision fusion approach. Neurocomputing 151(Part 2):905–912. https://doi.org/10.1016/j.neucom.2014.10.004

Norouzi M, Souri A, Samad Zamini M (2016) A data mining classification approach for behavioral malware detection. J Comput Netw Commun 2016:9. https://doi.org/10.1155/2016/8069672

Galal HS, Mahdy YB, Atiea MA (2016) Behavior-based features model for malware detection. J Comput Virol Hacking Tech 12:59–67. https://doi.org/10.1007/s11416-015-0244-0

Mao W, Cai Z, Towsley D, Feng Q, Guan X (2017) Security importance assessment for system objects and malware detection. Comput Secur 68:47–68. https://doi.org/10.1016/j.cose.2017.02.009

Wu S, Wang P, Li X, Zhang Y (2016) Effective detection of android malware based on the usage of data flow APIs and machine learning. Inf Softw Technol 75:17–25. https://doi.org/10.1016/j.infsof.2016.03.004

Dali Z, Hao J, Ying Y, Wu D, Weiyi C (2017) DeepFlow: deep learning-based malware detection by mining Android application for abnormal usage of sensitive data. In: 2017 IEEE symposium on computers and communications (ISCC), pp 438–443

Download references

Authors’ contributions

AS as the corresponding author. RH as the co-author. Both authors read and approved the final manuscript.

Acknowledgements

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

Consent for publication, ethics approval and consent to participate.

We confirm that this manuscript has not been published elsewhere and is not under consideration by another journal. All authors have approved the manuscript and agree with its submission.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

Alireza Souri

Department of Computer Engineering, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran

Rahil Hosseini

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Alireza Souri .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Souri, A., Hosseini, R. A state-of-the-art survey of malware detection approaches using data mining techniques. Hum. Cent. Comput. Inf. Sci. 8 , 3 (2018). https://doi.org/10.1186/s13673-018-0125-x

Download citation

Received : 20 July 2017

Accepted : 02 January 2018

Published : 12 January 2018

DOI : https://doi.org/10.1186/s13673-018-0125-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining
  • Malware detection
  • Classification
  • Behavior-based
  • Signature-based

research paper on malware detection

A Recent Research on Malware Detection Using Machine Learning Algorithm: Current Challenges and Future Works

  • Conference paper
  • First Online: 16 November 2021
  • Cite this conference paper

research paper on malware detection

  • Nor Zakiah Gorment 16 , 19 ,
  • Ali Selamat   ORCID: orcid.org/0000-0001-9746-8459 16 , 17 , 18 , 20 &
  • Ondrej Krejcar   ORCID: orcid.org/0000-0002-5992-2574 20  

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13051))

Included in the following conference series:

  • International Visual Informatics Conference

1545 Accesses

5 Citations

Each year, malware issues remain one of the cybersecurity concerns since malware’s complexity is constantly changing as the innovation rapidly grows. As a result, malware attacks have affected everyday life from various mediums and ways. Therefore, a machine learning algorithm is one of the essential solutions in the security of computer systems to detect malware regarding the ability of machine learning algorithms to keep up with the evolution of malware. This paper is devoted to reviewing the most up-to-date research works from 2017 to 2021 on malware detection where machine learning algorithm including K-Means, Decision Tree, Meta-Heuristic, Naïve Bayes, Neuro-fuzzy, Bayesian, Gaussian, Support Vector Machine (SVM), K-Nearest Neighbour (KNN) and n-Grams was discovered using a systematic literature review. This paper aims at the following: (1) it describes each machine learning algorithm, (2) for each algorithm; it shows the performance of malware detection, and (3) we present the challenges and limitations of the algorithm during research processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

A state-of-the-art survey of malware detection approaches using data mining techniques.

research paper on malware detection

Smart Boosted Model for Behavior-Based Malware Analysis and Detection

research paper on malware detection

Enhancing malware detection performance: leveraging K-Nearest Neighbors with Firefly Optimization Algorithm

McAfee Homepage (McAfee Labs Threats report April 2021). https://www.mcafee.com/enterprise/en-us/lp/threats-reports/apr-2021.html . Accessed 18 Aug 2021

AV-TEST Institute Homepage (Statistic of malware). https://www.av-test.org/en/statistics/malware/ . Accessed 18 Aug 2021

Qamar, A., Karim, A., Chang, V.: Mobile malware attacks: review, taxonomy & future directions. Fut. Gener. Comput. Syst. 97 , 887–909 (2019)

Article   Google Scholar  

Kara, I.: A basic malware analysis method. Comput. Fraud Secur. 2019 (6), 11–19 (2019)

Yu, B., Fang, Y., Yang, Q., Tang, Y., Liu, L.: A survey of malware behavior description and analysis. Front. Inf. Technol. Electron. Eng. 19 (5), 583–603 (2018). https://doi.org/10.1631/FITEE.1601745

Chakkaravarthy, S.S., Sangeetha, D., Vaidehi, V.: A survey on malware analysis and mitigation techniques. Comput. Sci. Rev. 32 , 1–23 (2019)

Article   MathSciNet   Google Scholar  

Saeed, I.A., Selamat, A., Abuagoub, A.M.: A survey on malware and malware detection systems. Int. J. Comput. Appl. 67 (16), 25–31 (2013)

Google Scholar  

Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf. Secur. Tech. Rep. 14 (1), 16–29 (2009)

Anshori, M., Mar’i, F., Bachtiar, F.A.: Comparison of machine learning methods for android malicious software classification based on system call. In: 2019 International Conference on Sustainable Information Engineering and Technology (SIET), pp. 343–348. IEEE (2019)

Al Ali, M., Svetinovic, D., Aung, Z., Lukman, S.: Malware detection in Android mobile platform using machine learning algorithms. In: 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS), pp. 763–768. IEEE (2017)

Abdullah, T.A., Ali, W., Abdulghafor, R.: Empirical study on intelligent Android malware detection based on supervised machine learning. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 11 (4) (2020)

Galib, A.H., Hossain, B.M.: A systematic review on hybrid analysis using machine learning for Android malware detection. In: 2019 2nd International Conference on Innovation in Engineering and Technology (ICIET), pp. 1–6. IEEE (2019)

Lindorfer, M., Neugschwandtner, M., Platzer, C.: Marvin: efficient and comprehensive mobile app classification through static and dynamic analysis. In: 39th Annual Computer Software and Applications Conference, vol. 2, pp. 422–433. IEEE (2015)

Spreitzenbarth, M., Freiling, F., Echtler, F., Schreck, T., Hoffmann, J.: Mobile-sandbox: having a deeper look into android applications. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 1808–1815 (2013)

Arshad, S., Shah, M.A., Wahid, A., Mehmood, A., Song, H., Yu, H.: SAMADroid: a novel 3-level hybrid malware detection model for Android operating system. IEEE Access 6 , 4321–4339 (2018)

Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: DREBIN: effective and explainable detection of android malware in your pocket. In: NDSS, vol. 14, pp. 23–26 (2014)

Kapratwar, A., Di Troia, F., Stamp, M.: Static and dynamic analysis of android malware. In: ICISSP, pp. 653–662 (2017)

Kitchenham, B.: Procedures for performing systematic reviews. Keele University and ESE, Nicta, UK, Australia, Technical report, TR/SE-0401, 0400011T.1 (2004)

Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: a systematic review. Inf. Softw. Technol. 50 (9–10), 833–859 (2008)

Huda, S., et al.: Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data. Inf. Sci. 379 , 211–228 (2017)

Abiola, A.M., Marhusin, M.F.: Signature-based malware detection using sequences of N-grams. Int. J. Eng. Technol. 7 , 120–125 (2018)

Sethi, K., Chaudhary, S.K., Tripathy, B.K., Bera, P.: A novel malware analysis framework for malware detection and classification using machine learning approach. In: Proceedings of the 19th International Conference on Distributed Computing and Networking, pp. 1–4 (2018)

Irshad, A., Maurya, R., Dutta, M.K., Burget, R., Uher, V.: Feature optimization for run time analysis of malware in windows operating system using machine learning approach. In: 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pp. 255–260, IEEE (2019)

Mishra, P., et al.: VMShield: memory introspection-based malware detection to secure cloud-based services against stealthy attacks. IEEE Trans. Ind. Inf. 17 (10), 6754–6764 (2021). https://doi.org/10.1109/TII.2020.3048791

KP, A.M., Chandran, S., Gressel, G., Arjun, T.U., Pavithran, V.: Using dtrace for machine learning solutions in malware detection. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2020)

Cruz, S., Coleman, C., Rudd, E.M., Boult, T.E.: Open set intrusion recognition for fine-grained attack categorization. In: 2017 IEEE International Symposium on Technologies for Homeland Security (HST), pp. 1–6. IEEE (2017)

Lingam, G., Rout, R.R., Somayajulu, D.V.L.N.: Detection of social botnet using a trust model based on spam content in Twitter network. In: 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS), pp. 280–285. IEEE (2018)

Rosli, N.A., Yassin, W., Faizal, M.A., Selamat, S.R.: Clustering analysis for malware behavior detection using registry data (IJACSA). Int. J. Adv. Comput. Sci. Appl. 10 , 12 (2019)

Al Zaabi, A., Mouheb, D.: Android malware detection using static features and machine learning. In: 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), pp. 1–5. IEEE (2020)

Ibrahim, W.N.H., et al.: Multilayer framework for botnet detection using machine learning algorithms. IEEE Access 9 , 48753–48768 (2021)

Wei, L., Luo, W., Weng, J., Zhong, Y., Zhang, X., Yan, Z.: Machine learning-based malicious application detection of android. IEEE Access 5 , 25591–25601 (2017)

Qasim, O.M.: Detection system for detecting worms using hybrid algorithm of naïve Bayesian classifier and k-means. In: 2019 2nd International Conference on Engineering Technology and its Applications (IICETA), pp. 173–178. IEEE (2019)

Dhalaria, M., Gandotra, E.: A framework for detection of android malware using static features. In: 2020 IEEE 17th India Council International Conference (INDICON), pp. 1–7. IEEE (2020)

Khariwal, K., Singh, J., Arora, A.: IPDroid: Android malware detection using intents and permissions. In: 2020 4th World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 197–202. IEEE (2020)

Coban, O., Ozel, S.: Adapting text categorization for manifest based android malware detection. Comput. Sci. 20 (3), 383 (2019). https://doi.org/10.7494/csci.2019.20.3.3285

Wu, F., Xiao, L., Zhu, J.: Bayesian model updating method based android malware detection for IoT services. In: 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 61–66. IEEE (2019)

Altaher, A.: An improved Android malware detection scheme based on an evolving hybrid neuro-fuzzy classifier (EHNFC) and permission-based features. Neural Comput. Appl. 28 (12), 4147–4157 (2016). https://doi.org/10.1007/s00521-016-2708-7

Cucchiarelli, A., Morbidoni, C., Spalazzi, L., Baldi, M.: Algorithmically generated malicious domain names detection based on n-grams features. Exp. Syst. Appl. 170 , 114551 (2021)

Download references

Acknowledgment

The authors sincerely thank Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876, for completing the research. This work was supported/funded by the Ministry of Higher Education under the Fundamental Research Grant Scheme (FRGS/1/2018/ICT04/UTM/01/1). The work is partially supported by the SPEV project (ID: 2102-2021), Faculty of Informatics and Management, University of Hradec Kralove. We are also grateful for the support of Ph.D. students Michal Dobrovolny and Sebastien Mambou in consultations regarding application aspects from Hradec Kralove University, Czech Republic.

Author information

Authors and affiliations.

Malaysia-Japan International Institute of Technology, Universiti Teknologi Malaysia, Jalan Sultan Yahya Petra, 54100, Kuala Lumpur, Malaysia

Nor Zakiah Gorment & Ali Selamat

Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia, 81310, Johor Bahru, Johor, Malaysia

Ali Selamat

MagicX (Media and Games Center of Excellence), Universiti Teknologi Malaysia, 81310, Johor Bahru, Johor, Malaysia

College of Computing and Informatics, Universiti Tenaga Nasional, Jalan IKRAM-UNITEN, 43000, Kajang, Selangor, Malaysia

Nor Zakiah Gorment

Faculty of Informatics and Management, Universiti Hradec Kralove, Rokitanskeho 62, 50003, Hradec Kralove, Czech Republic

Ali Selamat & Ondrej Krejcar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ali Selamat .

Editor information

Editors and affiliations.

Universiti Tenaga Nasional, Selangor, Malaysia

Halimah Badioze Zaman

Dublin City University, Dublin, Ireland

Alan F. Smeaton

National Central University, Jhongli, Taiwan

Timothy K. Shih

Queen Mary University of London, London, UK

Sergio Velastin

Toyo University, Tokyo, Japan

Tada Terutoshi

University of Southern Denmark, Odense, Denmark

Bo Nørregaard Jørgensen

Hazleen Aris

Nazrita Ibrahim

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Cite this paper.

Gorment, N.Z., Selamat, A., Krejcar, O. (2021). A Recent Research on Malware Detection Using Machine Learning Algorithm: Current Challenges and Future Works. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2021. Lecture Notes in Computer Science(), vol 13051. Springer, Cham. https://doi.org/10.1007/978-3-030-90235-3_41

Download citation

DOI : https://doi.org/10.1007/978-3-030-90235-3_41

Published : 16 November 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-90234-6

Online ISBN : 978-3-030-90235-3

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, malware detection.

91 papers with code • 2 benchmarks • 4 datasets

Malware Detection is a significant part of endpoint security including workstations, servers, cloud instances, and mobile devices. Malware Detection is used to detect and identify malicious activities caused by malware. With the increase in the variety of malware activities on CMS based websites such as malicious malware redirects on WordPress site (Aka, WordPress Malware Redirect Hack) where the site redirects to spam, being the most widespread, the need for automatic detection and classifier amplifies as well. The signature-based Malware Detection system is commonly used for existing malware that has a signature but it is not suitable for unknown malware or zero-day malware

Source: The Threat of Adversarial Attacks on Machine Learning in Network Security - A Survey

Benchmarks Add a Result

--> --> -->
Trend Dataset Best ModelPaper Code Compare
Graph2Vec
Sherlock

research paper on malware detection

Most implemented papers

Malware detection by eating a whole exe.

research paper on malware detection

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community.

Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN

A generative network is trained to minimize the generated adversarial examples' malicious probabilities predicted by the substitute detector.

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

However, deep learning is often criticized for its lack of robustness in adversarial settings (e. g., vulnerability to adversarial inputs) and general inability to rationalize its predictions.

subgraph2vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs

Also, we show that the subgraph vectors could be used for building a deep learning variant of Weisfeiler-Lehman graph kernel.

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs.

A learning model to detect maliciousness of portable executable using integrated feature set

urwithajit9/ClaMP • journal 2017

In the experiments conducted on the novel test data set the accuracy was observed as 89. 23% for the integrated feature set which is 15% improvement on accuracy achieved with raw-feature set alone.

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Many efforts have been made to use various forms of domain knowledge in malware detection.

DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification

tychen5/sportslottery • 21 Nov 2017

While conventional signature and token based methods for malware detection do not detect a majority of new variants for existing malware, the results presented in this paper show that signatures generated by the DBN allow for an accurate classification of new malware variants.

Efficient Formal Safety Analysis of Neural Networks

tcwangshiqi-columbia/ReluVal • NeurIPS 2018

Our approach can check different safety properties and find concrete counterexamples for networks that are 10$\times$ larger than the ones supported by existing analysis techniques.

Automatic Malware Description via Attribute Tagging and Similarity Embedding

With the rapid proliferation and increased sophistication of malicious software (malware), detection methods no longer rely only on manually generated signatures but have also incorporated more general approaches like machine learning detection.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Evaluation of Machine Learning Algorithms for Malware Detection

Associated data.

The data used to support the findings of this study are available from the corresponding author upon request.

This research study mainly focused on the dynamic malware detection. Malware progressively changes, leading to the use of dynamic malware detection techniques in this research study. Each day brings a new influx of malicious software programmes that pose a threat to online safety by exploiting vulnerabilities in the Internet. The proliferation of harmful software has rendered manual heuristic examination of malware analysis ineffective. Automatic behaviour-based malware detection using machine learning algorithms is thus considered a game-changing innovation. Threats are automatically evaluated based on their behaviours in a simulated environment, and reports are created. These records are converted into sparse vector models for use in further machine learning efforts. Classifiers used to synthesise the results of this study included kNN, DT, RF, AdaBoost, SGD, extra trees and the Gaussian NB classifier. After reviewing the test and experimental data for all five classifiers, we found that the RF, SGD, extra trees and Gaussian NB Classifier all achieved a 100% accuracy in the test, as well as a perfect precision (1.00), a good recall (1.00), and a good f1-score (1.00). Therefore, it is reasonable to assume that the proof-of-concept employing autonomous behaviour-based malware analysis and machine learning methodologies might identify malware effectively and rapidly.

1. Introduction

Cyberattacks from hackers are currently the leading cause of concern in the technological world.

Traditional antivirus systems that rely on signature matching often overlook polymorphic and newly discovered hazardous executables; therefore, this is an issue that has to be explored, as well as the rapid spread of malware on the Internet. As they spread throughout the Internet and into new locations, viruses and other types of malware become more widespread and hazardous [ 1 ]. Static analysis based on human heuristic inspection is no longer considered practicable or efficient in light of the increasing growth of malware. However, several novel approaches to detecting and preventing malware are now in development. One strategy for malware detection recommends integrating data mining techniques, such as machine learning algorithms, with autonomous dynamic malware analysis [ 2 ].

These days, guarding computer networks is a top priority for security experts. Malware incidents have been on the rise despite the widespread availability of virus scanners and malware detection programmes. Dynamic and static approaches to malware detection and categorisation have also been proposed. There may be benefits of using a dynamic method to identify malware rather than a static method. This is due to the fact that, in contrast to static malware detection, hiding dangerous behaviour during execution is far more difficult [ 3 , 4 ]. In recent years, experts in the field of cybersecurity have been emphasising the use of machine learning algorithms for the purpose of detecting malware and predicting the behaviour of malware families. However, it does not appear that there is a consolidated repository that compares and rates various machine learning approaches to identifying fraudulent data. We conducted a battery of experiments to compare different machine learning strategies for identifying malware and classifying it into evolving family clusters. Threats of new malware per second is listed in Figure 1 [ 5 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g001.jpg

Threats of new malware per second.

A collected dataset of authentic malware samples has been run by innocuous programmes from VirusTotal in a sandboxed setting to record malware behaviour, which we subsequently used to assess machine learning techniques in terms of commonly employed performance metrics [ 6 , 7 ]. Execution data collected as JSON reports provide us with a promising set of attributes defining the behaviour of a malware sample. When this is complete, the resultant feature set may be used to distinguish harmful from benign files. The motivation of this work came from the fact that several approaches have been created to optimize for a wide variety of criteria. As a result, they act differently, even when presented with the same circumstances. We also offer guidelines for researchers to follow and future research directions to take when tackling the challenge of dynamically recognising malware with machine learning techniques. The classification of OS-based threats is depicted in Figure 2 [ 8 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g002.jpg

Classification of OS-based malware threats.

More and more people are getting their information via the Internet through more and more different kinds of devices, from desktop PCs to embedded systems. Having more people have access to the Internet has been incredibly beneficial because of many advantages it provides, such as the possibility of rapid communication [ 9 ]. As long as there is the Internet connectivity, users may log in and use their preferred online services whenever they want. There is no surprise that malware has proliferated as the Internet use has increased, as it provides a lucrative market for illegal software makers. The exponential growth in malware detection seen in Figure 3 occurred only in the past few years [ 10 , 11 , 12 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g003.jpg

Total amounts of malware and potentially unwanted applications (PUAs).

Anti-malware, intrusion detection and other forms of detection have all emerged as responses to the harm caused by malware. However, there are certain urgent issues that require immediate attention, in light of ever-changing techniques employed by malicious software and the widespread presence of security flaws in popular programmes [ 13 , 14 ]. Many approaches from different areas have been presented for efficient malware detection. Since it is more difficult to disguise the destructive behaviour of malware while it is being run, dynamic techniques have proven more effective than static ones. As researchers have discovered the benefits of dynamic and automated techniques, they have shifted their focus away from traditional methods of malware detection [ 15 ].

2. Literature Review

A novel representation for tracking the actions of malicious programmes, dubbed MIST, was proposed by Trinius (2016). The representation has been fine-tuned for the efficient study of behaviour using data mining and machine learning. During malware investigation, it can be automatically gathered using a behaviour-monitoring programme or manually converted from existing behaviour reports. Rieck (2018) attempts to use commonalities between malicious programmes to categorise them [ 16 ]. According to the Patil (2020), there are consistent patterns of behaviour among malware versions that can be used to infer the authors’ intentions. The first step in their process involves observing how malware behaves in a sandbox setting; the second relies on a corpus of malware annotated by an antivirus programme; and the third involves analysing the results in Figure 4 [ 17 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g004.jpg

Malware analysis methods [ 17 ].

Learning methods are used to train a malware behaviour classifier, and the most distinguishing characteristics of the behaviour models are prioritised in order to provide an explanation for the classifications made. To analyse malware’s activity automatically, Rieck (2018) proposes a methodology based on machine learning [ 18 ]. The framework can classify unknown malware into previously identified classes based on their behaviours. Christodorescu (2018) proposes a method that uses a comparison between the execution patterns of known malware and a collection of innocuous apps to identify potential threats. The authors extract harmful features from a known virus that are absent in a collection of innocuous software. Malware detectors can use the results of the authors’ algorithm to identify new malware [ 19 ].

Machine learning algorithms concentrate on improving the quality of features by means of engineering, selection and representation. The model is trained with data representing the features of each class, producing a goodware/malware plane. This plane can be used to distinguish between malicious and legitimate software [ 20 ]. Understanding the domain is crucial for engineering and feature selection. Traditional machine learning-based malware detection systems have a problem, that is, they can be attacked if bad guys figure out how to reverse engineer, understand and represent the characteristics used by the model. Having a wide variety of examples to learn from is essential for machine learning algorithms. Privacy and security concerns mean that there is a scarcity of high-quality data that can be used for malware analysis in the public domain. Many researchers now create their own data sets for study using the same procedures developed by data scientists [ 21 ]. To examine Ye’s (2017) material would be an enormous task. All of these things make it difficult to develop a malware detection system that uses machine learning in real time [ 22 ].

Contemporary AI systems employ deep learning models, a refined version of the neural network model, to carry out a wide variety of tasks in natural language processing and robotics. It makes an effort to save a detailed representation of features in its hidden layers during training and can learn from its errors. Neelam (2020) examines studies that use deep learning models for malware analysis [ 23 ].

In 2015, Microsoft held a malware classification competition using the Kaggle platform. The provided database contained around 20,000 malware samples or almost half a terabyte. Using this information, Ronen (2015) analysed published studies and proposed studies in the field [ 24 ].

Souri (2020) did a thorough literature study of the various strategies given for malware detection using only data analysis techniques. The relevant scientific literature was partitioned into two groups, with one for each signature and behaviour type, and a comparative analysis of methods was carried out. In addition, recent studies have shown that hybrid techniques are more accurate than static or dynamic analysis alone; therefore, they should not be neglected [ 25 ].

Researcher Y. Yanfang (2018) summarised previous work on cloud-based malware detection methods, feature extraction and classification tactics and current malware development trends. This research also analysed and compared studies that used static analysis, dynamic analysis and a hybrid approach. However, the latest year covered by these analyses is 2017. The larger scale of recent investigations has made it clear that this initiative has to be expanded [ 26 , 27 ].

In a 2017 article, Ucci summed up various machine learning algorithms under consideration for the identification of malicious PE files in Windows OS. The studies were systematically arranged in accordance with many criteria, including the aims, methodology and sample characteristics of each individual study [ 28 ]. The economics of malware analysis, which we proposed to rename “malware analysis economy”, are discussed, along with the associated difficulties. Now, it has been three years, since the original study was published, and more research is warranted [ 29 , 30 ].

  • A . Research Gap

Cybercriminals create harmful software and introduce it into several computer systems in an effort to gain access or cause damage. Antivirus software, log file analysis and interaction monitoring are all used by businesses to look for telltale signs of malicious or suspicious behaviour that may indicate a recognised threat or attack trend [ 1 ]. Effective results may be obtained when using signature-based malware detection systems to identify well-known threats; nevertheless, these systems are easily bypassed by attackers. Increases in detection rates, decreases in false positive rates and reductions in processing time have been the subjects of much research into improving dangerous file detection. This sort of research is challenging to extend and develop due to a number of issues in the ecology of malicious software detection. In this study, we analysed many methods for finding malware in files that have already been released and discussed where more work needs to be performed. We took a look at the efforts being made to standardise the measurement, description, assessment and architecture enabling malware detection, and we pinpointed elements that may be valuable in making research on detecting harmful files more accessible and extensible [ 2 ].

3. Research Problem

The use of computers and their associated hardware is not only pervasive, but also intrinsically risky. This makes it possible for cybercriminals to create malicious software, take over computers and steal data [ 3 ]. Security professionals have a tough time providing constant, foolproof protection for computer systems because of a number of factors. In order to gain access or inflict damage, cybercriminals create malicious code and inject it into several computing systems. Organisations use antivirus software, log file analysis and interaction monitoring to identify patterns of behaviour that are consistent with known threats or attack vectors [ 4 ].

Malware’s harmful components might be uncovered via static analysis or dynamic analysis. Decompiling the virus and using a static analysis to parse malware files both aim to find harmful strings hidden inside the files. When the harmful code is run in a safe environment, such as a virtual computer, it may be monitored dynamically through analysis. Both methods have their advantages and disadvantages, but it is best to use both while analysing malware [ 5 ]. Better malware detection might be the result of less malicious features being employed in their development. That would give the researcher more time to go through the data. We are concerned that too many attributes are being used to detect malware when, in fact, a more limited collection of characteristics is needed. The initial step in deciding which malicious components to use is to identify possible approaches or algorithms [ 6 ]. There has to be a solution to the drastic decrease of both the amount of characteristics needed to detect malware and detect previously unseen malware [ 7 , 9 ].

4. Research Framework

The proliferation of more complex kinds of malware poses a growing threat to modern computing infrastructure. Traditional signature-based malware detection technologies are becoming more useless due to the exponential proliferation of malware samples [ 10 ]. Machine learning has been proven by researchers to be able to accurately detect and label harmful files. Further, the accuracy of these machine learning models may be improved by using feature selection techniques to identify the most important features and reducing the size of the dataset, which results in fewer calculations. The research framework is depicted in Figure 5 [ 11 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g005.jpg

Research framework.

In this study, we introduced a machine learning-based approach to malware analysis to enhance the efficiency and precision of malware detection and categorisation. We used the Cuckoo sandbox, which executes malware in an isolated environment and generates a report outlining its actions while in the sandbox to perform dynamic analysis [ 13 , 15 , 16 ]. In addition, we recommended a module for feature extraction and selection, which, as the name indicates, extracts features from the report before picking the most important qualities to ensure high accuracy with little computational cost. Then, for fine-grained classification and pinpoint detection, we employed a wide range of machine learning techniques. Our experimental results demonstrated higher detection and classification accuracy than state-of-the-art approaches. The malware detection framework structure is shown in Figure 6 [ 17 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g006.jpg

Malware detection framework structure.

5. Research Methodology

Figure 6 is a high-level overview of our machine learning-based malware detection procedure [ 18 ]. Some of the steps in this process include finding interesting datasets to train a classifier, detecting sophisticated malware and selecting features to include in the model. The following is a more in-depth explanation of the approach that was taken during this study. The proposed method is shown in Figure 7 [ 21 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g007.jpg

Proposed method of malware detection.

5.1. DataSet

The selected dataset was taken from the Kaggle library. This training set was built by me using a combination of native and non-native characteristics extracted from Windows programs. There were 373 total samples in the file, of which 301 were malicious and the other 72 were safe. There were 531 characteristics listed, from F1 to F531, including a label column that indicates whether or not the file is harmful. The Kaggle data were used exclusively for the study. Many of the files in this archive included log data that were stolen by various forms of malware. A broad range of models may be trained using the recovered log information. It turned out that the samples were infected with malware from five different families. Included were more than 198,063 separate data points gathered from a wide variety of sources. There were 373 rows in the data and 531 columns as shown in Table 1 [ 22 , 23 , 24 , 25 ].

Dataset preview.

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-i001.jpg

5.2. EDA and Visualisation

Features in the tens of thousands are common in modern datasets. The issue becomes more noticeable, as the number of characteristics in a machine learning model increases [ 26 , 27 ].

5.3. Features Selection

After new features are discovered through the process of feature extraction, the next step is to choose which features to use. Selecting features from a collection of newly recognised traits is called feature selection, and it is instrumental in improving the model accuracy, streamlining the model and reducing overfitting [ 28 ]. A variety of feature classification methods have been used by researchers to try and identify malicious software. As this study’s primary focus is on developing models to detect malware, the feature rank strategy is heavily utilised [ 29 ].

After features selections, from Figure 8 , it is clear that our dataset data points weremore malicious which occupied 78% of all data points, and non-malicious data points occupied 22% of all data points.

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g008.jpg

Counts of malicious and non-malicious data points after feature selection.

6. Results and Discussion

For any categorisation technique to be effective, training and testing must be conducted. The system has to be trained with both potentially dangerous and benign data [ 1 , 7 , 15 ]. The use of machine learning techniques allows for the training of a classifier to automatically produce high-quality predictions. Classifiers such as the random forest, the SGD, extra trees and Gaussian neural networks all become better, when they are exposed to more and more labelled data. During the validation step, the classifier is presented with a set of new files, some of which are harmful and some of which are not, and asked to label them accordingly [ 18 , 19 ].

A visual representation of the RF, the SGD, extra trees and Gaussian NB models is shown in Figure 9 . A dropout is used in the final fully connected layer in the RF, SGD, extra trees and Gaussian NB models. In most cases, it appears that the dropout is used to simply add more layers to the model as a whole, rather than as a regularisation technique [ 21 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g009.jpg

RF, SGD, extra trees and Gaussian NB Model.

In this section, we reported the outcomes of an experimental evaluation of our suggested strategy for classifying and detecting malware. After generating a malware and cleanware dataset, it is put to use in testing. We analysed malware and placed it into different groups using a number of supervised machine learning methods, such as kNN, DT, RF, AdaBoost, SGD, extra trees and the Gaussian NB Classifier [ 22 , 25 ].

Table 2 summarises the accuracies of the proposed Knn, RF, SGD, extra trees and Gaussian NB models. This study exemplified the growing interest in applying ML algorithmic approaches in malware detection across the academic community [ 26 ]. In this protection, we examined three machine learning methods for malware detection to see which one is most effective. In terms of detecting accuracy, the findings ( Figure A1 , Figure A2 , Figure A3 , Figure A4 , Figure A5 , Figure A6 , Figure A7 and Figure A8 , and Table A1 , Table A2 , Table A3 , Table A4 , Table A5 , Table A6 and Table A7 in Appendix) demonstrated that the RF, SGD, extra trees and Gaussian NB models are the top classifiers, each having a perfect F1 score, a 100% accuracy, a 100% precision and a 100% recall [ 28 ].

Illustration of the accuracy.

ModelAccuracy
RF1.00
DT0.99
KNN0.99
AdaBoost0.99
SGD1.00
Extra Tree Classifier1.00
Gaussian NB1.00

Table 2 depicts that the RF, ASG, extra trees and Gaussian NB classifiers had a perfect F1 score, a 100% accuracy, a 100% precision and a 100% recall.

In terms of the online safety, malware is among the highest priorities. In reality, malware is the root of most Internet issues, including spam e-mails and DDoS attacks. That is, infected computers are frequently linked into larger networks called botnets, and many assaults are carried out by means of these hostile, attacker-controlled networks. In the fourth place, it discusses major problems and obstacles that researchers have to confront with. We focused in particular on the issue of idea drift and the difficulties of adversarial learning. It also analyses the issue of class imbalance and the current state of benchmarks used by the scientific community to measure the effectiveness of their approaches [ 29 ].

7. Conclusions

Finally, we concluded, to overcome the deficiencies of human feature construction and the limitations of existing learning approaches, this research layered the RF, ASG, extra trees and Gaussian NB models to create a novel ensemble deep neural network for malware detection. An F1 score of 100%, a precision of 100%, a recall of 100% and an accuracy of 100% were all achieved for the ASG, extra trees and Gaussian NB models. The proposed RF, ASG, extra trees and Gaussian NB models significantly improved the accuracy to detect malware, and the accuracy reached around 1 during training and very nearly 1 during testing. When we combine the RF, ASG, extra trees and Gaussian NB models, we are able to model sequence, learn from long-term dependency and extract spatially local correlations. To boost malware detection rates using ML algorithms, lower false positive rates and accelerate malware detection, several experts in the field have turned to machine learning approaches. Researchers divide data into a training set and a test set for a machine learning algorithm’s evaluation, with a training set used to teach the algorithm the desired function and a test set used to gauge how well the algorithm performs with the newly learned function.

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A1.jpg

Testing of the models.

RF classifier.

Classification Accuracy = 100%
PrecisionRecallF1-ScoreSupport
malicious1.001.001.0061.00
non-malicious1.001.001.0014.00
accuracy 1.0075
macro avg1.001.001.0075
weighted avg1.001.001.0075

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A2.jpg

RF classifier model summary with a confusion matrix.

DT classifier.

Classification Accuracy = 98.76%
PrecisionRecallF1-ScoreSupport
malicious0.981.000.9961.00
non-malicious1.000.930.9614.00
accuracy 0.9975
macro avg0.990.960.9875
weighted avg0.990.990.9975

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A3.jpg

DT classifier model summary with a confusion matrix.

KNN classifier.

Classification Accuracy = 98.69%
PrecisionRecallF1-ScoreSupport
malicious0.981.000.9961
non-malicious1.000.930.9614
accuracy 0.9975
macro avg0.990.960.9875
weighted avg0.990.990.9975

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A4.jpg

KNN classifier model summary with a confusion matrix.

AdaBoost classifier.

Classification Accuracy = 98.71%
PrecisionRecallF1-ScoreSupport
malicious0.981.000.9961
non-malicious1.000.930.9614
accuracy 0.9975
macro avg0.990.960.9875
weighted avg0.990.990.9975

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A5.jpg

AdaBoost classifier model summary with a confusion matrix.

SGD classifier.

Classification Accuracy = 100%
PrecisionRecallF1-ScoreSupport
malicious1.001.001.0061
non-malicious1.001.001.0014
accuracy 1.0075
macro avg1.001.001.0075
weighted avg1.001.001.0075

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A6.jpg

SGD classifier model summary with a confusion matrix.

Extra trees classifier.

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A7.jpg

Extra Trees classifier model summary with a confusion matrix.

Gaussian NB classifier.

An external file that holds a picture, illustration, etc.
Object name is sensors-23-00946-g0A8.jpg

Gaussian NB classifier model summary with a confusion matrix.

Funding Statement

The National Natural Science Foundation of China (Grant No. 62162039).

Author Contributions

M.S.A. and T.F. contributed equally to the study’s conception. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare that they have no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

  • DOI: 10.54692/ijeci.2024.0801191
  • Corpus ID: 270017101

Malware Detection and Analysis Using Reverse Engineering

  • Muhammad Taseer Suleman
  • Published in International journal for… 13 March 2024
  • Computer Science

Related Papers

Showing 1 through 3 of 0 Related Papers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

74k Accesses

1 Citations

1479 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

research paper on malware detection

Testing theory of mind in large language models and humans

research paper on malware detection

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

research paper on malware detection

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references

Acknowledgements

We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar

Contributions

S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on malware detection

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: research on driver facial fatigue detection based on yolov8 model.

Abstract: In a society where traffic accidents frequently occur, fatigue driving has emerged as a grave issue. Fatigue driving detection technology, especially those based on the YOLOv8 deep learning model, has seen extensive research and application as an effective preventive measure. This paper discusses in depth the methods and technologies utilized in the YOLOv8 model to detect driver fatigue, elaborates on the current research status both domestically and internationally, and systematically introduces the processing methods and algorithm principles for various datasets. This study aims to provide a robust technical solution for preventing and detecting fatigue driving, thereby contributing significantly to reducing traffic accidents and safeguarding lives.
Comments: Accepted by the 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS 2024), 2024 IEEE
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: [cs.CV]
  (or [cs.CV] for this version)

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

agriculture-logo

Article Menu

research paper on malware detection

  • Subscribe SciFeed
  • Recommended Articles
  • Author Biographies
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Research on the detection method of the threshing rate of corn ears based on machine vision, share and cite.

Li, X.; Xu, S.; Zhang, W.; Wang, J.; Li, Y.; Peng, B.; Sun, R. Research on the Detection Method of the Threshing Rate of Corn Ears Based on Machine Vision. Agriculture 2024 , 14 , 1037. https://doi.org/10.3390/agriculture14071037

Li X, Xu S, Zhang W, Wang J, Li Y, Peng B, Sun R. Research on the Detection Method of the Threshing Rate of Corn Ears Based on Machine Vision. Agriculture . 2024; 14(7):1037. https://doi.org/10.3390/agriculture14071037

Li, Xinping, Shendi Xu, Wantong Zhang, Junyi Wang, Yanan Li, Bin Peng, and Ruizhe Sun. 2024. "Research on the Detection Method of the Threshing Rate of Corn Ears Based on Machine Vision" Agriculture 14, no. 7: 1037. https://doi.org/10.3390/agriculture14071037

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. (PDF) Advances In Malware Detection- An Overview

    research paper on malware detection

  2. (PDF) Techniques in Detection and Analyzing Malware Executables: A

    research paper on malware detection

  3. Steps followed in this research paper for building effective malware

    research paper on malware detection

  4. Malware Detection and Analysis: Challenges and Research Opportunities

    research paper on malware detection

  5. Top 50 Research Papers in Deep Learning for Malware Detection

    research paper on malware detection

  6. (PDF) A Theoretical Feature-wise Study of Malware Detection Techniques

    research paper on malware detection

VIDEO

  1. Finding the Evil in Encrypted TLS Traffic with Machine Learning

  2. Q1 Paper 1 oct nov 23 GCSE Computer Science

  3. 7. Malware and its types

  4. Malware Detection Using Machine Learning

  5. Malware detection

  6. CRYPTO WALLET SECURITY: ADVANCED TIPS TO SAFEGUARD YOUR FUNDS

COMMENTS

  1. A systematic literature review on Windows malware detection: Techniques

    The results are based on the selected papers published between 2009 and 2022, and our research findings are presented in Sections 5 Types of malware detection and their deployment methods (RQ1), 6 Benchmark datasets and extracted features (RQ2), 7 ML-based malware detection techniques (RQ3), 8 DL-based malware detection techniques (RQ4), 9 ...

  2. Malware Detection with Artificial Intelligence: A Systematic Literature

    In this survey, we review the key developments in the field of malware detection using AI and analyze core challenges. We systematically survey state-of-the-art methods across five critical aspects of building an accurate and robust AI-powered malware-detection model: malware sophistication, analysis techniques, malware repositories, feature selection, and machine learning vs. deep learning.

  3. Malware Detection and Analysis: Challenges and Research Opportunities

    The main contributions of this paper are: (1) providing a summary of the current challenges related to the malware detection approaches in data mining, (2) presenting a systematic and categorized ...

  4. The rise of machine learning for detection and classification of

    There is some research discussing malware detection methods but we consider it is incomplete. (The reader is referred to Section 2). To complement the papers surveyed and mitigate some flaws in the literature, this paper presents a systematic review on traditional and state-of-the-art machine-learning-powered techniques for malware detection ...

  5. Symmetry

    In this research paper, we present a protective mechanism that evaluates three ML algorithm approaches to malware detection and chooses the most appropriate one. According to statistics, the decision tree approach has the maximum detection accuracy (99.01%) and the lowest false positive rate (FPR; 0.021%) on a small dataset.

  6. Artificial Intelligence-Based Malware Detection, Analysis, and ...

    The paper is organized as follows: Section 2 describes the importance of moving from classical malware detection and analysis to smart and autonomous detection/analysis through the incorporation of advanced AI techniques. Moreover, we provide a classification of modern malware based on famous samples and high-profile cases.

  7. A state-of-the-art survey of malware detection ...

    This paper presents a systematic and detailed survey of the malware detection mechanisms using data mining techniques. In addition, it classifies the malware detection approaches in two main categories including signature-based methods and behavior-based detection. ... Also, Fig. 6 shows the main case study diagram of each research in malware ...

  8. [2101.08429] Malware Detection and Analysis: Challenges and Research

    However, several pressing issues (e.g., unknown malware samples detection) still need to be addressed adequately. This article first presents a concise overview of malware along with anti-malware and then summarizes various research challenges. This is a theoretical and perspective article that is hoped to complement earlier articles and works.

  9. A comprehensive survey on deep learning based malware detection

    This paper presents a systematic review of malware detection using Deep Learning techniques. On the basis of the evolution towards Deep Learning-based techniques, research taxonomy is proposed. Recent techniques for detecting malware on Android, iOS, IoT, Windows, APTs, and Ransomware are also explored and compared.

  10. A Recent Research on Malware Detection Using Machine ...

    This paper is devoted to reviewing the most up-to-date research works from 2017 to 2021 on malware detection where machine learning algorithm including K-Means, Decision Tree, Meta-Heuristic, Naïve Bayes, Neuro-fuzzy, Bayesian, Gaussian, Support Vector Machine (SVM), K-Nearest Neighbour (KNN) and n-Grams was discovered using a systematic ...

  11. Malware Detection Issues, Future Trends and Challenges: A Survey

    The emergence of new types of malware, such as file-less malware, is also discussed, along with the need for real-time detection and response. The research methodology used in this paper is presented, which includes a literature review of recent papers on the topic, keyword searches, and analysis and representation methods used in each study.

  12. (PDF) Malware Detection using Machine Learning

    The main contributions of this paper are: (1) providing a summary of the current challenges related to the malware detection approaches in data mining, (2) presenting a systematic and categorized ...

  13. PDF Malware Detection Issues, Challenges, and Future Directions: A Survey

    stop at generic detection approaches like signature and behavioral. These taxonomies also overlook data representation methods used by malware analysis and detection research. Additionally, the previous review papers relate the feature extraction to the analysis phase. This does not hold as the outcome from such a phase is the raw data, from ...

  14. Malware Detection

    91 papers with code • 2 benchmarks • 4 datasets. Malware Detection is a significant part of endpoint security including workstations, servers, cloud instances, and mobile devices. Malware Detection is used to detect and identify malicious activities caused by malware. With the increase in the variety of malware activities on CMS based ...

  15. Malware Detection and Analysis: Challenges and Research Opportunities

    daily 360,000 novel malware samples hit the scene [4]. As anti-malware becomes more avant-garde so as malwares in the wild, thereby escalating the arms race between malware guardians and writers. The quests for scalable and robust automated malware detection frameworks still have to go a long way. This article presents an overview of malwares

  16. Malware Analysis and Detection Using Machine Learning Algorithms

    and SVM algorithms' performances detecting malware on a small FPR (DT = 2.01%, CNN. = 3.97%, and SVM = 4.63%,) in a given dataset were compared. In this experiment, we. evaluated and quantified ...

  17. A Survey on Malware Detection Technology and Future Trends

    Malware has become a serious threat to the internet. Their numbers are constantly increasing, and the level of complexity is rising. This paper aims to conduct a systematic survey on the development of malware detection technology. The main contributions of this paper are: 1) Describing in detail the state-of-the-art of malware detection methods, 2) Exploring the challenges and limitations of ...

  18. Techniques of Malware Detection: Research Review

    Analysis, and detection of malicious software play a crucial role in computer security. Signature-based malware detection methods were a classical solution in this area. However, malware creators are able to bypass these detection methods using some obfuscation methods like metamorphism, polymorphism. To address this issue, methods based on machine learning have been applied. However, some ...

  19. Malware classification and composition analysis: A survey of recent

    Section 7 suggests possible research topics in malware analysis. ... Souri and Hosseini [32] also provide a taxonomy of AI-driven malware detection techniques. Our paper looks at a larger range of articles by including many works on malware classification and composition analysis. We also include other works related to non-AI-driven ...

  20. Evaluation of Machine Learning Algorithms for Malware Detection

    This research study mainly focused on the dynamic malware detection. Malware progressively changes, leading to the use of dynamic malware detection techniques in this research study. Each day brings a new influx of malicious software programmes that pose a threat to online safety by exploiting vulnerabilities in the Internet. The proliferation ...

  21. Malware Detection and Analysis Using Reverse Engineering

    This research paper presents a comprehensive exploration of the role of reverse engineering in the domain of malware detection and analysis, delves into the fundamental stages of the reverse engineering process, encompassing code disassembly, static analysis, and dynamic analysis. The pervasive and persistent nature of malware in the contemporary digital realm demands sophisticated ...

  22. Detecting hallucinations in large language models using ...

    Hallucinations (confabulations) in large language model systems can be tackled by measuring uncertainty about the meanings of generated responses rather than the text itself to improve ...

  23. (PDF) Malware Detection and Prevention using Artificial Intelligence

    2021 IEEE International Conference on Big Data (Big Data) 978-1-6654-3902-2/21/$31.00 ©2021 IEEE 5369. Malware Detection and Pre vention using Artificial. Intelligence T echniques. Md Jobair ...

  24. PDF LLM Critics Help Catch LLM Bugs

    determined the final configuration of many of the experiments in the paper and produced the diagrams and a large fraction of all plots in addition to working on the manuscript. • Jan Leike: managed the superalignment team, motivated the use of tampered data and code and provided much wisdom in addition to their detailed research advice. 21

  25. Research on Driver Facial Fatigue Detection Based on Yolov8 Model

    In a society where traffic accidents frequently occur, fatigue driving has emerged as a grave issue. Fatigue driving detection technology, especially those based on the YOLOv8 deep learning model, has seen extensive research and application as an effective preventive measure. This paper discusses in depth the methods and technologies utilized in the YOLOv8 model to detect driver fatigue ...

  26. New 'Snowblind' Banking Malware Targets Android Users With Linux Kernel

    A new strain of banking malware dubbed "Snowblind" that affects Android mobile devices has been targeting users to swipe their banking credentials this year, cybersecurity firm Promon has found.

  27. (PDF) Malware detection using machine learning

    Computer Science and Information Technology pp. 735-741 ISBN 978-83-60810-22-4. ISSN 1896-7094. Malware Detection Using Machine Learning. Dragos ¸ Gavrilut ¸ 1,2, Mihai Cimpoes¸u1,2, Dan ...

  28. Ransomware: Recent advances, analysis, challenges and future research

    2019. 2.1. Malware analysis. Malware analysis is a standard approach to understand the components and behaviour of malware, ransomware included. This analysis is useful to detect malware attacks and prevent similar attacks in the future. Malware analysis is broadly categorized into static and dynamic analysis.

  29. Agriculture

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... "Research on the Detection ...