• Survey paper
  • Open access
  • Published: 26 November 2016

Big data privacy: a technological perspective and review

  • Priyank Jain   ORCID: orcid.org/0000-0001-7988-1338 1 ,
  • Manasi Gyanchandani 1 &
  • Nilay Khare 1  

Journal of Big Data volume  3 , Article number:  25 ( 2016 ) Cite this article

133k Accesses

205 Citations

36 Altmetric

Metrics details

Big data is a term used for very large data sets that have more varied and complex structure. These characteristics usually correlate with additional difficulties in storing, analyzing and applying further procedures or extracting results. Big data analytics is the term used to describe the process of researching massive amounts of complex data in order to reveal hidden patterns or identify secret correlations. However, there is an obvious contradiction between the security and privacy of big data and the widespread use of big data. This paper focuses on privacy and security concerns in big data, differentiates between privacy and security and privacy requirements in big data. This paper covers uses of privacy by taking existing methods such as HybrEx, k-anonymity, T-closeness and L-diversity and its implementation in business. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (for example, data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a major review of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. This paper also presents recent techniques of privacy preserving in big data like hiding a needle in a haystack, identity based anonymization, differential privacy, privacy-preserving big data publishing and fast anonymization of big data streams. This paper refer privacy and security aspects healthcare in big data. Comparative study between various recent techniques of big data privacy is also done as well.

Big data [ 1 , 2 ] specifically refers to data sets that are so large or complex that traditional data processing applications are not sufficient. It’s the large volume of data—both structured and unstructured—that inundates a business on a day-to-day basis. Due to recent technological development, the amount of data generated by internet, social networking sites, sensor networks, healthcare applications, and many other companies, is drastically increasing day by day. All the enormous measure of data produced from various sources in multiple formats with very high speed [ 3 ] is referred as big data. The term big data [ 4 , 5 ] is defined as “a new generation of technologies and architectures, designed to economically separate value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery and analysis”. On the premise of this definition, the properties of big data are reflected by 3V’s, which are, volume, velocity and variety. Later studies pointed out that the definition of 3Vs is insufficient to explain the big data we face now. Thus, veracity, validity, value, variability, venue, vocabulary, and vagueness were added to make some complement explanation of big data [ 6 ]. A common theme of big data is that the data are diverse, i.e., they may contain text, audio, image, or video etc. This differing qualities of data is signified by variety. In order to ensure big data privacy, various mechanisms have been developed in recent years. These mechanisms can be grouped based on the stages of big data life cycle [ 7 ] Fig.  1 , i.e., data generation, storage, and processing. In data generation phase, for the protection of privacy, access restriction as well as falsifying data techniques are used. The approaches to privacy protection in data storage phase are chiefly based on encryption procedures. Encryption based techniques can be further divided into Identity Based Encryption (IBE), Attribute Based Encryption (ABE) and storage path encryption. In addition, to protect the sensitive information, hybrid clouds are utilized where sensitive data are stored in private cloud. The data processing phase incorporates Privacy Preserving Data Publishing (PPDP) and knowledge extraction from the data. In PPDP, anonymization techniques such as generalization and suppression are utilized to protect the privacy of data. These mechanisms can be further divided into clustering, classification and association rule mining based techniques. While clustering and classification split the input data into various groups, association rule mining based techniques find the useful relationships and trends in the input data [ 8 ]. To handle diverse measurements of big data in terms of volume, velocity, and variety, there is need to design efficient and effective frameworks to process expansive measure of data arriving at very high speed from various sources. Big data needs to experience multiple phases during its life cycle.

Big data life cycle stages of big data life cycle, i.e., data generation, storage, and processing are shown

As of 2012, 2.5 quintillion bytes of data are created daily. The volumes of data are vast, the generation speed of data is fast and the data/information space is global [ 9 ]. Lightweight incremental algorithms should be considered that are capable of achieving robustness, high accuracy and minimum pre-processing latency. Like, in case of mining, lightweight feature selection method by using Swarm Search and Accelerated PSO can be used in place of the traditional classification methods [ 10 ]. Further ahead, Internet of Things (IoT) would lead to connection of all of the things that people care about in the world due to which much more data would be produced than nowadays [ 11 ]. Indeed, IoT is one of the major driving forces for big data analytics [ 9 ].

In today’s digital world, where lots of information is stored in big data’s, the analysis of the databases can provide the opportunities to solve big problems of society like healthcare and others. Smart energy big data analytics is also a very complex and challenging topic that share many common issues with the generic big data analytics. Smart energy big data involve extensively with physical processes where data intelligence can have a huge impact to the safe operation of the systems in real-time [ 12 ]. This can also be useful for marketing and other commercial companies to grow their business. As the database contains the personal information, it is vulnerable to provide the direct access to researchers and analysts. Since in this case, the privacy of individuals is leaked, it can cause threat and it is also illegal. The paper is based on research not ranging to a specific timeline. As the references suggest, research papers range from as old as 1998 to papers published in 2016. Also, the number of papers that were retrieved from the keyword-based search can be verified from the presence of references based on the keywords. “ Privacy and security concerns ” section discusses of privacy and security concerns in big data and “ Privacy requirements in big data ” section covers the Privacy requirement in big data. “ Big data privacy in data generation phase ”, “ Big data privacy in data storage phase ” and “ Big data privacy preserving in data processing ” sections discusses about big data privacy in data generation, data storage, and data processing Phase. “ Privacy Preserving Methods in Big Data ” section covers the privacy preserving techniques using big data. “ Recent Techniques of Privacy Preserving in Big Data ” section presents some recent techniques of big data privacy and comparative study between these techniques.

Privacy and security concerns in big data

Privacy and security concerns.

Privacy and security in terms of big data is an important issue. Big data security model is not suggested in the event of complex applications due to which it gets disabled by default. However, in its absence, data can always be compromised easily. As such, this section focuses on the privacy and security issues.

Privacy Information privacy is the privilege to have some control over how the personal information is collected and used. Information privacy is the capacity of an individual or group to stop information about themselves from becoming known to people other than those they give the information to. One serious user privacy issue is the identification of personal information during transmission over the Internet [ 13 ].

Security Security is the practice of defending information and information assets through the use of technology, processes and training from:-Unauthorized access, Disclosure, Disruption, Modification, Inspection, Recording, and Destruction.

Privacy vs. security Data privacy is focused on the use and governance of individual data—things like setting up policies in place to ensure that consumers’ personal information is being collected, shared and utilized in appropriate ways. Security concentrates more on protecting data from malicious attacks and the misuse of stolen data for profit [ 14 ]. While security is fundamental for protecting data, it’s not sufficient for addressing privacy. Table  1 focuses on additional difference between privacy and security.

Privacy requirements in big data

Big data analytics draw in various organizations; a hefty portion of them decide not to utilize these services because of the absence of standard security and privacy protection tools. These sections analyse possible strategies to upgrade big data platforms with the help of privacy protection capabilities. The foundations and development strategies of a framework that supports:

The specification of privacy policies managing the access to data stored into target big data platforms,

The generation of productive enforcement monitors for these policies, and

The integration of the generated monitors into the target analytics platforms. Enforcement techniques proposed for traditional DBMSs appear inadequate for the big data context due to the strict execution necessities needed to handle large data volumes, the heterogeneity of the data, and the speed at which data must be analysed.

Businesses and government agencies are generating and continuously collecting large amounts of data. The current increased focus on substantial sums of data will undoubtedly create opportunities and avenues to understand the processing of such data over numerous varying domains. But, the potential of big data come with a price; the users’ privacy is frequently at danger. Ensures conformance to privacy terms and regulations are constrained in current big data analytics and mining practices. Developers should be able to verify that their applications conform to privacy agreements and that sensitive information is kept private regardless of changes in the applications and/or privacy regulations. To address these challenges, identify a need for new contributions in the areas of formal methods and testing procedures. New paradigms for privacy conformance testing to the four areas of the ETL (Extract, Transform, and Load) process as shown in Fig.  2 [ 15 , 16 ].

Big data architecture and testing area new paradigms for privacy conformance testing to the four areas of the ETL (Extract, Transform, and Load) processes are shown here

Pre‐hadoop process validation This step does the representation of the data loading process. At this step, the privacy specifications characterize the sensitive pieces of data that can uniquely identify a user or an entity. Privacy terms can likewise indicate which pieces of data can be stored and for how long. At this step, schema restrictions can take place as well.

Map‐reduce process validation This process changes big data assets to effectively react to a query. Privacy terms can tell the minimum number of returned records required to cover individual values, in addition to constraints on data sharing between various processes.

ETL process validation Similar to step (2), warehousing rationale should be confirmed at this step for compliance with privacy terms. Some data values may be aggregated anonymously or excluded in the warehouse if that indicates high probability of identifying individuals.

Reports testing reports are another form of questions, conceivably with higher visibility and wider audience. Privacy terms that characterize ‘purpose’ are fundamental to check that sensitive data is not reported with the exception of specified uses.

Big data privacy in data generation phase

Data generation can be classified into active data generation and passive data generation. By active data generation, we mean that the data owner will give the data to a third party [ 17 ], while passive data generation refers to the circumstances that the data are produced by data owner’s online actions (e.g., browsing) and the data owner may not know about that the data are being gathered by a third party. Minimization of the risk of privacy violation amid data generation by either restricting the access or by falsifying data.

Access restriction If the data owner thinks that the data may uncover sensitive information which is not supposed to be shared, it refuse to provide such data. If the data owner is giving the data passively, a few measures could be taken to ensure privacy, such as anti-tracking extensions, advertisement or script blockers and encryption tools.

Falsifying data In some circumstances, it is unrealistic to counteract access of sensitive data. In that case, data can be distorted using certain tools prior to the data gotten by some third party. If the data are distorted, the true information cannot be easily revealed. The following techniques are utilized by the data owner to falsify the data:

A tool Socketpuppet is utilized to hide online identity of individual by deception. By utilizing multiple Socketpuppets, the data belonging to one specific individual will be regarded as having a place with various people. In that way the data collector will not have enough knowledge to relate different socketpuppets to one individual.

Certain security tools can be used to mask individual’s identity, such as Mask Me. This is especially useful when the data owner needs to give the credit card details amid online shopping.

Big data privacy in data storage phase

Storing high volume data is not a major challenge due to the advancement in data storage technologies, for example, the boom in cloud computing [ 18 ]. If the big data storage system is compromised, it can be exceptionally destructive as individuals’ personal information can be disclosed [ 19 ]. In distributed environment, an application may need several datasets from various data centres and therefore confront the challenge of privacy protection.

The conventional security mechanisms to protect data can be divided into four categories. They are file level data security schemes, database level data security schemes, media level security schemes and application level encryption schemes [ 20 ]. Responding to the 3V’s nature of the big data analytics, the storage infrastructure ought to be scalable. It should have the ability to be configured dynamically to accommodate various applications. One promising technology to address these requirements is storage virtualization, empowered by the emerging cloud computing paradigm [ 21 ]. Storage virtualization is process in which numerous network storage devices are combined into what gives off an impression of being a single storage device. SecCloud is one of the models for data security in the cloud that jointly considers both of data storage security and computation auditing security in the cloud [ 22 ]. Therefore, there is a limited discussion in case of privacy of data when stored on cloud.

Approaches to privacy preservation storage on cloud

When data are stored on cloud, data security predominantly has three dimensions, confidentiality, integrity and availability [ 23 ]. The first two are directly related to privacy of the data i.e., if data confidentiality or integrity is breached it will have a direct effect on users privacy. Availability of information refers to ensuring that authorized parties are able to access the information when needed. A basic requirement for big data storage system is to protect the privacy of an individual. There are some existing mechanisms to fulfil that requirement. For example, a sender can encrypt his data using pubic key encryption (PKE) in a manner that only the valid recipient can decrypt the data. The approaches to safeguard the privacy of the user when data are stored on the cloud are as follows [ 7 ]:

Attribute based encryption Access control is based on the identity of a user complete access over all resources.

Homomorphic encryption Can be deployed in IBE or ABE scheme settings updating cipher text receiver is possible.

Storage path encryption It secures storage of big data on clouds.

Usage of Hybrid clouds Hybrid cloud is a cloud computing environment which utilizes a blend of on-premises, private cloud and third-party, public cloud services with organization between the two platforms.

Integrity verification of big data storage

At the point when cloud computing is used for big data storage, data owner loses control over data. The outsourced data are at risk as cloud server may not be completely trusted. The data owner should be firmly convinced that the cloud is storing data properly according to the service level contract. To ensure privacy to the cloud user is to provide the system with the mechanism to allow data owner verify that his data stored on the cloud is intact [ 24 , 25 ]. The integrity of data storage in traditional systems can be verified through number of ways i.e., Reed-Solomon code, checksums, trapdoor hash functions, message authentication code (MAC), and digital signatures etc. Therefore data integrity verification is of critical importance. It compares different integrity verification schemes discussed [ 24 , 26 ]. To verify the integrity of the data stored on cloud, straight forward approach is to retrieve all the data from the cloud. To verify the integrity of data without having to retrieve the data from cloud [ 25 , 26 ]. In integrity verification scheme, the cloud server can only provide the substantial evidence of integrity of data when all the data are intact. It is highly prescribed that the integrity verification should be conducted regularly to provide highest level of data protection [ 26 ].

Big data privacy preserving in data processing

Big data processing paradigm categorizes systems into batch, stream, graph, and machine learning processing [ 27 , 28 ]. For privacy protection in data processing part, division can be done into two phases. In the first phase, the goal is to safeguard information from unsolicited disclosure since the collected data might contain sensitive information of the data owner. In the second phase, the aim is to extract meaningful information from the data without violating the privacy.

Privacy preserving methods in big data

Few traditional methods for privacy preserving in big data is described in brief here. These methods being used traditionally provide privacy to a certain amount but their demerits led to the advent of newer methods.

De-identification

De-identification [ 29 , 30 ] is a traditional technique for privacy-preserving data mining, where in order to protect individual privacy, data should be first sanitized with generalization (replacing quasi-identifiers with less particular but semantically consistent values) and suppression (not releasing some values at all) before the release for data mining. Mitigate the threats from re-identification; the concepts of k-anonymity [ 29 , 31 , 32 ], l-diversity [ 30 , 31 , 33 ] and t-closeness [ 29 , 33 ] have been introduced to enhance traditional privacy-preserving data mining. De-identification is a crucial tool in privacy protection, and can be migrated to privacy preserving big data analytics. Nonetheless, as an attacker can possibly get more external information assistance for de-identification in the big data, we have to be aware that big data can also increase the risk of re-identification. As a result, de-identification is not sufficient for protecting big data privacy.

Privacy-preserving big data analytics is still challenging due to either the issues of flexibility along with effectiveness or the de-identification risks.

De-identification is more feasible for privacy-preserving big data analytics if develop efficient privacy-preserving algorithms to help mitigate the risk of re-identification.

There are three -privacy-preserving methods of De-identification, namely, K-anonymity, L-diversity and T-closeness. There are some common terms used in the privacy field of these methods:

Identifier attributes include information that uniquely and directly distinguish individuals such as full name, driver license, social security number.

Quasi - identifier attributes means a set of information, for example, gender, age, date of birth, zip code. That can be combined with other external data in order to re-identify individuals.

Sensitive attributes are private and personal information. Examples include, sickness, salary, etc.

Insensitive attributes are the general and the innocuous information.

Equivalence classes are sets of all records that consists of the same values on the quasi-identifiers.

K-anonymity

A release of data is said to have the k -anonymity [ 29 , 31 ] property if the information for each person contained in the release cannot be perceived from at least k-1 individuals whose information show up in the release. In the context of k -anonymization problems, a database is a table which consists of  n  rows and  m  columns, where each row of the table represents a record relating to a particular individual from a populace and the entries in the different rows need not be unique. The values in the different columns are the values of attributes connected with the members of the population. Table  2 is a non-anonymized database comprising of the patient records of some fictitious hospital in Hyderabad.

There are six attributes along with ten records in this data. There are two regular techniques for accomplishing k -anonymity for some value of k .

Suppression In this method, certain values of the attributes are supplanted by an asterisk ‘*’. All or some of the values of a column may be replaced by ‘*’. In the anonymized Table  3 , replaced all the values in the ‘Name’ attribute and each of the values in the ‘Religion’ attribute by a ‘*’.

Generalization In this method, individual values of attributes are replaced with a broader category. For instance, the value ‘19’ of the attribute ‘Age’ may be supplanted by ‘ ≤20’, the value ‘23’ by ‘20 < age ≤ 30’, etc.

Table  3 has 2-anonymity with respect to the attributes ‘Age’, ‘Gender’ and ‘State of domicile’ since for any blend of these attributes found in any row of the table there are always no less than two rows with those exact attributes. The attributes that are available to an adversary are called “quasi-identifiers”. Each “quasi-identifier” tuple occurs in at least k records for a dataset with k-anonymity. K-anonymous data can still be helpless against attacks like unsorted matching attack, temporal attack, and complementary release attack [ 33 , 34 ]. On the positive side, it will present a greedy O(k log k)-approximation algorithm for optimal k-anonymity via suppression of entries. The complexity of rendering relations of private records k-anonymous, while minimizing the amount of information that is not released and simultaneously ensure the anonymity of individuals up to a group of size k, and withhold a minimum amount of information to achieve this privacy level and this optimization problem is NP-hard. In general, a further restriction of the problem where attributes are suppressed instead of individual entries is also NP-hard [ 35 ]. Therefore we move towards L-diversity strategy of data anonymization.

L-diversity

It is a form of group based anonymization that is utilized to safeguard privacy in data sets by reducing the granularity of data representation. This decrease is a trade-off that results outcomes in some loss of viability of data management or mining algorithms for gaining some privacy. The  l -diversity model (Distinct, Entropy, Recursive) [ 29 , 31 , 34 ] is an extension of the  k -anonymity model which diminishes the granularity of data representation utilizing methods including generalization and suppression in a way that any given record maps onto at least  k  different records in the data. The  l -diversity model handles a few of the weaknesses in the  k -anonymity model in which protected identities to the level of  k -individuals is not equal to protecting the corresponding sensitive values that were generalized or suppressed, particularly when the sensitive values in a group exhibit homogeneity. The  l -diversity model includes the promotion of intra-group diversity for sensitive values in the anonymization mechanism. The problem with this method is that it depends upon the range of sensitive attribute. If want to make data L-diverse though sensitive attribute has not as much as different values, fictitious data to be inserted. This fictitious data will improve the security but may result in problems amid analysis. Also L-diversity method is subject to skewness and similarity attack [ 34 ] and thus can’t prevent attribute disclosure.

T-closeness

It is a further improvement of  l -diversity group based anonymization that is used to preserve privacy in data sets by decreasing the granularity of a data representation. This reduction is a trade-off that results in some loss of adequacy of data management or mining algorithms in order to gain some privacy. The  t -closeness model(Equal/Hierarchical distance) [ 29 , 33 ] extends the l -diversity model by treating the values of an attribute distinctly by taking into account the distribution of data values for that attribute.

An equivalence class is said to have  t -closeness if the distance between the conveyance of a sensitive attribute in this class and the distribution of the attribute in the whole table is less than a threshold  t . A table is said to have  t -closeness if all equivalence classes have  t -closeness. The main advantage of t-closeness is that it intercepts attribute disclosure. The problem lies in t-closeness is that as size and variety of data increases, the odds of re-identification too increases. The brute-force approach that examines each possible partition of the table to find the optimal solution takes \({\text{n}}^{{{\text{O}}({\text{n}})}} {\text{m}}^{{{\text{O}}( 1)}} = 2^{{{\text{O}}({\text{nlogn}})}} {\text{m}}^{{{\text{O}}( 1)}}\) time. We first improve this bound to single exponential in n (Note that it cannot be improved to polynomial unless P = NP) [ 36 ].

Comparative analysis of de-identification privacy methods

Advanced data analytics can extricate valuable information from big data but at the same time it poses a big risk to the users’ privacy [ 32 ]. There have been numerous proposed approaches to preserve privacy before, during, and after analytics process on the big data. This paper discusses three privacy methods such as K-anonymity, L-diversity, and T-closeness. As consumer’s data continues to grow rapidly and technologies are unremittingly improving, the trade-off between privacy breaching and preserving will turn out to be more intense. Table  4 presents existing De-identification preserving privacy measures and its limitations in big data.

Hybrid execution model [ 37 ] is a model for confidentiality and privacy in cloud computing. It executes public clouds only for operations which are safe while integrating an organization’s private cloud, i.e., it utilizes public clouds only for non-sensitive data and computation of an organization classified as public, whereas for an organization’s sensitive, private, data and computation, the model utilizes their private cloud. It considers data sensitivity before a job’s execution. It provides integration with safety.

The four categories in which HybrEx MapReduce enables new kinds of applications that utilize both public and private clouds are as follows-

Map hybrid The map phase is executed in both the public and the private clouds while the reduce phase is executed in only one of the clouds as shown in Fig.  3 a.

HybrEx methods: a map hybrid b vertical partitioning c horizontal partitioning d hybrid. The four categories in which HybrEx MapReduce enables new kinds of applications that utilize both public and private clouds are shown

Vertical partitioning It is shown in Fig.  3 b. Map and reduce tasks are executed in the public cloud using public data as the input, shuffle intermediate data amongst them, and store the result in the public cloud. The same work is done in the private cloud with private data. The jobs are processed in isolation.

Horizontal partitioning The Map phase is executed at public clouds only while the reduce phase is executed at a private cloud as can be seen in Fig.  3 c.

Hybrid As in the figure shown in Fig.  3 d, the map phase and the reduce phase are executed on both public and private clouds. Data transmission among the clouds is also possible.

Integrity check models of full integrity and quick integrity checking are suggested as well. The problem with HybridEx is that it does not deal with the key that is generated at public and private clouds in the map phase and that it deals with only cloud as an adversary.

Privacy-preserving aggregation

Privacy-preserving aggregation [ 38 ] is built on homomorphic encryption used as a popular data collecting technique for event statistics. Given a homomorphic public key encryption algorithm, different sources can use the same public key to encrypt their individual data into cipher texts [ 39 ]. These cipher texts can be aggregated, and the aggregated result can be recovered with the corresponding private key. But, aggregation is purpose-specific. So, privacy- preserving aggregation can protect individual privacy in the phases of big data collecting and storing. Because of its inflexibility, it cannot run complex data mining to exploit new knowledge. As such, privacy-preserving aggregation is insufficient for big data analytics.

Operations over encrypted data

Motivated by searching over encrypted data [ 38 ], operations can be run over encrypted data to protect individual privacy in big data analytics. Since, operations over encrypted data are mostly complex along with being time-consuming and big data is high-volume and needs us to mine new knowledge in a reasonable timeframe, running operations over encrypted data can be termed as inefficient in the case of big data analytics.

Recent techniques of privacy preserving in big data

Differential privacy.

Differential Privacy [ 40 ] is a technology that provides researchers and database analysts a facility to obtain the useful information from the databases that contain personal information of people without revealing the personal identities of the individuals. This is done by introducing a minimum distraction in the information provided by the database system. The distraction introduced is large enough so that they protect the privacy and at the same time small enough so that the information provided to analyst is still useful. Earlier some techniques have been used to protect the privacy, but proved to be unsuccessful.

In mid-90s when the Commonwealth of Massachusetts Group Insurance Commission (GIC) released the anonymous health record of its clients for research to benefit the society [ 32 ]. GIC hides some information like name, street address etc. so as to protect their privacy. Latanya Sweeney (then a PhD student in MIT) using the publicly available voter database and database released by GIC, successfully identified the health record by just comparing and co-relating them. Thus hiding some information cannot assures the protection of individual identity.

Differential Privacy (DP) deals to provide the solution to this problem as shown Fig.  4 . In DP analyst are not provided the direct access to the database containing personal information. An intermediary piece of software is introduced between the database and the analyst to protect the privacy. This intermediary software is also called as the privacy guard.

Differential privacy big data differential privacy (DP) as a solution to privacy-preserving in big data is shown

Step 1 The analyst can make a query to the database through this intermediary privacy guard.

Step 2 The privacy guard takes the query from the analyst and evaluates this query and other earlier queries for the privacy risk. After evaluation of privacy risk.

Step 3 The privacy guard then gets the answer from the database.

Step 4 Add some distortion to it according to the evaluated privacy risk and finally provide it to the analyst.

The amount of distortion added to the pure data is proportional to the evaluated privacy risk. If the privacy risk is low, distortion added is small enough so that it do not affect the quality of answer, but large enough that they protect the individual privacy of database. But if the privacy risk is high then more distortion is added.

Identity based anonymization

These techniques encountered issues when successfully combined anonymization, privacy protection, and big data techniques [ 41 ] to analyse usage data while protecting the identities of users. Intel Human Factors Engineering team wanted to use web page access logs and big data tools to enhance convenience of Intel’s heavily used internal web portal. To protect Intel employees’ privacy, they were required to remove personally identifying information (PII) from the portal’s usage log repository but in a way that did not influence the utilization of big data tools to do analysis or the ability to re-identify a log entry in order to investigate unusual behaviour. Cloud computing is a type of large-scale distributed computing paradigms which has become a driving force for Information and Communications Technology over the past several years, due to its innovative and promising vision. It provides the possibility of improving IT systems management and is changing the way in which hardware and software are designed, purchased, and utilized. Cloud storage service brings significant benefits to data owners, say, (1) reducing cloud users’ burden of storage management and equipment maintenance, (2) avoiding investing a large amount of hardware and software, (3) enabling the data access independent of geographical position, (4) accessing data at any time and from anywhere [ 42 ].

To meet these objectives, Intel created an open architecture for anonymization [ 41 ] that allowed a variety of tools to be utilized for both de-identifying and re-identifying web log records. In the process of implementing architecture, found that enterprise data has properties different from the standard examples in anonymization literature [ 43 ]. This concept showed that big data techniques could yield benefits in the enterprise environment even when working on anonymized data. Intel also found that despite masking obvious Personal Identification Information like usernames and IP addresses, the anonymized data was defenceless against correlation attacks. They explored the trade-offs of correcting these vulnerabilities and found that User Agent (Browser/OS) information strongly correlates to individual users. This is a case study of anonymization implementation in an enterprise, describing requirements, implementation, and experiences encountered when utilizing anonymization to protect privacy in enterprise data analysed using big data techniques. This investigation of the quality of anonymization used k-anonymity based metrics. Intel used Hadoop to analyse the anonymized data and acquire valuable results for the Human Factors analysts [ 44 , 45 ]. At the same time, learned that anonymization needs to be more than simply masking or generalizing certain fields—anonymized datasets need to be carefully analysed to determine whether they are vulnerable to attack.

Privacy preserving Apriori algorithm in MapReduce framework

Hiding a needle in a haystack [ 46 ].

Existing privacy-preserving association rule algorithms modify original transaction data through the noise addition. However, this work maintained the original transaction in the noised transaction in light of the fact that the goal is to prevent data utility deterioration while prevention the privacy violation. Therefore, the possibility that an untrusted cloud service provider infers the real frequent item set remains in the method [ 47 ]. Despite the risk of association rule leakage, provide enough privacy protection because this privacy-preserving algorithm is based on “hiding a needle in a haystack” [ 46 ] concept. This concept is based on the idea that detecting a rare class of data, such as the needles, is hard to find in a haystack, such as a large size of data, as shown in Fig.  5 . Existing techniques [ 48 ] cannot add noise haphazardly because of the need to consider privacy-data utility trade-off. Instead, this technique incurs additional computation cost in adding noise that will make the “haystack” to hide the “needle.” Therefore, ought to consider a trade-off between problems would be easier to resolve with the use of the Hadoop framework in a cloud environment. In Fig.  5 , the dark diamond dots are original association rule and the empty circles are noised association rule. Original rules are hard to be revealed because there are too many noised association rules [ 46 ]

Hiding a needle in a haystack Mechanism of hiding a needle in a haystack is shown

In Fig.  6 , the service provider adds a dummy item as noise to the original transaction data collected by the data provider. Subsequently, a unique code is assigned to the dummy and the original items. The service provider maintains the code information to filter out the dummy item after the extraction of frequent item set by an external cloud platform. Apriori algorithm is performed by the external cloud platform using data which is sent by the service provider. The external cloud platform returns the frequent item set and support value to the service provider. The service provider filters the frequent item set that is affected by the dummy item using a code to extract the correct association rule using frequent item set without the dummy item. The process of extraction association rule is not a burden to the service provider, considering that the amount of calculation required for extracting the association rule is not much.

Overview of the process of association rule mining the service provider adds a dummy item as noise to the original transaction data collected by the data provider

Privacy-preserving big data publishing

The publication and dissemination of raw data are crucial components in commercial, academic, and medical applications with an increasing number of open platforms, such as social networks and mobile devices from which data might be gathered, the volume of such data has also increased over time [ 49 ]. Privacy-preserving models broadly fall into two different settings, which are referred to as input and output privacy. In input privacy, the primary concern is publishing anonymized data with models such as k-anonymity and l-diversity. In output privacy, generally interest is in problems such as association rule hiding and query auditing where the output of different data mining algorithms is perturbed or audited in order to preserve privacy. Much of the work in privacy has been focused on the quality of privacy preservation (vulnerability quantification) and the utility of the published data. The solution is to just divide the data into smaller parts (fragments) and anonymize each part independently [ 50 ].

Despite the fact that k-anonymity can prevent identity attacks, it fails to protect from attribute disclosure attacks because of the lack of diversity in the sensitive attribute within the equivalence class. The l-diversity model mandates that each equivalence class must have at least l well-represented sensitive values. It is common for large data sets to be processed with distributed platforms such as the MapReduce framework [ 51 , 52 ] in order to distribute a costly process among multiple nodes and accomplish considerable performance improvement. Therefore, in order to resolve the inefficiency, improvements of privacy models are introduced.

Trust evaluation plays an important role in trust management. It is a technical approach of representing trust for digital processing, in which the factors influencing trust are evaluated based on evidence data to get a continuous or discrete number, referred to as a trust value. It propose two schemes to preserve privacy in trust evaluation. To reduce the communication and computation costs, propose to introduce two servers to realize the privacy preservation and evaluation result sharing among various requestors. Consider a scenario with two independent service parties that do not collude with each other due to their business incentives. One is an authorized proxy (AP) that is responsible for access control and management of aggregated evidence to enhance the privacy of entities being evaluated. The other is an evaluation party (EP) (e.g., offered by a cloud service provider) that processes the data collected from a number of trust evidence providers. The EP processes the collected data in an encrypted form and produces an encrypted trust pre-evaluation result. When a user requests the pre-evaluation result from EP, the EP first checks the user’s access eligibility with AP. If the check is positive, the AP re-encrypts the pre evaluation result that can be decrypted by the requester (Scheme 1) or there is an additional step involving the EP that prevents the AP from obtaining the plain pre-evaluation result while still allowing decryption of the pre-evaluation result by the requester (Scheme 2) [ 53 ].

Improvement of k-anonymity and l-diversity privacy model

Mapreduce-based anonymization.

For efficient data processing MapReduce framework is proposed. Larger data sets are handled with large and distributed MapReduce like frameworks. The data is split into equal sized chunks which are then fed to separate mapper. The mappers process its chunks and provide pairs as outputs. The pairs having the same key are transferred by the framework to one reducer. The reducer output sets are then used to produce the final result [ 32 , 34 ].

K-anonymity with MapReduce

Since the data is automatically split by the MapReduce framework, the k-anonymization algorithm must be insensitive to data distribution across mappers. Our MapReduce based algorithm is reminiscent of the Mondrian algorithm. For better generality and more importantly, reducing the required iterations, each equivalence class is split into (at most) q equivalence classes in each iteration, rather than only two [ 50 ].

MapReduce-based l-diversity

The extension of the privacy model from k-anonymity to l-diversity requires the integration of sensitive values into either the output keys or values of the mapper. Thus, pairs which are generated by mappers and combiners need to be appropriately modified. Unlike the mapper in k-anonymity, the mapper in l-diversity, receives both quasi-identifiers and the sensitive attribute as input [ 50 ].

Fast anonymization of big data streams

Big data associated with time stamp is called big data stream. Sensor data, call centre records, click streams, and health- care data are examples of big data streams. Quality of service (QoS) parameters such as end-to-end delay, accuracy, and real-time processing are some constraints of big data stream processing. The most pre-requirement of big data stream mining in applications such as health-care is privacy preserving [ 54 ]. One of the common approaches to anonymize static data is k-anonymity. This approach is not directly applicable for the big data streams anonymization. The reasons are as follows [ 55 ]:

Unlike static data, data streams need real-time processing and the existing k-anonymity approaches are NP-hard, as proved.

For the existing static k-anonymization algorithms to reduce information loss, data must be repeatedly scanned during the anonymization procedure. The same process is impossible in data streams processing.

The scales of data streams that need to be anonymized in some applications are increasing tremendously.

Data streams have become so large that anonymizing them is becoming a challenge for existing anonymization algorithms.

To cope with the first and second aforementioned challenges, FADS algorithm was chosen. This algorithm is the best choice for data stream anonymization. But it has two main drawbacks:

The FADS algorithm handles tuples sequentially so is not suitable for big data stream.

Some tuples may remain in the system for quite a while and are discharged when a specified threshold comes to an end.

This work provided three contributions. First, utilizing parallelism to expand the effectiveness of FADS algorithm and make it applicable for big data stream anonymization. Second, proposal of a simple proactive heuristic estimated round-time to prevent publishing of a tuple after its expiration. Third, illustrating (through experimental results) that FAST is more efficient and effective over FADS and other existing algorithm while it noticeably diminishes the information loss and cost metric during anonymization process.

Proactive heuristic

In FADS, a new parameter is considered that represented the maximum delay that is tolerable for an application. This parameter is called expiration-time. To avert a tuple be published when its expiration-time passed, a simple heuristic estimated-round-time is defined. In FADS, there is no check for whether a tuple can remain more in the system or not. As a result, some tuples are published after expiration. This issue is violated the real time condition of a data stream application and also increase cost metric notably.

Privacy and security aspects healthcare in big data

The new wave of digitizing medical records has seen a paradigm shift in the healthcare industry. As a result, healthcare industry is witnessing an increase in sheer volume of data in terms of complexity, diversity and timeliness [ 56 – 58 ]. The term “big data” refers to the agglomeration of large and complex data sets, which exceeds existing computational, storage and communication capabilities of conventional methods or systems. In healthcare, several factors provide the necessary impetus to harness the power of big data [ 59 ]. The harnessing the power of big data analysis and genomic research with real-time access to patient records could allow doctors to make informed decisions on treatments [ 60 ]. Big data will compel insurers to reassess their predictive models. The real-time remote monitoring of vital signs through embedded sensors (attached to patients) allows health care providers to be alerted in case of an anomaly. Healthcare digitization with integrated analytics is one of the next big waves in healthcare Information Technology (IT) with Electronic Health Records (EHRs) being a crucial building block for this vision. With the introduction of HER incentive programs [ 61 ], healthcare organizations recognized EHR’s value proposition to facilitate better access to complete, accurate and sharable healthcare data, that eventually lead to improved patient care. With the ever-changing risk environment and introduction of new emerging threats and vulnerabilities, security violations are expected to grow in the coming years [ 62 ].

Big data presented a comprehensive survey of different tools and techniques used in Pervasive healthcare in a disease-specific manner. It covered the major diseases and disorders that can be quickly detected and treated with the use of technology, such as fatal and non-fatal falls, Parkinson’s disease, cardio-vascular disorders, stress, etc. We have discussed different pervasive healthcare techniques available to address those diseases and many other permanent handicaps, like blindness, motor disabilities, paralysis, etc. Moreover, a plethora of commercially available pervasive healthcare products. It provides understanding of the various aspects of pervasive healthcare with respect to different diseases [ 63 ].

Adoption of big data in healthcare significantly increases security and patient privacy concerns. At the outset, patient information is stored in data centres with varying levels of security. Traditional security solutions cannot be directly applied to large and inherently diverse data sets. With the increase in popularity of healthcare cloud solutions, complexity in securing massive distributed Software as a Service (SaaS) solutions increases with varying data sources and formats. Hence, big data governance is necessary prior to exposing data to analytics.

Data governance

As the healthcare industry moves towards a value-based business model leveraging healthcare analytics, data governance will be the first step in regulating and managing healthcare data.

The goal is to have a common data representation that encompasses industry standards and local and regional standards.

Data generated by BSN is diverse in nature and would require normalization, standardization and governance prior to analysis.

Real-time security analytics

Analysing security risks and predicting threat sources in real-time is of utmost need in the burgeoning healthcare industry.

Healthcare industry is witnessing a deluge of sophisticated attacks ranging from Distributed Denial of Service (DDoS) to stealthy malware.

Healthcare industry leverages on emerging big data technologies to make better-informed decisions, security analytics will be at the core of any design for the cloud based SaaS solution hosting protected health information (PHI) [ 64 ].

Privacy-preserving analytics

Invasion of patient privacy is a growing concern in the domain of big data analytics.

Privacy-preserving encryption schemes that allow running prediction algorithms on encrypted data while protecting the identity of a patient is essential for driving healthcare analytics [ 65 ].

Data quality

Health data is usually collected from different sources with totally different set-ups and database designs which makes the data complex, dirty, with a lot of missing data, and different coding standards for the same fields.

Problematic handwritings are no more applicable in EHR systems, the data collected via these systems are not mainly gathered for analytical purposes and contain many issues—missing data, incorrectness, miscoding—due to clinicians’ workloads, not user friendly user interfaces, and no validity checks by humans [ 66 ].

Data sharing and privacy

The health data contains personal health information (PHI), there will be legal difficulties in accessing the data due to the risk of invading the privacy.

Health data can be anonymized using masking and de-identification techniques, and be disclosed to the researchers based on a legal data sharing agreement [ 67 ].

The data gets anonymized so much with the aim of protecting the privacy, on the other hand it will lose its quality and would not be useful for analysis anymore And coming up with a balance between the privacy-protection elements (anonymization, sharing agreement, and security controls) is essential to be able to access a data that is usable for analytics.

Relying on predictive models

It should not be unrealistic expectations from the constructed data mining models. Every model has an accuracy.

It is important to consider that it would be dangerous to only rely on the predictive models when making critical decisions that directly affects the patient’s life, and this should not even be expected from the predictive model.

Variety of methods and complex math’s

The underlying math of almost all data mining techniques is complex and not very easily understandable for non-technical fellows, thus, clinicians and epidemiologists have usually preferred to continue working with traditional statistics methods.

It is essential for the data analyst to be familiar with the different techniques, and also the different accuracy measurements to apply multiple techniques when analysing a specific dataset.

Summary on recent approaches used in big data privacy

In this section, a summary on recent approaches used in big data privacy is done. Table 5 is presented here comprising of different papers, the methods introduced, their focus and demerits. It presents an overview of the work done till now in the field of big data privacy.

Conclusion and future work

Big data [ 2 , 68 ] is analysed for bits of knowledge that leads to better decisions and strategic moves for overpowering businesses. Yet only a small percentage of data is actually analysed. In this paper, we have investigated the privacy challenges in big data by first identifying big data privacy requirements and then discussing whether existing privacy-preserving techniques are sufficient for big data processing. Privacy challenges in each phase of big data life cycle [ 7 ] are presented along with the advantages and disadvantages of existing privacy-preserving technologies in the context of big data applications. This paper also presents traditional as well as recent techniques of privacy preserving in big data. Hiding a needle in a haystack [ 46 ] is one such example in which privacy preserving is used by association rule mining. Concepts of identity based anonymization [ 41 ] and differential privacy [ 40 ] and comparative study between various recent techniques of big data privacy are also discussed. It presents scalable anonymization methods [ 69 ] within the MapReduce framework. It can be easily scaled up by increasing the number of mappers and reducers. As our future direction, perspectives are needed to achieve effective solutions to the scalability problem [ 70 ] of privacy and security in the era of big data and especially to the problem of reconciling security and privacy models by exploiting the map reduce framework. In terms of healthcare services [ 59 , 64 – 67 ] as well, more efficient privacy techniques need to be developed. Differential privacy is one such sphere which has got much of hidden potential to be utilized further. Also with the rapid development of IoT, there are lots of challenges when IoT and big data come; the quantity of data is big but the quality is low and the data are various from different data sources inherently possessing a great many different types and representation forms, and the data is heterogeneous, as-structured, semi structured, and even entirely unstructured [ 71 ]. This poses new privacy challenges and open research issues. So, different methods of privacy preserving mining may be studied and implemented in future. As such, there exists a huge scope for further research in privacy preserving methods in big data.

Abadi DJ, Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Stone-braker M, Tatbul N, Zdonik SB. Aurora: a new model and architecture for data stream manag ement. VLDB J. 2003;12(2):120–39.

Article   Google Scholar  

Kolomvatsos K, Anagnostopoulos C, Hadjiefthymiades S. An efficient time optimized scheme for progressive analytics in big data. Big Data Res. 2015;2(4):155–65.

Big data at the speed of business, [online]. http://www-01.ibm.com/soft-ware/data/bigdata/2012 .

Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers A. Big data: the next frontier for innovation, competition, and productivity. New York: Mickensy Global Institute; 2011. p. 1–137.

Google Scholar  

Gantz J, Reinsel D. Extracting value from chaos. In: Proc on IDC IView. 2011. p. 1–12.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big Data Springer Open J. 2015.

Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S. Protection of big data privacy. In: IEEE translations and content mining are permitted for academic research. 2016.

Jain P, Pathak N, Tapashetti P, Umesh AS. Privacy preserving processing of data decision tree based on sample selection and singular value decomposition. In: 39th international conference on information assurance and security (lAS). 2013.

Qin Y, et al. When things matter: a survey on data-centric internet of things. J Netw Comp Appl. 2016;64:137–53.

Fong S, Wong R, Vasilakos AV. Accelerated PSO swarm search feature selection for data stream mining big data. In: IEEE transactions on services computing, vol. 9, no. 1. 2016.

Middleton P, Kjeldsen P, Tully J. Forecast: the internet of things, worldwide. Stamford: Gartner; 2013.

Hu J, Vasilakos AV. Energy Big data analytics and security: challenges and opportunities. IEEE Trans Smart Grid. 2016;7(5):2423–36.

Porambage P, et al. The quest for privacy in the internet of things. IEEE Cloud Comp. 2016;3(2):36–45.

Jing Q, et al. Security of the internet of things: perspectives and challenges. Wirel Netw. 2014;20(8):2481–501.

Han J, Ishii M, Makino H. A hadoop performance model for multi-rack clusters. In: IEEE 5th international conference on computer science and information technology (CSIT). 2013. p. 265–74.

Gudipati M, Rao S, Mohan ND, Gajja NK. Big data: testing approach to overcome quality challenges. Data Eng. 2012:23–31.

Xu L, Jiang C, Wang J, Yuan J, Ren Y. Information security in big data: privacy and data mining. IEEE Access. 2014;2:1149–76.

Liu S. Exploring the future of computing. IT Prof. 2011;15(1):2–3.

Sokolova M, Matwin S. Personal privacy protection in time of big data. Berlin: Springer; 2015.

Cheng H, Rong C, Hwang K, Wang W, Li Y. Secure big data storage and sharing scheme for cloud tenants. China Commun. 2015;12(6):106–15.

Mell P, Grance T. The NIST definition of cloud computing. Natl Inst Stand Technol. 2009;53(6):50.

Wei L, Zhu H, Cao Z, Dong X, Jia W, Chen Y, Vasilakos AV. Security and privacy for storage and computation in cloud computing. Inf Sci. 2014;258:371–86.

Xiao Z, Xiao Y. Security and privacy in cloud computing. In: IEEE Trans on communications surveys and tutorials, vol 15, no. 2, 2013. p. 843–59.

Wang C, Wang Q, Ren K, Lou W. Privacy-preserving public auditing for data storage security in cloud computing. In: Proc. of IEEE Int. Conf. on INFOCOM. 2010. p. 1–9.

Liu C, Ranjan R, Zhang X, Yang C, Georgakopoulos D, Chen J. Public auditing for big data storage in cloud computing—a survey. In: Proc. of IEEE Int. Conf. on computational science and engineering. 2013. p. 1128–35.

Liu C, Chen J, Yang LT, Zhang X, Yang C, Ranjan R, Rao K. Authorized public auditing of dynamic big data storage on cloud with efficient verifiable fine-grained updates. In: IEEE trans. on parallel and distributed systems, vol 25, no. 9. 2014. p. 2234–44

Xu K, et al. Privacy-preserving machine learning algorithms for big data systems. In: Distributed computing systems (ICDCS) IEEE 35th international conference; 2015.

Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV. Parallel processing systems for big data: a survey. In: Proceedings of the IEEE. 2016.

Li N, et al. t-Closeness: privacy beyond k -anonymity and L -diversity. In: Data engineering (ICDE) IEEE 23rd international conference; 2007.

Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. L-diversity: privacy beyond k - anonymity. In: Proc. 22nd international conference data engineering (ICDE); 2006. p. 24.

Ton A, Saravanan M. Ericsson research. [Online]. http://www.ericsson.com/research-blog/data-knowledge/big-data-privacy-preservation/2015 .

Samarati P. Protecting respondent’s privacy in microdata release. IEEE Trans Knowl Data Eng. 2001;13(6):1010–27.

Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory; 1998.

Sweeney L. K-anonymity: a model for protecting privacy. Int J Uncertain Fuzz. 2002;10(5):557–70.

Article   MathSciNet   MATH   Google Scholar  

Meyerson A, Williams R. On the complexity of optimal k-anonymity. In: Proc. of the ACM Symp. on principles of database systems. 2004.

Bredereck R, Nichterlein A, Niedermeier R, Philip G. The effect of homogeneity on the complexity of k-anonymity. In: FCT; 2011. p. 53–64.

Ko SY, Jeon K, Morales R. The HybrEx model for confidentiality and privacy in cloud computing. In: 3rd USENIX workshop on hot topics in cloud computing, HotCloud’11, Portland; 2011.

Lu R, Zhu H, Liu X, Liu JK, Shao J. Toward efficient and privacy-preserving computing in big data era. IEEE Netw. 2014;28:46–50.

Paillier P. Public-key cryptosystems based on composite degree residuosity classes. In: EUROCRYPT. 1999. p. 223–38.

Microsoft differential privacy for everyone, [online]. 2015. http://download.microsoft.com/…/Differential_Privacy_for_Everyone.pdf .

Sedayao J, Bhardwaj R. Making big data, privacy, and anonymization work together in the enterprise: experiences and issues. Big Data Congress; 2014.

Yong Yu, et al. Cloud data integrity checking with an identity-based auditing mechanism from RSA. Future Gener Comp Syst. 2016;62:85–91.

Oracle Big Data for the Enterprise, 2012. [online]. http://www.oracle.com/ca-en/technoloqies/biq-doto .

Hadoop Tutorials. 2012. https://developer.yahoo.com/hadoop/tutorial .

Fair Scheduler Guide. 2013. http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html ,

Jung K, Park S, Park S. Hiding a needle in a haystack: privacy preserving Apriori algorithm in MapReduce framework PSBD’14, Shanghai; 2014. p. 11–17.

Ateniese G, Johns RB, Curtmola R, Herring J, Kissner L, Peterson Z, Song D. Provable data possession at untrusted stores. In: Proc. of int. conf. of ACM on computer and communications security. 2007. p. 598–609.

Verma A, Cherkasova L, Campbell RH. Play it again, SimMR!. In: Proc. IEEE Int’l conf. cluster computing (Cluster’11); 2011.

Feng Z, et al. TRAC: Truthful auction for location-aware collaborative sensing in mobile crowd sourcing INFOCOM. Piscataway: IEEE; 2014. p. 1231–39.

HessamZakerdah CC, Aggarwal KB. Privacy-preserving big data publishing. La Jolla: ACM; 2015.

Dean J, Ghemawat S. Map reduce: simplied data processing on large clusters. OSDI; 2004.

Lammel R. Google’s MapReduce programming model-revisited. Sci Comput Progr. 2008;70(1):1–30.

Yan Z, et al. Two schemes of privacy-preserving trust evaluation. Future Gener Comp Syst. 2016;62:175–89.

Zhang Y, Fong S, Fiaidhi S, Mohammed S. Real-time clinical decision support system with data stream mining. J Biomed Biotechnol. 2012;2012:8.

Mohammadian E, Noferesti M, Jalili R. FAST: fast anonymization of big data streams. In: ACM proceedings of the 2014 international conference on big data science and computing, article 1. 2014.

Haferlach T, Kohlmann A, Wieczorek L, Basso G, Kronnie GT, Bene M-C, De Vos J, Hernandez JM, Hofmann W-K, Mills KI, Gilkes A, Chiaretti S, Shurtleff SA, Kipps TJ, Rassenti LZ, Yeoh AE, Papenhausen PR, Liu WM, Williams PM, Fo R. Clinical utility of microarray-based gene expression profiling in the diagnosis and sub classification of leukemia: report from the international microarray innovations in leukemia study group. J Clin Oncol. 2010;28(15):2529–37.

Salazar R, Roepman P, Capella G, Moreno V, Simon I, Dreezen C, Lopez-Doriga A, Santos C, Marijnen C, Westerga J, Bruin S, Kerr D, Kuppen P, van de Velde C, Morreau H, Van Velthuysen L, Glas AM, Tollenaar R. Gene expression signature to improve prognosis prediction of stage II and III colorectal cancer. J Clin Oncol. 2011;29(1):17–24.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7.

Groves P, Kayyali B, Knott D, Kuiken SV. The ‘big data’ revolution in healthcare. New York: McKinsey & Company; 2013.

Public Law 111–148—Patient Protection and Affordable Care Act. U.S. Government Printing Office (GPO); 2013.

EHR incentive programs. 2014. [Online]. https://www.cms.gov/Regulations-and Guidance/Legislation/EHRIncentivePrograms/index.html.

First things first—highmark makes healthcare-fraud prevention top priority with SAS. SAS; 2006.

Acampora G, et al. Data analytics for pervasive health. In: Healthcare data analytics. ISSN:533-576. 2015.

Haselton MG, Nettle D, Andrews PW. The evolution of cognitive bias. In: The handbook of evolutionary psychology. Hoboken: Wiley; 2005. p. 724–46.

Hill K. How target figured out a teen girl was pregnant before her father did. New York: Forbes, Inc.; 2012. [Online]. http://www.forbes.com/sites/kashmirhill/2012/02/16/howtarget - figured-out-a-teen-girl-was-pregnant-before-herfather- did/.

Violán C, Foguet-Boreu Q, Hermosilla-Pérez E, Valderas JM, Bolíbar B, Fàbregas-Escurriola M, Brugulat-Guiteras P, Muñoz-Pérez MÁ. Comparison of the information provided by electronic health records data and a population health survey to estimate prevalence of selected health conditions and multi morbidity. BMC Public Health. 2013;13(1):251.

Emam KE. Guide to the de-identification of personal health information. Boca Raton: CRC Press; 2013.

Book   Google Scholar  

Wu X. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

Zhang X, Yang T, Liu C, Chen J. A scalable two-phase top-down specialization approach for data anonymization using systems, in MapReduce on cloud. IEEE Trans Parallel Distrib. 2014;25(2):363–73.

Zhang X, Dou W, Pei J, Nepal S, Yang C, Liu C, Chen J. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. In: IEEE transactions on computers, vol. 64, no. 8, 2015.

Chen F, et al. Data mining for the internet of things: literature review and challenges. Int J Distrib Sens Netw. 2015;501:431047.

Fei H, et al. Robust cyber-physical systems: concept, models, and implementation. Future Gener Comp Syst. 2016;56:449–75.

Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst. 2002;10(5):557–70.

Dou W, et al. Hiresome-II: towards privacy-aware cross-cloud service composition for big data applications. IEEE Trans Parallel Distrib Syst. 2014;26(2):455–66.

Liang K, Susilo W, Liu JK. Privacy-preserving ciphertext for big data storage. In: IEEE transactions on informatics and forensics security. vol 10, no. 8. 2015.

Xu K, Yue H, Guo Y, Fang Y. Privacy-preserving machine learning algorithms for big data systems. In: IEEE 35th international conference on distributed systems. 2015.

Yan Z, Ding W, Xixun Yu, Zhu H, Deng RH. Deduplication on encrypted big data in cloud. IEEE Trans Big Data. 2016;2(2):138–50.

Download references

Authors’ contributions

PJ performed the primary literature review and analysis for this manuscript work. MG worked with PJ to develop the article framework and focus, and MG also drafted the manuscript. NK introduced this topic to PJ. MG and NK revised the manuscript for important intellectual content and have given final approval of the version to be published. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Computer Science Department, MANIT, Bhopal, India

Priyank Jain, Manasi Gyanchandani & Nilay Khare

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Priyank Jain .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Jain, P., Gyanchandani, M. & Khare, N. Big data privacy: a technological perspective and review. J Big Data 3 , 25 (2016). https://doi.org/10.1186/s40537-016-0059-y

Download citation

Received : 27 July 2016

Accepted : 05 November 2016

Published : 26 November 2016

DOI : https://doi.org/10.1186/s40537-016-0059-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Privacy and security
  • Privacy preserving: k-anonymity: T-closeness, L-diversity

big data security research paper

Advertisement

Advertisement

Cyber risk and cybersecurity: a systematic review of data availability

  • Open access
  • Published: 17 February 2022
  • Volume 47 , pages 698–736, ( 2022 )

Cite this article

You have full access to this open access article

big data security research paper

  • Frank Cremer 1 ,
  • Barry Sheehan   ORCID: orcid.org/0000-0003-4592-7558 1 ,
  • Michael Fortmann 2 ,
  • Arash N. Kia 1 ,
  • Martin Mullins 1 ,
  • Finbarr Murphy 1 &
  • Stefan Materne 2  

77k Accesses

92 Citations

43 Altmetric

Explore all metrics

Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses the extant academic and industry literature on cybersecurity and cyber risk management with a particular focus on data availability. From a preliminary search resulting in 5219 cyber peer-reviewed studies, the application of the systematic methodology resulted in 79 unique datasets. We posit that the lack of available data on cyber risk poses a serious problem for stakeholders seeking to tackle this issue. In particular, we identify a lacuna in open databases that undermine collective endeavours to better manage this set of risks. The resulting data evaluation and categorisation will support cybersecurity researchers and the insurance industry in their efforts to comprehend, metricise and manage cyber risks.

Similar content being viewed by others

big data security research paper

Systematic Review: Cybersecurity Risk Taxonomy

big data security research paper

A Survey of Cybersecurity Risk Management Frameworks

big data security research paper

Cybersecurity Risk Management Frameworks in the Oil and Gas Sector: A Systematic Literature Review

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

Globalisation, digitalisation and smart technologies have escalated the propensity and severity of cybercrime. Whilst it is an emerging field of research and industry, the importance of robust cybersecurity defence systems has been highlighted at the corporate, national and supranational levels. The impacts of inadequate cybersecurity are estimated to have cost the global economy USD 945 billion in 2020 (Maleks Smith et al. 2020 ). Cyber vulnerabilities pose significant corporate risks, including business interruption, breach of privacy and financial losses (Sheehan et al. 2019 ). Despite the increasing relevance for the international economy, the availability of data on cyber risks remains limited. The reasons for this are many. Firstly, it is an emerging and evolving risk; therefore, historical data sources are limited (Biener et al. 2015 ). It could also be due to the fact that, in general, institutions that have been hacked do not publish the incidents (Eling and Schnell 2016 ). The lack of data poses challenges for many areas, such as research, risk management and cybersecurity (Falco et al. 2019 ). The importance of this topic is demonstrated by the announcement of the European Council in April 2021 that a centre of excellence for cybersecurity will be established to pool investments in research, technology and industrial development. The goal of this centre is to increase the security of the internet and other critical network and information systems (European Council 2021 ).

This research takes a risk management perspective, focusing on cyber risk and considering the role of cybersecurity and cyber insurance in risk mitigation and risk transfer. The study reviews the existing literature and open data sources related to cybersecurity and cyber risk. This is the first systematic review of data availability in the general context of cyber risk and cybersecurity. By identifying and critically analysing the available datasets, this paper supports the research community by aggregating, summarising and categorising all available open datasets. In addition, further information on datasets is attached to provide deeper insights and support stakeholders engaged in cyber risk control and cybersecurity. Finally, this research paper highlights the need for open access to cyber-specific data, without price or permission barriers.

The identified open data can support cyber insurers in their efforts on sustainable product development. To date, traditional risk assessment methods have been untenable for insurance companies due to the absence of historical claims data (Sheehan et al. 2021 ). These high levels of uncertainty mean that cyber insurers are more inclined to overprice cyber risk cover (Kshetri 2018 ). Combining external data with insurance portfolio data therefore seems to be essential to improve the evaluation of the risk and thus lead to risk-adjusted pricing (Bessy-Roland et al. 2021 ). This argument is also supported by the fact that some re/insurers reported that they are working to improve their cyber pricing models (e.g. by creating or purchasing databases from external providers) (EIOPA 2018 ). Figure  1 provides an overview of pricing tools and factors considered in the estimation of cyber insurance based on the findings of EIOPA ( 2018 ) and the research of Romanosky et al. ( 2019 ). The term cyber risk refers to all cyber risks and their potential impact.

figure 1

An overview of the current cyber insurance informational and methodological landscape, adapted from EIOPA ( 2018 ) and Romanosky et al. ( 2019 )

Besides the advantage of risk-adjusted pricing, the availability of open datasets helps companies benchmark their internal cyber posture and cybersecurity measures. The research can also help to improve risk awareness and corporate behaviour. Many companies still underestimate their cyber risk (Leong and Chen 2020 ). For policymakers, this research offers starting points for a comprehensive recording of cyber risks. Although in many countries, companies are obliged to report data breaches to the respective supervisory authority, this information is usually not accessible to the research community. Furthermore, the economic impact of these breaches is usually unclear.

As well as the cyber risk management community, this research also supports cybersecurity stakeholders. Researchers are provided with an up-to-date, peer-reviewed literature of available datasets showing where these datasets have been used. For example, this includes datasets that have been used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems. This reduces a time-consuming search for suitable datasets and ensures a comprehensive review of those available. Through the dataset descriptions, researchers and industry stakeholders can compare and select the most suitable datasets for their purposes. In addition, it is possible to combine the datasets from one source in the context of cybersecurity or cyber risk. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks.

Cyber risks are defined as “operational risks to information and technology assets that have consequences affecting the confidentiality, availability, and/or integrity of information or information systems” (Cebula et al. 2014 ). Prominent cyber risk events include data breaches and cyberattacks (Agrafiotis et al. 2018 ). The increasing exposure and potential impact of cyber risk have been highlighted in recent industry reports (e.g. Allianz 2021 ; World Economic Forum 2020 ). Cyberattacks on critical infrastructures are ranked 5th in the World Economic Forum's Global Risk Report. Ransomware, malware and distributed denial-of-service (DDoS) are examples of the evolving modes of a cyberattack. One example is the ransomware attack on the Colonial Pipeline, which shut down the 5500 mile pipeline system that delivers 2.5 million barrels of fuel per day and critical liquid fuel infrastructure from oil refineries to states along the U.S. East Coast (Brower and McCormick 2021 ). These and other cyber incidents have led the U.S. to strengthen its cybersecurity and introduce, among other things, a public body to analyse major cyber incidents and make recommendations to prevent a recurrence (Murphey 2021a ). Another example of the scope of cyberattacks is the ransomware NotPetya in 2017. The damage amounted to USD 10 billion, as the ransomware exploited a vulnerability in the windows system, allowing it to spread independently worldwide in the network (GAO 2021 ). In the same year, the ransomware WannaCry was launched by cybercriminals. The cyberattack on Windows software took user data hostage in exchange for Bitcoin cryptocurrency (Smart 2018 ). The victims included the National Health Service in Great Britain. As a result, ambulances were redirected to other hospitals because of information technology (IT) systems failing, leaving people in need of urgent assistance waiting. It has been estimated that 19,000 cancelled treatment appointments resulted from losses of GBP 92 million (Field 2018 ). Throughout the COVID-19 pandemic, ransomware attacks increased significantly, as working from home arrangements increased vulnerability (Murphey 2021b ).

Besides cyberattacks, data breaches can also cause high costs. Under the General Data Protection Regulation (GDPR), companies are obliged to protect personal data and safeguard the data protection rights of all individuals in the EU area. The GDPR allows data protection authorities in each country to impose sanctions and fines on organisations they find in breach. “For data breaches, the maximum fine can be €20 million or 4% of global turnover, whichever is higher” (GDPR.EU 2021 ). Data breaches often involve a large amount of sensitive data that has been accessed, unauthorised, by external parties, and are therefore considered important for information security due to their far-reaching impact (Goode et al. 2017 ). A data breach is defined as a “security incident in which sensitive, protected, or confidential data are copied, transmitted, viewed, stolen, or used by an unauthorized individual” (Freeha et al. 2021 ). Depending on the amount of data, the extent of the damage caused by a data breach can be significant, with the average cost being USD 392 million Footnote 1 (IBM Security 2020 ).

This research paper reviews the existing literature and open data sources related to cybersecurity and cyber risk, focusing on the datasets used to improve academic understanding and advance the current state-of-the-art in cybersecurity. Furthermore, important information about the available datasets is presented (e.g. use cases), and a plea is made for open data and the standardisation of cyber risk data for academic comparability and replication. The remainder of the paper is structured as follows. The next section describes the related work regarding cybersecurity and cyber risks. The third section outlines the review method used in this work and the process. The fourth section details the results of the identified literature. Further discussion is presented in the penultimate section and the final section concludes.

Related work

Due to the significance of cyber risks, several literature reviews have been conducted in this field. Eling ( 2020 ) reviewed the existing academic literature on the topic of cyber risk and cyber insurance from an economic perspective. A total of 217 papers with the term ‘cyber risk’ were identified and classified in different categories. As a result, open research questions are identified, showing that research on cyber risks is still in its infancy because of their dynamic and emerging nature. Furthermore, the author highlights that particular focus should be placed on the exchange of information between public and private actors. An improved information flow could help to measure the risk more accurately and thus make cyber risks more insurable and help risk managers to determine the right level of cyber risk for their company. In the context of cyber insurance data, Romanosky et al. ( 2019 ) analysed the underwriting process for cyber insurance and revealed how cyber insurers understand and assess cyber risks. For this research, they examined 235 American cyber insurance policies that were publicly available and looked at three components (coverage, application questionnaires and pricing). The authors state in their findings that many of the insurers used very simple, flat-rate pricing (based on a single calculation of expected loss), while others used more parameters such as the asset value of the company (or company revenue) or standard insurance metrics (e.g. deductible, limits), and the industry in the calculation. This is in keeping with Eling ( 2020 ), who states that an increased amount of data could help to make cyber risk more accurately measured and thus more insurable. Similar research on cyber insurance and data was conducted by Nurse et al. ( 2020 ). The authors examined cyber insurance practitioners' perceptions and the challenges they face in collecting and using data. In addition, gaps were identified during the research where further data is needed. The authors concluded that cyber insurance is still in its infancy, and there are still several unanswered questions (for example, cyber valuation, risk calculation and recovery). They also pointed out that a better understanding of data collection and use in cyber insurance would be invaluable for future research and practice. Bessy-Roland et al. ( 2021 ) come to a similar conclusion. They proposed a multivariate Hawkes framework to model and predict the frequency of cyberattacks. They used a public dataset with characteristics of data breaches affecting the U.S. industry. In the conclusion, the authors make the argument that an insurer has a better knowledge of cyber losses, but that it is based on a small dataset and therefore combination with external data sources seems essential to improve the assessment of cyber risks.

Several systematic reviews have been published in the area of cybersecurity (Kruse et al. 2017 ; Lee et al. 2020 ; Loukas et al. 2013 ; Ulven and Wangen 2021 ). In these papers, the authors concentrated on a specific area or sector in the context of cybersecurity. This paper adds to this extant literature by focusing on data availability and its importance to risk management and insurance stakeholders. With a priority on healthcare and cybersecurity, Kruse et al. ( 2017 ) conducted a systematic literature review. The authors identified 472 articles with the keywords ‘cybersecurity and healthcare’ or ‘ransomware’ in the databases Cumulative Index of Nursing and Allied Health Literature, PubMed and Proquest. Articles were eligible for this review if they satisfied three criteria: (1) they were published between 2006 and 2016, (2) the full-text version of the article was available, and (3) the publication is a peer-reviewed or scholarly journal. The authors found that technological development and federal policies (in the U.S.) are the main factors exposing the health sector to cyber risks. Loukas et al. ( 2013 ) conducted a review with a focus on cyber risks and cybersecurity in emergency management. The authors provided an overview of cyber risks in communication, sensor, information management and vehicle technologies used in emergency management and showed areas for which there is still no solution in the literature. Similarly, Ulven and Wangen ( 2021 ) reviewed the literature on cybersecurity risks in higher education institutions. For the literature review, the authors used the keywords ‘cyber’, ‘information threats’ or ‘vulnerability’ in connection with the terms ‘higher education, ‘university’ or ‘academia’. A similar literature review with a focus on Internet of Things (IoT) cybersecurity was conducted by Lee et al. ( 2020 ). The review revealed that qualitative approaches focus on high-level frameworks, and quantitative approaches to cybersecurity risk management focus on risk assessment and quantification of cyberattacks and impacts. In addition, the findings presented a four-step IoT cyber risk management framework that identifies, quantifies and prioritises cyber risks.

Datasets are an essential part of cybersecurity research, underlined by the following works. Ilhan Firat et al. ( 2021 ) examined various cybersecurity datasets in detail. The study was motivated by the fact that with the proliferation of the internet and smart technologies, the mode of cyberattacks is also evolving. However, in order to prevent such attacks, they must first be detected; the dissemination and further development of cybersecurity datasets is therefore critical. In their work, the authors observed studies of datasets used in intrusion detection systems. Khraisat et al. ( 2019 ) also identified a need for new datasets in the context of cybersecurity. The researchers presented a taxonomy of current intrusion detection systems, a comprehensive review of notable recent work, and an overview of the datasets commonly used for assessment purposes. In their conclusion, the authors noted that new datasets are needed because most machine-learning techniques are trained and evaluated on the knowledge of old datasets. These datasets do not contain new and comprehensive information and are partly derived from datasets from 1999. The authors noted that the core of this issue is the availability of new public datasets as well as their quality. The availability of data, how it is used, created and shared was also investigated by Zheng et al. ( 2018 ). The researchers analysed 965 cybersecurity research papers published between 2012 and 2016. They created a taxonomy of the types of data that are created and shared and then analysed the data collected via datasets. The researchers concluded that while datasets are recognised as valuable for cybersecurity research, the proportion of publicly available datasets is limited.

The main contributions of this review and what differentiates it from previous studies can be summarised as follows. First, as far as we can tell, it is the first work to summarise all available datasets on cyber risk and cybersecurity in the context of a systematic review and present them to the scientific community and cyber insurance and cybersecurity stakeholders. Second, we investigated, analysed, and made available the datasets to support efficient and timely progress in cyber risk research. And third, we enable comparability of datasets so that the appropriate dataset can be selected depending on the research area.

Methodology

Process and eligibility criteria.

The structure of this systematic review is inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework (Page et al. 2021 ), and the search was conducted from 3 to 10 May 2021. Due to the continuous development of cyber risks and their countermeasures, only articles published in the last 10 years were considered. In addition, only articles published in peer-reviewed journals written in English were included. As a final criterion, only articles that make use of one or more cybersecurity or cyber risk datasets met the inclusion criteria. Specifically, these studies presented new or existing datasets, used them for methods, or used them to verify new results, as well as analysed them in an economic context and pointed out their effects. The criterion was fulfilled if it was clearly stated in the abstract that one or more datasets were used. A detailed explanation of this selection criterion can be found in the ‘Study selection’ section.

Information sources

In order to cover a complete spectrum of literature, various databases were queried to collect relevant literature on the topic of cybersecurity and cyber risks. Due to the spread of related articles across multiple databases, the literature search was limited to the following four databases for simplicity: IEEE Xplore, Scopus, SpringerLink and Web of Science. This is similar to other literature reviews addressing cyber risks or cybersecurity, including Sardi et al. ( 2021 ), Franke and Brynielsson ( 2014 ), Lagerström (2019), Eling and Schnell ( 2016 ) and Eling ( 2020 ). In this paper, all databases used in the aforementioned works were considered. However, only two studies also used all the databases listed. The IEEE Xplore database contains electrical engineering, computer science, and electronics work from over 200 journals and three million conference papers (IEEE 2021 ). Scopus includes 23,400 peer-reviewed journals from more than 5000 international publishers in the areas of science, engineering, medicine, social sciences and humanities (Scopus 2021 ). SpringerLink contains 3742 journals and indexes over 10 million scientific documents (SpringerLink 2021 ). Finally, Web of Science indexes over 9200 journals in different scientific disciplines (Science 2021 ).

A search string was created and applied to all databases. To make the search efficient and reproducible, the following search string with Boolean operator was used in all databases: cybersecurity OR cyber risk AND dataset OR database. To ensure uniformity of the search across all databases, some adjustments had to be made for the respective search engines. In Scopus, for example, the Advanced Search was used, and the field code ‘Title-ABS-KEY’ was integrated into the search string. For IEEE Xplore, the search was carried out with the Search String in the Command Search and ‘All Metadata’. In the Web of Science database, the Advanced Search was used. The special feature of this search was that it had to be carried out in individual steps. The first search was carried out with the terms cybersecurity OR cyber risk with the field tag Topic (T.S. =) and the second search with dataset OR database. Subsequently, these searches were combined, which then delivered the searched articles for review. For SpringerLink, the search string was used in the Advanced Search under the category ‘Find the resources with all of the words’. After conducting this search string, 5219 studies could be found. According to the eligibility criteria (period, language and only scientific journals), 1581 studies were identified in the databases:

Scopus: 135

Springer Link: 548

Web of Science: 534

An overview of the process is given in Fig.  2 . Combined with the results from the four databases, 854 articles without duplicates were identified.

figure 2

Literature search process and categorisation of the studies

Study selection

In the final step of the selection process, the articles were screened for relevance. Due to a large number of results, the abstracts were analysed in the first step of the process. The aim was to determine whether the article was relevant for the systematic review. An article fulfilled the criterion if it was recognisable in the abstract that it had made a contribution to datasets or databases with regard to cyber risks or cybersecurity. Specifically, the criterion was considered to be met if the abstract used datasets that address the causes or impacts of cyber risks, and measures in the area of cybersecurity. In this process, the number of articles was reduced to 288. The articles were then read in their entirety, and an expert panel of six people decided whether they should be used. This led to a final number of 255 articles. The years in which the articles were published and the exact number can be seen in Fig.  3 .

figure 3

Distribution of studies

Data collection process and synthesis of the results

For the data collection process, various data were extracted from the studies, including the names of the respective creators, the name of the dataset or database and the corresponding reference. It was also determined where the data came from. In the context of accessibility, it was determined whether access is free, controlled, available for purchase or not available. It was also determined when the datasets were created and the time period referenced. The application type and domain characteristics of the datasets were identified.

This section analyses the results of the systematic literature review. The previously identified studies are divided into three categories: datasets on the causes of cyber risks, datasets on the effects of cyber risks and datasets on cybersecurity. The classification is based on the intended use of the studies. This system of classification makes it easier for stakeholders to find the appropriate datasets. The categories are evaluated individually. Although complete information is available for a large proportion of datasets, this is not true for all of them. Accordingly, the abbreviation N/A has been inserted in the respective characters to indicate that this information could not be determined by the time of submission. The term ‘use cases in the literature’ in the following and supplementary tables refers to the application areas in which the corresponding datasets were used in the literature. The areas listed there refer to the topic area on which the researchers conducted their research. Since some datasets were used interdisciplinarily, the listed use cases in the literature are correspondingly longer. Before discussing each category in the next sections, Fig.  4 provides an overview of the number of datasets found and their year of creation. Figure  5 then shows the relationship between studies and datasets in the period under consideration. Figure  6 shows the distribution of studies, their use of datasets and their creation date. The number of datasets used is higher than the number of studies because the studies often used several datasets (Table 1 ).

figure 4

Distribution of dataset results

figure 5

Correlation between the studies and the datasets

figure 6

Distribution of studies and their use of datasets

Most of the datasets are generated in the U.S. (up to 58.2%). Canada and Australia rank next, with 11.3% and 5% of all the reviewed datasets, respectively.

Additionally, to create value for the datasets for the cyber insurance industry, an assessment of the applicability of each dataset has been provided for cyber insurers. This ‘Use Case Assessment’ includes the use of the data in the context of different analyses, calculation of cyber insurance premiums, and use of the information for the design of cyber insurance contracts or for additional customer services. To reasonably account for the transition of direct hyperlinks in the future, references were directed to the main websites for longevity (nearest resource point). In addition, the links to the main pages contain further information on the datasets and different versions related to the operating systems. The references were chosen in such a way that practitioners get the best overview of the respective datasets.

Case datasets

This section presents selected articles that use the datasets to analyse the causes of cyber risks. The datasets help identify emerging trends and allow pattern discovery in cyber risks. This information gives cybersecurity experts and cyber insurers the data to make better predictions and take appropriate action. For example, if certain vulnerabilities are not adequately protected, cyber insurers will demand a risk surcharge leading to an improvement in the risk-adjusted premium. Due to the capricious nature of cyber risks, existing data must be supplemented with new data sources (for example, new events, new methods or security vulnerabilities) to determine prevailing cyber exposure. The datasets of cyber risk causes could be combined with existing portfolio data from cyber insurers and integrated into existing pricing tools and factors to improve the valuation of cyber risks.

A portion of these datasets consists of several taxonomies and classifications of cyber risks. Aassal et al. ( 2020 ) propose a new taxonomy of phishing characteristics based on the interpretation and purpose of each characteristic. In comparison, Hindy et al. ( 2020 ) presented a taxonomy of network threats and the impact of current datasets on intrusion detection systems. A similar taxonomy was suggested by Kiwia et al. ( 2018 ). The authors presented a cyber kill chain-based taxonomy of banking Trojans features. The taxonomy built on a real-world dataset of 127 banking Trojans collected from December 2014 to January 2016 by a major U.K.-based financial organisation.

In the context of classification, Aamir et al. ( 2021 ) showed the benefits of machine learning for classifying port scans and DDoS attacks in a mixture of normal and attack traffic. Guo et al. ( 2020 ) presented a new method to improve malware classification based on entropy sequence features. The evaluation of this new method was conducted on different malware datasets.

To reconstruct attack scenarios and draw conclusions based on the evidence in the alert stream, Barzegar and Shajari ( 2018 ) use the DARPA2000 and MACCDC 2012 dataset for their research. Giudici and Raffinetti ( 2020 ) proposed a rank-based statistical model aimed at predicting the severity levels of cyber risk. The model used cyber risk data from the University of Milan. In contrast to the previous datasets, Skrjanc et al. ( 2018 ) used the older dataset KDD99 to monitor large-scale cyberattacks using a cauchy clustering method.

Amin et al. ( 2021 ) used a cyberattack dataset from the Canadian Institute for Cybersecurity to identify spatial clusters of countries with high rates of cyberattacks. In the context of cybercrime, Junger et al. ( 2020 ) examined crime scripts, key characteristics of the target company and the relationship between criminal effort and financial benefit. For their study, the authors analysed 300 cases of fraudulent activities against Dutch companies. With a similar focus on cybercrime, Mireles et al. ( 2019 ) proposed a metric framework to measure the effectiveness of the dynamic evolution of cyberattacks and defensive measures. To validate its usefulness, they used the DEFCON dataset.

Due to the rapidly changing nature of cyber risks, it is often impossible to obtain all information on them. Kim and Kim ( 2019 ) proposed an automated dataset generation system called CTIMiner that collects threat data from publicly available security reports and malware repositories. They released a dataset to the public containing about 640,000 records from 612 security reports published between January 2008 and 2019. A similar approach is proposed by Kim et al. ( 2020 ), using a named entity recognition system to extract core information from cyber threat reports automatically. They created a 498,000-tag dataset during their research (Ulven and Wangen 2021 ).

Within the framework of vulnerabilities and cybersecurity issues, Ulven and Wangen ( 2021 ) proposed an overview of mission-critical assets and everyday threat events, suggested a generic threat model, and summarised common cybersecurity vulnerabilities. With a focus on hospitality, Chen and Fiscus ( 2018 ) proposed several issues related to cybersecurity in this sector. They analysed 76 security incidents from the Privacy Rights Clearinghouse database. Supplementary Table 1 lists all findings that belong to the cyber causes dataset.

Impact datasets

This section outlines selected findings of the cyber impact dataset. For cyber insurers, these datasets can form an important basis for information, as they can be used to calculate cyber insurance premiums, evaluate specific cyber risks, formulate inclusions and exclusions in cyber wordings, and re-evaluate as well as supplement the data collected so far on cyber risks. For example, information on financial losses can help to better assess the loss potential of cyber risks. Furthermore, the datasets can provide insight into the frequency of occurrence of these cyber risks. The new datasets can be used to close any data gaps that were previously based on very approximate estimates or to find new results.

Eight studies addressed the costs of data breaches. For instance, Eling and Jung ( 2018 ) reviewed 3327 data breach events from 2005 to 2016 and identified an asymmetric dependence of monthly losses by breach type and industry. The authors used datasets from the Privacy Rights Clearinghouse for analysis. The Privacy Rights Clearinghouse datasets and the Breach level index database were also used by De Giovanni et al. ( 2020 ) to describe relationships between data breaches and bitcoin-related variables using the cointegration methodology. The data were obtained from the Department of Health and Human Services of healthcare facilities reporting data breaches and a national database of technical and organisational infrastructure information. Also in the context of data breaches, Algarni et al. ( 2021 ) developed a comprehensive, formal model that estimates the two components of security risks: breach cost and the likelihood of a data breach within 12 months. For their survey, the authors used two industrial reports from the Ponemon institute and VERIZON. To illustrate the scope of data breaches, Neto et al. ( 2021 ) identified 430 major data breach incidents among more than 10,000 incidents. The database created is available and covers the period 2018 to 2019.

With a direct focus on insurance, Biener et al. ( 2015 ) analysed 994 cyber loss cases from an operational risk database and investigated the insurability of cyber risks based on predefined criteria. For their study, they used data from the company SAS OpRisk Global Data. Similarly, Eling and Wirfs ( 2019 ) looked at a wide range of cyber risk events and actual cost data using the same database. They identified cyber losses and analysed them using methods from statistics and actuarial science. Using a similar reference, Farkas et al. ( 2021 ) proposed a method for analysing cyber claims based on regression trees to identify criteria for classifying and evaluating claims. Similar to Chen and Fiscus ( 2018 ), the dataset used was the Privacy Rights Clearinghouse database. Within the framework of reinsurance, Moro ( 2020 ) analysed cyber index-based information technology activity to see if index-parametric reinsurance coverage could suggest its cedant using data from a Symantec dataset.

Paté-Cornell et al. ( 2018 ) presented a general probabilistic risk analysis framework for cybersecurity in an organisation to be specified. The results are distributions of losses to cyberattacks, with and without considered countermeasures in support of risk management decisions based both on past data and anticipated incidents. The data used were from The Common Vulnerability and Exposures database and via confidential access to a database of cyberattacks on a large, U.S.-based organisation. A different conceptual framework for cyber risk classification and assessment was proposed by Sheehan et al. ( 2021 ). This framework showed the importance of proactive and reactive barriers in reducing companies’ exposure to cyber risk and quantifying the risk. Another approach to cyber risk assessment and mitigation was proposed by Mukhopadhyay et al. ( 2019 ). They estimated the probability of an attack using generalised linear models, predicted the security technology required to reduce the probability of cyberattacks, and used gamma and exponential distributions to best approximate the average loss data for each malicious attack. They also calculated the expected loss due to cyberattacks, calculated the net premium that would need to be charged by a cyber insurer, and suggested cyber insurance as a strategy to minimise losses. They used the CSI-FBI survey (1997–2010) to conduct their research.

In order to highlight the lack of data on cyber risks, Eling ( 2020 ) conducted a literature review in the areas of cyber risk and cyber insurance. Available information on the frequency, severity, and dependency structure of cyber risks was filtered out. In addition, open questions for future cyber risk research were set up. Another example of data collection on the impact of cyberattacks is provided by Sornette et al. ( 2013 ), who use a database of newspaper articles, press reports and other media to provide a predictive method to identify triggering events and potential accident scenarios and estimate their severity and frequency. A similar approach to data collection was used by Arcuri et al. ( 2020 ) to gather an original sample of global cyberattacks from newspaper reports sourced from the LexisNexis database. This collection is also used and applied to the fields of dynamic communication and cyber risk perception by Fang et al. ( 2021 ). To create a dataset of cyber incidents and disputes, Valeriano and Maness ( 2014 ) collected information on cyber interactions between rival states.

To assess trends and the scale of economic cybercrime, Levi ( 2017 ) examined datasets from different countries and their impact on crime policy. Pooser et al. ( 2018 ) investigated the trend in cyber risk identification from 2006 to 2015 and company characteristics related to cyber risk perception. The authors used a dataset of various reports from cyber insurers for their study. Walker-Roberts et al. ( 2020 ) investigated the spectrum of risk of a cybersecurity incident taking place in the cyber-physical-enabled world using the VERIS Community Database. The datasets of impacts identified are presented below. Due to overlap, some may also appear in the causes dataset (Supplementary Table 2).

Cybersecurity datasets

General intrusion detection.

General intrusion detection systems account for the largest share of countermeasure datasets. For companies or researchers focused on cybersecurity, the datasets can be used to test their own countermeasures or obtain information about potential vulnerabilities. For example, Al-Omari et al. ( 2021 ) proposed an intelligent intrusion detection model for predicting and detecting attacks in cyberspace, which was applied to dataset UNSW-NB 15. A similar approach was taken by Choras and Kozik ( 2015 ), who used machine learning to detect cyberattacks on web applications. To evaluate their method, they used the HTTP dataset CSIC 2010. For the identification of unknown attacks on web servers, Kamarudin et al. ( 2017 ) proposed an anomaly-based intrusion detection system using an ensemble classification approach. Ganeshan and Rodrigues ( 2020 ) showed an intrusion detection system approach, which clusters the database into several groups and detects the presence of intrusion in the clusters. In comparison, AlKadi et al. ( 2019 ) used a localisation-based model to discover abnormal patterns in network traffic. Hybrid models have been recommended by Bhattacharya et al. ( 2020 ) and Agrawal et al. ( 2019 ); the former is a machine-learning model based on principal component analysis for the classification of intrusion detection system datasets, while the latter is a hybrid ensemble intrusion detection system for anomaly detection using different datasets to detect patterns in network traffic that deviate from normal behaviour.

Agarwal et al. ( 2021 ) used three different machine learning algorithms in their research to find the most suitable for efficiently identifying patterns of suspicious network activity. The UNSW-NB15 dataset was used for this purpose. Kasongo and Sun ( 2020 ), Feed-Forward Deep Neural Network (FFDNN), Keshk et al. ( 2021 ), the privacy-preserving anomaly detection framework, and others also use the UNSW-NB 15 dataset as part of intrusion detection systems. The same dataset and others were used by Binbusayyis and Vaiyapuri ( 2019 ) to identify and compare key features for cyber intrusion detection. Atefinia and Ahmadi ( 2021 ) proposed a deep neural network model to reduce the false positive rate of an anomaly-based intrusion detection system. Fossaceca et al. ( 2015 ) focused in their research on the development of a framework that combined the outputs of multiple learners in order to improve the efficacy of network intrusion, and Gauthama Raman et al. ( 2020 ) presented a search algorithm based on Support Vector machine to improve the performance of the detection and false alarm rate to improve intrusion detection techniques. Ahmad and Alsemmeari ( 2020 ) targeted extreme learning machine techniques due to their good capabilities in classification problems and handling huge data. They used the NSL-KDD dataset as a benchmark.

With reference to prediction, Bakdash et al. ( 2018 ) used datasets from the U.S. Department of Defence to predict cyberattacks by malware. This dataset consists of weekly counts of cyber events over approximately seven years. Another prediction method was presented by Fan et al. ( 2018 ), which showed an improved integrated cybersecurity prediction method based on spatial-time analysis. Also, with reference to prediction, Ashtiani and Azgomi ( 2014 ) proposed a framework for the distributed simulation of cyberattacks based on high-level architecture. Kirubavathi and Anitha ( 2016 ) recommended an approach to detect botnets, irrespective of their structures, based on network traffic flow behaviour analysis and machine-learning techniques. Dwivedi et al. ( 2021 ) introduced a multi-parallel adaptive technique to utilise an adaption mechanism in the group of swarms for network intrusion detection. AlEroud and Karabatis ( 2018 ) presented an approach that used contextual information to automatically identify and query possible semantic links between different types of suspicious activities extracted from network flows.

Intrusion detection systems with a focus on IoT

In addition to general intrusion detection systems, a proportion of studies focused on IoT. Habib et al. ( 2020 ) presented an approach for converting traditional intrusion detection systems into smart intrusion detection systems for IoT networks. To enhance the process of diagnostic detection of possible vulnerabilities with an IoT system, Georgescu et al. ( 2019 ) introduced a method that uses a named entity recognition-based solution. With regard to IoT in the smart home sector, Heartfield et al. ( 2021 ) presented a detection system that is able to autonomously adjust the decision function of its underlying anomaly classification models to a smart home’s changing condition. Another intrusion detection system was suggested by Keserwani et al. ( 2021 ), which combined Grey Wolf Optimization and Particle Swam Optimization to identify various attacks for IoT networks. They used the KDD Cup 99, NSL-KDD and CICIDS-2017 to evaluate their model. Abu Al-Haija and Zein-Sabatto ( 2020 ) provide a comprehensive development of a new intelligent and autonomous deep-learning-based detection and classification system for cyberattacks in IoT communication networks that leverage the power of convolutional neural networks, abbreviated as IoT-IDCS-CNN (IoT-based Intrusion Detection and Classification System using Convolutional Neural Network). To evaluate the development, the authors used the NSL-KDD dataset. Biswas and Roy ( 2021 ) recommended a model that identifies malicious botnet traffic using novel deep-learning approaches like artificial neural networks gutted recurrent units and long- or short-term memory models. They tested their model with the Bot-IoT dataset.

With a more forensic background, Koroniotis et al. ( 2020 ) submitted a network forensic framework, which described the digital investigation phases for identifying and tracing attack behaviours in IoT networks. The suggested work was evaluated with the Bot-IoT and UINSW-NB15 datasets. With a focus on big data and IoT, Chhabra et al. ( 2020 ) presented a cyber forensic framework for big data analytics in an IoT environment using machine learning. Furthermore, the authors mentioned different publicly available datasets for machine-learning models.

A stronger focus on a mobile phones was exhibited by Alazab et al. ( 2020 ), which presented a classification model that combined permission requests and application programme interface calls. The model was tested with a malware dataset containing 27,891 Android apps. A similar approach was taken by Li et al. ( 2019a , b ), who proposed a reliable classifier for Android malware detection based on factorisation machine architecture and extraction of Android app features from manifest files and source code.

Literature reviews

In addition to the different methods and models for intrusion detection systems, various literature reviews on the methods and datasets were also found. Liu and Lang ( 2019 ) proposed a taxonomy of intrusion detection systems that uses data objects as the main dimension to classify and summarise machine learning and deep learning-based intrusion detection literature. They also presented four different benchmark datasets for machine-learning detection systems. Ahmed et al. ( 2016 ) presented an in-depth analysis of four major categories of anomaly detection techniques, which include classification, statistical, information theory and clustering. Hajj et al. ( 2021 ) gave a comprehensive overview of anomaly-based intrusion detection systems. Their article gives an overview of the requirements, methods, measurements and datasets that are used in an intrusion detection system.

Within the framework of machine learning, Chattopadhyay et al. ( 2018 ) conducted a comprehensive review and meta-analysis on the application of machine-learning techniques in intrusion detection systems. They also compared different machine learning techniques in different datasets and summarised the performance. Vidros et al. ( 2017 ) presented an overview of characteristics and methods in automatic detection of online recruitment fraud. They also published an available dataset of 17,880 annotated job ads, retrieved from the use of a real-life system. An empirical study of different unsupervised learning algorithms used in the detection of unknown attacks was presented by Meira et al. ( 2020 ).

New datasets

Kilincer et al. ( 2021 ) reviewed different intrusion detection system datasets in detail. They had a closer look at the UNS-NB15, ISCX-2012, NSL-KDD and CIDDS-001 datasets. Stojanovic et al. ( 2020 ) also provided a review on datasets and their creation for use in advanced persistent threat detection in the literature. Another review of datasets was provided by Sarker et al. ( 2020 ), who focused on cybersecurity data science as part of their research and provided an overview from a machine-learning perspective. Avila et al. ( 2021 ) conducted a systematic literature review on the use of security logs for data leak detection. They recommended a new classification of information leak, which uses the GDPR principles, identified the most widely publicly available dataset for threat detection, described the attack types in the datasets and the algorithms used for data leak detection. Tuncer et al. ( 2020 ) presented a bytecode-based detection method consisting of feature extraction using local neighbourhood binary patterns. They chose a byte-based malware dataset to investigate the performance of the proposed local neighbourhood binary pattern-based detection method. With a different focus, Mauro et al. ( 2020 ) gave an experimental overview of neural-based techniques relevant to intrusion detection. They assessed the value of neural networks using the Bot-IoT and UNSW-DB15 datasets.

Another category of results in the context of countermeasure datasets is those that were presented as new. Moreno et al. ( 2018 ) developed a database of 300 security-related accidents from European and American sources. The database contained cybersecurity-related events in the chemical and process industry. Damasevicius et al. ( 2020 ) proposed a new dataset (LITNET-2020) for network intrusion detection. The dataset is a new annotated network benchmark dataset obtained from the real-world academic network. It presents real-world examples of normal and under-attack network traffic. With a focus on IoT intrusion detection systems, Alsaedi et al. ( 2020 ) proposed a new benchmark IoT/IIot datasets for assessing intrusion detection system-enabled IoT systems. Also in the context of IoT, Vaccari et al. ( 2020 ) proposed a dataset focusing on message queue telemetry transport protocols, which can be used to train machine-learning models. To evaluate the performance of machine-learning classifiers, Mahfouz et al. ( 2020 ) created a dataset called Game Theory and Cybersecurity (GTCS). A dataset containing 22,000 malware and benign samples was constructed by Martin et al. ( 2019 ). The dataset can be used as a benchmark to test the algorithm for Android malware classification and clustering techniques. In addition, Laso et al. ( 2017 ) presented a dataset created to investigate how data and information quality estimates enable the detection of anomalies and malicious acts in cyber-physical systems. The dataset contained various cyberattacks and is publicly available.

In addition to the results described above, several other studies were found that fit into the category of countermeasures. Johnson et al. ( 2016 ) examined the time between vulnerability disclosures. Using another vulnerabilities database, Common Vulnerabilities and Exposures (CVE), Subroto and Apriyana ( 2019 ) presented an algorithm model that uses big data analysis of social media and statistical machine learning to predict cyber risks. A similar databank but with a different focus, Common Vulnerability Scoring System, was used by Chatterjee and Thekdi ( 2020 ) to present an iterative data-driven learning approach to vulnerability assessment and management for complex systems. Using the CICIDS2017 dataset to evaluate the performance, Malik et al. ( 2020 ) proposed a control plane-based orchestration for varied, sophisticated threats and attacks. The same dataset was used in another study by Lee et al. ( 2019 ), who developed an artificial security information event management system based on a combination of event profiling for data processing and different artificial network methods. To exploit the interdependence between multiple series, Fang et al. ( 2021 ) proposed a statistical framework. In order to validate the framework, the authors applied it to a dataset of enterprise-level security breaches from the Privacy Rights Clearinghouse and Identity Theft Center database. Another framework with a defensive aspect was recommended by Li et al. ( 2021 ) to increase the robustness of deep neural networks against adversarial malware evasion attacks. Sarabi et al. ( 2016 ) investigated whether and to what extent business details can help assess an organisation's risk of data breaches and the distribution of risk across different types of incidents to create policies for protection, detection and recovery from different forms of security incidents. They used data from the VERIS Community Database.

Datasets that have been classified into the cybersecurity category are detailed in Supplementary Table 3. Due to overlap, records from the previous tables may also be included.

This paper presented a systematic literature review of studies on cyber risk and cybersecurity that used datasets. Within this framework, 255 studies were fully reviewed and then classified into three different categories. Then, 79 datasets were consolidated from these studies. These datasets were subsequently analysed, and important information was selected through a process of filtering out. This information was recorded in a table and enhanced with further information as part of the literature analysis. This made it possible to create a comprehensive overview of the datasets. For example, each dataset contains a description of where the data came from and how the data has been used to date. This allows different datasets to be compared and the appropriate dataset for the use case to be selected. This research certainly has limitations, so our selection of datasets cannot necessarily be taken as a representation of all available datasets related to cyber risks and cybersecurity. For example, literature searches were conducted in four academic databases and only found datasets that were used in the literature. Many research projects also used old datasets that may no longer consider current developments. In addition, the data are often focused on only one observation and are limited in scope. For example, the datasets can only be applied to specific contexts and are also subject to further limitations (e.g. region, industry, operating system). In the context of the applicability of the datasets, it is unfortunately not possible to make a clear statement on the extent to which they can be integrated into academic or practical areas of application or how great this effort is. Finally, it remains to be pointed out that this is an overview of currently available datasets, which are subject to constant change.

Due to the lack of datasets on cyber risks in the academic literature, additional datasets on cyber risks were integrated as part of a further search. The search was conducted on the Google Dataset search portal. The search term used was ‘cyber risk datasets’. Over 100 results were found. However, due to the low significance and verifiability, only 20 selected datasets were included. These can be found in Table 2  in the “ Appendix ”.

The results of the literature review and datasets also showed that there continues to be a lack of available, open cyber datasets. This lack of data is reflected in cyber insurance, for example, as it is difficult to find a risk-based premium without a sufficient database (Nurse et al. 2020 ). The global cyber insurance market was estimated at USD 5.5 billion in 2020 (Dyson 2020 ). When compared to the USD 1 trillion global losses from cybercrime (Maleks Smith et al. 2020 ), it is clear that there exists a significant cyber risk awareness challenge for both the insurance industry and international commerce. Without comprehensive and qualitative data on cyber losses, it can be difficult to estimate potential losses from cyberattacks and price cyber insurance accordingly (GAO 2021 ). For instance, the average cyber insurance loss increased from USD 145,000 in 2019 to USD 359,000 in 2020 (FitchRatings 2021 ). Cyber insurance is an important risk management tool to mitigate the financial impact of cybercrime. This is particularly evident in the impact of different industries. In the Energy & Commodities financial markets, a ransomware attack on the Colonial Pipeline led to a substantial impact on the U.S. economy. As a result of the attack, about 45% of the U.S. East Coast was temporarily unable to obtain supplies of diesel, petrol and jet fuel. This caused the average price in the U.S. to rise 7 cents to USD 3.04 per gallon, the highest in seven years (Garber 2021 ). In addition, Colonial Pipeline confirmed that it paid a USD 4.4 million ransom to a hacker gang after the attack. Another ransomware attack occurred in the healthcare and government sector. The victim of this attack was the Irish Health Service Executive (HSE). A ransom payment of USD 20 million was demanded from the Irish government to restore services after the hack (Tidy 2021 ). In the car manufacturing sector, Miller and Valasek ( 2015 ) initiated a cyberattack that resulted in the recall of 1.4 million vehicles and cost manufacturers EUR 761 million. The risk that arises in the context of these events is the potential for the accumulation of cyber losses, which is why cyber insurers are not expanding their capacity. An example of this accumulation of cyber risks is the NotPetya malware attack, which originated in Russia, struck in Ukraine, and rapidly spread around the world, causing at least USD 10 billion in damage (GAO 2021 ). These events highlight the importance of proper cyber risk management.

This research provides cyber insurance stakeholders with an overview of cyber datasets. Cyber insurers can use the open datasets to improve their understanding and assessment of cyber risks. For example, the impact datasets can be used to better measure financial impacts and their frequencies. These data could be combined with existing portfolio data from cyber insurers and integrated with existing pricing tools and factors to better assess cyber risk valuation. Although most cyber insurers have sparse historical cyber policy and claims data, they remain too small at present for accurate prediction (Bessy-Roland et al. 2021 ). A combination of portfolio data and external datasets would support risk-adjusted pricing for cyber insurance, which would also benefit policyholders. In addition, cyber insurance stakeholders can use the datasets to identify patterns and make better predictions, which would benefit sustainable cyber insurance coverage. In terms of cyber risk cause datasets, cyber insurers can use the data to review their insurance products. For example, the data could provide information on which cyber risks have not been sufficiently considered in product design or where improvements are needed. A combination of cyber cause and cybersecurity datasets can help establish uniform definitions to provide greater transparency and clarity. Consistent terminology could lead to a more sustainable cyber market, where cyber insurers make informed decisions about the level of coverage and policyholders understand their coverage (The Geneva Association 2020).

In addition to the cyber insurance community, this research also supports cybersecurity stakeholders. The reviewed literature can be used to provide a contemporary, contextual and categorised summary of available datasets. This supports efficient and timely progress in cyber risk research and is beneficial given the dynamic nature of cyber risks. With the help of the described cybersecurity datasets and the identified information, a comparison of different datasets is possible. The datasets can be used to evaluate the effectiveness of countermeasures in simulated cyberattacks or to test intrusion detection systems.

In this paper, we conducted a systematic review of studies on cyber risk and cybersecurity databases. We found that most of the datasets are in the field of intrusion detection and machine learning and are used for technical cybersecurity aspects. The available datasets on cyber risks were relatively less represented. Due to the dynamic nature and lack of historical data, assessing and understanding cyber risk is a major challenge for cyber insurance stakeholders. To address this challenge, a greater density of cyber data is needed to support cyber insurers in risk management and researchers with cyber risk-related topics. With reference to ‘Open Science’ FAIR data (Jacobsen et al. 2020 ), mandatory reporting of cyber incidents could help improve cyber understanding, awareness and loss prevention among companies and insurers. Through greater availability of data, cyber risks can be better understood, enabling researchers to conduct more in-depth research into these risks. Companies could incorporate this new knowledge into their corporate culture to reduce cyber risks. For insurance companies, this would have the advantage that all insurers would have the same understanding of cyber risks, which would support sustainable risk-based pricing. In addition, common definitions of cyber risks could be derived from new data.

The cybersecurity databases summarised and categorised in this research could provide a different perspective on cyber risks that would enable the formulation of common definitions in cyber policies. The datasets can help companies addressing cybersecurity and cyber risk as part of risk management assess their internal cyber posture and cybersecurity measures. The paper can also help improve risk awareness and corporate behaviour, and provides the research community with a comprehensive overview of peer-reviewed datasets and other available datasets in the area of cyber risk and cybersecurity. This approach is intended to support the free availability of data for research. The complete tabulated review of the literature is included in the Supplementary Material.

This work provides directions for several paths of future work. First, there are currently few publicly available datasets for cyber risk and cybersecurity. The older datasets that are still widely used no longer reflect today's technical environment. Moreover, they can often only be used in one context, and the scope of the samples is very limited. It would be of great value if more datasets were publicly available that reflect current environmental conditions. This could help intrusion detection systems to consider current events and thus lead to a higher success rate. It could also compensate for the disadvantages of older datasets by collecting larger quantities of samples and making this contextualisation more widespread. Another area of research may be the integratability and adaptability of cybersecurity and cyber risk datasets. For example, it is often unclear to what extent datasets can be integrated or adapted to existing data. For cyber risks and cybersecurity, it would be helpful to know what requirements need to be met or what is needed to use the datasets appropriately. In addition, it would certainly be helpful to know whether datasets can be modified to be used for cyber risks or cybersecurity. Finally, the ability for stakeholders to identify machine-readable cybersecurity datasets would be useful because it would allow for even clearer delineations or comparisons between datasets. Due to the lack of publicly available datasets, concrete benchmarks often cannot be applied.

Average cost of a breach of more than 50 million records.

Aamir, M., S.S.H. Rizvi, M.A. Hashmani, M. Zubair, and J. Ahmad. 2021. Machine learning classification of port scanning and DDoS attacks: A comparative analysis. Mehran University Research Journal of Engineering and Technology 40 (1): 215–229. https://doi.org/10.22581/muet1982.2101.19 .

Article   Google Scholar  

Aamir, M., and S.M.A. Zaidi. 2019. DDoS attack detection with feature engineering and machine learning: The framework and performance evaluation. International Journal of Information Security 18 (6): 761–785. https://doi.org/10.1007/s10207-019-00434-1 .

Aassal, A. El, S. Baki, A. Das, and R.M. Verma. 2020. 2020. An in-depth benchmarking and evaluation of phishing detection research for security needs. IEEE Access 8: 22170–22192. https://doi.org/10.1109/ACCESS.2020.2969780 .

Abu Al-Haija, Q., and S. Zein-Sabatto. 2020. An efficient deep-learning-based detection and classification system for cyber-attacks in IoT communication networks. Electronics 9 (12): 26. https://doi.org/10.3390/electronics9122152 .

Adhikari, U., T.H. Morris, and S.Y. Pan. 2018. Applying Hoeffding adaptive trees for real-time cyber-power event and intrusion classification. IEEE Transactions on Smart Grid 9 (5): 4049–4060. https://doi.org/10.1109/tsg.2017.2647778 .

Agarwal, A., P. Sharma, M. Alshehri, A.A. Mohamed, and O. Alfarraj. 2021. Classification model for accuracy and intrusion detection using machine learning approach. PeerJ Computer Science . https://doi.org/10.7717/peerj-cs.437 .

Agrafiotis, I., J.R.C.. Nurse, M. Goldsmith, S. Creese, and D. Upton. 2018. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity 4: tyy006.

Agrawal, A., S. Mohammed, and J. Fiaidhi. 2019. Ensemble technique for intruder detection in network traffic. International Journal of Security and Its Applications 13 (3): 1–8. https://doi.org/10.33832/ijsia.2019.13.3.01 .

Ahmad, I., and R.A. Alsemmeari. 2020. Towards improving the intrusion detection through ELM (extreme learning machine). CMC Computers Materials & Continua 65 (2): 1097–1111. https://doi.org/10.32604/cmc.2020.011732 .

Ahmed, M., A.N. Mahmood, and J.K. Hu. 2016. A survey of network anomaly detection techniques. Journal of Network and Computer Applications 60: 19–31. https://doi.org/10.1016/j.jnca.2015.11.016 .

Al-Jarrah, O.Y., O. Alhussein, P.D. Yoo, S. Muhaidat, K. Taha, and K. Kim. 2016. Data randomization and cluster-based partitioning for Botnet intrusion detection. IEEE Transactions on Cybernetics 46 (8): 1796–1806. https://doi.org/10.1109/TCYB.2015.2490802 .

Al-Mhiqani, M.N., R. Ahmad, Z.Z. Abidin, W. Yassin, A. Hassan, K.H. Abdulkareem, N.S. Ali, and Z. Yunos. 2020. A review of insider threat detection: Classification, machine learning techniques, datasets, open challenges, and recommendations. Applied Sciences—Basel 10 (15): 41. https://doi.org/10.3390/app10155208 .

Al-Omari, M., M. Rawashdeh, F. Qutaishat, M. Alshira’H, and N. Ababneh. 2021. An intelligent tree-based intrusion detection model for cyber security. Journal of Network and Systems Management 29 (2): 18. https://doi.org/10.1007/s10922-021-09591-y .

Alabdallah, A., and M. Awad. 2018. Using weighted Support Vector Machine to address the imbalanced classes problem of Intrusion Detection System. KSII Transactions on Internet and Information Systems 12 (10): 5143–5158. https://doi.org/10.3837/tiis.2018.10.027 .

Alazab, M., M. Alazab, A. Shalaginov, A. Mesleh, and A. Awajan. 2020. Intelligent mobile malware detection using permission requests and API calls. Future Generation Computer Systems—the International Journal of eScience 107: 509–521. https://doi.org/10.1016/j.future.2020.02.002 .

Albahar, M.A., R.A. Al-Falluji, and M. Binsawad. 2020. An empirical comparison on malicious activity detection using different neural network-based models. IEEE Access 8: 61549–61564. https://doi.org/10.1109/ACCESS.2020.2984157 .

AlEroud, A.F., and G. Karabatis. 2018. Queryable semantics to detect cyber-attacks: A flow-based detection approach. IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (2): 207–223. https://doi.org/10.1109/TSMC.2016.2600405 .

Algarni, A.M., V. Thayananthan, and Y.K. Malaiya. 2021. Quantitative assessment of cybersecurity risks for mitigating data breaches in business systems. Applied Sciences (switzerland) . https://doi.org/10.3390/app11083678 .

Alhowaide, A., I. Alsmadi, and J. Tang. 2021. Towards the design of real-time autonomous IoT NIDS. Cluster Computing—the Journal of Networks Software Tools and Applications . https://doi.org/10.1007/s10586-021-03231-5 .

Ali, S., and Y. Li. 2019. Learning multilevel auto-encoders for DDoS attack detection in smart grid network. IEEE Access 7: 108647–108659. https://doi.org/10.1109/ACCESS.2019.2933304 .

AlKadi, O., N. Moustafa, B. Turnbull, and K.K.R. Choo. 2019. Mixture localization-based outliers models for securing data migration in cloud centers. IEEE Access 7: 114607–114618. https://doi.org/10.1109/ACCESS.2019.2935142 .

Allianz. 2021. Allianz Risk Barometer. https://www.agcs.allianz.com/content/dam/onemarketing/agcs/agcs/reports/Allianz-Risk-Barometer-2021.pdf . Accessed 15 May 2021.

Almiani, M., A. AbuGhazleh, A. Al-Rahayfeh, S. Atiewi, and Razaque, A. 2020. Deep recurrent neural network for IoT intrusion detection system. Simulation Modelling Practice and Theory 101: 102031. https://doi.org/10.1016/j.simpat.2019.102031

Alsaedi, A., N. Moustafa, Z. Tari, A. Mahmood, and A. Anwar. 2020. TON_IoT telemetry dataset: A new generation dataset of IoT and IIoT for data-driven intrusion detection systems. IEEE Access 8: 165130–165150. https://doi.org/10.1109/access.2020.3022862 .

Alsamiri, J., and K. Alsubhi. 2019. Internet of Things cyber attacks detection using machine learning. International Journal of Advanced Computer Science and Applications 10 (12): 627–634.

Alsharafat, W. 2013. Applying artificial neural network and eXtended classifier system for network intrusion detection. International Arab Journal of Information Technology 10 (3): 230–238.

Google Scholar  

Amin, R.W., H.E. Sevil, S. Kocak, G. Francia III., and P. Hoover. 2021. The spatial analysis of the malicious uniform resource locators (URLs): 2016 dataset case study. Information (switzerland) 12 (1): 1–18. https://doi.org/10.3390/info12010002 .

Arcuri, M.C., L.Z. Gai, F. Ielasi, and E. Ventisette. 2020. Cyber attacks on hospitality sector: Stock market reaction. Journal of Hospitality and Tourism Technology 11 (2): 277–290. https://doi.org/10.1108/jhtt-05-2019-0080 .

Arp, D., M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C.E.R.T. Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss 14: 23–26.

Ashtiani, M., and M.A. Azgomi. 2014. A distributed simulation framework for modeling cyber attacks and the evaluation of security measures. Simulation 90 (9): 1071–1102. https://doi.org/10.1177/0037549714540221 .

Atefinia, R., and M. Ahmadi. 2021. Network intrusion detection using multi-architectural modular deep neural network. Journal of Supercomputing 77 (4): 3571–3593. https://doi.org/10.1007/s11227-020-03410-y .

Avila, R., R. Khoury, R. Khoury, and F. Petrillo. 2021. Use of security logs for data leak detection: A systematic literature review. Security and Communication Networks 2021: 29. https://doi.org/10.1155/2021/6615899 .

Azeez, N.A., T.J. Ayemobola, S. Misra, R. Maskeliunas, and R. Damasevicius. 2019. Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce. Computers 8 (4): 15. https://doi.org/10.3390/computers8040086 .

Bakdash, J.Z., S. Hutchinson, E.G. Zaroukian, L.R. Marusich, S. Thirumuruganathan, C. Sample, B. Hoffman, and G. Das. 2018. Malware in the future forecasting of analyst detection of cyber events. Journal of Cybersecurity . https://doi.org/10.1093/cybsec/tyy007 .

Barletta, V.S., D. Caivano, A. Nannavecchia, and M. Scalera. 2020. Intrusion detection for in-vehicle communication networks: An unsupervised Kohonen SOM approach. Future Internet . https://doi.org/10.3390/FI12070119 .

Barzegar, M., and M. Shajari. 2018. Attack scenario reconstruction using intrusion semantics. Expert Systems with Applications 108: 119–133. https://doi.org/10.1016/j.eswa.2018.04.030 .

Bessy-Roland, Y., A. Boumezoued, and C. Hillairet. 2021. Multivariate Hawkes process for cyber insurance. Annals of Actuarial Science 15 (1): 14–39.

Bhardwaj, A., V. Mangat, and R. Vig. 2020. Hyperband tuned deep neural network with well posed stacked sparse AutoEncoder for detection of DDoS attacks in cloud. IEEE Access 8: 181916–181929. https://doi.org/10.1109/ACCESS.2020.3028690 .

Bhati, B.S., C.S. Rai, B. Balamurugan, and F. Al-Turjman. 2020. An intrusion detection scheme based on the ensemble of discriminant classifiers. Computers & Electrical Engineering 86: 9. https://doi.org/10.1016/j.compeleceng.2020.106742 .

Bhattacharya, S., S.S.R. Krishnan, P.K.R. Maddikunta, R. Kaluri, S. Singh, T.R. Gadekallu, M. Alazab, and U. Tariq. 2020. A novel PCA-firefly based XGBoost classification model for intrusion detection in networks using GPU. Electronics 9 (2): 16. https://doi.org/10.3390/electronics9020219 .

Bibi, I., A. Akhunzada, J. Malik, J. Iqbal, A. Musaddiq, and S. Kim. 2020. A dynamic DL-driven architecture to combat sophisticated android malware. IEEE Access 8: 129600–129612. https://doi.org/10.1109/ACCESS.2020.3009819 .

Biener, C., M. Eling, and J.H. Wirfs. 2015. Insurability of cyber risk: An empirical analysis. The   Geneva Papers on Risk and Insurance—Issues and Practice 40 (1): 131–158. https://doi.org/10.1057/gpp.2014.19 .

Binbusayyis, A., and T. Vaiyapuri. 2019. Identifying and benchmarking key features for cyber intrusion detection: An ensemble approach. IEEE Access 7: 106495–106513. https://doi.org/10.1109/ACCESS.2019.2929487 .

Biswas, R., and S. Roy. 2021. Botnet traffic identification using neural networks. Multimedia Tools and Applications . https://doi.org/10.1007/s11042-021-10765-8 .

Bouyeddou, B., F. Harrou, B. Kadri, and Y. Sun. 2021. Detecting network cyber-attacks using an integrated statistical approach. Cluster Computing—the Journal of Networks Software Tools and Applications 24 (2): 1435–1453. https://doi.org/10.1007/s10586-020-03203-1 .

Bozkir, A.S., and M. Aydos. 2020. LogoSENSE: A companion HOG based logo detection scheme for phishing web page and E-mail brand recognition. Computers & Security 95: 18. https://doi.org/10.1016/j.cose.2020.101855 .

Brower, D., and M. McCormick. 2021. Colonial pipeline resumes operations following ransomware attack. Financial Times .

Cai, H., F. Zhang, and A. Levi. 2019. An unsupervised method for detecting shilling attacks in recommender systems by mining item relationship and identifying target items. The Computer Journal 62 (4): 579–597. https://doi.org/10.1093/comjnl/bxy124 .

Cebula, J.J., M.E. Popeck, and L.R. Young. 2014. A Taxonomy of Operational Cyber Security Risks Version 2 .

Chadza, T., K.G. Kyriakopoulos, and S. Lambotharan. 2020. Learning to learn sequential network attacks using hidden Markov models. IEEE Access 8: 134480–134497. https://doi.org/10.1109/ACCESS.2020.3011293 .

Chatterjee, S., and S. Thekdi. 2020. An iterative learning and inference approach to managing dynamic cyber vulnerabilities of complex systems. Reliability Engineering and System Safety . https://doi.org/10.1016/j.ress.2019.106664 .

Chattopadhyay, M., R. Sen, and S. Gupta. 2018. A comprehensive review and meta-analysis on applications of machine learning techniques in intrusion detection. Australasian Journal of Information Systems 22: 27.

Chen, H.S., and J. Fiscus. 2018. The inhospitable vulnerability: A need for cybersecurity risk assessment in the hospitality industry. Journal of Hospitality and Tourism Technology 9 (2): 223–234. https://doi.org/10.1108/JHTT-07-2017-0044 .

Chhabra, G.S., V.P. Singh, and M. Singh. 2020. Cyber forensics framework for big data analytics in IoT environment using machine learning. Multimedia Tools and Applications 79 (23–24): 15881–15900. https://doi.org/10.1007/s11042-018-6338-1 .

Chiba, Z., N. Abghour, K. Moussaid, A. Elomri, and M. Rida. 2019. Intelligent approach to build a Deep Neural Network based IDS for cloud environment using combination of machine learning algorithms. Computers and Security 86: 291–317. https://doi.org/10.1016/j.cose.2019.06.013 .

Choras, M., and R. Kozik. 2015. Machine learning techniques applied to detect cyber attacks on web applications. Logic Journal of the IGPL 23 (1): 45–56. https://doi.org/10.1093/jigpal/jzu038 .

Chowdhury, S., M. Khanzadeh, R. Akula, F. Zhang, S. Zhang, H. Medal, M. Marufuzzaman, and L. Bian. 2017. Botnet detection using graph-based feature clustering. Journal of Big Data 4 (1): 14. https://doi.org/10.1186/s40537-017-0074-7 .

Cost Of A Cyber Incident: Systematic Review And Cross-Validation, Cybersecurity & Infrastructure Agency , 1, https://www.cisa.gov/sites/default/files/publications/CISA-OCE_Cost_of_Cyber_Incidents_Study-FINAL_508.pdf (2020).

D’Hooge, L., T. Wauters, B. Volckaert, and F. De Turck. 2019. Classification hardness for supervised learners on 20 years of intrusion detection data. IEEE Access 7: 167455–167469. https://doi.org/10.1109/access.2019.2953451 .

Damasevicius, R., A. Venckauskas, S. Grigaliunas, J. Toldinas, N. Morkevicius, T. Aleliunas, and P. Smuikys. 2020. LITNET-2020: An annotated real-world network flow dataset for network intrusion detection. Electronics 9 (5): 23. https://doi.org/10.3390/electronics9050800 .

De Giovanni, A.L.D., and M. Pirra. 2020. On the determinants of data breaches: A cointegration analysis. Decisions in Economics and Finance . https://doi.org/10.1007/s10203-020-00301-y .

Deng, L., D. Li, X. Yao, and H. Wang. 2019. Retracted Article: Mobile network intrusion detection for IoT system based on transfer learning algorithm. Cluster Computing 22 (4): 9889–9904. https://doi.org/10.1007/s10586-018-1847-2 .

Donkal, G., and G.K. Verma. 2018. A multimodal fusion based framework to reinforce IDS for securing Big Data environment using Spark. Journal of Information Security and Applications 43: 1–11. https://doi.org/10.1016/j.jisa.2018.10.001 .

Dunn, C., N. Moustafa, and B. Turnbull. 2020. Robustness evaluations of sustainable machine learning models against data Poisoning attacks in the Internet of Things. Sustainability 12 (16): 17. https://doi.org/10.3390/su12166434 .

Dwivedi, S., M. Vardhan, and S. Tripathi. 2021. Multi-parallel adaptive grasshopper optimization technique for detecting anonymous attacks in wireless networks. Wireless Personal Communications . https://doi.org/10.1007/s11277-021-08368-5 .

Dyson, B. 2020. COVID-19 crisis could be ‘watershed’ for cyber insurance, says Swiss Re exec. https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/covid-19-crisis-could-be-watershed-for-cyber-insurance-says-swiss-re-exec-59197154 . Accessed 7 May 2020.

EIOPA. 2018. Understanding cyber insurance—a structured dialogue with insurance companies. https://www.eiopa.europa.eu/sites/default/files/publications/reports/eiopa_understanding_cyber_insurance.pdf . Accessed 28 May 2018

Elijah, A.V., A. Abdullah, N.Z. JhanJhi, M. Supramaniam, and O.B. Abdullateef. 2019. Ensemble and deep-learning methods for two-class and multi-attack anomaly intrusion detection: An empirical study. International Journal of Advanced Computer Science and Applications 10 (9): 520–528.

Eling, M., and K. Jung. 2018. Copula approaches for modeling cross-sectional dependence of data breach losses. Insurance Mathematics & Economics 82: 167–180. https://doi.org/10.1016/j.insmatheco.2018.07.003 .

Eling, M., and W. Schnell. 2016. What do we know about cyber risk and cyber risk insurance? Journal of Risk Finance 17 (5): 474–491. https://doi.org/10.1108/jrf-09-2016-0122 .

Eling, M., and J. Wirfs. 2019. What are the actual costs of cyber risk events? European Journal of Operational Research 272 (3): 1109–1119. https://doi.org/10.1016/j.ejor.2018.07.021 .

Eling, M. 2020. Cyber risk research in business and actuarial science. European Actuarial Journal 10 (2): 303–333.

Elmasry, W., A. Akbulut, and A.H. Zaim. 2019. Empirical study on multiclass classification-based network intrusion detection. Computational Intelligence 35 (4): 919–954. https://doi.org/10.1111/coin.12220 .

Elsaid, S.A., and N.S. Albatati. 2020. An optimized collaborative intrusion detection system for wireless sensor networks. Soft Computing 24 (16): 12553–12567. https://doi.org/10.1007/s00500-020-04695-0 .

Estepa, R., J.E. Díaz-Verdejo, A. Estepa, and G. Madinabeitia. 2020. How much training data is enough? A case study for HTTP anomaly-based intrusion detection. IEEE Access 8: 44410–44425. https://doi.org/10.1109/ACCESS.2020.2977591 .

European Council. 2021. Cybersecurity: how the EU tackles cyber threats. https://www.consilium.europa.eu/en/policies/cybersecurity/ . Accessed 10 May 2021

Falco, G. et al. 2019. Cyber risk research impeded by disciplinary barriers. Science (American Association for the Advancement of Science) 366 (6469): 1066–1069.

Fan, Z.J., Z.P. Tan, C.X. Tan, and X. Li. 2018. An improved integrated prediction method of cyber security situation based on spatial-time analysis. Journal of Internet Technology 19 (6): 1789–1800. https://doi.org/10.3966/160792642018111906015 .

Fang, Z.J., M.C. Xu, S.H. Xu, and T.Z. Hu. 2021. A framework for predicting data breach risk: Leveraging dependence to cope with sparsity. IEEE Transactions on Information Forensics and Security 16: 2186–2201. https://doi.org/10.1109/tifs.2021.3051804 .

Farkas, S., O. Lopez, and M. Thomas. 2021. Cyber claim analysis using Generalized Pareto regression trees with applications to insurance. Insurance: Mathematics and Economics 98: 92–105. https://doi.org/10.1016/j.insmatheco.2021.02.009 .

Farsi, H., A. Fanian, and Z. Taghiyarrenani. 2019. A novel online state-based anomaly detection system for process control networks. International Journal of Critical Infrastructure Protection 27: 11. https://doi.org/10.1016/j.ijcip.2019.100323 .

Ferrag, M.A., L. Maglaras, S. Moschoyiannis, and H. Janicke. 2020. Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications 50: 19. https://doi.org/10.1016/j.jisa.2019.102419 .

Field, M. 2018. WannaCry cyber attack cost the NHS £92m as 19,000 appointments cancelled. https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/ . Accessed 9 May 2018.

FitchRatings. 2021. U.S. Cyber Insurance Market Update (Spike in Claims Leads to Decline in 2020 Underwriting Performance). https://www.fitchratings.com/research/insurance/us-cyber-insurance-market-update-spike-in-claims-leads-to-decline-in-2020-underwriting-performance-26-05-2021 .

Fossaceca, J.M., T.A. Mazzuchi, and S. Sarkani. 2015. MARK-ELM: Application of a novel Multiple Kernel Learning framework for improving the robustness of network intrusion detection. Expert Systems with Applications 42 (8): 4062–4080. https://doi.org/10.1016/j.eswa.2014.12.040 .

Franke, U., and J. Brynielsson. 2014. Cyber situational awareness–a systematic review of the literature. Computers & security 46: 18–31.

Freeha, K., K.J. Hwan, M. Lars, and M. Robin. 2021. Data breach management: An integrated risk model. Information & Management 58 (1): 103392. https://doi.org/10.1016/j.im.2020.103392 .

Ganeshan, R., and P. Rodrigues. 2020. Crow-AFL: Crow based adaptive fractional lion optimization approach for the intrusion detection. Wireless Personal Communications 111 (4): 2065–2089. https://doi.org/10.1007/s11277-019-06972-0 .

GAO. 2021. CYBER INSURANCE—Insurers and policyholders face challenges in an evolving market. https://www.gao.gov/assets/gao-21-477.pdf . Accessed 16 May 2021.

Garber, J. 2021. Colonial Pipeline fiasco foreshadows impact of Biden energy policy. https://www.foxbusiness.com/markets/colonial-pipeline-fiasco-foreshadows-impact-of-biden-energy-policy . Accessed 4 May 2021.

Gauthama Raman, M.R., N. Somu, S. Jagarapu, T. Manghnani, T. Selvam, K. Krithivasan, and V.S. Shankar Sriram. 2020. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review 53 (5): 3255–3286. https://doi.org/10.1007/s10462-019-09762-z .

Gavel, S., A.S. Raghuvanshi, and S. Tiwari. 2021. Distributed intrusion detection scheme using dual-axis dimensionality reduction for Internet of things (IoT). Journal of Supercomputing . https://doi.org/10.1007/s11227-021-03697-5 .

GDPR.EU. 2021. FAQ. https://gdpr.eu/faq/ . Accessed 10 May 2021.

Georgescu, T.M., B. Iancu, and M. Zurini. 2019. Named-entity-recognition-based automated system for diagnosing cybersecurity situations in IoT networks. Sensors (switzerland) . https://doi.org/10.3390/s19153380 .

Giudici, P., and E. Raffinetti. 2020. Cyber risk ordering with rank-based statistical models. AStA Advances in Statistical Analysis . https://doi.org/10.1007/s10182-020-00387-0 .

Goh, J., S. Adepu, K.N. Junejo, and A. Mathur. 2016. A dataset to support research in the design of secure water treatment systems. In CRITIS.

Gong, X.Y., J.L. Lu, Y.F. Zhou, H. Qiu, and R. He. 2021. Model uncertainty based annotation error fixing for web attack detection. Journal of Signal Processing Systems for Signal Image and Video Technology 93 (2–3): 187–199. https://doi.org/10.1007/s11265-019-01494-1 .

Goode, S., H. Hoehle, V. Venkatesh, and S.A. Brown. 2017. USER compensation as a data breach recovery action: An investigation of the sony playstation network breach. MIS Quarterly 41 (3): 703–727.

Guo, H., S. Huang, C. Huang, Z. Pan, M. Zhang, and F. Shi. 2020. File entropy signal analysis combined with wavelet decomposition for malware classification. IEEE Access 8: 158961–158971. https://doi.org/10.1109/ACCESS.2020.3020330 .

Habib, M., I. Aljarah, and H. Faris. 2020. A Modified multi-objective particle swarm optimizer-based Lévy flight: An approach toward intrusion detection in Internet of Things. Arabian Journal for Science and Engineering 45 (8): 6081–6108. https://doi.org/10.1007/s13369-020-04476-9 .

Hajj, S., R. El Sibai, J.B. Abdo, J. Demerjian, A. Makhoul, and C. Guyeux. 2021. Anomaly-based intrusion detection systems: The requirements, methods, measurements, and datasets. Transactions on Emerging Telecommunications Technologies 32 (4): 36. https://doi.org/10.1002/ett.4240 .

Heartfield, R., G. Loukas, A. Bezemskij, and E. Panaousis. 2021. Self-configurable cyber-physical intrusion detection for smart homes using reinforcement learning. IEEE Transactions on Information Forensics and Security 16: 1720–1735. https://doi.org/10.1109/tifs.2020.3042049 .

Hemo, B., T. Gafni, K. Cohen, and Q. Zhao. 2020. Searching for anomalies over composite hypotheses. IEEE Transactions on Signal Processing 68: 1181–1196. https://doi.org/10.1109/TSP.2020.2971438

Hindy, H., D. Brosset, E. Bayne, A.K. Seeam, C. Tachtatzis, R. Atkinson, and X. Bellekens. 2020. A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8: 104650–104675. https://doi.org/10.1109/ACCESS.2020.3000179 .

Hong, W., D. Huang, C. Chen, and J. Lee. 2020. Towards accurate and efficient classification of power system contingencies and cyber-attacks using recurrent neural networks. IEEE Access 8: 123297–123309. https://doi.org/10.1109/ACCESS.2020.3007609 .

Husák, M., M. Zádník, V. Bartos, and P. Sokol. 2020. Dataset of intrusion detection alerts from a sharing platform. Data in Brief 33: 106530.

IBM Security. 2020. Cost of a Data breach Report. https://www.capita.com/sites/g/files/nginej291/files/2020-08/Ponemon-Global-Cost-of-Data-Breach-Study-2020.pdf . Accessed 19 May 2021.

IEEE. 2021. IEEE Quick Facts. https://www.ieee.org/about/at-a-glance.html . Accessed 11 May 2021.

Kilincer, I.F., F. Ertam, and S. Abdulkadir. 2021. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks 188: 107840. https://doi.org/10.1016/j.comnet.2021.107840 .

Jaber, A.N., and S. Ul Rehman. 2020. FCM-SVM based intrusion detection system for cloud computing environment. Cluster Computing—the Journal of Networks Software Tools and Applications 23 (4): 3221–3231. https://doi.org/10.1007/s10586-020-03082-6 .

Jacobs, J., S. Romanosky, B. Edwards, M. Roytman, and I. Adjerid. 2019. Exploit prediction scoring system (epss). arXiv:1908.04856

Jacobsen, A. et al. 2020. FAIR principles: Interpretations and implementation considerations. Data Intelligence 2 (1–2): 10–29. https://doi.org/10.1162/dint_r_00024 .

Jahromi, A.N., S. Hashemi, A. Dehghantanha, R.M. Parizi, and K.K.R. Choo. 2020. An enhanced stacked LSTM method with no random initialization for malware threat hunting in safety and time-critical systems. IEEE Transactions on Emerging Topics in Computational Intelligence 4 (5): 630–640. https://doi.org/10.1109/TETCI.2019.2910243 .

Jang, S., S. Li, and Y. Sung. 2020. FastText-based local feature visualization algorithm for merged image-based malware classification framework for cyber security and cyber defense. Mathematics 8 (3): 13. https://doi.org/10.3390/math8030460 .

Javeed, D., T.H. Gao, and M.T. Khan. 2021. SDN-enabled hybrid DL-driven framework for the detection of emerging cyber threats in IoT. Electronics 10 (8): 16. https://doi.org/10.3390/electronics10080918 .

Johnson, P., D. Gorton, R. Lagerstrom, and M. Ekstedt. 2016. Time between vulnerability disclosures: A measure of software product vulnerability. Computers & Security 62: 278–295. https://doi.org/10.1016/j.cose.2016.08.004 .

Johnson, P., R. Lagerström, M. Ekstedt, and U. Franke. 2018. Can the common vulnerability scoring system be trusted? A Bayesian analysis. IEEE Transactions on Dependable and Secure Computing 15 (6): 1002–1015. https://doi.org/10.1109/TDSC.2016.2644614 .

Junger, M., V. Wang, and M. Schlömer. 2020. Fraud against businesses both online and offline: Crime scripts, business characteristics, efforts, and benefits. Crime Science 9 (1): 13. https://doi.org/10.1186/s40163-020-00119-4 .

Kalutarage, H.K., H.N. Nguyen, and S.A. Shaikh. 2017. Towards a threat assessment framework for apps collusion. Telecommunication Systems 66 (3): 417–430. https://doi.org/10.1007/s11235-017-0296-1 .

Kamarudin, M.H., C. Maple, T. Watson, and N.S. Safa. 2017. A LogitBoost-based algorithm for detecting known and unknown web attacks. IEEE Access 5: 26190–26200. https://doi.org/10.1109/ACCESS.2017.2766844 .

Kasongo, S.M., and Y.X. Sun. 2020. A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Computers & Security 92: 15. https://doi.org/10.1016/j.cose.2020.101752 .

Keserwani, P.K., M.C. Govil, E.S. Pilli, and P. Govil. 2021. A smart anomaly-based intrusion detection system for the Internet of Things (IoT) network using GWO–PSO–RF model. Journal of Reliable Intelligent Environments 7 (1): 3–21. https://doi.org/10.1007/s40860-020-00126-x .

Keshk, M., E. Sitnikova, N. Moustafa, J. Hu, and I. Khalil. 2021. An integrated framework for privacy-preserving based anomaly detection for cyber-physical systems. IEEE Transactions on Sustainable Computing 6 (1): 66–79. https://doi.org/10.1109/TSUSC.2019.2906657 .

Khan, I.A., D.C. Pi, A.K. Bhatia, N. Khan, W. Haider, and A. Wahab. 2020. Generating realistic IoT-based IDS dataset centred on fuzzy qualitative modelling for cyber-physical systems. Electronics Letters 56 (9): 441–443. https://doi.org/10.1049/el.2019.4158 .

Khraisat, A., I. Gondal, P. Vamplew, J. Kamruzzaman, and A. Alazab. 2020. Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics 9 (1): 18. https://doi.org/10.3390/electronics9010173 .

Khraisat, A., I. Gondal, P. Vamplew, and J. Kamruzzaman. 2019. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2 (1): 20. https://doi.org/10.1186/s42400-019-0038-7 .

Kilincer, I.F., F. Ertam, and A. Sengur. 2021. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks 188: 16. https://doi.org/10.1016/j.comnet.2021.107840 .

Kim, D., and H.K. Kim. 2019. Automated dataset generation system for collaborative research of cyber threat analysis. Security and Communication Networks 2019: 10. https://doi.org/10.1155/2019/6268476 .

Kim, G., C. Lee, J. Jo, and H. Lim. 2020. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International Journal of Machine Learning and Cybernetics 11 (10): 2341–2355. https://doi.org/10.1007/s13042-020-01122-6 .

Kirubavathi, G., and R. Anitha. 2016. Botnet detection via mining of traffic flow characteristics. Computers & Electrical Engineering 50: 91–101. https://doi.org/10.1016/j.compeleceng.2016.01.012 .

Kiwia, D., A. Dehghantanha, K.K.R. Choo, and J. Slaughter. 2018. A cyber kill chain based taxonomy of banking Trojans for evolutionary computational intelligence. Journal of Computational Science 27: 394–409. https://doi.org/10.1016/j.jocs.2017.10.020 .

Koroniotis, N., N. Moustafa, and E. Sitnikova. 2020. A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework. Future Generation Computer Systems 110: 91–106. https://doi.org/10.1016/j.future.2020.03.042 .

Kruse, C.S., B. Frederick, T. Jacobson, and D. Kyle Monticone. 2017. Cybersecurity in healthcare: A systematic review of modern threats and trends. Technology and Health Care 25 (1): 1–10.

Kshetri, N. 2018. The economics of cyber-insurance. IT Professional 20 (6): 9–14. https://doi.org/10.1109/MITP.2018.2874210 .

Kumar, R., P. Kumar, R. Tripathi, G.P. Gupta, T.R. Gadekallu, and G. Srivastava. 2021. SP2F: A secured privacy-preserving framework for smart agricultural Unmanned Aerial Vehicles. Computer Networks . https://doi.org/10.1016/j.comnet.2021.107819 .

Kumar, R., and R. Tripathi. 2021. DBTP2SF: A deep blockchain-based trustworthy privacy-preserving secured framework in industrial internet of things systems. Transactions on Emerging Telecommunications Technologies 32 (4): 27. https://doi.org/10.1002/ett.4222 .

Laso, P.M., D. Brosset, and J. Puentes. 2017. Dataset of anomalies and malicious acts in a cyber-physical subsystem. Data in Brief 14: 186–191. https://doi.org/10.1016/j.dib.2017.07.038 .

Lee, J., J. Kim, I. Kim, and K. Han. 2019. Cyber threat detection based on artificial neural networks using event profiles. IEEE Access 7: 165607–165626. https://doi.org/10.1109/ACCESS.2019.2953095 .

Lee, S.J., P.D. Yoo, A.T. Asyhari, Y. Jhi, L. Chermak, C.Y. Yeun, and K. Taha. 2020. IMPACT: Impersonation attack detection via edge computing using deep Autoencoder and feature abstraction. IEEE Access 8: 65520–65529. https://doi.org/10.1109/ACCESS.2020.2985089 .

Leong, Y.-Y., and Y.-C. Chen. 2020. Cyber risk cost and management in IoT devices-linked health insurance. The Geneva Papers on Risk and Insurance—Issues and Practice 45 (4): 737–759. https://doi.org/10.1057/s41288-020-00169-4 .

Levi, M. 2017. Assessing the trends, scale and nature of economic cybercrimes: overview and Issues: In Cybercrimes, cybercriminals and their policing, in crime, law and social change. Crime, Law and Social Change 67 (1): 3–20. https://doi.org/10.1007/s10611-016-9645-3 .

Li, C., K. Mills, D. Niu, R. Zhu, H. Zhang, and H. Kinawi. 2019a. Android malware detection based on factorization machine. IEEE Access 7: 184008–184019. https://doi.org/10.1109/ACCESS.2019.2958927 .

Li, D.Q., and Q.M. Li. 2020. Adversarial deep ensemble: evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security 15: 3886–3900. https://doi.org/10.1109/tifs.2020.3003571 .

Li, D.Q., Q.M. Li, Y.F. Ye, and S.H. Xu. 2021. A framework for enhancing deep neural networks against adversarial malware. IEEE Transactions on Network Science and Engineering 8 (1): 736–750. https://doi.org/10.1109/tnse.2021.3051354 .

Li, R.H., C. Zhang, C. Feng, X. Zhang, and C.J. Tang. 2019b. Locating vulnerability in binaries using deep neural networks. IEEE Access 7: 134660–134676. https://doi.org/10.1109/access.2019.2942043 .

Li, X., M. Xu, P. Vijayakumar, N. Kumar, and X. Liu. 2020. Detection of low-frequency and multi-stage attacks in industrial Internet of Things. IEEE Transactions on Vehicular Technology 69 (8): 8820–8831. https://doi.org/10.1109/TVT.2020.2995133 .

Liu, H.Y., and B. Lang. 2019. Machine learning and deep learning methods for intrusion detection systems: A survey. Applied Sciences—Basel 9 (20): 28. https://doi.org/10.3390/app9204396 .

Lopez-Martin, M., B. Carro, and A. Sanchez-Esguevillas. 2020. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Systems with Applications . https://doi.org/10.1016/j.eswa.2019.112963 .

Loukas, G., D. Gan, and Tuan Vuong. 2013. A review of cyber threats and defence approaches in emergency management. Future Internet 5: 205–236.

Luo, C.C., S. Su, Y.B. Sun, Q.J. Tan, M. Han, and Z.H. Tian. 2020. A convolution-based system for malicious URLs detection. CMC—Computers Materials Continua 62 (1): 399–411.

Mahbooba, B., M. Timilsina, R. Sahal, and M. Serrano. 2021. Explainable artificial intelligence (XAI) to enhance trust management in intrusion detection systems using decision tree model. Complexity 2021: 11. https://doi.org/10.1155/2021/6634811 .

Mahdavifar, S., and A.A. Ghorbani. 2020. DeNNeS: Deep embedded neural network expert system for detecting cyber attacks. Neural Computing & Applications 32 (18): 14753–14780. https://doi.org/10.1007/s00521-020-04830-w .

Mahfouz, A., A. Abuhussein, D. Venugopal, and S. Shiva. 2020. Ensemble classifiers for network intrusion detection using a novel network attack dataset. Future Internet 12 (11): 1–19. https://doi.org/10.3390/fi12110180 .

Maleks Smith, Z., E. Lostri, and J.A. Lewis. 2020. The hidden costs of cybercrime. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-hidden-costs-of-cybercrime.pdf . Accessed 16 May 2021.

Malik, J., A. Akhunzada, I. Bibi, M. Imran, A. Musaddiq, and S.W. Kim. 2020. Hybrid deep learning: An efficient reconnaissance and surveillance detection mechanism in SDN. IEEE Access 8: 134695–134706. https://doi.org/10.1109/ACCESS.2020.3009849 .

Manimurugan, S. 2020. IoT-Fog-Cloud model for anomaly detection using improved Naive Bayes and principal component analysis. Journal of Ambient Intelligence and Humanized Computing . https://doi.org/10.1007/s12652-020-02723-3 .

Martin, A., R. Lara-Cabrera, and D. Camacho. 2019. Android malware detection through hybrid features fusion and ensemble classifiers: The AndroPyTool framework and the OmniDroid dataset. Information Fusion 52: 128–142. https://doi.org/10.1016/j.inffus.2018.12.006 .

Mauro, M.D., G. Galatro, and A. Liotta. 2020. Experimental review of neural-based approaches for network intrusion management. IEEE Transactions on Network and Service Management 17 (4): 2480–2495. https://doi.org/10.1109/TNSM.2020.3024225 .

McLeod, A., and D. Dolezel. 2018. Cyber-analytics: Modeling factors associated with healthcare data breaches. Decision Support Systems 108: 57–68. https://doi.org/10.1016/j.dss.2018.02.007 .

Meira, J., R. Andrade, I. Praca, J. Carneiro, V. Bolon-Canedo, A. Alonso-Betanzos, and G. Marreiros. 2020. Performance evaluation of unsupervised techniques in cyber-attack anomaly detection. Journal of Ambient Intelligence and Humanized Computing 11 (11): 4477–4489. https://doi.org/10.1007/s12652-019-01417-9 .

Miao, Y., J. Ma, X. Liu, J. Weng, H. Li, and H. Li. 2019. Lightweight fine-grained search over encrypted data in Fog computing. IEEE Transactions on Services Computing 12 (5): 772–785. https://doi.org/10.1109/TSC.2018.2823309 .

Miller, C., and C. Valasek. 2015. Remote exploitation of an unaltered passenger vehicle. Black Hat USA 2015 (S 91).

Mireles, J.D., E. Ficke, J.H. Cho, P. Hurley, and S.H. Xu. 2019. Metrics towards measuring cyber agility. IEEE Transactions on Information Forensics and Security 14 (12): 3217–3232. https://doi.org/10.1109/tifs.2019.2912551 .

Mishra, N., and S. Pandya. 2021. Internet of Things applications, security challenges, attacks, intrusion detection, and future visions: A systematic review. IEEE Access . https://doi.org/10.1109/ACCESS.2021.3073408 .

Monshizadeh, M., V. Khatri, B.G. Atli, R. Kantola, and Z. Yan. 2019. Performance evaluation of a combined anomaly detection platform. IEEE Access 7: 100964–100978. https://doi.org/10.1109/ACCESS.2019.2930832 .

Moreno, V.C., G. Reniers, E. Salzano, and V. Cozzani. 2018. Analysis of physical and cyber security-related events in the chemical and process industry. Process Safety and Environmental Protection 116: 621–631. https://doi.org/10.1016/j.psep.2018.03.026 .

Moro, E.D. 2020. Towards an economic cyber loss index for parametric cover based on IT security indicator: A preliminary analysis. Risks . https://doi.org/10.3390/risks8020045 .

Moustafa, N., E. Adi, B. Turnbull, and J. Hu. 2018. A new threat intelligence scheme for safeguarding industry 4.0 systems. IEEE Access 6: 32910–32924. https://doi.org/10.1109/ACCESS.2018.2844794 .

Moustakidis, S., and P. Karlsson. 2020. A novel feature extraction methodology using Siamese convolutional neural networks for intrusion detection. Cybersecurity . https://doi.org/10.1186/s42400-020-00056-4 .

Mukhopadhyay, A., S. Chatterjee, K.K. Bagchi, P.J. Kirs, and G.K. Shukla. 2019. Cyber Risk Assessment and Mitigation (CRAM) framework using Logit and Probit models for cyber insurance. Information Systems Frontiers 21 (5): 997–1018. https://doi.org/10.1007/s10796-017-9808-5 .

Murphey, H. 2021a. Biden signs executive order to strengthen US cyber security. https://www.ft.com/content/4d808359-b504-4014-85f6-68e7a2851bf1?accessToken=zwAAAXl0_ifgkc9NgINZtQRAFNOF9mjnooUb8Q.MEYCIQDw46SFWsMn1iyuz3kvgAmn6mxc0rIVfw10Lg1ovJSfJwIhAK2X2URzfSqHwIS7ddRCvSt2nGC2DcdoiDTG49-4TeEt&sharetype=gift?token=fbcd6323-1ecf-4fc3-b136-b5b0dd6a8756 . Accessed 7 May 2021.

Murphey, H. 2021b. Millions of connected devices have security flaws, study shows. https://www.ft.com/content/0bf92003-926d-4dee-87d7-b01f7c3e9621?accessToken=zwAAAXnA7f2Ikc8L-SADkm1N7tOH17AffD6WIQ.MEQCIDjBuROvhmYV0Mx3iB0cEV7m5oND1uaCICxJu0mzxM0PAiBam98q9zfHiTB6hKGr1gGl0Azt85yazdpX9K5sI8se3Q&sharetype=gift?token=2538218d-77d9-4dd3-9649-3cb556a34e51 . Accessed 6 May 2021.

Murugesan, V., M. Shalinie, and M.H. Yang. 2018. Design and analysis of hybrid single packet IP traceback scheme. IET Networks 7 (3): 141–151. https://doi.org/10.1049/iet-net.2017.0115 .

Mwitondi, K.S., and S.A. Zargari. 2018. An iterative multiple sampling method for intrusion detection. Information Security Journal 27 (4): 230–239. https://doi.org/10.1080/19393555.2018.1539790 .

Neto, N.N., S. Madnick, A.M.G. De Paula, and N.M. Borges. 2021. Developing a global data breach database and the challenges encountered. ACM Journal of Data and Information Quality 13 (1): 33. https://doi.org/10.1145/3439873 .

Nurse, J.R.C., L. Axon, A. Erola, I. Agrafiotis, M. Goldsmith, and S. Creese. 2020. The data that drives cyber insurance: A study into the underwriting and claims processes. In 2020 International conference on cyber situational awareness, data analytics and assessment (CyberSA), 15–19 June 2020.

Oliveira, N., I. Praca, E. Maia, and O. Sousa. 2021. Intelligent cyber attack detection and classification for network-based intrusion detection systems. Applied Sciences—Basel 11 (4): 21. https://doi.org/10.3390/app11041674 .

Page, M.J. et al. 2021. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Systematic Reviews 10 (1): 89. https://doi.org/10.1186/s13643-021-01626-4 .

Pajouh, H.H., R. Javidan, R. Khayami, A. Dehghantanha, and K.R. Choo. 2019. A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Transactions on Emerging Topics in Computing 7 (2): 314–323. https://doi.org/10.1109/TETC.2016.2633228 .

Parra, G.D., P. Rad, K.K.R. Choo, and N. Beebe. 2020. Detecting Internet of Things attacks using distributed deep learning. Journal of Network and Computer Applications 163: 13. https://doi.org/10.1016/j.jnca.2020.102662 .

Paté-Cornell, M.E., M. Kuypers, M. Smith, and P. Keller. 2018. Cyber risk management for critical infrastructure: A risk analysis model and three case studies. Risk Analysis 38 (2): 226–241. https://doi.org/10.1111/risa.12844 .

Pooser, D.M., M.J. Browne, and O. Arkhangelska. 2018. Growth in the perception of cyber risk: evidence from U.S. P&C Insurers. The Geneva Papers on Risk and Insurance—Issues and Practice 43 (2): 208–223. https://doi.org/10.1057/s41288-017-0077-9 .

Pu, G., L. Wang, J. Shen, and F. Dong. 2021. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology 26 (2): 146–153. https://doi.org/10.26599/TST.2019.9010051 .

Qiu, J., W. Luo, L. Pan, Y. Tai, J. Zhang, and Y. Xiang. 2019. Predicting the impact of android malicious samples via machine learning. IEEE Access 7: 66304–66316. https://doi.org/10.1109/ACCESS.2019.2914311 .

Qu, X., L. Yang, K. Guo, M. Sun, L. Ma, T. Feng, S. Ren, K. Li, and X. Ma. 2020. Direct batch growth hierarchical self-organizing mapping based on statistics for efficient network intrusion detection. IEEE Access 8: 42251–42260. https://doi.org/10.1109/ACCESS.2020.2976810 .

Rahman, Md.S., S. Halder, Md. Ashraf Uddin, and U.K. Acharjee. 2021. An efficient hybrid system for anomaly detection in social networks. Cybersecurity 4 (1): 10. https://doi.org/10.1186/s42400-021-00074-w .

Ramaiah, M., V. Chandrasekaran, V. Ravi, and N. Kumar. 2021. An intrusion detection system using optimized deep neural network architecture. Transactions on Emerging Telecommunications Technologies 32 (4): 17. https://doi.org/10.1002/ett.4221 .

Raman, M.R.G., K. Kannan, S.K. Pal, and V.S.S. Sriram. 2016. Rough set-hypergraph-based feature selection approach for intrusion detection systems. Defence Science Journal 66 (6): 612–617. https://doi.org/10.14429/dsj.66.10802 .

Rathore, S., J.H. Park. 2018. Semi-supervised learning based distributed attack detection framework for IoT. Applied Soft Computing 72: 79–89. https://doi.org/10.1016/j.asoc.2018.05.049 .

Romanosky, S., L. Ablon, A. Kuehn, and T. Jones. 2019. Content analysis of cyber insurance policies: How do carriers price cyber risk? Journal of Cybersecurity (oxford) 5 (1): tyz002.

Sarabi, A., P. Naghizadeh, Y. Liu, and M. Liu. 2016. Risky business: Fine-grained data breach prediction using business profiles. Journal of Cybersecurity 2 (1): 15–28. https://doi.org/10.1093/cybsec/tyw004 .

Sardi, Alberto, Alessandro Rizzi, Enrico Sorano, and Anna Guerrieri. 2021. Cyber risk in health facilities: A systematic literature review. Sustainability 12 (17): 7002.

Sarker, Iqbal H., A.S.M. Kayes, Shahriar Badsha, Hamed Alqahtani, Paul Watters, and Alex Ng. 2020. Cybersecurity data science: An overview from machine learning perspective. Journal of Big Data 7 (1): 41. https://doi.org/10.1186/s40537-020-00318-5 .

Scopus. 2021. Factsheet. https://www.elsevier.com/__data/assets/pdf_file/0017/114533/Scopus_GlobalResearch_Factsheet2019_FINAL_WEB.pdf . Accessed 11 May 2021.

Sentuna, A., A. Alsadoon, P.W.C. Prasad, M. Saadeh, and O.H. Alsadoon. 2021. A novel Enhanced Naïve Bayes Posterior Probability (ENBPP) using machine learning: Cyber threat analysis. Neural Processing Letters 53 (1): 177–209. https://doi.org/10.1007/s11063-020-10381-x .

Shaukat, K., S.H. Luo, V. Varadharajan, I.A. Hameed, S. Chen, D.X. Liu, and J.M. Li. 2020. Performance comparison and current challenges of using machine learning techniques in cybersecurity. Energies 13 (10): 27. https://doi.org/10.3390/en13102509 .

Sheehan, B., F. Murphy, M. Mullins, and C. Ryan. 2019. Connected and autonomous vehicles: A cyber-risk classification framework. Transportation Research Part a: Policy and Practice 124: 523–536. https://doi.org/10.1016/j.tra.2018.06.033 .

Sheehan, B., F. Murphy, A.N. Kia, and R. Kiely. 2021. A quantitative bow-tie cyber risk classification and assessment framework. Journal of Risk Research 24 (12): 1619–1638.

Shlomo, A., M. Kalech, and R. Moskovitch. 2021. Temporal pattern-based malicious activity detection in SCADA systems. Computers & Security 102: 17. https://doi.org/10.1016/j.cose.2020.102153 .

Singh, K.J., and T. De. 2020. Efficient classification of DDoS attacks using an ensemble feature selection algorithm. Journal of Intelligent Systems 29 (1): 71–83. https://doi.org/10.1515/jisys-2017-0472 .

Skrjanc, I., S. Ozawa, T. Ban, and D. Dovzan. 2018. Large-scale cyber attacks monitoring using Evolving Cauchy Possibilistic Clustering. Applied Soft Computing 62: 592–601. https://doi.org/10.1016/j.asoc.2017.11.008 .

Smart, W. 2018. Lessons learned review of the WannaCry Ransomware Cyber Attack. https://www.england.nhs.uk/wp-content/uploads/2018/02/lessons-learned-review-wannacry-ransomware-cyber-attack-cio-review.pdf . Accessed 7 May 2021.

Sornette, D., T. Maillart, and W. Kröger. 2013. Exploring the limits of safety analysis in complex technological systems. International Journal of Disaster Risk Reduction 6: 59–66. https://doi.org/10.1016/j.ijdrr.2013.04.002 .

Sovacool, B.K. 2008. The costs of failure: A preliminary assessment of major energy accidents, 1907–2007. Energy Policy 36 (5): 1802–1820. https://doi.org/10.1016/j.enpol.2008.01.040 .

SpringerLink. 2021. Journal Search. https://rd.springer.com/search?facet-content-type=%22Journal%22 . Accessed 11 May 2021.

Stojanovic, B., K. Hofer-Schmitz, and U. Kleb. 2020. APT datasets and attack modeling for automated detection methods: A review. Computers & Security 92: 19. https://doi.org/10.1016/j.cose.2020.101734 .

Subroto, A., and A. Apriyana. 2019. Cyber risk prediction through social media big data analytics and statistical machine learning. Journal of Big Data . https://doi.org/10.1186/s40537-019-0216-1 .

Tan, Z., A. Jamdagni, X. He, P. Nanda, R.P. Liu, and J. Hu. 2015. Detection of denial-of-service attacks based on computer vision techniques. IEEE Transactions on Computers 64 (9): 2519–2533. https://doi.org/10.1109/TC.2014.2375218 .

Tidy, J. 2021. Irish cyber-attack: Hackers bail out Irish health service for free. https://www.bbc.com/news/world-europe-57197688 . Accessed 6 May 2021.

Tuncer, T., F. Ertam, and S. Dogan. 2020. Automated malware recognition method based on local neighborhood binary pattern. Multimedia Tools and Applications 79 (37–38): 27815–27832. https://doi.org/10.1007/s11042-020-09376-6 .

Uhm, Y., and W. Pak. 2021. Service-aware two-level partitioning for machine learning-based network intrusion detection with high performance and high scalability. IEEE Access 9: 6608–6622. https://doi.org/10.1109/ACCESS.2020.3048900 .

Ulven, J.B., and G. Wangen. 2021. A systematic review of cybersecurity risks in higher education. Future Internet 13 (2): 1–40. https://doi.org/10.3390/fi13020039 .

Vaccari, I., G. Chiola, M. Aiello, M. Mongelli, and E. Cambiaso. 2020. MQTTset, a new dataset for machine learning techniques on MQTT. Sensors 20 (22): 17. https://doi.org/10.3390/s20226578 .

Valeriano, B., and R.C. Maness. 2014. The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research 51 (3): 347–360. https://doi.org/10.1177/0022343313518940 .

Varghese, J.E., and B. Muniyal. 2021. An Efficient IDS framework for DDoS attacks in SDN environment. IEEE Access 9: 69680–69699. https://doi.org/10.1109/ACCESS.2021.3078065 .

Varsha, M. V., P. Vinod, K.A. Dhanya. 2017 Identification of malicious android app using manifest and opcode features. Journal of Computer Virology and Hacking Techniques 13 (2): 125–138. https://doi.org/10.1007/s11416-016-0277-z

Velliangiri, S., and H.M. Pandey. 2020. Fuzzy-Taylor-elephant herd optimization inspired Deep Belief Network for DDoS attack detection and comparison with state-of-the-arts algorithms. Future Generation Computer Systems—the International Journal of Escience 110: 80–90. https://doi.org/10.1016/j.future.2020.03.049 .

Verma, A., and V. Ranga. 2020. Machine learning based intrusion detection systems for IoT applications. Wireless Personal Communications 111 (4): 2287–2310. https://doi.org/10.1007/s11277-019-06986-8 .

Vidros, S., C. Kolias, G. Kambourakis, and L. Akoglu. 2017. Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet 9 (1): 19. https://doi.org/10.3390/fi9010006 .

Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, A. Al-Nemrat, and S. Venkatraman. 2019. Deep learning approach for intelligent intrusion detection system. IEEE Access 7: 41525–41550. https://doi.org/10.1109/access.2019.2895334 .

Walker-Roberts, S., M. Hammoudeh, O. Aldabbas, M. Aydin, and A. Dehghantanha. 2020. Threats on the horizon: Understanding security threats in the era of cyber-physical systems. Journal of Supercomputing 76 (4): 2643–2664. https://doi.org/10.1007/s11227-019-03028-9 .

Web of Science. 2021. Web of Science: Science Citation Index Expanded. https://clarivate.com/webofsciencegroup/solutions/webofscience-scie/ . Accessed 11 May 2021.

World Economic Forum. 2020. WEF Global Risk Report. http://www3.weforum.org/docs/WEF_Global_Risk_Report_2020.pdf . Accessed 13 May 2020.

Xin, Y., L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang. 2018. Machine learning and deep learning methods for cybersecurity. IEEE Access 6: 35365–35381. https://doi.org/10.1109/ACCESS.2018.2836950 .

Xu, C., J. Zhang, K. Chang, and C. Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management.

Yang, J., T. Li, G. Liang, W. He, and Y. Zhao. 2019. A Simple recurrent unit model based intrusion detection system with DCGAN. IEEE Access 7: 83286–83296. https://doi.org/10.1109/ACCESS.2019.2922692 .

Yuan, B.G., J.F. Wang, D. Liu, W. Guo, P. Wu, and X.H. Bao. 2020. Byte-level malware classification based on Markov images and deep learning. Computers & Security 92: 12. https://doi.org/10.1016/j.cose.2020.101740 .

Zhang, S., X.M. Ou, and D. Caragea. 2015. Predicting cyber risks through national vulnerability database. Information Security Journal 24 (4–6): 194–206. https://doi.org/10.1080/19393555.2015.1111961 .

Zhang, Y., P. Li, and X. Wang. 2019. Intrusion detection for IoT based on improved genetic algorithm and deep belief network. IEEE Access 7: 31711–31722.

Zheng, Muwei, Hannah Robbins, Zimo Chai, Prakash Thapa, and Tyler Moore. 2018. Cybersecurity research datasets: taxonomy and empirical analysis. In 11th {USENIX} workshop on cyber security experimentation and test ({CSET} 18).

Zhou, X., W. Liang, S. Shimizu, J. Ma, and Q. Jin. 2021. Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems. IEEE Transactions on Industrial Informatics 17 (8): 5790–5798. https://doi.org/10.1109/TII.2020.3047675 .

Zhou, Y.Y., G. Cheng, S.Q. Jiang, and M. Dai. 2020. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Computer Networks 174: 17. https://doi.org/10.1016/j.comnet.2020.107247 .

Download references

Open Access funding provided by the IReL Consortium.

Author information

Authors and affiliations.

University of Limerick, Limerick, Ireland

Frank Cremer, Barry Sheehan, Arash N. Kia, Martin Mullins & Finbarr Murphy

TH Köln University of Applied Sciences, Cologne, Germany

Michael Fortmann & Stefan Materne

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Barry Sheehan .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 334 kb)

Supplementary file1 (docx 418 kb), rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cremer, F., Sheehan, B., Fortmann, M. et al. Cyber risk and cybersecurity: a systematic review of data availability. Geneva Pap Risk Insur Issues Pract 47 , 698–736 (2022). https://doi.org/10.1057/s41288-022-00266-6

Download citation

Received : 15 June 2021

Accepted : 20 January 2022

Published : 17 February 2022

Issue Date : July 2022

DOI : https://doi.org/10.1057/s41288-022-00266-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cyber insurance
  • Systematic review
  • Cybersecurity
  • Find a journal
  • Publish with us
  • Track your research

REVIEW article

When we talk about big data, what do we really mean toward a more precise definition of big data.

\r\nXiaoyao Han

  • 1 Department of Governance and Innovation, Campus Fryslan, University of Groningen, Leeuwarden, Netherlands
  • 2 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Groningen, Netherlands

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

1 Introduction

As the amount of data being generated continues to grow over the last two decades ( Petroc, 2023 ), the need to process and analyze it became increasingly urgent. This led to the development of new tools and techniques for handling Big Data ( Khalid and Yousaf, 2021 ), such as distributed computing frameworks like Hadoop and Apache Spark, as well as machine learning algorithms like neural networks and decision trees. As researchers and practitioners in various fields began to integrate Big Data into their work, a body of literature emerged discussing its implementation, characteristics and features, as well as its potential benefits and limitations.

The concept of Big Data has become increasingly important for scientific research. Its characteristics have been extensively discussed in the literature, and numerous researchers have summarized the definitions of Big Data and its closely related features ( Khan et al., 2014 ; Chebbi et al., 2015 ). Epistemological discussions are common ( Kitchin, 2014 ; Ekbia et al., 2015 ; Succi and Coveney, 2019 ) and the recent emergence of new quantitative-based approaches, utilizing large text corpora,offers new opportunities to understand Big Data from different perspectives. Hansmann and Niemeyer (2014) , for example, apply a text mining approach to extract and characterize the elements of the Big Data concept by analyzing 248 publications relevant to Big Data. They focus on the concept of the term Big Data and summarize four dimensions to describe it: the dimensions of data, IT infrastructure, applied methods, and an applications perspective. Van Altena et al. (2016) , analyze a large number of articles from the biomedical literature to give a better understanding of the term Big Data by extracting relevant topics through topic modeling. Akoka et al. (2017) review studies using a systematic mapping approach based on more than a decade of academic papers relevant to Big Data. Their bibliometric review of Big Data publications shows that the research community has made a significant contribution in the field of Big Data, which is further evidenced by the continued increase in the number of scientific publications dealing with the concept.

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. Many authors endeavor to provide comprehensive definitions of Big Data based on their research, aiming to better capture its essence. However, a universally accepted definition is yet to emerge. Instead, a commonly accepted description portrays Big Data through various “V” characteristics (volume, variety, velocity, etc.). Nonetheless, this widely accepted description does not ensure a profound common ground for discussing Big Data in different contexts. As a result, Big Data is still being perceived as a broad and vague concept, making it difficult to grasp in interdisciplinary contexts. Over the past few decades, there has been little systematic research on the position and practical implications of the term Big Data in research environments. When different authors spend significant portions of their studies discussing similar definitions and characteristics of Big Data, they are actually referring to different concepts (technology, platforms, or datasets), which is often not explicitly stated. Exploring this ambiguity is crucial, as it forms the basis for research and discussion. Many reviews aim to enrich a comprehensive understanding of Big Data, rather than retrospectively observing its actual usage in different contexts. we believe this inspection on the current useage sitation of Big Data will help to clarify the ambiguities and facilitates clearer communication among researchers by providing a understanding framework of what Big Data entails in different contexts.

To address this gap, this paper presents a Systematic Literature (SLR) on secondary studies ( Kitchenham et al., 2009 , 2010 ) to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.

The rest of this paper is structured as follows: we first present our systematic literature review methodology. The results are then divided into four sections: an overview of the bibliographic information of the extracted articles, a summary of the most prominent technologies used, a discussion on the technologies that are considered to be Big Data, and a presentation of the perceived benefits and drawbacks of Big Data. Finally, before conclusion we discuss our findings in terms of the understanding of Big Data and its value, and suggest avenues for further research.

2 Methodology

This study employs a SLR methodology to investigate the use of Big Data in scientific research. Our review, which follows the guidelines for SLR defined by Kitchenham et al. (2009) and Kitchenham et al. (2010) , enables a structured and replicable procedure that increases the reliability of research findings. An SLR study is referred to as a tertiary study when it applies the SLR methodology on secondary studies, i.e. it surveys literature reviews instead of primary studies. We selected a tertiary approach for this study due to the extensive range of Big Data technologies and applications in research. Collecting primary data from a large number of papers across domains can be a laborious task. By analyzing secondary data sources instead, our approach offers a comprehensive and holistic overview of the landscape of Big Data in the scientific community. Adopting this approach allows us to better control the bibliographic selection and obtain a comprehensive overview of the field, thereby ensuring high-quality data analysis.

The research questions addressed by this paper are:

1. Which technologies are considered as Big Data in different fields?

2. What Big Data implies in a scientific study context?

3. What is the perceived impact of the adoption of Big Data in each research domain?

For the purpose of answering RQ1, we have referenced the Big Data technology taxonomy developed by Patgiri (2019) . We anticipated that the reviewed papers would mention a broad spectrum of technologies, techniques, applications, and heterogeneities, which could pose a challenge in synthesizing the results. Therefore, we relied on Patgiri's taxonomy to guide our full-text review and data extraction process to identify the Big Data technologies utilized in the various scientific domains. This taxonomy offers a comprehensive overview of the various technical approaches to Big Data. By utilizing this taxonomy, we aim to initiate a discourse on what Big Data truly represents in each research domain, and how it has been understood by the respective scientific community. This discussion will enable us to further deliberate on RQ2, which examines the perception of Big Data in specific research domains. To structure our investigation for both research questions (RQ1 and RQ2) we utilize the subject taxonomy of Web of Science (WoS) to classify scientific fields, and apply no limitations to the scope of included scientific fields. As a result, both questions encompass a wide range of Big Data usage.

RQ2 is rooted in the fact that the scope of Big Data technologies has yet to be strictly defined and reach a consensus. Through this question we aim to investigate which technologies or objects are perceived as constituting Big Data in specific research domains. By these means we attempt to shed light on how Big Data as a concept is implicitly understood in each domain by linking it to specific technologies. In addition to offering an overview of the technological landscape, RQ3 aims to delve deeper into the discussion of the impact of Big Data in each domain by examining the benefits and drawbacks of its adoption, as reported by the examined secondary studies. By doing so, we seek to gain a comprehensive understanding of the role of Big Data in scientific domains.

2.1 Study design

This study uses the framework of Population Intervention Comparison Outcome and Context (PICOC) ( Petticrew and Roberts, 2008 ) to guide the research questions and the systematic literature review process, with the aim of exploring the use of Big Data technologies across various scientific domains and understanding the perceived meaning and value of Big Data in the scientific community. The use of the PICOC standard allowed for a structured approach to conducting the study and ensured that all relevant aspects were considered in the data collection, refinement, extraction, and analysis processes ( Mengist et al., 2020 ). The Population of interest consists of studies that explore the state of the art of Big Data in different research domains. Intervention refers to systematic secondary studies, such as systematic literature reviews, mapping studies, and surveys. The Comparison in this study is between the perception of Big Data as a concept and the adoption of related technologies. The desired Outcome of this study is a conceptual mapping of how the term Big Data applied in research across domains. In terms of Context, the study focuses on research domains that use the term Big Data to explore the potential of its technology and tool in advancing research without any restriction on the type of domains.

Figure 1 summarizes the process followed by this SLR. Each of the steps in this process is discussed in more detail in the following.

www.frontiersin.org

Figure 1 . Process of the research methodology.

2.2 Data collection

Our aim for data collection was to explore as wide a range of studies in as many different domains as possible. As a result, we decided to retrieve candidate studies through queries on the online databases of Elsevier Scopus, Web of Science, and PubMed. Scopus is a database containing 77 million records from almost 5,000 international publishers ( Scopus Content Coverage Guide, 2023 ), Web of Science is one of the world's leading databases which covers all domains of science ( Cumbley and Church, 2013 ), and PubMed is a comprehensive medicine and biomedical database ( Falagas et al., 2007 ). Our search strategy for all three databases after a few stages of pilot searches, was to look for publications with the exact phrase “big data” and either technology, platform, application, or adoption in the title, along with review, survey, or “mapping study” in either title or keywords.

The specific search query for each database is shown in Figure 2 . As the databases use slightly different tags to identify search fields, the search query expression varied but had the same conceptual content. For example, Scopus used the tag TITLE for the title field, whereas WoS used TI and PubMed used square brackets behind the term. For the keywords field, Scopus automatically combined author keywords, index terms, trade names, and chemical names in the tag KEY. WoS provided two relevant fields for keywords, AK representing author keywords and KP representing keywords derived from the titles of article references based on a special algorithm. PubMed placed author keywords and other terms indicating major concepts in the field Other Term. We obtained 490 results from Scopus, 116 from WoS, and 18 from PubMed, resulting in a total of 624 articles before removing duplicates.

www.frontiersin.org

Figure 2 . Search queries and screening process.

2.3 Data selection

We established inclusion and exclusion criteria listed in Table 1 to ensure the relevance of the retrieved studies to be included in our analysis. In terms of inclusion criteria, we only considered articles that were published in English between 2017 and 2022 (in order to ensure that the studies reflect the current state of the art in each field), and that were secondary studies. Specifically, we focused on comprehensive reviews or systematic literature reviews that provided an overview of Big Data adoption in a particular research field, rather than articles that focused solely on the application of Big Data technologies to address individual research questions or on the improvement of Big Data technologies themselves. Moreover, we only included articles that were published in their entirety in journals, rather than conference proceedings or abstracts. In contrast, we excluded early access and not final stage articles to avoid potential bias or incomplete information. Publications in languages other than English were also excluded due to language barriers. Additionally, we excluded primary studies, as our aim was to focus on the adoption, application, and implications of Big Data technologies in specific scientific domains, rather than specific research questions.

www.frontiersin.org

Table 1 . Inclusion and exclusion for data selection.

By implementing these criteria, we were able to ensure that the articles selected for our analysis were recent, relevant, and provided a comprehensive view of the Big Data landscape. Moreover, by focusing on secondary studies, we were able to avoid redundant coverage of primary research studies and instead obtain a broader perspective on the field. Ultimately, this approach allowed us to generate a more comprehensive and robust understanding of the different Big Data technologies, applications, and techniques being used in various scientific domains. This step resulted in a corpus of 76 records after removing the duplicates.

2.4 Quality assessment

To ensure that the papers selected for full-text review meet the requirements of our study, we then conducted a Quality Assessment (QA) based on five standards:

1. Is Big Data adoption discussed at sufficient depth?

2. Is a specific methodology being used in the secondary study?

3. Are the bibliographic sources used included in the study design?

4. Is the number of included primary studies clear in the study?

5. Are the results well-organized?

Firstly, we examined whether the papers provide an overview of Big Data adoption in a particular research field, excluding other aspects such as the application of Big Data technologies that focus on individual research questions or the improvement of Big Data Technologies themselves. Secondly, we assessed whether the selected papers clearly define their research methodology or base it on a specific paradigm. Thirdly, we evaluated whether the papers clearly specify their bibliographic sources for the secondary studies. Fourthly, we documented whether the number of primary papers that are contained in the secondary studies is also clearly stated. Finally, we evaluated whether the results of the papers are well-organized around research questions.

We scored the papers based on these criteria in a binary fashion (fulfilled/not fulfilled), and to be considered for inclusion in our study, we set a threshold of at least three out of five standards that must be fulfilled. While QA is not meant in principle to be used for an additional filtering, we decided that given the observed wide disparities in the quality of the collected secondary studies, setting this threshold is justified. The resulting selected articles primarily use systematic literature reviews, although some pure review articles that are relevant to our study have also been included. This step leads to 33 studies left for further processing.

2.5 Snowballing

In addition to the papers collected, we employed the forward snowball sampling method ( Wohlin, 2014 ) to supplement our database. For each article under full-text review passing the QA step, we reviewed all the titles of its references and the articles citing it. We added records that met our inclusion and exclusion criteria and were relevant to our topics. As we are only interested in recent studies on the topic, we did not perform backward snowballing. Ultimately, we selected only one additional study for inclusion, resulting in 34 studies in total for data extraction. All authors actively contributed to each step of the research process, engaging in thorough discussions and collaborations. In every stage outlined above, one researcher took the lead and collaborated with the other two authors to ensure comprehensive coverage and rigorous analysis.

2.6 Data extraction

The data extraction form shown in Table 2 was designed to answer the research questions in this study. The extraction fields can be divided into three categories. The first part concerns demographics-related data to be extracted from each study in order to check for possible bias toward specific fields or publication venues. The second part is used to answer RQ1 and RQ2. For the Research Field aspect, we applied the subject classification from WoS to ensure consistency and heterogeneity of the level of the discipline described. This also enabled us to make cross-sectional comparisons and statistical summaries. Apart from the WoS categories, we added another field to specify the sub-level of the Application Domain in which the secondary study takes place. This helps us to better connect the technologies collected with their intended use cases. In terms of the technologies and applications that we collected, due to the wide range of information described from different perspectives and heterogeneity issues, we added one more column to clarify and support the documentation, namely Perspective. This column records the perspective from which the Big Data technologies are described. For example, some papers only summarized the use cases of machine learning methods as applications of Big Data. In this way, besides the concrete names of methods and models documented in the Technologies column (including also specific Big Data applications or platforms), the perspective will be marked as machine learning. This helps us to further analyze and categorize the discussed technologies. Reported shortcomings and benefits of the perceived impact of Big Data from the authors are extracted in the third part. This helps us to answer RQ3.

www.frontiersin.org

Table 2 . Fields of data extraction form.

2.7 Threats to validity

The present study faces several potential threats to its validity. Firstly, the inclusion and exclusion criteria employed could introduce bias into the sample. If the criteria are too narrow, relevant articles may be omitted, while if they are too broad, irrelevant articles may be included, resulting in a less representative sample. To mitigate this threat, we thoroughly reviewed and refined our criteria through piloting to ensure their optimal balance between inclusivity and exclusivity. Secondly, the quality assessment of the chosen articles could be influenced by researcher bias, as different researchers might interpret the quality criteria differently. To address this issue, we utilized a standardized quality assessment tool during the evaluation process to ensure consistent interpretation of the criteria. Moreover, limiting the search to a specific publication date range could result in publication date bias, leading to an outdated or incomplete synthesis of the literature. To avoid this, we focused on secondary studies and restricted the publication date to the latest 5 years, which enabled us to include publications from a wide range of dates. To enhance the efficiency of the quality assessment, we established the quality assessment process before conducting the full-text review. This approach ensured that all selected articles were relevant to the topic and that their quality was assessed rigorously. Other possible threats to the validity of this study include publication bias and selective reporting of results. To mitigate these threats, we conducted a comprehensive search of multiple databases and employed the snowballing method, while carefully screening all articles for possible selective reporting.

This section presents the results of our study and is divided into several subsections. Firstly, we provide an overview of the background information of the articles in our corpus, including the distribution of disciplines, publishers, and popular databases cited for their secondary studies. The presentation of the results is using the classification of the secondary studies into research domains extracted for RQ1 and RQ2. In Section 3.2, we discuss our findings on the Big Data technologies extracted from our SLR in order to answer RQ1. In Section 3.3, we discuss the concepts that are perceived as Big Data as reported by the secondary studies in our corpus. Finally, we summarize the impact of the adoption of Big Data technologies in the various domains in Section 3.4.

3.1 Overview of bibliographic results

The following section presents an overview of the created corpus used in the analysis, providing statistics on domain distribution, publishers, and bibliographic sources. Domain distribution helps to identify the research background of the analyzed articles, while publishers' distribution sheds light on those who are active in the field of Big Data. Additionally, the bibliographic sources used by secondary studies provide insights into the popularity of different databases and sources among researchers in the field. This information is crucial since it can affect the quality of the data and the generalizability of the findings and ensure transparency for analysis.

Figure 3 shows the distribution of WoS-defined disciplines of the articles extracted for the study. Health care emerged as the discipline that includes the most Big Data articles, with COVID-19 as a distinct area of interest that we promoted to its own domain. Transportation and Business & Economics were the next most popular areas of study. While several cross-domain studies did not specify the application domain of Big Data, they did focus on a specific aspect of Big Data applications such as storage and cybersecurity. Disaster response studies identified in the Public, Environmental & Occupational Health category are also notable. Other disciplines appear only once; these include Material Science that focused on Aerospace, Telecommunications (e.g. mobile Big Data), Construction & Building Technology, and Agriculture. It should be noted that this distribution does not necessarily represent the entire scientific community, and no conclusions regarding the interest of studying or using Big Data can be drawn from this.

www.frontiersin.org

Figure 3 . Publisher distribution of included articles.

Our analysis of the articles in our corpus, as summarized in Figure 4 , showed that MDPI and Elsevier published the most studies related to Big Data (with nine each), followed by Springer and IEEE (with three each). Eight more publishers have one publication each in our corpus and are omitted from the figure.

www.frontiersin.org

Figure 4 . Publishers' distribution of the included secondary studies.

In terms of databases chosen for secondary studies, Figure 5 shows that WoS and Scopus were the most popular sources, being in alignment with our choice of databases for our study, followed by IEEE, ScienceDirect, Google Scholar, Wiley Online Library, Sage, PubMed, and Association for Computing Machinery's (ACM) Digital Library.

www.frontiersin.org

Figure 5 . Popularity of databases used by the secondary studies.

Overall, a wide diversity of publishing venues and data sources can be attested for our corpus, providing evidence toward a lack of selection bias for our review.

3.2 Big Data technologies

To address RQ1, we refer to Patgiri (2019) , who presents a taxonomy of Big Data consisting of seven key categories: Semantics, Compute Infrastructure, Storage System, Big Data Management, Big Data Mining, Big Machine Learning, and Security & Privacy. This taxonomy covers various aspects of Big Data technologies, including implementation tools, system architectures, and operational processes. We categorize and describe our findings based on the six perspectives (excluding Semantics) proposed by Patgiri. Since the Semantics perspective primarily concerns the conceptual understanding of Big Data, we exclude it from our summary of technologies.

The data extraction results from our study reveal the use of various Big Data technologies already discussed in Patgiri (2019) , including Xplenty, Apache Cassandra, MongoDB, Hadoop, Datawrapper, Rapidminer, Tableau, KNIME, Storm, Cloudera Distributed Hadoop(CDH), Kafka, Spark, Mapreduce, Hive, Pig, Flume, Sqoop, Apache Tez, Flink, and Storm. These technologies are used to extract, process, analyze, and visualize large volumes of data across a range of scientific domains.

From the Compute Infrastructure perspective, the most commonly used technology is Hadoop, 1 which is used for distributed computing and data processing. Various disciplines have been found utilizing Hadoop, including health care [S12, S63, S87, S133], environmental science [S77], and transportation [S194]. Other technologies, such as Apache Cassandra 2 and MongoDB, 3 are used for distributed data storage and retrieval. These technologies enable data processing at scale and facilitate parallel processing, enabling users to analyze large volumes of data quickly and efficiently. In terms of storage systems, Apache Cassandra, Hadoop Distributed File System (HDFS), 4 HBase, 5 and MongoDB are popular choices for distributed data storage. These technologies are designed to handle large volumes of data and provide high scalability and fault tolerance. The reported application domains for these technologies include Internet of Things (IoT), healthcare, decision making and electric power data [S94]. [S133] notes that NoSQL database systems such as MongoDB, Cassandra, and HDFS can be used to handle exponential data growth to replace the traditional database management systems. Additionally, technologies such as Flume 6 and Sqoop 7 enable the efficient transfer of data between different storage systems [S74].

Regarding the Big Data Management perspective, frameworks such as Hadoop, MapReduce, Hive, 8 and Pig 9 are used for managing and processing large volumes of data. These technologies enable users to process data in a distributed environment, thereby increasing efficiency and scalability [S74, S133, S194]. Other technologies, such as Apache Tez 10 and Flink, 11 provide high-performance data processing and streaming capabilities [S74]. From the Big Data Mining & Machine Learning aspect, a range of machine learning models are used to analyze large volumes of data, including Artificial Neural Network (ANN), fuzzy logic, Decision Tree (DT), regression, Support vector machine (SVM), Self-Organizing Map (SOM), Fuzzy C-Means (FCM), K-means, Genetic algorithm (GA), and Convolution Neural Networks (CNN), DT, Random Forest, Rotation Forest, ANN, Bayesian, Boosted Regression Tree, Classification And Regression Trees (CART), Conditional Inference Tree, Maxent, K-means, Non-negative Matrix Fact, and others [S19, S27, S65, S96, S178, S203, S207]. These techniques enable users to extract insights from large volumes of data, identify patterns and trends, and make predictions based on data analysis. Finally, as for security and privacy, Intrusion Detection Systems (IDS) are used to detect anomalous behaviors in complex environments. Machine learning models such as unsupervised online deep neural networks and deep learning techniques are used to identify and analyze Controller Area Network (CAN-Bus) attacks. Log parsers such as Zookeeper, 12 Proxifier, 13 BGL, HPC and HDFS are used to secure data and ensure privacy in distributed environments [S88]. These technologies provide access control, authentication, and encryption capabilities to protect data against unauthorized access and misuse.

Although many Big Data technologies and their applications are presented in this section, a significant portion of our research found that there are additional technologies and applications that are discussed outside of Patgiti's taxonomy, yet are still labeled as Big Data. In order to distinguish between these two cases and further deepen the understanding of Big Data and related technologies in the scientific field, we discuss the latter case in the following research question.

3.3 Perceptions of Big Data scope

In this section, we aim to provide an overview of the landscape on how Big Data is actually used based on our analysis of the literature to answer RQ2. Initially, we attempted to use the WoS categories to structure the discussion around domains; however, due to the heterogeneity issue at different angles, the difference in technology use was found to be significant even within one subject. As a result, we summarize the technologies according to the Perspective field recorded in the data extraction form. Table 3 summarizes our findings.

www.frontiersin.org

Table 3 . Perceptions of Big Data scope.

3.3.1 Big Data as abstract concepts or collections for application domains

In this category, the studies do not exemplify the Big Data technologies in concrete circumstances, but rather as concepts or collections for application domains. For example, [S67] discusses using text mining methods to explore the application of Big Data disciplines. The research focuses on the broad application of Big Data but does not include specific technologies. Based on the presented text mining method, Big Data technologies are identified automatically as keywords without further classification, and no concrete technologies and their use cases are explored. Similarly, [S104] studies the implementation of cloud computing, Big Data, and blockchain technology adoption in Enterprise Resource Planning (ERP), using only publication keywords generated from literature review and covering a broad spectrum. Why certain technologies are being referred to as Big Data, and how they are specifically utilized is missing, however. In another case, the term “Big Data” and related expressions such as “Big Data Analytics” and “Big Data Applications” are used as abstract concepts throughout the paper without a clear definition of the scope of the technology or little mention of the technology itself. For instance, in an attempt to identify privacy issues in Big Data applications, [S181] uses phrases such as “Big Data analytics” to refer to the technology, without focusing on any specific technologies.

3.3.2 Big Data as large data sources

This category of studies focuses on newly established large data sources, such as databases, websites, and crowd-sourcing platforms, which are taken as the application target of Big Data. The term Big Data is not strictly focused on the technologies but rather the volume or velocity of the data, with no specific techniques linked to those data sources. The application of Big Data in this case mainly means the use of large volumes of data. For example, [S132] explores the utilization of Big Data, smart and digital technologies, and artificial intelligence in the field of psoriatic arthritis studies. The authors cite a series of examples of how Big Data, combined with techniques such as artificial neural networks, natural language processing, k-nearest neighbors, and Bayesian models, can help intercept patients with psoriatic arthritis early. In this case, Big Data mainly refers to the repositories, registries, and databases that are generated from surveys, medical insurance data, vital registration data, etc. A similar research study in healthcare is [S47]. The authors list a number of databases worldwide with concrete gastrointestinal and liver diseases sample sizes, and techniques that are briefly mentioned for analysis are R, python, statistics, and Natural Language Processing (NLP). Also in ecology, ecological datasets such as AmeriFlux are considered Big Data as indicated by [S16].

3.3.3 Machine learning methods and models

This category refers to Big Data as models, algorithms, and statistical methods for managing large data sets, closely related to Machine Learning, Artificial Intelligence, Data Mining, Internet of Things, etc. Big Data technology is equivalent to the definition of machine learning or artificial intelligence, sometimes referred to as deep learning. Specific machine learning models are mentioned, such as ANN, Neural Network (NN), SVM, K-means, decision trees, and also improved models that are based on those methods, to solve practical problems, such as prediction for a certain disease, optimization of transportation. In our research, the majority of the technologies and applications extracted fall into this category. Popular models and techniques summarized are decision trees, random forests and support vector logistic regression models, naive Bayes models, decision trees, and k-nearest neighbors (k-NN) classic linear statistic models, Bayesian networks, and CNNs. There are also some studies (e.g. [S203, S207]) that summarize the applications of Big Data from their research field by discussing a series of concrete improvements of existing models.

3.3.4 Big Data ecosystem

The Big Data ecosystem category represents a comprehensive structure of the different levels of technology solutions that exist. Various applications such as Hadoop, MapReduce, and NoSQL are commonly mentioned in this category. The authors of those studies show a strong familiarity with how these technologies fit into their respective disciplines, demonstrating a deep understanding of the subject matter. In one study [S194], for example, the authors explained the different layers in an architecture for Big Data relating to traffic, and how applications such as Hadoop, MapReduce, HDFS, Apache Spark, Apache Hive, Hut, and Apache Kafka work together in the system. The article provides valuable insights into the role that these applications play in the overall architecture. Another study [S216] focused on the use of MapReduce and HDFS functionality in Big Data Architecture.

3.3.5 Outside of the classification

Some articles in our study did not fit into the categories outlined in our classification, as it is challenging to extract useful information from them. For instance, some studies e.g., [S12,S63,S77,S172] presented a comparison of Big Data frameworks sourced from other literature without exploring how these frameworks are utilized in their respective fields after introducing their background discipline. While these frameworks can be categorized into the fourth category, we found it difficult to ascertain how Big Data is applied and understood in these disciplines. As a result, we did not include them in Table 3 . Articles that do not fit into this classification do not imply that the classification is invalid for the study, but rather that we cannot accurately estimate the authors' understanding of Big Data.

3.4 Added value of Big Data

In this section, we summarize the perceived impact of Big Data adoption to each domain, as stated in the secondary studies and in order to answer RQ3. The impact of Big Data is categorized into benefits and shortcomings, which will be introduced below. Table 4 provides a summary of the main points extracted from the texts.

www.frontiersin.org

Table 4 . Summary of benefits and shortcomings of Big Data.

3.4.1 Benefits

The advantages of Big Data in various fields have been extensively researched and documented. Big Data tools have the ability to store, process and analyze large volumes and varieties of data in real time, enabling researchers to extract valuable insights and improve their performance. In healthcare, the integration of different scientific fields such as informatics, clinical sciences, and analytics has been facilitated by the application of Big Data [S12]. Further reported benefits of Big Data in healthcare include, for example, cost reduction in medical treatments, elimination of illness-related risk factors, disease prediction, and improved preventative care and medication efficiency analysis [S12, S47, S132, S181]. Large sample sizes have enabled reliable capture of small variations in incidence or disease flare, and epidemiological/clinical Big Data has been instrumental in guiding public and global health policies [S47, S132]. In the field of transportation, Big Data technologies have been used to improve the effectiveness of traffic crash detection and prediction research, with the aim of preventing the occurrences of traffic crashes and secondary crashes [S178, S194]. In the ecosystem service research, Big Data and machine learning tools have been used to address data availability, uncertainty, and socio-ecological gaps [S16]. In addition, Big Data analytics platforms have been useful in revealing previously overlooked correlations, market trends, and valuable information from a large amount of Big Data [S66]. Machine learning has also been used in the smart grid area to sift through Big Data and extract useful information that can aid in demand and generation pattern recognition [S51]. Overall, the benefits of Big Data are diverse and far-reaching, and its application has the potential to revolutionize research and improve outcomes in various fields.

3.4.2 Shortcomings

One of the primary challenges with Big Data is the assumption that having vast amounts of data guarantees accurate results. As [S74] notes, Big Data can give a false sense of security because having a lot of data does not necessarily mean that the results are valid. In the field of Psoriatic Arthritis, for example, there are significant variations in socio-demographic characteristics, co-morbidities, and major complication rates between individual (single- or multi-center) and database-based studies [S132]. This inconsistency raises concerns when critically appraising rheumatological and dermatological research, as well as risk adjustment modeling.

Another challenge with Big Data is that unnecessary utilization of Big Data can lead to a waste of resources, as it ties up computer resources [S74]. With the exponential growth of data, the storage and processing demands for Big Data have increased significantly. Unnecessary utilization of Big Data can exhaust computer resources and make them less available for other important tasks. This can result in increased costs for organizations, as they need to invest in more powerful computer systems to handle the increased demand. Big Data also poses physical challenges to current IT architecture, servers, and software. As pointed out by [S51], IoT devices generate a huge amount of data, which cannot be handled through conventional analysis techniques. While many data storage technologies have been proposed to store and process growing data, more efficient technologies are required for data acquisition, processing, pre-processing, storage, and management of Big Data [S94].

Security and privacy issues also arise with Big Data. In the healthcare industry, data security and patient privacy issues are a significant concern for authorities and patients [S66]. In the research concerning COVID-19 [S207], for example, there are ethical issues surrounding privacy, the use of personal data to limit the pandemic spread, and the need for security to protect data from being overused by technology.

Finally, there are challenges in managing data consistency, scalability, and integration. For instance, challenges in using Big Data in environmental Sciences include data cleansing, lack of labeled datasets, mismatched data ingestion, high costs of platforms, and lack of data governance and socio-technical infrastructure [S135].The transportation industry faces difficulties in collecting data from diverse sources and addressing quality concerns. When using Big Data in transportation, data collected from different sources needs to be analyzed to extract meaningful insights. However, this data often contains noise and uncertainty that must be addressed before use [S194].

In conclusion, several challenges need to be addressed to ensure that the use of Big Data is effective. These challenges include the need for integrated and comprehensive systems, efficient data acquisition and management technologies, security and privacy concerns, data consistency, scalability, as well as integration issues.

4 Discussion

Going beyond our findings as reported in the previous section, in the following we identify three specific topics emerging from our analysis that need further consideration. These are how Big Data is being understood across scientific domains, how the adoption of the Big Data occurs across disciplines, and how the perception of the added value that Big Data technologies bring looks like.

4.1 Understanding of Big Data

The challenge in analyzing research using Big Data technologies stems from the lack of a clear definition of what a Big Data technology is. The authors of the secondary studies surveyed through this study elaborate on their understanding of Big Data and its applications from different perspectives. This results in a wide spectrum of technologies which are labeled as Big Data, while significant differences between them remain, even when applied in the same field. Additionally, terms closely related to Big Data technology, such as AI, ML, Big Data analytics, Big Data platform, IoT, and Deep Learning are often used interchangeably. While these terms may not require definition in each individual article surveyed, their meanings and scope may overlap in practice. Therefore, to gain a clear understanding of Big Data, it would be helpful to have a clear distinction and consistent use of each term with more precision to avoid confusion. In other words, we argue that when reviewing the literature, the primary task is to understand what authors mean when they use concepts such as Big Data and Big Data technology, rather than simply extracting relevant technical applications and their context.

The results show that the concept of Big Data technology can be very broad. At the same time, these tools and technologies may be specific to a particular discipline and may not have interdisciplinary significance. This ambiguity has led to the widespread use of Big Data terminology in some cases. Our findings also revealed that some articles provide a vague description of Big Data, making it difficult to extract useful information that cannot be categorized under any of the categories in Section 3.3. These articles focus on the theoretical knowledge of Big Data and do not delve into its practical applications in different domains. As a result, they provide a pure comparison of Big Data frameworks cited in other literature without exploring how these frameworks are used in their respective domains. This approach contributes to pervasive references to Big Data, however, without providing a clear understanding of its practical applications.

In conclusion, this study highlights the need for a clear and comprehensive definition of Big Data technology and its practical applications in different domains. By doing so, we can avoid the misuse of Big Data terminology and gain a better understanding of the tools and technologies that are truly related to Big Data.

4.2 Range of disciplines represented in Big Data applications

In this study, we conducted a rigorous systematic literature review to investigate the applications of Big Data and used the WoS subject classification to categorize the related research fields. However, we acknowledge that this approach may not capture the full range of disciplines, as indicated by our analysis of the discipline distribution. While disciplines such as astronomy and high-energy physics are typically associated with managing large volumes of data ( Jacobs, 2009 ), we found that they were not represented in the papers we extracted. The lack of such data-intensive disciplines led us to question why they were not included, since our primary objective was to provide an overview of Big Data applications across the entire scientific community. We took great care to ensure that our SLR was not biased toward any specific domain by carefully selecting keywords and defining our inclusion and exclusion criteria. Additionally, we used comprehensive databases and an auditable and repeatable methodology. The absence of data-intensive subjects in our review suggests that some disciplines may be utilizing Big Data technologies without explicitly using the term “Big Data” in their research papers. One hypothesis is that computer science and related fields may consider the scale of data used as the norm, hence not explicitly mentioning “Big Data.” To understand this phenomenon better, we suggest a further step to investigate the underlying assumptions, such as the choice of terminologies, used in these disciplines as part of future work.

4.3 Value of Big Data

The enthusiasm surrounding Big Data arises from the belief that vast amount of information with the development of technologies, algorithms and machine learning techniques, can provide innovative insights that traditional research methods cannot achieve ( Agrawal et al., 2011 ; Chen and Zhang, 2014 ; Khalid and Yousaf, 2021 ). However, these optimistic views are not undisputed. As Big Data knowledge infrastructures emerge, researchers are increasingly discussing the challenges and limitations they present.

Numerous papers extol the merits of Big Data, with the main claim being that it offers unparalleled opportunities for scientific breakthroughs, leading to transformative research, as discussed in Section 3.4. The most significant advantage of the Big Data approach is its ability to address problems on larger and finer temporal and spatial scales, as well as provide information on which data are reliable or uncertain, thereby mapping ignorance ( Hampton et al., 2013 ; Harford, 2014 ; Isaac et al., 2020 ). Big Data also highlights blind spots and uncertainties in research, revealing gaps in existing knowledge ( Hortal et al., 2015 ). In this study, we have explored the benefits of Big Data tools. They can store, process, and analyze large volumes and varieties of data in real-time, enabling researchers to extract valuable insights and improve their performance. Big Data analytics platforms have proven useful in revealing previously overlooked correlations, market trends, and valuable information from large datasets. Additionally, machine learning has aided in sifting through Big Data to extract useful information for demand and generation pattern recognition in the smart grid.

However, according to Ekbia et al. (2015) , Big Data presents both conceptual and practical dilemmas based on a broad range of literature. They argue that an epistemological shift in science occurs due to the use of Big Data, where predictive modeling and simulation gain more importance than causal explanations based on repeatable experiments testing hypotheses. The authors Rosenheim and Gratton (2017) reject what they perceive as the suggestion of the most fervent proponents of Big Data that knowledge of correlation alone can replace knowledge of causality. They point out that understanding cause-and-effect relationships is critical in fields such as agricultural entomology, where research-oriented recommendations enable farmers to implement management actions that lead to desired outcomes. The study in Harford (2014) points out that conducting a correlation-based analysis without a theoretical framework is inherently vulnerable. Without understanding the underlying factors influencing a correlation, it becomes impossible to anticipate and account for factors that could potentially compromise its validity.

Similarly to the above, in our research, we found critical questions regarding whether vast amounts of data guarantee accurate results. Brady (2019) argues that social scientists must grasp the meaning of concepts and predictions generated by convoluted algorithms, weigh the relative value of prediction vs. causal inference, and cope with ethical challenges as their methods. Another notable challenge posed by Big Data that we found is managing data consistency, scalability and heterogeneity in different fields. Additionally, large amounts of data also pose challenges to IT architecture, servers and software. This was also testified by Fan et al. (2014) . They point out that the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. Furthermore, we found that the misuses of Big Data arouse ethical issues. This concern has already been widely discussed in previous literature, see for example Cumbley and Church (2013) and Knoppers and Thorogood (2017) .

In conclusion, while Big Data has generated much enthusiasm as a powerful tool for scientific breakthroughs, it is not without its challenges and limitations. Despite these challenges, Big Data holds promise as a tool for innovative research, and future work should continue to address these concerns while exploring new opportunities for knowledge discovery.

5 Conclusion

The lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there is a need to first retrospectively observe what Big Data actually means and refers to in concrete studies. This will help clarify ambiguities and enhance understanding of the role of Big Data in science. It will facilitate clearer communication among researchers by providing a framework for a common understanding of what Big Data entails in different contexts, which is crucial for interdisciplinary collaboration. To address this gap, we conducted a systematic literature review (SLR) of secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.

Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. All these aspects combined represent the true meaning of Big Data. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

Author contributions

XH: Conceptualization, Writing – original draft. OG: Conceptualization, Supervision, Writing – review & editing. VA: Conceptualization, Methodology, Supervision, Validation, Writing – review & editing.

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ https://hadoop.apache.org/

2. ^ https://cassandra.apache.org/

3. ^ https://www.mongodb.com/

4. ^ https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

5. ^ https://hbase.apache.org/

6. ^ https://flume.apache.org/

7. ^ https://sqoop.apache.org/

8. ^ https://hive.apache.org/

9. ^ https://pig.apache.org/

10. ^ https://tez.apache.org/

11. ^ https://flink.apache.org/

12. ^ https://zookeeper.apache.org/

13. ^ https://www.proxifier.com/

Agrawal, D., Bernstein, P. A., Bertino, E., Davidson, S. B., Dayal, U., Franklin, M. J., et al. (2011). Challenges and opportunities with big data. Cyber Center Technical Reports, 2011–1. Available at: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1000&context=cctech

Google Scholar

Akoka, J., Comyn-Wattiau, I., and Laoufi, N. (2017). Research on Big Data – a systematic mapping study. Comput. Stand. Interf . 54, 105–115. doi: 10.1016/j.csi.2017.01.004

Crossref Full Text | Google Scholar

Brady, H. E. (2019). The challenge of Big Data and data science. Ann. Rev. Polit. Sci . 22, 297–323. doi: 10.1146/annurev-polisci-090216-023229

Chebbi, I., Boulila, W., and Farah, I. R. (2015). “Big data: concepts, challenges and applications,” inComputational Collective Intelligence , eds. M. Núñez, N. T. Nguyen, D. Camacho, and B. Trawiński (Cham: Springer International Publishing), 638–647. doi: 10.1007/978-3-319-24306-1_62

Chen, C., and Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci . 275, 314–347. doi: 10.1016/j.ins.2014.01.015

PubMed Abstract | Crossref Full Text | Google Scholar

Cumbley, R., and Church, P. C. (2013). Is “Big Data" creepy? Comput. Law Secur. Rev . 29, 601–609. doi: 10.1016/j.clsr.2013.07.007

Ekbia, H. R., Mattioli, M., Kouper, I., Arave, G., Ghazinejad, A., Bowman, T. D., et al. (2015). Big data, bigger dilemmas: a critical review. J. Assoc. Inform. Sci. Technol . 66, 1523–1545. doi: 10.1002/asi.23294

Falagas, M. E., Pitsouni, E., Malietzis, G., and Pappas, G. (2007). Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. FASEB J . 22, 338–342. doi: 10.1096/fj.07-9492LSF

Fan, J., Han, F., and Liu, H. (2014). Challenges of Big Data analysis. Natl. Sci. Rev . 1, 293–314. doi: 10.1093/nsr/nwt032

Hampton, S. E., Strasser, C., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Front. Ecol. Environ . 11, 156–162. doi: 10.1890/120103

Hansmann, T., and Niemeyer, P. (2014). “Big Data - characterizing an emerging research field using topic models,” in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (New York, NY: ACM). doi: 10.1109/WI-IAT.2014.15

Harford, T. (2014, March 28). Big data: Are we making a big mistake? Financial Times. Retrieved from: https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0 (accessed August 25, 2024).

Hortal, J., De Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., and Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annu. Rev. Ecol. Evol. Syst . 46, 523–549. doi: 10.1146/annurev-ecolsys-112414-054400

Isaac, N. J. B., Jarzyna, M. A., Keil, P., Dambly, L. I., Boersch-Supan, P. H., Browning, E., et al. (2020). Data integration for large-scale models of species distributions. Trends Ecol. Evol . 35, 56–67. doi: 10.1016/j.tree.2019.08.006

Jacobs, A. (2009). The pathologies of big data. Commun. ACM 52, 36–44. doi: 10.1145/1536616.1536632

Khalid, M., and Yousaf, M. H. (2021). A comparative analysis of big data frameworks: an adoption perspective. Appl. Sci . 11:11033. doi: 10.3390/app112211033

Khan, M. A.-U.-D., Uddin, M. J., and Gupta, N. (2014). Seven V's of Big Data understanding Big Data to extract value . doi: 10.1109/ASEEZone1.2014.6820689

Kitchenham, B., Brereton, O. P., Budgen, D., Seed, P. T., Bailey, J. E., Linkman, S., et al. (2009). Systematic literature reviews in software engineering – a systematic literature review. Inf. Softw. Technol . 51, 7–15. doi: 10.1016/j.infsof.2008.09.009

Kitchenham, B., Pretorius, R., Budgen, D., Brereton, O. P., Seed, P. T., Niazi, M., et al. (2010). Systematic literature reviews in software engineering – a tertiary study. Inf. Softw. Technol . 52, 792–805. doi: 10.1016/j.infsof.2010.03.006

Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data Soc . 1:205395171452848. doi: 10.1177/2053951714528481

Knoppers, B. M., and Thorogood, A. (2017). Ethics and Big Data in health. Curr. Opin. Syst. Biol . 4, 53–57. doi: 10.1016/j.coisb.2017.07.001

Mengist, W., Soromessa, T., and Legese, G. (2020). Method for conducting systematic literature review and meta-analysis for environmental science research. MethodsX 7:100777. doi: 10.1016/j.mex.2019.100777

Patgiri, R. (2019). A taxonomy on Big Data: survey. arXiv [Preprint] . doi: 10.48550/arxiv.1808.08474

Petroc, T. (2023). Amount of data created, consumed, and stored 2010-2020, with forecasts to 2025. Statista. Available at: https://www.statista.com/statistics/871513/worldwide-data-created/

Petticrew, M., and Roberts, H. (2008). Systematic Reviews in the Social Sciences . Hoboken, NJ: John Wiley & Sons.

Rosenheim, J. A., and Gratton, C. (2017). Ecoinformatics (Big Data) for agricultural entomology: pitfalls, progress, and promise. Annu. Rev. Entomol . 62, 399–417. doi: 10.1146/annurev-ento-031616-035444

Scopus Content Coverage Guide (2023). In Scopus. Available at: https://assets.ctfassets.net/o78em1y1w4i4/EX1iy8VxBeQKf8aN2XzOp/c36f79db25484cb38a5972ad9a5472ec/Scopus_ContentCoverage_Guide_WEB.pdf

Succi, S., and Coveney, P. V. (2019). Big data: the end of the scientific method? Philos. Trans. A Math. Phys. Eng. Sci . 377:20180145. doi: 10.1098/rsta.2018.0145

Van Altena, A. J., Moerland, P. D., Zwinderman, A. H., and Olabarriaga, S. D. (2016). Understanding big data themes from scientific biomedical literature through topic modeling. J. Big Data 3. doi: 10.1186/s40537-016-0057-0

Wohlin, C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. doi: 10.1145/2601248.2601268

www.frontiersin.org

Table A1 . Secondary studies included in our review.

Keywords: Big Data definition, systematic literature review, scientific research, Big Data review, Big Data epistemology

Citation: Han X, Gstrein OJ and Andrikopoulos V (2024) When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. Front. Big Data 7:1441869. doi: 10.3389/fdata.2024.1441869

Received: 31 May 2024; Accepted: 12 August 2024; Published: 10 September 2024.

Reviewed by:

Copyright © 2024 Han, Gstrein and Andrikopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiaoyao Han, x.han@rug.nl

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. (PDF) Big Data Analytics for Cyber Security

    big data security research paper

  2. (PDF) Meta-Analysis of Big Data Security and Privacy Scholarly

    big data security research paper

  3. security issues in big data research papers

    big data security research paper

  4. (PDF) Big Data Security and Privacy: A Review on Issues, Challenges and

    big data security research paper

  5. (PDF) Big Data Security

    big data security research paper

  6. (PDF) Is Big Data Security Essential for Students to Understand?

    big data security research paper

VIDEO

  1. Technical seminar- Big Data Security

  2. Qu’est-ce que le “big security data” ?

  3. Key Considerations for Big Data Security Analytics SIEM 2 0

  4. Network Security Research Paper

  5. Smart Investigator

  6. تخصصات و علوم الحاسب الآلي والمستقبل

COMMENTS

  1. Data Security in Big Data: Challenges, Strategies, and Future Trends

    In the dynamic landscape of big data analytics, this paper explores the critical dimension of data. security, addressing challenges, strategies, and emerging trends. Recognizing the exponential ...

  2. Privacy Prevention of Big Data Applications: A Systematic Literature

    Information security is a Big Data issue. The research community is focusing on the development in the period of Big Data, computer science, and increasing business applications of quick and efficient algorithms for Big Data security intelligence, with the primary aim of ensuring a safe environment free of unlawful access (Cheng et al., 2017).

  3. Data security governance in the era of big data: status, challenges

    1. Global status of data security governance. Countries and economic communities across the globe have devised countermeasures to cope with emerging big data security issues, and prepare for upcoming problems through enhancing data security governance. 1.1. Stepping up legislative efforts in protecting personal data.

  4. A key review on security and privacy of big data: issues, challenges

    Big data collection means collecting large volumes of data to have insight into better business decisions and greater customer satisfaction. Securing big data is difficult not just because of the large amount of data it handles, but also because of the continuous streaming of data, multiple types of data, and cloud-based data storage. Additionally, traditional security and privacy methods are ...

  5. Cybersecurity in Big Data Era: From Securing Big Data to Data-Driven

    In this paper, we explore recent research works in cybersecurity in relation to big data. We highlight how big data is protected and how big data can also be used as a tool for cybersecurity. We summarize recent works in the form of tables and have presented trends, open research challenges and problems.

  6. Big data in cybersecurity: a survey of applications and future trends

    The paper categorizes applications into areas of intrusion and anomaly detection, spamming and spoofing detection, malware and ransomware detection, code security, cloud security, along with another category surveying other directions of research in big data and cybersecurity. The paper concludes with pointing to possible future directions in ...

  7. Security and privacy for big data: A systematic literature review

    Big data is currently a hot research topic, with four million hits on Google scholar in October 2016. One reason for the popularity of big data research is the knowledge that can be extracted from analyzing these large data sets. However, data can contain sensitive information, and data must therefore be sufficiently protected as it is stored and processed. Furthermore, it might also be ...

  8. Comprehensive Review: Security Challenges and Countermeasures for Big

    As big data becomes increasingly prevalent in the cloud computing, ensuring the security of data becomes paramount. This comprehensive review paper explores the security challenges and countermeasures associated with big data security in cloud computing environments. Through a thorough analysis of existing literature, the paper identifies key security issues including confidentiality, data ...

  9. Big Data Security and Privacy Concerns: A Review

    Different tools and techniques such as analytics and data mining are being used to make the data useful. However, big data is beset by security and privacy issues. Researcher from different domain are trying to overcome the issues in the big data, this paper will focus on the applications and major security, and privacy issues of big data.

  10. Security and Privacy Issues of Big Data

    Considering Big Data there is a set of risk areas that need to be considered. These include the information lifecycle (provenance, ownership and classification of data), the data creation and collection process, and the lack of security procedures. Ultimately, the Big Data security objectives are no different from any

  11. (PDF) Big data in cybersecurity: a survey of ...

    Abstract With over 4.57 billion people using the Internet in 2020, the amount. of data being generated has exceeded 2.5 quintillion bytes per day. This rapid. increase in the generation of data ...

  12. (PDF) Privacy Issues and Data Protection in Big Data: A Case Study

    This paper describes privacy issues in big data analysis and. elaborates on two case studies (government-funded projects 2,3 ) in order to elucidate how legal priv acy requirements can be met. in ...

  13. Big Data and Cybersecurity: A Review of Key Privacy and Security

    In the context of big data, this article offers a detailed and exhaustive examination of both current and potential security and privacy challenges. 5 security and privacy attributes notably confidentiality, integrity, accessibility, privacy-preservability, as well as accountability—are identified as an outcome of the research done in this study.

  14. Big data privacy: a technological perspective and review

    Big data is a term used for very large data sets that have more varied and complex structure. These characteristics usually correlate with additional difficulties in storing, analyzing and applying further procedures or extracting results. Big data analytics is the term used to describe the process of researching massive amounts of complex data in order to reveal hidden patterns or identify ...

  15. Big Data Analytics for Cyber Security

    In this paper author have imparted the importance of Big Data Analytics in Cyber Security to handle unified data representation, zero-day attack detection, data sharing across the risk detecting ...

  16. Cyber risk and cybersecurity: a systematic review of data availability

    Cybercrime is estimated to have cost the global economy just under USD 1 trillion in 2020, indicating an increase of more than 50% since 2018. With the average cyber insurance claim rising from USD 145,000 in 2019 to USD 359,000 in 2020, there is a growing necessity for better cyber information sources, standardised databases, mandatory reporting and public awareness. This research analyses ...

  17. Big Data security and privacy: A review

    Abstract: While Big Data gradually become a hot topic of research and business and has been everywhere used in many industries, Big Data security and privacy has been increasingly concerned. However, there is an obvious contradiction between Big Data security and privacy and the widespread use of Big Data. In this paper, we firstly reviewed the enormous benefits and challenges of security and ...

  18. When we talk about Big Data, What do we really mean? Toward a more

    The absence of data-intensive subjects in our review suggests that some disciplines may be utilizing Big Data technologies without explicitly using the term "Big Data" in their research papers. One hypothesis is that computer science and related fields may consider the scale of data used as the norm, hence not explicitly mentioning "Big ...

  19. Big Data Security Issues and Challenges

    Data is growing because of use of internet, smart. phone and social network. Big data is a collection of data sets which is very larg e in size as well as complex. Genera lly. size of the data is ...

  20. The Use and Usefulness of Big Data in Finance: Evidence ...

    Alternative data have become a new source of information for investors. This paper examines how this development is reshaping the role of sell-side analysts. Employing textual analysis of analysts' written reports, we show that analysts themselves have begun to adopt alternative data in their analyses.

  21. Big Data Security and Privacy Protection

    In view of the wide application and popularization of large data, more and more data security and privacy issues have brought great challenges to the development of large data. Starting from the characteristics of big data, this paper analyses various risks of information security, and puts forward the corresponding development strategy of big data security. The results show that the ...

  22. Research on Data Security in Big Data Cloud Computing Environment

    In the big data cloud computing environment, data security issues have become a focus of attention. This paper delivers an overview of conceptions, characteristics and advanced technologies for big data cloud computing. Security issues of data quality and privacy control are elaborated pertaining to data access, data isolation, data integrity, data destruction, data transmission and data ...