Big data

Big data is data sets That are so voluminous and complex That traditional data processing Application software are inadequate to deal with ’em. Big data challenges include capturing data , data storage , data analysis , search, sharing , transfer , visualization , querying , and updating information privacy . Volume, Variety and Velocity.

Lately, the term “big data” tends to refer to the use of predictive analytics , user behavior analytics , or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. “There is little doubt that the quantities of data are indeed wide, but that is not the most characteristic feature of this new data ecosystem.” [2] Analysis of data sets can find new correlations to “business crime trends, prevent diseases, fight crime and so on.” [3]Scientists, business executives, practitioners of medicine,fintech , urban informatics , and business informatics . Scientists encounter limitations in e-Science work, including meteorology , genomics , [4] connectomics , complex physics simulations, biology and environmental research. [5]

Data sets Grow Rapidly – in hand Because They Are increasingly Gathered by cheap and Numerous information-sensing Internet of Things Devices Such As mobile devices , aerial ( remote sensing ), software logs, cameras , microphones, radio-frequency identification (RFID) readers and wireless sensor networks . [6] [7] The world’s technological per-capita capacity has nearly doubled every 40 months since the 1980s; [8] as of 2012 , every day 2.5 exabytes (2.5 × 10 18 ) of data are generated. [9]By 2025, IDC predicts there will be 163 zettabytes of data. [10] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization. [11]

Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require “massively parallel software running on tens, hundreds, or even thousands of servers”. [12] What counts as “big data” varies depending on the capabilities of the users and their tools, and “For some organizations,” [13]


The term has been used by John Mashey for coining or at least making it popular. [14] [15] Big data usually includes datasets with sizes beyond the ability of commonly used software tools to capture , curate , manage, and process data within a tolerable elapsed time. [16] Big Data philosophy is unstructured, semi-structured and structured data, however the main focus is on unstructured data. [17] Big data “size” is a moving moving target, as of 2012 from a small number of terabytes to many petabytes of data. [18]Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are different, complex, and of a massive scale. [19]

In a 2001 research report [20] and related readings, META Group (now Gartner ), which is defined as increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continues to use this “3Vs” model for describing big data. [21] In 2012, Gartner updated its definition: “Big data is high-volume, high-velocity and / or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making , and process automation. ” [22]Gartner’s definition of the 3Vs is still widely used, and in agreement with a consensus that states that “Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value” . [23] Additionally, a new V “Veracity” is added by some organizations to describe it, [24] revisionism challenged by some industry authorities. [25] The 3Vs have been expanded to other complementary characteristics of big data: [26] [27]

  • Volume: big data does not sample; it just observed and tracks what happens
  • Velocity: big data is often available in real-time
  • Variety: big data draws from text, images, audio, video; more complete missing parts through data fusion
  • Machine learning : big data often does not ask why and simply detects patterns [28]
  • Digital footprint : big data is often a cost-free byproduct of digital interaction [27] [29]

The growing maturity of the concept more starkly delineates the difference between big data and Business Intelligence : [30]

  • Business Intelligence uses descriptive statistics with data with high information density to measure things, etc.
  • Big data uses inductive statistics and concepts from a nonlinear system identification [31] to infer laws (from regressions, nonlinear relationships, and causal effects) from large sets of data with low information density [32] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors. [31] [33]


Big data can be described by the following characteristics: [26] [27]

The quantity of generated and stored data. The size of the data determines the value and potential insight.
The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.
In this context, the speed at which the data is generated and processed to meet the challenges that lie in the path of growth and development.
Inconsistency of the data set can hamper processes to handle and manage it.
The data quality of captured data can vary greatly, affecting the accurate analysis. [34]

Factory work and Cyber-physical systems may have a 6C system:

  • Connection (sensor and networks)
  • Cloud (computing and data on demand) [35] [36]
  • Cyber ​​(model and memory)
  • Content / context (meaning and correlation)
  • Community (sharing and collaboration)
  • Customization (personalization and value)

Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. For example, to manage a machine and a machine. Information generation algorithms must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor. [37] [38]


Big data repositories have existed in many forms, often with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published a largest database report. [39]

Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata was first published in 1992. Hard disk drives were 2.5GB in 1991 so the definition of big data Kryder’s Law constantly evolves. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro.

In 2000, Seisint Inc. (now LexisNexis Group ) developed a C ++ -based distributed file-sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a C ++ dialect called ECL . ECL uses an “apply schema on read” method to infer the structure of stored data when it is queried, instead of when it is stored. In 2004, LexisNexis acquired Seisint Inc. [40] and in 2008 acquired ChoicePoint, Inc. [41] and their high-speed parallel processing platform. The two platforms were merged intoHPCC(or High-Performance Computing Cluster) Systems and in 2011, HPCC was open-sourced under the Apache v2.0 License. Quantcast File System was available about the same time. [42]

In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, [43] so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by Apache open-source project named Hadoop . [44]

MIKE2.0 is an open approach to information management that has been identified as a “Big Data Solution Offering”. [45] The methodology addresses handling big data in terms of Useful permutations of data sources, complexity in Interrelationships, and difficulty in deleting (gold-modifying) individual records. [46]

2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents. A distributed parallel architecture Distributes data across multiple servers; These parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a DBMS parallel, which implements the use of MapReduce and Hadoop frameworks. This type of framework is used by the front-end application server. [47]

Big data analytics for manufacturing is marketed as a 5C architecture (connection, conversion, cyber, cognition, and configuration). [48]

The data lake allows an organization to shift its focus to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data on the data lake, reducing the overhead time. [49] [50]


A 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows: [51]

  • Techniques for analyzing data, such as A / B testing , machine learning and natural language processing
  • Big data technologies, like business intelligence , cloud computing and databases
  • Visualization, such as charts, graphs and other displays of the data

Multidimensional big data can also be represented as tensors , which can be more easily handled by tensor-based computation, [52] such as multilinear subspace learning . 54] Cloud computing parallel computing ( MPP ) databases, search-based applications , data mining , [54] distributed file systems , distributed databases , cloud computing and HPC-based infrastructure (applications, storage and computing resources) [55] and the Internet. citation needed]Although, many approaches and technologies have been developed, it still remains difficult to carry out learning with big data. [56]

Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of large data tables in the RDBMS . [57]

DARPA ‘s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi . [58]

The Practitioners of big data analytics processes are hostile Generally Slower to shared storage, [59] Preferring Direct-attached storage ( DAS ) in various forms from ict solid state drive ( Solid State ) to high capacity SATA disk buried inside parallel processing nodes. The storage area network (SAN) and Network-attached storage (NAS) -is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of big data analytics. Where is it possible? Data in memory is good-data on spinning disk at the end of FC SAN connection is not. The cost of a SAN is much greater than other storage techniques.

There are advantages as far as data is concerned, but big data analytics practitioners as of 2011 did not favor it. [60]


Big Data has grown so much that Software AG , Oracle Corporation , IBM , Microsoft , SAP , EMC , HP and Dell have spent more than $ 15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $ 100 billion and was growing at almost 10 percent a year. [3]

Developed economies using data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. [3]Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people becoming more literate, which in turn leads to information growth. The world’s effective capacity was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 [8] and predictions for the amount of internet traffic at 667 exabytes annually by 2014. [3] ]According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data, [61] which is the most useful for most big data applications. This shows the potential of yet unused data (ie in the form of video and audio content).

While many vendors offer off-the-shelf solutions for big data, experts recommend the development of in-house custom-tailored solutions to solve the company’s problem. [62]


The use and adoption of cost, productivity, and innovation [63] does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome.

International development

Research on the effective use of information and communication technologies for development (Also Known As ICT4D ) Suggests That big data technology can make significant contributions single goal aussi present challenges to International development . [64] [65] Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity , crime, security, and natural disaster and resource management. [66] [67] [68] Additionally, user-generated data offers new possibilities to give the voice a voice. [69]However, longstanding challenges for developing regions, such as infrastructure and economic vulnerability, and exacerbation of the environment. [66]


Based on TCS 2013 Global Trend Study, the latest developments in supply planning and product quality. Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availability. Predictive Manufacturing as an Applicable Approach Towards Near-Zero Downtime and Transparency Require a Large Amount of Data and Advanced Prediction Tools for Information Technology. [70]A conceptual framework of predictive data acquisition where different types of sensory data is available to acquire such acoustics, vibration, pressure, current, voltage and controller data. Vast amount of sensory data in the field of data production. The generated big data acts as the input into predictive tools and preventive strategies Such As Prognostics and Health Management (PHM). [71] [72]


Big data analytics has been developed by the French Ministry of Health and Social Services, and has provided personalized medical advice and predictive analytics. [73]Some areas of improvement are more aspirational than actually implemented. The level of data generated within healthcare systems is not trivial. With the adoption of mHealth, eHealth and wearable technologies, the volume of data will continue to increase. This includes electronic health record data, image data, patient generated data, sensor data, and other forms of data processing. There is now an even greater need for such environments to pay greater attention to data and information quality. [74] “Big data very often means` dirty data’and the fraction of data inaccuracies increases with data volume growth. ” [75] While the data is still in the process of being analyzed, it can not be used. Extensive information in healthcare is nowadays, it fits under the big data umbrella as it is unstructured and difficult to use. [76]


A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers [51] and a number of universities [77], including the University of Tennessee and UC Berkeley , have created programs to meet this demand. These programs also include programs such as the Data Incubator and the General Assembly . [78]


To understand how to use the media, it is necessary to provide some context for the process. It has-been suggéré by Nick Couldry and Joseph Turow That Practitionersin the media and advertising approach. The industry seems to be moving towards the traditional approach of the media, such as magazines, or television shows, and more and more of them. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer’s mindset. For example, publishing environments are more popular than advertising and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities. [79]

  • Targeting of consumers (for advertising by marketers)
  • Data-capture
  • Data journalism : publishers and journalists use big data tools to provide unique and innovative insights and infographics.

Channel 4 , the British public-service television broadcaster, is a leader in the field of big data and data analysis . [80]

Internet of Things (IoT)

Main article: Internet of Things

Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device interconnectivity. Such mappings have been used by the media industry, and their audience to increase their audience and increase media efficiency. IoT has also been adopted as a means of collecting sensory data, and has been used in medical [81] and manufacturing [82] contexts.

Kevin Ashton , digital innovation expert who is credited with coining the term citation needed ] , defines the Internet of Things in this quote us-we would be able to track and count everything, and greatly reduce waste, loss and cost. We would have been able to replace them by replacing them with their previous ones. ”

Information Technology

Especially since 2015, big data has come to prominence within Business Operations as a tool to help employees work more efficiently and streamline the collection and distribution of Information Technology (IT). The use of IT data is an IT operation called IT Operations Analytics (ITOA). [83] By applying the principles of machine intelligence and deep computing, IT departments can predict potential solutions and provide solutions before the problems even happen. [83] In this time, ITOA businesses were also beginning to play a major role in systems managementby bringing together data from the data collected by the data system.

Case studies


United States of America

  • In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important issues faced by the government. [84] The initiative is composed of 84 different big data programs spread across six departments. [85]
  • Big data analysis played a large role in Barack Obama’s successful 2012 re-election campaign . [86]
  • The United States Federal Government owns six of the ten most powerful supercomputers in the world. [87]
  • The Utah Data Center has been constructed by the United States National Security Agency . When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage is unknown, but more recent sources claim it will be on the order of a few exabytes . [88] [89] [90]


  • Big data analysis was tried out for the BJP to win the Indian General Election 2014 . [91]
  • The Indian government uses various techniques to ascertain how the Indian electorate is responding to government action.

United Kingdom

Examples of uses of big data in public services:

  • Data on prescribing drugs: by connecting the origin, the time and the time of each prescription, and the UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that the most up-to-date drugs take some time to filter through the general patient. [92]
  • Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as ‘meals on wheels’. The connection of data allowed the local authority to avoid any weather-related delay. [93]


  • Walmart handles more than 1 million customer transactions every hour, which are imported into 2.5 petabytes (2560 terabytes) of data-the equivalent of 167 times the information contained in all the books in the US Library of Congress . [3]
  • Windermere Real Estate uses information from nearly 100 million drivers to help them find their way to work. [94]
  • FICO Card Detection System protects accounts worldwide. [95]


  • The Large Hadron Collider experiments represent about 40 million times. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% [96] of These streams, there are 100 collisions of interest per second. [97] [98] [99]
    • As a result, only working with 0.001% of the sensor stream data, the data from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
    • If all sensor data were recorded in LHC, the data would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To get the number in perspective, this is equivalent to 500 quintillion (5 × 10 20 ) bytes per day, almost 200 times more than all other sources combined in the world.
  • The Square Kilometer Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and one petabyte per day. [100] [101] It is considered one of the most ambitious scientific projects ever undertaken. [102]
  • When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it was more widely collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has more than 140 terabytes of information. [3] When the Large Synoptic Survey Telescope , successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days. [3]
  • Decoding the human genome originally took 10 years to process, now it can be achieved in less than a day. The DNA sequencers have been sequenced by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by Moore’s Law . [103]
  • The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. [104] [105]
  • Google’s DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and accurate calculations eliminate any ‘friction points’, ‘or human errors’ that could be made by one of the many scientists and biologists experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources. [106] [107]
  • 23andme’s DNA database contains genetic information of over 1,000,000 people worldwide.[108] The company explores selling the “anonymous aggregated genetic data” to other researchers and pharmaceutical companies for research purposes if patients give their consent.[109][110][111][112][113] Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company’s new service is that it makes genetic research accessible and relatively cheap for scientists.[109] A study that identified 15 genome sites linked to depression in 23andMe’s database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.[114]
  • Computational Fluid Dynamics (CFD) and hydrodynamic turbulence research generate massive datasets. The Johns Hopkins Turbulence Databases (JHTDB) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using “virtual sensors” with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients’ platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications.


Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.[115] Future performance of players could be predicted as well. Thus, players’ value and salary is determined by data collected throughout the season.[116]

The movie MoneyBall demonstrates how big data could be used to scout players and also identify undervalued players.[117]

In Formula One breeds, race cars with hundreds of generators terabytes of data. These sensors collect data points from fuel pressure. [118] Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will race beforehand, based on simulations using data collected over the season. [119]


  • uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. [120]
  • handles millions of back-end operations every day, more than half a million third-party sellers. The core technology that keeps running Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. [121]
  • Facebook handles 50 billion photos from its user base. [122]
  • Google was handling roughly 100 billion searches per month as of August 2012 . [123]
  • Oracle NoSQL Database has been tested to have 1.2M ops / sec with 10 shards. [124]

Research activities

Encrypted research and cluster training in big data was demonstrated in March 2014 at the American Society of Engineering Education. Gautam Tackling the Challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratoryand Dr. Amir Esmailpour at UNH Research Group investigated the key features of big data and the formation of clusters and their interconnections. They focus on the security of big data and the actual orientation of the term towards the presence of different types of data in an encrypted form. Moreover, they propose an approach for identifying the encoding technique to advance towards an accelerated search over encrypted text leading to the security enhancements in big data. [125]

In March 2012, The White House announced a national “Big Data Initiative” that consisted of six federal departments and agencies committing more than $ 200 million to big data research projects. [126]

The initiative included a National Science Foundation’s “Expeditions in Computing” grant of $ 10 million over 5 years to AMPLab [127] at the University of California, Berkeley. [128] The AMPLab also received funding from DARPA , and over a large number of industrial sponsors and uses a wide range of data to predict traffic congestion [129] to combat cancer. [130]

The White House Big Data Initiative also included a commitment by the Department of Energy to provide $ 25 million in funding over 5 years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute, [131] led by the Energy Department’s Lawrence Berkeley National Laboratory . The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department’s supercomputers.

The US state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions. [132] The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory , combining government, corporate, and institutional funding and research efforts. [133]

The European Commission is funding the 2-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy for the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020 , their next framework program . [134]

The British government announced in March 2014 the founding of the Alan Turing Institute , named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets. [135]

At the University of Waterloo Stratford Canadian Open Data Experience Campus (CODE) Inspiration Day, attendees demonstrated how to use data visualization to increase understanding and appeal to big data sets and to communicate their story to the world. [136]

To make manufacturing more competitive in the United States (and globe), there is a need to integrate more American ingenuity and innovation into manufacturing; Therefore, the National Science Foundation has granted the Industry University Cooperative Research Center for Intelligent Maintenance Systems (IMS) at the University of Cincinnati to focus on developing advanced predictive tools and techniques to be applicable in a big data environment. [137] In May 2013, IMS Center held an industry advisory board meeting on the subject of big data.

Computational social sciences – Anyone can use Application Programming Interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences. [138] Often these APIs are provided for free. [138] Tobias Preis et al. used Google Trends data to Demonstrate That Internet users from countries with A Higher per capita gross domestic product (GDP) are More Likely to search for information about the future than about the past information. The findings suggest there can be a link between online behavior and real-world economic indicators. [139] [140] [141]The authors of the study examined Google queries made by the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘ future orientation index ‘ . [142] They compared the future orientation index to the per capita GDP of each country, and found a strong trend for the country where users inquire about the future. The results are likely to be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.

Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify online precursors for stock market moves using Google Trends. [143] Their analysis of Google research volume, published in Scientific Reports , [144] suggests that the effects of this study are likely to be significant. [145] [146] [147] [148] [149] [150] [151] [152]

Big data sets come with algorithmic challenges that previously did not exist. Hence, there is a need to fundamentally change the processing ways. [153]

The Workshops on Algorithms for Modern Massive Data Sets (MMDS) bring together computer scientists, statisticians, mathematicians, and data analysts. [154]

Sampling big data

An important research question that can be asked about big data sets is that you need to look at some conclusions about the properties of the data or is a sample good enough. The name big data itself is an important feature of big data. Purpose Sampling (statistics)allows for the selection of data from the general population. For example, there are about 600 million tweets produced every day. Is it necessary to look at all of them? Is it necessary to look at all the tweets to determine the feeling on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller are available at short time intervals. To predict it may be sufficient. Big data can be broken down into various categories, such as demographic, psychographic, behavioral, and transactional data. With large sets of data points,

There is some work done in Sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed. [155]


Criticism of the big data paradigm, the question of the implications of the approach itself, and the fact that it is currently done. [156] One approach to this criticism is the field of Critical data studies .

Reviews of the big data paradigm

“A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data”.[16] In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson’s assertion that big data will spell the end of theory:[157] focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts.[158] Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analysed, must be complemented by “big judgment,” according to an article in the Harvard Business Review.[159]

Much in the same line, it has been pointed out that the decisions are inevitably “informed by the world in the past, or, at best, as it currently is”. [66] The way in which the past is evolving, the algorithms can predict future development. [160]If the systems dynamics of the future changes (if it is not a stationary process ), the Past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the dynamic systems, which requires theory. [160] As a response to this criticism it has been suggested to combine computer simulations, such asagent-based models [66] and Complex Systems . Agents are based on the prediction of the complexity of social complexities that are based on a mutually interdependent algorithm. [161] [162] In addition, the use of multivariate methods for the study of data analysis , such as factor analysis and cluster analysis , have been useful in the analysis of cross-tabs ) employed with smaller data sets.

In health and biology, the scientific research is based on experimentation. For these approaches, the limiting factor is that the data can not confirm or refute the initial hypothesis. [163] A new postulate is accepted by biometrics: the information provided by the data in large volumes ( omics ) [164] [165] In the massive approaches, it is the formulation of a hypothesis to explain that the limiting factor. [166] The search logic is reversed and the limits of induction (“Glory of Science and Philosophy scandal”, CD Broad, 1926) are to be considered. quote needed ]

Privacy advocates are Concerned about the threat to privacy représentée par Increasing storage and integration of identifiable information Personally ; expert panels have published various policy recommendations to conform to expectations of privacy. [167] [168] [169]

Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual liberties in a context of big data. The use of Big Data should be monitored at national and international levels. [170]

Reviews of big data execution

Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a “fad” in scientific research. [138] Researcher Danah Boyd HAS Concerns raised about the use of big data science in neglecting principles Such As choosing a representative sample by being white too Concerned about the huge water equivalent Actually handling of data. [171] This approach May lead to results bias in one way or another. Integration across heterogeneous data resources – some of them may be considered as large as possible, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.[172] In the provocative article “Critical Questions for Big Data”, [173] the authors title big data apart of mythology : “large data sets offer A Higher form of intelligence and knowledge […], with the will of truth, objectivity, and accuracy “. Users of big data are often “lost in the sheer volume of numbers”, and “working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth”. [173] Recent developments in the field of BI, such as pro-active reporting especially target improvements in the usability of big data, through automated filtering of non-useful data and correlations. [174]

Big data analysis is often compared to smaller data sets. [175] In many big data projects, there is no large data analysis, but the challenge is the extract, the transform, the load part of data preprocessing. [175]

Big data is a buzzword and a “vague term”, [176] [177] but at the same time an “obsession” [177] with entrepreneurs, consultants, scientists and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the effects of a factor of two. Similarly, Academy awardsand election predictions based solely on Twitter. Big data often poses the same challenges as small data; it does not include problems of bias, but may emphasize other problems. In particular data sources such as these are not representative of the general population, and results drawn from such sources may lead to wrong conclusions. Google Translate -which is based on a big data statistical analysis of text-a good job at translating web pages. However, results from specialized domains can be dramatically skewed. On the other hand, big data can also introduce new problems, such as the multiple comparisons problem: simultaneous testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant. Ioannidis argued that “most published research findings are false” [178] due to substantially the same effect: when many scientific teams and researchers perform many experiments (ie, a large amount of scientific data, likelihood of a “significant” result being actually false grows fast – even more so, when only positive results are published. Moreover, big data analytics results are only as good as the model on which they are predicated. In an example, big data took part in attempting to predict the results of the 2016 US Presidential Election [179]with varying degrees of success. Forbes predicted “If you believe in Big Data analytics, it’s time to start planning for a Hillary Clinton presidency and all that entails.” [180]