Big data

The big data , literally “big data,” or big data (recommended 3 ), sometimes called Big Data 4 , designate sets of data become so large that they exceed the intuition and the human capacity for analysis and even those tools conventional computer database or information management 5.

The quantitative explosion (often redundant) of the digital data forced new ways of seeing and analyzing the world 6 . New orders of magnitude concern capturing, storing, searching, sharing, analyzing and visualizing data . The perspectives of big data processing are enormous and partly unsuspected; there is often talk of new possibilities for exploring information disseminated by the media 7 , knowledge and evaluation, trend analysis and prospective(climate, environmental or socio-political, etc.) and risk management (commercial, insurance, industrial, natural) and religious, cultural, political 8 , but also in terms of genomics or metagenomics 9 , for medicine (understanding brain function , epidemiology , eco – epidemiology …), meteorology and adaptation to climate change , management of complex energy networks (via smart grids or a future ”  energy internet  “), ecology(functioning and dysfunction of ecological networks, food webs with GBIF, for example), or security and the fight against crime 10 . The multiplicity of these applications already leaves already appearing a true economic ecosystem implying, already, the biggest players of the sector of the information technologies 11.

Some [who?] Assume that the big data could help companies reduce risk and facilitate decision-making, or create a difference through predictive analytics and “customer experience” more personalized and contextualized 12 .

Various experts, major institutions (such as MIT 13 in the United States), administrations 14 and specialists in the field of technologies or uses 15 consider the big data phenomenon as one of the major IT challenges of the 2010-2020 decade and in have made one of their new research and development priorities , which could notably lead to Artificial Intelligence being explored by self – learning artificial neural networks 16.

Dimensions 

The big data is accompanied by the analytical referred to application development, that process data to make sense of 34 . These analyzes are called Big Analytics 35 or “data crushing”. They focus on complex quantitative data using distributed computing methods and statistics.

In 2001, a research report of the META Group (now Gartner ) 36 defines the issues inherent to the growth of data as being three-dimensional: complex analyzes meet the so-called “3V” rule (volume, velocity and variety 37 ). This model is still widely used today to describe this phenomenon 38 .

The global average annual growth rate of the big data technology and services market over the 2011-2016 period is expected to be 31.7%. This market is expected to reach $ 23.8 billion in 2016 (according to IDC March 2013). Big data should also represent 8% of European GDP in 2020 (AFDEL February 2013).

Volume

This is a relative dimension: big data, as Lev Manovitch noted in 2011 39 once defined “data sets large enough to require super-computers” , but it is quickly (in the years 1990/2000) use standard software on desktops to analyze or co-analyze large sets of data 40 .

The volume of stored data is growing: the digital data created around the world has grown from 1.2 zettabytes per year in 2010 to 1.8 zettabytes in 2011 41 , then 2.8 zettabytes in 2012 and will rise to 40 zettabytes in 2020. As an example, Twitter generated in January 2013, 7 terabytes of data each day and Facebook 10 terabytes 42 . In 2014, Facebook Hive generated 4,000 TB of data per day 43 .

It is the technical-scientific facilities (meteorology, etc.) that produce the most data [ref. necessary] . Many pharaonic projects are underway. The radio telescope ” Square Kilometer Array ” for example will produce 50 terabytes of data analyzed per day, at a rate of 7,000 terabytes of raw data per second 44 .

Variety

The volume of big data puts the data centers face a real challenge: the variety of data. These are not traditional relational data , these data are raw, semi-structured or even unstructured (however, unstructured data will have to be structured for use 45 ). These are complex data from the Web ( Web Mining ), text (Text Mining) and images (Image Mining). They can be public (Open Data, Web of the data), geo-demographic by block ( IP addresses ), or come under the property of the consumers (Profiles 360 °) [ref. necessary]. This makes them difficult to use with traditional tools.

The multiplication of collection tools on individuals and objects used to always collect more data 46 . And the analyzes are all the more complex as they relate more and more to the links between data of a different nature.

Velocity 

Velocity is both the frequency with which the data are generated, captured, shared and updated 47 .

Growing data flows should be analyzed in near real-time ( data stream mining ) to meet the needs of chrono-sensitive processes 48 . For example, the systems put in place by the stock market and companies must be able to process these data before a new generation cycle has begun, with the risk for humans of losing much of the control of the economy. system when the main operators become “robots” capable of launching buy or sell orders at the nanosecond ( High Frequency Trading ) without having all the relevant analysis criteria for the medium and long term.

Difference with business intelligence 

If the definition of Gartner in 3V is still largely taken over (or even increased by “V” additional according to the inspiration of the marketing services), the maturation of the subject brings up another more fundamental criterion of difference with the data- processing and concerning the data and their use 49  :

  • Business Intelligence: use of descriptive statistics , on data with a high density of information in order to measure phenomena, detect trends …;
  • Big data: use of inferential statistics , on data with a low density of information 50 whose large volume makes it possible to infer laws (regressions …) thus giving big data (with the limits of inference) predictive capacities 51 .

Synthetically:

  • traditional computer science, including business intelligence, is based on a model of the world;
  • the big data is that the mathematics found a pattern in the data 52 , 53 .

Representation

Models 

Traditional relational databases can not handle large data volumes. New models of representation can guarantee performance on volumetrics in. These technologies, called Business Analytics & Optimization (BAO) to manage massively parallel bases 54 . Big Data Architecture Framework (BDAF) 55 is proposed by the actors of this market as MapReduce created by Google and used in the Hadoopframework.With this system, the queries are separated and distributed to parallelized nodes. , then run in parallel (map), and the results are collated and retrieved (reduce).Teradata , Oracle or EMC (through the acquisition of Greenplum) also offer such structures, based on standard servers whose configurations are optimized. They are competing with publishers like SAP and more recently Microsoft 56 . Market players rely on highly scalable systems and solutions based on NoSQL ( MongoDB , Cassandra ) rather than traditional relational databases 57 .

Storage 

To address big data issues, the storage architecture of systems must be redesigned and storage models multiply accordingly.

  • Cloud computing  : access is via the network, services are on demand and self-service on shared and configurable computing resources 58 . The most popular services are Google BigQuery, Big Data on Amazon Web Services and Microsoft Windows Azure .
  • Super Hybrid Calculators  : HPC for High Performance Computing, found in France in national centers of computation such as IDRIS , CINES , but also at CEA or HPC-LR 59
  • Distributed File System (DFS): Data is no longer stored on a single machine because the quantity is much too large. The data, the files are “cut” into pieces of a defined size and each piece is sent on a specific machine using local storage 60 . Local storage is preferred over SAN / NAS storage for network bottlenecks and SAN network interfaces. In addition, using a storage type SAN is much more expensive for much lower performance. In distributed storage systems for big data, we introduce the principle of “Data locality” 61. The data is saved where it can be processed.

Applications

The big data has applications in many fields: science programs (CERN28 Mastodons), business tools (IBM29, Amazon Web Services, BigQuery, SAP HANA) sometimes specialized (Teradata Jaspersoft30, Pentaho31 …) or startups as well only in the field of open source ( Apache Hadoop , Infobright32, Talend33 …) and open operating software (with for example the open software for analyzing big data H2O (software) ).

Scientific research

The big data comes from it and it feeds part of the research. Thus the Large Hadron Collider of CERN uses about 150 million sensors delivering data 40 million times per second; For 600 million collisions per second, it remains after filtering 100 collisions of interest per second, or 25 Po of data to be stored per year, and 200 Po after replication 62 , 63 , 64 . The big data analysis tools could refine the exploitation of these data.

When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it collected in a few weeks more data than any previously collected in the history of astronomy. It continues at a rate of 200 GB per night, and has in 10 years (2000-2010) stored more than 140 terabytes of information. The Large Synoptic Survey Telescope planned for 2015 should amass as every five days 65 .

Decoding the first human genome took 10 years, but today takes less than a week: DNA sequencers have grown by a factor of 10,000 in the last ten years, or 100 times Moore’s law (which has progressed about a factor of 100 over 10 years) 66 . In biology, mass approaches based on a logical data mining and inductive research are legitimate and complementary to traditional approaches based on the initial hypothesis 67 . Big data has also been introduced in the field of proteins .

The NASA Center for Climate Simulation (NCCS) stores 32 Po data of observations and climate simulations 68 .

The social sciences explore corpuses as diverse as Wikipedia content around the world or millions of publications and tweets transported over the Internet.

Policy

Analyzing Big Data has played an important role in the campaign to re-elect Barack Obama, in particular to analyze the political views of the population 69 . Since 2012, the US Defense Department invests annually on projects big data more than 250 million 70 . The US government has six of the ten most powerful supercomputers in the world 71 . The National Security Agency is currently building the Utah Data Center . Once completed, this data center will store up a yottabyte information collected by the NSA on the internet 72. In 2013, big data was one of France’s 7 strategic ambitions determined by the Innovation Commission 2030 73 .

Private sector

Nordstrom treats over one million customer transactions per hour, they are imported into the database that are considered to contain more than 2.5 Po information 74 . Facebook processes 50 billion photos. In general, the exploration of big data allows the development of customer profiles that were not supposed to exist 75 .

The use of big data data is now part of the museum strategy as prestigious as the Guggenheim Museum. Using electronic transmitters placed in the rooms, visitors are followed throughout their visit. The museum can determine new trail based on the most popular works or exhibitions decide to implement 76 .

The use of big data in the field of insurance is also useful. The increase in the number of connected objects makes it possible to collect a large amount of data in real time. This helps to better understand the people and objects insured 77 .

Energy sector

The smart buildings (optionally within smart cities ) are buildings characterized by a “hybridization” between digital and energy .

These buildings or individual dwellings can produce energy (or even be ”  energy positive  ” ). They can also produce data on this energy and / or on their energy consumption. Once aggregated and analyzed, these data can be used to apprehend or even anticipate the consumption of users, neighborhoods, cities, etc. according to the variations of the context, meteorological in particular.

The analysis of the collected data of production (solar, micro-wind, etc.) and of consumption in a building, via connected objects and smartgrid , also makes it possible to manage user consumption in a “personalized” way.

While waiting for the installation of significant energy storage infrastructures , on cloudy days and without wind it is still necessary to use conventional power plants, and exceptionally beautiful and windy days (ex: May 8, 2016 where during 4 hours the wind and the sun have generated more than 90% of the country’s electricity, coal and gas power plants must reduce their production in time.An extreme case is that of a solar eclipse (predictable), but the management of these peaks and intermittent now costs more than 500 million € / year in Germany and leads to emissions of CO 2 and other greenhouse gases that we would avoid 78. Thanks to the correlations that can emerge from the fine analysis of big data, energy operators can better understand the fine variations in the field of renewable energies and cross them with real demand.

Examples  :

  • In 2009 the National Center for Atmospheric Research (NCAR) in Boulder, Colorado launched such a system. It is mid-2016 operational in eight US states. According to Xcel Energy (a Denver, Colorado-based utility that has the first wind capacity in the US), this approach has improved my forecast, enough so that since 2009, customers have avoided US $ 60 million / year of corrective expenditures, and the emission of more than a quarter of one million tons CO 2 / year thanks to less use of fossil fuels 78  ;
  • In 2016 , Germany took an important step towards the internet of energy as proposed by the futurist Jeremy Rifkin by experimenting a process (EWeLiNE 79 ) of automatic analysis of energy and meteorological big data.
    Background  : With 45,000 megawatts , wind capacity in Germany is the 3 th in the world, behind China and the United States, and only China competes with Germany in terms of solar capacity. 1/3 of electricity is in 2016 of renewable origin and the target government to 80% of the total before 2050 and 35% before 2020 78. This will require developing a “smartgrid” allowing an even more intelligent and reactive distribution and storage of energy .
    Experimentation  : In June 2016, to better adapt the electricity network (”  grid  “) to the intermittent nature of solar and wind energy , as well as instantaneous variations, daily and seasonal demand, and to limit the call Germany has launched a process ( EWeLiNE ) for the automatic analysis of big data.
    EWeLiNE associates 3 operators ( TSOs Amprion GmbH, TenneT TSO GmbH and 50 Hertz ) 78 . They benefit from 7€ M (disbursed by the Federal Ministry of Economic Affairs and Energy) 78 . Software will exploit the big data of weather data and data of energy interest to predict with increasing precision the instantaneous productive capacity of the ENR (because when the wind increases or a cloudpasses over a solar farm, production increases or falls locally and the network has to adapt). EWeLiNEmust improve the real-time and anticipated management of production and consumption through energy-weather forecasting via a “learner” system for statistically advanced prediction of wind force (at turbine hub level) and solar power (at the level of photovoltaic modules).
    Large wind turbines themselves often measure turbines in real time, and some solar panels incorporate light intensity sensors 78 . EWeLiNE combines these data with conventional weather data (terrestrial, radar and satellite) and transfers them to sophisticated computer models (“learning systems”) to better predict electricity generation over the next 48 hours (or more) 78. The scientific team checks these power forecasts, and the computers “learn” from their errors, allowing the predictive models to be more and more accurate.
    EWeLiNE was first tested (in June 2016) on some networks of solar panels and wind turbines equipped with sensors. Starting in July, operators will gradually expand the system by connecting to a growing number of solar and wind facilities that will transmit their data in real time to adjust the amount of energy produced nationwide ( the goal is to do it in 2 years) 78. We will then approach what J Rifkin has named the energy internet, except that it also integrates domestic and individual uses (which should be enabled by the spread of smart meters and smart systems and local or mobile energy storage).
    First feedback: The first German results suggest that the approach will work, because the work of German modelers had already allowed good improvements before access to these data. EWeLiNE is not a declination nor a translation of the American NCAR system; meteorological models and algorithms converting weather forecasts into power forecasts differ 78

Prospects and developments

One of the main productivity issues for big data in its evolution will be the logistics of information, that is to say how to ensure that the relevant information arrives at the right place at the right time. This is a micro-economic approach. Its effectiveness will thus depend on the combination of micro- and macroeconomic approaches to a problem.

According to an IDC study, digital data created around the world will reach 40 zettabytes by 2020 80 . In comparison, Facebook generated about 10 terabytes of data per day in early 2013. The development of massive data hosting seems to have been accelerated by several phenomena simultaneously: the shortage of hard drives following the floods in Thailand in 2011, the explosion of the mobile media market (including smartphones and tablets), etc. Added to this, the democratization of cloud computing increasingly close, thanks to tools like Dropbox, brings big data to the center of information logistics.

In order to make the most of big data , a lot of progress has to be made, along three axes.

Data modeling

Current data modeling methods as well as database management systems have been designed for much lower data volumes. Data mining has fundamentally different characteristics and current technologies can not exploit them. In the future, we will need data modeling and query languages ​​to:

  • a representation of the data in accordance with the needs of several scientific disciplines;
  • describe discipline-specific aspects ( metadata models );
  • to represent the source of the data;
  • represent contextual information on the data;
  • to represent and support uncertainty;
  • to represent the quality of the data 81 .

There are many other research topics related to this theme, including: model reduction for PDEs, compressed acquisition in imaging, the study of high order numerical methods … Probability, statistics, numerical analysis, equations deterministic and stochastic partial derivatives, approximation, high-performance computing, algorithmic … A large part of the scientific community, particularly in applied mathematics and computer science, is concerned by this buoyant theme.

Data Management

The need to handle extremely large data is blatant and today’s technologies are not enough to do it. There is a need to rethink basic concepts of data management that have been identified in the past. For scientific research, for example, it will be essential to reconsider the principle that a query on a DBMS provides a complete and correct answer regardless of the time or resources required. Indeed, the exploratory dimension of data mining means that scientists do not necessarily know what they are looking for. It would be advisable for the DBMS to provide quick and inexpensive answers that would only be an approximation, but which would guide the scientist in his research.81 .

In the area of customer data, there are also real use of these data needs, particularly because of the sharp increase in volume in recent years 82 . Big data and related technologies can address a variety of issues, such as faster client data analysis times, the ability to analyze all customer data and not just a sample of them, or the recovery and recovery of data. the centralization of new customer data sources to analyze in order to identify sources of value for the company.

Data Management Tools

The tools used at the moment are not in line with the data volumes generated in the exploration of big data . There is a need to develop tools to better visualize , analyze, and catalog datasets to enable a data-driven search perspective 81 . Big data research is just beginning. The amount of data is changing much faster than our knowledge of this area. The site The Gov Lab provides that it will be no enough of scientific data . In 2018, the United States would need 140 000-190 000 scientists specializing in big data 70.

Entropy Management

The deluge of data that feeds big data (and some of which are illegal or uncontrolled) is often metaphorically compared to both a continuous flow of food, oil, and energy (which feeds data mining companiesand secondarily society information 83 ) which exposes to the risk of information overload and could be compared to the equivalent of a “pollution” 40 of the cyberspace and noösphere (metaphorically, the big datapartly correspond to a sort of large information spill, or to a diffuse but increasing and continuing eutrophication of the digital world that could lead to dystrophication, or even dysfunctions within digital ecosystems) 84 .

Faced with this ”  informational entropy “, some neguentropic responses were born ( Wikipedia is part of this by sorting and restructuring previously published information).

Other answers were the creation of search engines and tools for semantic analysis and search of data streams , more and more powerful and fast.

Nevertheless, the analysis of big data tends to generate big data, with a need for storage and servers that seems exponential.

Energy balance

In parallel with the growth of the mass and the flow of data, a growing energy is spent on the one hand in the race for data mining tools, encryption / decryption and analytical and authentication tools, and on the other hand in the build server farms that need to be cooled; at the expense of the energy and electrical balance of the Web.

Ideas received

In 2010 , man -made datasets are increasingly complemented by other data, massively acquired passively and automatically by a growing number of electronic sensors and in forms that are increasingly interoperable and understandable by the computers. The volume of data stored in the world is more than doubling every two years, and by migrating more and more on the internet, some see in the big data intelligently used a source of information that would make it possible to fight against poverty, crime or pollution. And at the other end of the spectrum of opinion, others, often defenders of privacy, have a darker vision, fearing or saying that big data is moreBig Brother posing in “new clothes” 85 , “in business clothes” 86 .

In 2011, in a 10-year review of the Internet for the company, both danah boyd (from Microsoft Research ) and Kate Crawford ( University of New South Wales ) provocatively challenged six misconceptions about big data: “The automation of research changes the definition of knowledge (…) The claims of objectivity and accuracy are misleading (…) Larger data are not always better data (…) All data does not are not equivalent (…) Accessible does not mean ethics (…) Limited access to big data creates new digital divides ” 40 (among poor and rich) 40 .

Risks

Several types of risks of invasion of privacy and fundamental rights are cited in the literature:

  • Dehumanization: In what Bruce Schneier calls “the golden age of surveillance,” most people may feel dehumanized and can no longer protect personal and non-personal information about them, which is collected, analyzed, and sold without their knowledge. While it becomes difficult to dispense with credit card, smartphone or Internet consultation, they may feel unable to escape constant surveillance or pressure to consume, vote, etc. .
  • IT Security Rift: In an increasingly interconnected and Internet-related world, online security is becoming crucial for the protection of privacy, but also for the economy (eg, in case of a serious problem, there is a risk of loss of confidence, for example, regarding the security of online purchasing processes, which could have significant economic consequences).
  • “Vassalization of scientific research by commercial companies and their marketing services” 40 .
  • Apopheny  : (undue deductions) 40  : access and interpretation biases are numerous ( “a corpus is not more scientific or objective because one is able to aspire all data from a site. Especially since there are many biases (technical with the API, but also organizational) in the access even to these data that would be wrong to consider as total.This access relies only on the good will of commercial companies and the financial means available to researchers and universities) ” 40  ;
    In addition, a gender biasexists: most computer experts Researchers are now men, gold feminist historians and philosophers of science have shown that the gender of the person asking the questions often determine the questions to be asked 87 .
  • Misinterpretation of some data related to otherness , with possible sociopsychological consequences, for example and misunderstanding or interpretation of the other ( “the other is not a given,” recalls D. Pucheu 88 ).
    Another risk is that of a “scarcity of opportunities for exposure of individuals to things that would not have been foreseen for them, and thus a drying up of the public space (as space for deliberation, formation of projects that are not dealt with solely on the basis of individual interests), these unforeseen things being precisely constitutive of the common, or of the public space ” 89 .
  • Exacerbation of the digital divide , as data mining tools offer a few companies increased and almost instant access to billions of data and digitized documents. For those who know how to use this data, and with certain limitations, they also offer a certain capacity to produce, sort or distinguish information considered strategic, thus also allowing to retain or on the contrary release before others some strategic information 90 . This very privileged and untransparent access to information may favor situations of conflicts of interest or insider trading.. There is a risk of growing inequalities with regard to data and the power that we have over them: Manovich thus distinguishes 3 categories of actors, fundamentally unequal with regard to data: “those who create the data (whether consciously or leaving digital traces), those who have the means to collect them, and those who have the competence to analyze them ” (2011).
    They are few in number, but very privileged (they are often used by companies and other entities of big data and thus have the best access to the data, they contribute to produce or guide the rules which will frame them and frame the exploitation of big data Institutional inequalities are a prioriinescapable but they can be minimized and should at least be studied as they guide the data and types of research and applications that will result.
  • Monopolistic ownership of certain big data games collected by a few large companies (Google, Facebook, etc.) or public tools or secrets of large states (eg PRISM ) aimed at “capturing the real to influence it” 20 ); a huge amount of data is unobtrusively (and most of the time legally) collected by specialized companies or state or intelligence agencies, including on buying behaviors and interests on the Internet of all groups and individuals. This data is stored, and sometimes hacked (Thus, in 2003, during a search for security vulnerabilities the company Acxiom, one of the leading data brokers realized that 1.6 billion consumer records had been hacked via 137 computer attacks made from January to July 2003; the stolen information included names, addresses and e-mail addresses of millions of Americans 91 , 92 , 93 , 94). These data are then more or less updated, and eventually rented or sold for marketing and targeted advertising, scientific studies by pollsters, influence groups or political parties (which can more easily contact their constituents). potential), etc. People whose data is circulating in this way are generally not informed about it, have not given informed consent, and can hardly check this data or, especially, remove it from databases that store it for a potentially unlimited period of time. Risks of error production and misuse exist (in the field of insurance and bank loans for example). According to the report by F Lescalier entitled “Big Data: the new soothsayers” 9580% of global personal data would be held by 4 major players who are (in alphabetical order): Amazon , Apple, Facebook and Google .
  • Ethically unsustainable excesses, already noted in Part gray or dark 96 of the internet, including in the major social networks (including Facebook and Twitter , which collects a lot of data and information on their users and networks in which they s enter 97 , 98 ); Others call for the adoption of good practices 99 and stricter ethical rules for data mining 100 and the management of these big data 101 , 102 .
    Especially since the revelations of the American whistleblowerEdward Snowden 103 , some worry that besides monitoring increasingly invasive (or Pervasive 104 ) of our activities by Internet service providers 105, then flourish legislation facilitating (under the pretext of economic facilities and / or national security) the use of tracing tools (via payment cards, loyalty, health, pointing, surveillance cameras, some smart grids) or home automation tools, some connected objects geolocating their owner, etc.). Some of these laws explicitly facilitate or legitimize eavesdropping (listening and analysis of telephone conversations, interception and analysis of emails and networks) and general monitoring of Internet activities, which seems to be a context that can prepare Orweillian generalized surveillance of individuals. These authors criticize the emergence process and a context increasingly orweillien 23intrinsically difficult to control, and insist on the importance of privacy 106 , “even when one has nothing to hide” 107 , 108 or (as B. Schneier in 2008 109 or Culnan & Williams in 2009 110 ) recall that the notions of security and protection of privacy and autonomy of the individual are not opposed.

Governance and Big Data

It requires a constant citizen debate 111 as well as adapted modes of governance and surveillance 112 because states, groups or companies with privileged access to big data can extract a large number of “diffuse personal data” which, by crossing and analysis, allow an increasingly precise, intrusive and sometimes illegal profiling (disregarding the protection of privacy ) of individuals, groups, companies, and in particular their social, cultural, religious or professional status ( example of the NSA PRISM program 113 ), their personal activities, their travel, purchasing and consumption habits, or even their health.”The rise of big data also brings great responsibilities .

Posted in Web