Data science

Data science , also known as data-driven science , is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge or insights from data in various forms, which is either structured or unstructured, [1] [2] similar to data mining .

Data science is a “concept to unify statistics, data analysis and their related methods” in order to “understand and analyze actual phenomena” with data. [3] It uses the techniques and theories of mathematics , statistics , information science , and computer science , in particular from the subdomains of machine learning , clustering , cluster analysis , data mining , databases , and visualization .

Turing award winner Jim Gray is a fourth paradigm of science ( empirical , theoretical , computational and now data-driven) and asserted that “everything about science is changing because of the impact of information technology” and the data deluge . [4] [5]

When Harvard Business Review called Expired it “The Sexiest Job of the 21st Century” [6] the term est devenu a buzzword , and is now Often applied to business analytics , [7] or Even arbitrary use of data, or used as a sexed-up term for statistics. [8] There is no consensus on a definition or curriculum content. [7] Because of the current popularity of this term, there are many “advocacy efforts” surrounding it. [9]

History

Data science process flowchart from Doing Data Science, Cathy O’Neil and Rachel Schutt, 2013

The term “data science” (originally used interchangeably with ” datalogy “) has been used as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published the Concise Survey of Computer Methods , which freely They are used in the field of data processing and are used in a wide range of applications.

In 1996, members of the International Federation of Classification Societies (IFCS) put in Kobe for their biennial conference. Here, for the first time, the term data science is included in the title of the conference (“Data Science, Classification, and Related Methods”), [10] after the term was introduced in a roundtable discussion by Chikio Hayashi. [3]

In November 1997, CF Jeff Wu gave the inaugural reading entitled “Statistics = Data Science?” [11] for his appointment to the HC Carver Professorship at the University of Michigan . [12]In this lecture, he is a statistical data collector, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, use of the term “data science” and advocated that statistics be renamed science and statisticians data scientists. [11] Later, he presented his reading entitled “Statistics = Data Science?” PC Mahalanobis Memorial Lectures. [13] These readings honorPrasanta Chandra Mahalanobis , an Indian scientist and statistician and founder of the Indian Statistical Institute .

In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to include “advances in computing with data” in his article “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics, “which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / International Journal of Statistics. [14] In his report, Cleveland establishes six technical areas which he believes in the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.

In April 2002, the International Council for Science (ICSU) [15] began the Data Science Journal , [16] published their publication on the internet, applications and legal issues. [17] Shortly thereafter, in January, 2003, Columbia University Began publishing The Journal of Data Science , [18]which provided a platform for all data workers and their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. In 2005, The National Science Board published “Long-lived Digital Data Collections: Defining Data Scientists in the 21st Century”, defining data scientists as “information and computer scientists, database and software programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection “whose primary activity is to conduct creative inquiry and analysis.” [19]

In the 2012 Harvard Business Review article “Data Scientist: The Sexiest Job of the 21st Century”, [6] DJ Patil claims to have coined this term in 2008 with Jeff Hammerbacher to their LinkedIn at LinkedIn and Facebook, respectively. He asserts that a scientist is “a new breed”, and that “a shortage of data scientists is becoming a serious constraint in some sectors”, but describes a much more business oriented role.

In 2013, the IEEE Task Force on Data Science and Advanced Analytics [20] was launched. In 2013, the first European Conference on Data Analysis (ECDA) was organized in Luxembourg, establishing the European Association for Data Science (EuADS) . The first international conference: IEEE International Conference on Data Science and Advanced Analytics was launched in 2014. [21] In 2014, General Assembly launched the student-paid bootcamp and the Data Incubator launched a competitive free data science fellowship. [22] . In 2014, the American Statistical Associationsection on Statistical Learning and Data Mining renamed its journal to “Statistical Analysis and Data Mining: The ASA Data Science Journal” and in 2016 changed its section to “Statistical Learning and Data Science.” [9] . In 2015, the International Journal on Data Science and Analytics [23] was launched by Springer to publish original work on data science and big data analytics. In September 2015 the Gesellschaft für Klassifikation (GfKl) added to the name of the Society “Data Science Society” at the third ECDA conference at the University of Essex , Colchester, UK.

Relationship to Statistics

The popularity of the term “data science” has exploded in business environments and academia, as indicated by a jump in job openings. [24] However, many critical academics and journalists see the distinction between science and statistics . Writing in Forbes , Gil Press argues that data science is a buzzword without a definition and has simply replaced ” business analytics ” in contexts such as graduate degree programs. [7] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of the American Statistical Association , noted applied statistician Nate Silversaid, “I think data-scientist is a sexed up term for a statistician …. Statistics is a branch of science. Data scientist is redundant Slightly In Some Way and people shoulds not berate the term statistician. ” [8] Similarly, in business sector, multiple Researchers and analysts That state data scientists are far from alone in being white Sufficient Granting companies has real competitive advantage [ 25) and consider data scientists as well as big business data analysts , data scientists, big data developers and Big Data engineers . [26]

On the other hand, responses to criticism are as numerous. In a 2014 Wall Street Journal article, Irving Wladawsky-Berger compares the data science enthusiasm with the dawn of computer science . He argues data science, like any other interdisciplinary field, employs methodologies and practices from across academia and industry , but then it will morph into a new discipline . He brings attention to the sharp criticisms of computer science, now a well respected academic discipline, had to ounce face. [27] Likewise, NYU Stern’s Vasant Dhar, as do many other academic proponents of data science,[27] argues more specifically in December 2013 which data is different from the existing practice of data analysis across all disciplines , which focuses only on explaining data sets . Data science seeks actionable and consistent pattern for predictive uses . [1] This practical engineering goal takes data science beyond traditional analytics . Now the data in these disciplines and applied fields that lacked solid theories , like health science and social science , could be sought and used to generate powerful predictive models. [1]

In an effort similar to David Dhar’s, Stanford Professor David Donoho , in September 2015, takes the proposal further by rejecting three simplistic and misleading definitions of data science in place of criticisms. [28] First, for Donoho, data science does not equate big data , the data is not a criterion for distinguishing data science and statistics. [28] Second, data science is not defined by the computing skills of large data sets, in which these skills are used for analyzes across all disciplines. [28] Third, data science is a more applied field where academic programsright now do not want to prepare data scientists for jobs, in which many graduate programs misleadingly advertise their analytics and statistics training as the essence of a science program. [28] [29] As a statistician , Donoho , following many in his field, champions of the broadening of learning in the form of data science, [28] like John Chambers who urges statisticians to adopt an inclusive concept of learning from data, [30] or like William Cleveland who urges to prioritize extracting from data applicable predictive tools over explanatory theories. [14] Together, thesestatisticians envisioned an increasing inclusive applied field that grows out of traditional statistics and beyond.

For the future of data science, Donoho projects for an ever-growing environment for open science where data sets are used for academic publications . [28] US National Institute of Health has already announced plans to enhance reproducibility and transparency of research data. [31] Other big journals are likewise following suit.[32][33] This way, the future of data science not only exceeds the boundary of statistical theories in scale and methodology, but data science will revolutionize current academia and research paradigms.[28] As Donoho concludes, “the scope and impact of data science will continue to expand enormously in coming decades as scientific data and data about science itself become ubiquitously available.”[28]

References

  1. ^ Jump up to:c Dhar, V. (2013). “Data science and prediction” . Communications of the ACM . 56 (12): 64. doi : 10.1145 / 2500499 .
  2. Jump up^ Jeff Leek (2013-12-12). “The key word in” Data Science “is not Data, it is Science” . Simply Statistics.
  3. ^ Jump up to:b Hayashi Chikio (1998-01-01). “What is Data Science? Fundamental Concepts and a Heuristic Example” . In Hayashi, Chikio; Yajima, Keiji; Bock, Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa. Data Science, Classification, and Related Methods . Studies in Classification, Data Analysis, and Knowledge Organization. Springer Japan. pp. 40-51. doi : 10.1007 / 978-4-431-65950-1_3 . ISBN  9784431702085 .
  4. Jump up^ Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery . Microsoft Research. ISBN  978-0-9825442-0-4 .
  5. Jump up^ Bell, G .; Hey, T .; Szalay, A. (2009). “COMPUTER SCIENCE: Beyond the Data Deluge”. Science . 323 (5919): 1297-1298. doi : 10.1126 / science.1170411 . ISSN  0036-8075 .
  6. ^ Jump up to:b Davenport, Thomas H. ; Patil, DJ (Oct 2012), Data Scientist: The Sexiest Job of the 21st Century , Harvard Business Review
  7. ^ Jump up to:c “Data Science: What’s The Half-Life Of A Buzzword?” . Forbes . 2013-08-19.
  8. ^ Jump up to:b “Nate Silver: What I need from statisticians” . 23 Aug 2013.
  9. ^ Jump up to:b Talley, Jill (2016-06-01). “ASA Expands Scope, Outreach to Foster Growth, Collaboration in Data Science” . AMSTATNEWS . American Statistical Association . Retrieved 2017-02-04 .
  10. Jump up^ Press, Gil. “Very Short History Of Data Science” .
  11. ^ Jump up to:b Wu, CFJ (1997). “Statistics = Data Science?” (PDF) . Retrieved 9 October 2014 .
  12. Jump up^ “Identity of statistics in science examined” . The University Records, 9 November 1997, The University of Michigan . Retrieved 12 August 2013 .
  13. Jump up^ “PC Mahalanobis Memorial Readings, 7th series” . PC Mahalanobis Memorial Lectures, Indian Statistical Institute. Archived from the originalon 26 Feb 2017 . Retrieved 18 Jul 2017 .
  14. ^ Jump up to:b Cleveland, WS (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review / International Review of Statistics, 21-26
  15. Jump up^ International Council for Science: Committee on Data for Science and Technology. (2012, April). CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science: Committee on Data for Science and Technology:http://www.codata.org/
  16. Jump up^ Data Science Journal. (2012, April). Available Volumes. Retrieved from Japan Science and Technology Information Aggregator, Electronic:http://www.jstage.jst.go.jp/browse/dsj/_vols
  17. Jump up^ Data Science Journal. (2002, April). Contents of Volume 1, Issue 1, April 2002. Retrieved from Japan’s Science and Technology Information Aggregator, Electronic:http://www.jstage.jst.go.jp/browse/dsj/1/0/_contents
  18. Jump up^ The Journal of Data Science. (2003, January). Contents of Volume 1, Issue 1, January 2003. Retrieved fromhttp://www.jds-online.com/v1-1
  19. Jump up^ National Science Board. “Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century” . National Science Foundation . Retrieved 30 June 2013 .
  20. Jump up^ “IEEE Task Force on Data Science and Advanced Analytics” .
  21. Jump up^ “2014 IEEE International Conference on Data Science and Advanced Analytics” .
  22. Jump up^ “NY gets new bootcamp for data scientists: It’s free, but harder to get into than Harvard” . Beat Venture . Retrieved 2016-02-22 .
  23. Jump up^ “Journal on Data Science and Analytics” .
  24. Jump up^ Darrow, Barb (May 21, 2015). “Data science is still hot white, but nothing lasts forever” . Fortune . Retrieved November 20, 2017 .
  25. Jump up^ Miller, Steven (2014-04-10). “Collaborative Approaches Needed to Close the Big Data Skills Gap” . Journal of Organization Design . 3 (1): 26-30. doi : 10.7146 / jod.9823 . ISSN  2245-408X .
  26. Jump up^ From Mauro, Andrea; Greco, Marco; Grimaldi, Michele; Ritala, Paavo. “Human resources for Big Data Professions: A systematic classification of job roles and required skill sets” . Information Processing & Management . doi : 10.1016 / j.ipm.2017.05.004 .
  27. ^ Jump up to:b Wladawsky-Berger, Irving (May 2, 2014). “Why Do We Need Data Science When We’ve Had Statistics for Centuries?” . The Wall Street Journal . Retrieved November 20, 2017 .
  28. ^ Jump up to:h Donoho, David (September 2015). “50 Years of Data Science” (PDF) . Based on a conversation at Tukey Centennial Workshop, Princeton NJ Sept 18 2015 .
  29. Jump up^ Barlow, Mike (2013). The Culture of Big Data . O’Reilly Media, Inc.
  30. Jump up^ Chambers, John M. (1993-12-01). “Greater or lesser statistics: a choice for future research” . Statistics and Computing . 3 (4): 182-184. doi :10.1007 / BF00141776 . ISSN  0960-3174 .
  31. Jump up^ Collins, Francis S .; Tabak, Lawrence A. (2014-01-30). “NIH plans to enhance reproducibility” . Nature . 505 (7485): 612-613. ISSN  0028-0836 . PMC  4058759  . PMID  24482835 .
  32. Jump up^ McNutt, Marcia (2014-01-17). “Reproducibility” . Science . 343 (6168): 229-229. doi : 10.1126 / science.1250475 . ISSN  0036-8075 . PMID  24436391 .
  33. Jump up^ Peng, Roger D. (2009-07-01). “Reproducible research and Biostatistics” . Biostatistics . 10 (3): 405-408. doi : 10.1093 / biostatistics / kxp014 . ISSN  1465-4644 .