DataOps

DataOps is an automated, process-oriented methodology, used by big data teams, to improve the quality and reduce the cycle time of data analytics . While DataOps began as a set of best practices, it has now become a new and independent approach to data analytics. [1] DataOps applies to the entire data lifecycle [2] from data preparation to reporting, and to the interconnected nature of the data analytics team and information technology operations. [3] From a process perspective and methodology, DataOps Applies Agile software development , DevOps [3] and the statistical process control used inlean manufacturing , to data analytics. [4]

In DataOps, development of new analytics is streamlined using Agile software development , an iterative project management methodology that replaces the traditional Waterfall sequential methodology . Agile Development is used. The Agile methodology is particularly effective in a rapidly evolving situation. [5]

DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scalability of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring about these same improvements to data analytics. [3]

Like lean manufacturing , DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is always monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. [6]

DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration [4] , agility [5] , quality, security, access and ease of use. [7]

History

The term DataOps was originally introduced in a blog by Andy Palmer at Tamr. [3] DataOps is a moniker for “Data Operations.” [2]

DataOps heritage from DevOps, Agile, and manufacturing

Goals and Philosophy

The volume of data is forecast to grow at a rate of 32% CAGR to 180 Zettabytes by the year 2025 (Source: IDC). [7] DataOps seeks to provide tools, processes, and organizational structures to cope with this significant increase in data. [7] Automation streamlines the daily demands of managing large integrated databases, freeing the data team to develop new analytics in a more efficient and effective way. [8]

DataOps embraces the need to manage many sources of data, multiple data pipelines and a wide variety of transformations. [3] DataOps seeks to increase velocity, reliability, and quality of data analytics. [9] It emphasizes communication, collaboration, integration, automation, measurement and cooperation between data scientists , analysts, data / ETL (extract, transform, load ) engineers, information technology (IT) , and quality assurance / governance. [10] It aims to help organizations rapidly produce insight, turn that insight into operational tools, and continuously improve analytic operations and performance. [10]

Implementation

Toph Whitmore at Blue Hill Research thesis offers DataOps leadership principles for the information technology department: [1]

  • “Establish progress and performance measurements at every stage of the data flow. Where possible, benchmark data-flow cycle times.
  • Define rules for an abstracted semantic layer. Ensure everyone is “speaking the same language” and agrees with the data (and metadata) is and is not.
  • Validate with the “eyeball test”: Include continuous-improvement -oriented human feedback loops. Consumers must be able to trust the data, and that can only come with incremental validation.
  • Automata as many stages of data as possible including BI, data science, and analytics.
  • Using benchmarked performance information, identify and optimize bottlenecks for them. This may require investment in commodity hardware, or automation of a trained-human-delivered data-science step in the process.
  • Establish governance discipline, with a particular focus on two-way data control, data ownership, transparency, and comprehensive data-lineage tracking through the entire workflow.
  • Design process for growth and extensibility. The data flow model must be designed to accommodate volume and variety of data. Ensure enabling technologies are priced affordably to scale with that enterprise data growth. “

DataKitchen has published seven steps to implement DataOps

  1. Add data and logic tests
  2. Use a version control system
  3. Branch and merge
  4. Use multiple environments
  5. Reuse and containerize
  6. Parameterize your processing
  7. Work without fear  orchestrate the pipelines that publish data and new analytics.

DataOps Manifesto

DataOps has produced a DataOps manifesto , consisting of 18 DataOps principles, which summarizes the mission, values, philosophies, goals and best practices of DataOps practitioners.

The Role of DataOps Engineer

The DataOps Engineer orchestrates and automates the data analytics pipeline, promotes features to production and automates quality. [11] In many organizations, the DataOps is a separate role. In others, it is a shared function. [12]

Ecosystem

The DataOps ecosystem includes:

  • Blue Hill Research
  • Composable Analytics
  • DataKitchen
  • DataOps
  • DataOps Summit
  • Delphix
  • Interana
  • John Snow Labs
  • Nexla
  • Qubole
  • Switchboard Software
  • Talend
  • Tamr
  • Trifacta
  • Unravel Data

References

  1. ^ Jump up to:b “DataOps – It’s a Secret” . www.datasciencecentral.com . Retrieved 2017-04-05 .
  2. ^ Jump up to:b “What is DataOps (data operations)? – Definition from WhatIs.com” . SearchDataManagement . Retrieved 2017-04-05 .
  3. ^ Jump up to:e “From DevOps to DataOps, By Andy Palmer – Tamr Inc.” . Tamr Inc . 2015-05-07 . Retrieved 2017-03-21 .
  4. ^ Jump up to:b “The DataOps Ecosystem Emerges – Tamr Inc.” . Tamr Inc . 2017-05-04 . Retrieved 2017-08-24 .
  5. ^ Jump up to:b DataKitchen (2017-02-21). “How Software Teams Accelerated Average Frequency Release from 12 Months to Three Weeks” . Medium . Retrieved 2017-08-24 .
  6. Jump up^ DataKitchen (2017-03-07). “Lean Manufacturing Secrets That You Can Apply to Data Analytics” . Medium . Retrieved 2017-08-24 .
  7. ^ Jump up to:c “What is DataOps? | Nexla: Scalable Data Operations Platform for the Machine Learning Age” . www.nexla.com . Retrieved 2017-09-07 .
  8. Jump up^ “5 driving trends Big Data in 2017” . CIO Dive . Retrieved 2017-09-07 .
  9. Jump up^ “Unravel Data Advances Performance Management Application for Big Data” . Database Trends and Applications . 2017-03-10 . Retrieved 2017-09-07 .
  10. ^ Jump up to:b DataKitchen (2017-03-15). “How to Become a Rising Star with Data Analytics” . data-ops . Retrieved 2017-09-07 .
  11. Jump up^ DataKitchen (2017-07-19). “Building a DataOps Team” . data-ops . Retrieved 2017-08-24 .
  12. Jump up^ DataKitchen (2017-05-16). “DataOps Engineer Will Be the Sexiest Job in Analytics” . Medium . Retrieved 2017-08-24 .