Data lineage

Data lineage includes the data ‘s origins, what happens to it and where it moves over time. [1] Data lineage provides visibility into the process of data analysis . [2]

It also allows for specific replaying of portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database uses such information, called data provenance , to address similar validation and debugging challenges. [3] Data source refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of data and its origins. Forensic activities, as well as data and dependency analysis, error / compromise detection and recovery, auditing, and compliance analysis. ” Lineage is a simple type of provenance .” [3]

Data lineage can be represented visually to the data flow by its source to destination via various changes and hops on its way in the enterprise environment, how the data is translated along the way data splits or converges after each hop. A simple representation of the Lineage Data can be shown with dots and lines, where dot represents a data container for data point (s) and lines connecting the transformation (s) the data point under goes, between the data containers.

Representation broadly depends on the scope of meta-data management and reference point of interest. Data lineage provides the data with the data back to the end of the data , and leads to the final destination of data points and its intermediate data flows with forward data lineage . These views can be combined with end to endfor final destination (s). As the data points or hops increases, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data would be to leave the data unwanted peripheral data points. Tools That avez la masking feature Enables scalability of the view and analysis with best Enhances User Experience for Both technical and business users.

The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance , and data management determined the scope of the data lineage based on Their regulations , enterprise data management strategy, data impact reporting attributes, and critical data Elements of the organization.

Data lineage provides the audit trail of the data points at the highest granular level, but presentation of the lineage can be done at various levels to simplify the vast information, similar to analytic web maps. Data Lineage can be visualized at various levels based on the granularity of the view. At a very high level of data lineage provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the point where it can provide the details of the data and its historical behavior, attribute properties, and trends and data quality of the data passed through that specific data point in the data lineage.

Data governance plays a key role in metadata management for guidelines, strategies, policies, implementation. Data quality , and master data management helps the data lineage with more business value. Even though the final representation of data lineage is provided in one interface the metadata is harnessed and exposed to the data lineage graphical user interface could be entirely different. Thus, data lineage can be broadly divided into three categories based on data metadata, data lineage involving software packages for structured data, programming languages , and big data .

Data lineage including technical metadata involving data transformations. Enriched data lineage information can include data quality test results, reference data values, data models , business vocabulary , data stewards , program management information , and enterprise information systems linked to data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. To represent disparate systems into one common view, “metadata normalization” or standardization may be necessary.

Case for Data Lineage

The world of big data is changing rapidly. Statistics say that 90% of the world’s data has been created in the last two years alone. [4] This explosion of data has resulted in the growth of systems and automation at all levels in all sizes of organizations.

Distributed systems like Google Map Reduce , [5] Microsoft Dryad, [6] Apache Hadoop [7] (an open-source project) and Google Pregel [8] provide such platforms for businesses and users. However, even with these systems, big dataanalytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a prediction algorithm for the Netflix Prize challenge to be taken over by 20 cores, and a large-scale image processing task to estimate geographic information took complete 400 cores. [9]“The Large Synoptic Survey Telescope is expected to generate more than 50 petabytes, while in the bioinformatics sector, the largest genome in the world.” [10] It is very difficult for a scientist to trace an unknown or an unanticipated result.

Big Data Debugging

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. They apply machine learning algorithms etc. to the data which transforms the data. Due to the humongous size of the data, there might be some unknown features in the data, possibly even outliers. It is pretty difficult for a scientist to actually debug an unexpected result.

The massive scale and unstructured nature of data, the complexity of these analytical pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may be debugging, the debugger for step-wise debugging, this can be expensive. Auditing and data validation is another major issue in the field of data transfer, and the use of third-party data in business enterprises. [11] [12] [13] [14]These problems will continue to grow and continue to grow. As such, more cost-efficient ways of analyzing data-intensive scalable computing (DISC) are crucial to their continued effective use.

Challenges in Big Data Debugging

Massive Scale

According to an EMC / IDC study: [15]

  • 2.8ZB of data were created and replicated in 2012,
  • the digital universe will double every two years between now and 2020, and
  • There will be approximately 5.2TB of data for every man, woman and child on earth in 2020.

Working with this scale of data has become very challenging.

Unstructured Data

Unstructured data usually refers to information that does not reside in a traditional row-column database. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other types of business documents. Note that while these spells may have an internal structure, they are still considered “unstructured” because they do not fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of a data structure is growing rapidly. ” Big datacan not be said to be unstructured data, but IDC estimates that 90 percent of big data is unstructured data. ” [16]

The fundamental challenge of unstructured data is that they are difficult for non-technical business users and data analysts to understand, understand, and prepare for analytic use. Beyond issues of structure, is the sheer volume of this type of data. Because of this, the data mining techniques often leave out valuable information and make analysis of data laborious and expensive. [17]

Long Runtime

In today’s competitive business environment, they need to quickly. The challenge is going through the volumes of data and accessing the level of detail, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory using a grid computing approach, where many machines are used to solve a problem. Both approaches allow organizations to explore huge data volumes. Even this level of sophisticated hardware and software, few of the image processing tasks in large scale.[18]Debugging of the data processing is extremely hard.

Trifacta , Alteryx and others A third approach of advanced data discovery solutions combined self-service data prep with visual data discovery Trifacta , Alteryx and others. [19]

Another method to track data is that of providing users with lineage, or the ability to see what is dependent on another, but the structure of the transformation is lost. Similarly, ETL or mapping software provides transform-level lineage, yet this view is usually not limited to that of data logically independent (eg transforms that operate on separate columns) or dependent. [20]

Complex Platform

Big Data platforms have a very complicated structure. Data is distributed among several machines. Typically the jobs are broken down into several machines and results. Debugging of a big data pipeline becomes very challenging because of the very nature of the system. It will be an easy task for the data scientist to figure out which machine has the outliers and unknown features.

Proposed Solution

Data origin or data lineage can be used to make the debugging of the big data pipeline easier. This necessitates the collection of data about data transformations. The following section will explain data provenance in more detail.

Data Provenance

Data Provenance provides a historical record of the data and its origins. The origin of data which is generated by complex transformations such as workflows is of considerable value to scientists [21] . From it, one can ascertain the quality of the data based on its ancestral data and derivations, trace back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

The use of data is proposed in distributed systems to trace records through a dataflow, replay the dataflow on a subset of its original inputs and debug data flows. To do so, one needs to keep track of the set of inputs to each operator, which were used to derive each of its outputs. Although there are several forms of provenance, such as copy-provenance and how-provenance, [14] [22] the information we need is a simple form of why-provenance, or lineage , as defined by Cui et al. [23]

Lineage Capture

Intuitively, for an operator Producing output o, lineage consists of triplets of form I, T, o, where I is the set of inputs to T used to derive o. Capturing lineage for each operator T in a dataflow enables users to ask questions such as “Which outputs were produced by an input i on operator T?” And “Which inputs produced in T?” [3] A query that finds the An output is called a backward tracing query, while one that finds the outputs produced by an input is called a forward tracing query. [24] Backward tracing is useful for debugging, while forward tracing is useful for tracking error propagation. [24] Tracing queries also form the basis for replaying an original dataflow. [12] [23] [24]However, it is necessary to use linear logging for multiple data (or granularity) of operators and data.

DISC system consists of several levels of operators and data, and different uses of lineage needs to be captured. Lineage can be captured at the level of the job, using lines and giving lineage tuples of the form IF i, M RJob, OF i, lineage can also be captured at the level of each task, using records and giving, for example, lineage tuples of form (krr, vrr), map, (km, vm). The first form of lineage is called coarse-grain lineage, while the second form is called fine-grain lineage. Integrating lineage across different granularities enables users to ask questions such as “Which file reads by a MapReduce job produces this particular output record?” And can be useful in debugging across different operators and data granularities within a dataflow. [3]

Map Reduce Job showing containment relationships

To capture end-to-end lineage in a system, we use the Ibis model, [25] which introduces the notion of containment hierarchies for operators and data. SPECIFICALLY, Ibis That year proposed operator can be contained Within Reviews another and Such a relationship entre two operators s’intitule operator containment . “Operator containment implies that the (or child) operator performs a part of the logical operation of the containing (or parent) operator.” [3] For example, a MapReduce task is contained in a job. Similar containment relationships exist for data as well, called data containment. Data containment implies that the data is a subset of the contained data (superset).

Containment Hierarchy

Prescriptive Data Lineage

The concept of Lineage Prescriptive Data combines both the logical model (entity) and the way in which it is supposed to flow. [26]

Data lineage and provenance typically refers to the way or the steps to a dataset to its current state Lineage data, as well as all copies or derivatives. However, it is only necessary to check the accuracy of the data for certain data management cases. For instance, it is impossible to determine with certainty whether the road a data workflow was correct or in compliance with the logic model.

Only by combining the principle of atomic forensic events

  1. Authorized copies, joins, or CTAS operations
  2. Mapping of processing to the systems
  3. Ad-Hoc versus established processing sequences

Many certified compliance reports require a specific instance of data. With these types of situations, [27] This is a shift in thinking from the perspective of a model to a framework which is better suited to capture compliance workflows.

Active vs Lazy Lineage

Lazy lineage collection typically captures only coarse-grain lineage at run time. These systems incur low capture overheads due to the small amount of lineage they capture. However, to answer fine-grain tracing queries, they must replay the data for all (or a large part) of their input and collect fine-grain lineage during the replay. This approach is suitable for forensic systems, where the user wants to debug an observed bad output.

Active collection systems capture the lineage of the data flow at run time. The kind of lineage they capture may be coarse-grain or fine-grain gold, but they do not require any further computations on the data after their execution. Active fine-grain lineage collection systems incur higher capture overhead than lazy collection systems. However, they enable sophisticated replay and debugging. [3]

Actors

An actor is an entity that transforms data; it can be a Dryad vertex, a local map and down operators, a MapReduce job, or an entire dataflow pipeline. Actors act as black boxes and the inputs and outputs of an actor are tapped to capture lineage in the form of associations, where an association is a triplet i, T, o that relates to an input with an output o for an actor T. The instrumentation thus captures lineage in a dataflow one actor at a time, piecing it into a set of associations for each actor. The actor reads (from other actors) and the data actor writes (to other actors). For example, a developer can treat the Hadoop Job Tracker as an actor by recording the set of files read and written by each job. [28]

Associations

Association is a combination of the inputs, outputs and the operation itself. The operation is represented in terms of a black box also known as the actor. The associations describe the transformations that are applied on the data. The associations are stored in the association tables. Each unique actor is represented by its own association table. An association itself looks like i, T, o where i is the set of inputs to the actor T and o is set of outputs produced by the actor. Associations are the basic units of Lineage Data. Individual associations are later clubbed together to the entire history of transformations that have been applied to the data. [3]

Architecture

Big data systems scale horizontally increasing capacity by adding new hardware or software entities to the distributed system. The distributed system acts as a single entity in the logical level even though it includes multiple hardware and software entities. The system should continue to maintain this property after horizontal scaling. An important advantage of horizontal scalability is that it can provide the ability to increase capacity on the fly. The biggest plus point is that horizontal scaling can be done using commodity hardware.

The horizontal scaling feature of Big Data Systems should be taken into account while creating the architecture of lineage store. This is essential because the lineage store itself should also be able to scale in parallel with the Big data system. The number of associations and amount of storage required to increase the size of the system. The architecture of Big data systems makes the use of a single lineage store not appropriate and impossible to scale. The immediate solution to this problem is to distribute the lineage store itself. [3]

The best case scenario is to use a local lineage store for every machine in the distributed system network. This allows the lineage store also to scale horizontally. In this design, the lineage of data transformations applied to the data is a particular machine is stored on the local lineage store of that specific machine. The lineage store conventionally stores association tables. Each actor is represented by its own association table. The rows are the associations themselves and columns represent inputs and outputs. This design solves 2 problems. It allows horizontal scaling of the lineage store. If a centralized lineage store was used, then this information would be carried over the network, which would cause additional network latency. The network latency is also avoided by the use of a distributed lineage store.[28]

Architecture of Lineage Systems

Data flow Reconstruction

The information stored in the terms of the associations needs to be combined by some means to get a particular job. In a distributed system a job is broken down into multiple tasks. One or more instances run a particular task. The results produced on these individual machines are later combined. Tasks running on different machines perform multiple transformations on the data in the machine. All the transformations applied to the data on a machine are stored in the local lineage store of that machines. This information needs to be combined to get the lineage of the entire job. The lineage of the whole job should help the data scientist understand the data and the datapipeline. The data flow is reconstructed in 3 stages.

Association tables

The first stage of the data flow is the computation of the association tables. The association tables exists for each actor in each local lineage store. The entire association table for an actor can be compiled by combining these individual association tables. This is generally done using a series of separate statements based on the actors themselves. In few scenarios the tables might be joined by using inputs as the key. Indexes can also be used to improve the efficiency of a joined. There are multiple schemes that are used to pick a machine where a join would be computed. The easiest one with the minimum CPU load. Space constraints should also be kept in mind while picking the instance where join would happen.

Graph Association

The second step in data flow reconstruction is computing an association graph from the lineage information. The graph represents the steps in the data flow. The actors act as vertices and the associations act as edges. Each actor is linked to its upstream and downstream actors in the data flow. An upstream actor of T is one that produces the input of T, while a downstream actor is one that consumes the output of T. Containment relationships are always considered while creating links. The graph consists of three types of links or edges.

Explicitly specified links

The simplest link is an explicitly specified link between two actors. These links are explicitly specified in the code of a machine learning algorithm. When an actor is aware of its exact upstream or downstream actor, it can communicate this information to the API lineage. This information is later used by these actors during the tracing query. For example, in the MapReduce architecture, each map instance knows the exact record reader instance whose output it consumes. [3]

Logically inferred links

Developers can attach data flow archetypes to each logical actor. A data flow archetype explains how the children types of an actor type arrange themselves in a data flow. With the help of this information, one can link between each actor of a source type and a destination type. For example, in the MapReduce architecture, the map actor is the source for reduce, and vice versa. The system infers this data from the archetypes and duly links map instances with reduce instances. However, there may be several MapReducejobs in the data flow, and linking to all instances with all reduce instances can create false links. To prevent this, such links are restricted to a common actor instance of a container (or parent) actor type. Thus, map and reduce instances are only linked to each other if they belong to the same job. [3]

Implicit links through data set sharing

In distributed systems, sometimes there are implicit links, which are not specified during execution. For example, an implicit link exists between an actor and a writer. Such links connect actors which uses a common data set for execution. The dataset is the output of the first actor and is the input of the actor following it. [3]

Topological Sorting

The final step in the data flow reconstruction is the Topological sorting of the graph association. The directed graph is created in the preceding step. This inheritance of the actors defines the data of the big data pipeline or task.

Tracing & Replay

This is the most crucial step in Big Data debugging. The captured lineage is combined and processed to obtain the data of the pipeline. The data flow helps the data scientist or developer to look deeply into the actors and their transformations. This step allows the data scientist to figure out the part of the algorithm that is generating the unexpected output. A big data pipeline can go wrong in 2 broad ways. The first is a presence of a suspicious actor in the data-flow. The second being the existence of outliers in the data.

The first case can be debugged by tracing the data-flow. By using lineage and data-flow information together a data scientist can figure out how the inputs are converted into outputs. During the process that behave unexpectedly can be caught. These actors can be removed from the data they can be augmented by new actors to change the data-flow. The data-flow can be replayed to test the validity of it. Debugging faulty actors include recursively performing coarse-grain replay on actors in the data-flow, [29] which can be expensive in resources for long data flows . Another approach is to manually inspect lineage logs to find anomalies, [13] [30]which can be tedious and time-consuming across several stages of a data-flow. Moreover, these approaches work only when the scientist can discover bad outputs. To debug analytics, the data scientist need to analyze the data-flow for suspicious behavior in general. However, often, the user may not know the expected normal behavior and can not specify predicates. This section describes the problem of analyzing the problem of analyzing data in a multi-stage data-flow. We believe that sudden changes in an actor’s behavior, such as its average selectivity, is a characteristic of an anomaly. Lineage can reflect such changes in actor behavior over time and across different actors. THUS,

Tracing Anomalous Actors

The second problem of the existence of outliers can also be identified by running the data-flow step wise and looking at the transformed outputs. The data scientist finds a subset of outputs that are not in conformity with the rest of outputs. The inputs which are causing these outlets in the data. This problem can be solved by removing the set of outliers from the data and replaying the entire data-flow. It can also be solved by modifying the machine learning algorithm by adding, removing or moving actors in the data-flow. The changes in the data-flow are successful if the replayed data-flow does not produce bad outputs.

Tracing Outliers in the data

Challenges

Even though use data is a novel way of debugging big data pipelines, the process is not simple. The challenges are scalability of lineage store, fault tolerance of the lineage store, accurate capture of lineage for black box operators and many others. These challenges must be considered carefully and made to achieve a realistic design for data lineage capture.

Scalability

DISC systems are primarily batch processing systems designed for high throughput. They perform several jobs per analytics, with several tasks per job. The overall number of operators in the field of clustering. Lineage capture for these systems must be able to scale large volumes of data and numbers operators to avoid bottlenecks for the DISC analytics.

Fault tolerance

Lineage capture systems must also be tolerant to avoid rerunning data flows to capture lineage. At the same time, they must also fit into the DISC system. To do so, they must be able to identify a task that has not been performed. A lineage system should also be able to gracefully handle multiple instances of local lineage systems going down. This can be achieved by storing replicas of lineage associations in multiple machines. The replica can act like a backup in the event of the real copy being lost.

Black-box operators

Lineage systems for DISC dataflow must be able to capture accurate lineage across black-box operators to enable fine-grain debugging. Current approaches to this include Probing, which seeks to find the minimum set of inputs that can produce a certain output, and [31] and dynamic slicing, as used by Zhang et al. [32] to capture lineage for NoSQLoperators through binary rewriting to compute dynamic slices. It can be used in the past, but it can be used for a better time. Thus, there is a need for a data collection that allows the capture of arbitrary operators with reasonable accuracy, and without significant overheads in capture or tracing.

Efficient tracing

Tracing is essential for debugging, in which a user can issue multiple tracing queries. Thus, it is important that tracing has fast turnaround times. Ikeda et al. [24] can perform efficient backward tracing queries for MapReduce dataflows. Lipstick, [33] a lineage system for Pig, [34] while able to perform both backward and forward tracing, is specific to Pig and SQL operators and can only perform coarse-grain tracing for black-box operators. Thus, there is a need for a system that enables efficient forward and backward tracing for generic DISC systems and data flows with black-box operators.

Sophisticated replay

Replaying only specific inputs or portions of a data-flow is crucial for efficient debugging and simulating what-if scenarios. Ikeda et al. present a methodology for lineage-based refresh, which selectively replays updated inputs to recompute affected outputs. [35]This is useful for debugging for re-computing when a bad input has been fixed. However, sometimes a user may want to remove the bad input and replay the lineage of outputs previously affected by the error to produce error-free outputs. We call this exclusive replay. Another use of replay in debugging involves replaying bad inputs for step-wise debugging (called selective replay). Current approaches to using lineage in DISC systems Thus, there is a need for a lineage system that can be used exclusively and selectively.

Anomaly detection

One of the primary debugging concerns. In long dataflows with several hundreds of operators or tasks, manual inspection can be tedious and prohibitive. Even if lineage is used to narrow the subset of the operators to examine, the lineage of a single output can still multiple operators. There is a need for an inexpensive automated debugging system, which can substantially narrow the set of potentially faulty operators, with reasonable accuracy, to minimize the amount of manual examination required.

See also

  • Origin
  • Big Data
  • Topological sorting
  • debugging
  • NoSQL
  • scalability
  • Directed acyclic graph

References

  1. Jump up^ http://www.techopedia.com/definition/28040/data-lineage
  2. Jump up^ Hoang, Natalie (2017-03-16). “Data Lineage Helps Business Value Drives | Trifacta” . Trifacta . Retrieved 2017-09-20 .
  3. ^ Jump up to:k From, Soumyarupa. (2012). Newt: an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn
  4. Jump up^http://newstex.com/2014/07/12/thedataexplosionin2014minutebyminuteinfographic/
  5. Jump up^ Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Common. ACM, 51 (1): 107-113, January 2008.
  6. Jump up^ Michael Isard, Mihai Budiu, Yu Yuan, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS / EuroSys European Conference on Computer Systems 2007, EuroSys ’07, pages 59-72, New York, NY, USA, 2007. ACM.
  7. Jump up^ Apache Hadoop. http://hadoop.apache.org.
  8. Jump up^ Grzegorz Malewicz, Matthew H. Austern, Aart JC Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for largecale graph processing. In Proceedings of the 2010 International Conference on Management of Data, SIGMOD ’10, pp. 135-146, New York, NY, USA, 2010. ACM.
  9. Jump up^ Chen Shimin and Steven W. Schlosser. Map-reduce meets different types of applications. Technical report, Intel Research, 2008.
  10. Jump up^ The data deluge in genomics. https://www-304.ibm.com/connections/blogs/ibmhealthcare/entry/dataoverload in genomics3? lang = de, 2010.
  11. Jump up^ Yogesh L. Simmhan Beth Plale, and Dennis Gannon. A survey of data from sources in e-science. SIGMOD Rec., 34 (3): 31-36, September 2005.
  12. ^ Jump up to:b Ian Foster, Jens Vockler Michael Wilde, and Yong Zhao. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In 14th International Conference on Scientific and Statistical Database Management, July 2002.
  13. ^ Jump up to:b Benjamin H. Sigelman Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, has large-scale distributed tracing infrastructure systems. Technical report, Google Inc, 2010.
  14. ^ Jump up to:b Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Data source: Some basic issues. In Proceedings of the 20th Conference on Foundations of SoftwareTechnology and Theoretical Computer Science, FST TCS 2000, pp. 87-93, London, UK, UK, 2000. Springer-Verlag
  15. Jump up^ http://www.emc.com/about/news/press/2012/20121211-01.htm
  16. Jump up^ Webopediahttp://www.webopedia.com/TERM/U/unstructured_data.html
  17. Jump up^ Schaefer, Paige (2016-08-24). “Differences Between Structured & Unstructured Data” . Trifacta . Retrieved 2017-09-20 .
  18. Jump up^ SAS. http://www.sas.com/resources/asset/five-big-data-challenges-article.pdf
  19. Jump up^ “5 Requirements for Effective Self-Service Data Preparation” . www.itbusinessedge.com . Retrieved 2017-09-20 .
  20. Jump up^ Kandel, Sean (2016-11-04). “Tracking Data Lineage in Financial Services | Trifacta” . Trifacta . Retrieved 2017-09-20 .
  21. Jump up^ Pasquier, Thomas; Lau, Matthew K .; Trisovic, Ana; Boose, Emery R .; Couturier, Ben; Crosas, Mercè; Ellison, Aaron M .; Gibson, Valerie; Jones, Chris R .; Seltzer, Margo (5 September 2017). “If these data could talk”. Scientific Data . 4 : 170114. doi : 10.1038 / sdata.2017.114 .
  22. Jump up^ Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report, Stanford University, 2009.
  23. ^ Jump up to:b Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB Journal, 12 (1), 2003.
  24. ^ Jump up to:d Robert Ikeda Hyunjung Park, and Jennifer Widom. Provenance for generalized map and reduce workflows. In Proc. of CIDR, January 2011.
  25. Jump up^ C. Olston and A. Das Sarma. Ibis: A provenance manager for multi-layer systems. In Proc. of CIDR, January 2011.
  26. Jump up^ http://info.hortonworks.com/rs/549-QAL-086/images/Hadoop-Governance-White-Paper.pdf
  27. Jump up^ SEC Small Entity Compliance Guide
  28. ^ Jump up to:b Dionysios Logothetis, Soumyarupa From, and Kenneth Yocum. 2013. Scalable Lineage Capture for Debugging DISC Analytics. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC ’13). ACM, New York, NY, USA,, Article 17, 15 pages.
  29. Jump up^ Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. Secure network provenance. In Proceedings of 23rd ACM Symposium on Operating System Principles (SOSP), December 2011.
  30. Jump up^ Rodrigo Fonseca, George Porter, Randy Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In In Proceedings of NSDI’07, 2007.
  31. Jump up^ Anish Sarma Dais, Alpa Jain, and Philip Bohannon. PROBER: Ad-Hoc Debugging of Extraction and Integration Pipelines. Technical report, Yahoo, April 2010.
  32. Jump up^ Zhang Mingwu, Xiangyu Zhang, Xiang Zhang, and Sunil Prabhakar. Tracing lineage beyond relational operators. In Proc. Conference on Very Large Data Bases (VLDB), September 2007.
  33. Jump up^ Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, and Julia Stoyanovich. Putting lipstick on a pig: Enabling database-style workflow from. In Proc. of VLDB, August 2011.
  34. Jump up^ Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Latin Pig: A not-so-foreign language for data processing. In Proc. of ACM SIGMOD, Vancouver, Canada, June 2008.
  35. Jump up^ Robert Ikeda, Semih Salihoglu, and Jennifer Widom. Provenance-based refresh in data-oriented workflows. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, 1659-1668, New York, NY, USA, 2011. ACM.