ML Based Lineage in Databases

Leybovich, Michael; Shmueli, Oded

Abstract:We track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As time goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple. We devise an alternative and improved lineage tracking mechanism, that of keeping track of and querying lineage at the column level; thereby, we manage to better distinguish between the provenance features and the textual characteristics of a tuple. We integrate our lineage computations into the PostgreSQL system via an extension (ProvSQL) and extensive experiments exhibit useful results in terms of accuracy against exact, semiring-based, justifications, especially for the column-based (CV) method which exhibits high precision and high per-level recall. In the experiments, we focus on tuples with \textit{multiple generations} of tuples in their lifelong lineage and analyze them in terms of direct and distant lineage.

Comments:	12 pages
Subjects:	Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:2109.06339 [cs.DB]
	(or arXiv:2109.06339v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2109.06339

Computer Science > Databases

Title:ML Based Lineage in Databases

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators