At the end of September George Chernishev gave a talk at the ACM SIGMOD Moscow Chapter Meeting, where he presented Desbordante — an open-source data profiler intended for discovery of complex patterns. He discussed Desbordante’s vision, purpose, competitors, and positioning. He also gave examples of formalisms that define discovered patterns. The talk concluded with discussion of Desbordante’s development history, features, and architecture, as well as a short live demonstration of the tool.
Yesterday, the
video of the talk has been made available on the chapter channel on Youtube. The
slides are also available on the talk
page. The talk is in Russian, as well as all of the materials.
Absctract Desbordante is a high-performance data profiler oriented towards exploratory data analysis. Unlike many other profilers, its focus is on uncovering complex patterns hidden in data such as functional (both exact and approximate) dependencies, conditional dependencies, association rules and others. These patterns are described by various mathematical formalisms, and their extraction requires sophisticated algorithms. Desbordante also offers data cleaning scenarios based on combinations of these formalisms. For now, we have implemented a typo detection pipeline using approximate and exact dependencies.
Our idea is to open science-intensive exploratory data analysis to the broader public. The aforementioned formalisms were known to database and data mining academic communities for a long time, however, they are effectively inaccessible to most other people who would find them useful. The implementations of discovery algorithms were either scattered around the web, required effort to set up and run, or were slow (usually implemented either in Python or Java). At the same time, discovered patterns could be useful to various scientists (for hypothesis formulation), all kinds of analysts, and machine learning professionals (for feature engineering & ablation studies).
Desbordante is inspired by Metanome [1], but differs from it:
* a ready-to-use web application with a user-friendly interface that provides various filters and sorting
* significantly improved performance [2]: it is implemented in C++ with a completely new codebase
* resilient through extensive use of containerization, allows to schedule resources
* pipelines of formalisms that implement data cleaning scenarios
* has a publicly accessible deployed version
This is a beta version and we plan to continue developing our tool. More formalisms are soon to follow, e.g., we have already implemented metric dependencies in the console version, algebraic constraints are in the works, etc. Our goals also include extending our library of data cleaning scenarios. Finally, we plan to implement many quality of life improvements, such as in-place table updates (interactive tables), downloading results, or allowing users to construct custom pipelines consisting of implemented formalisms.
Desbordante is already deployed and ready to use, and we will be happy to hear your feedback. Note that you have to register to try your own datasets; due to high demand, for now, we enforce strict time and memory limits (a task is killed if it exceeds them).
Desbordante is an open-source project, and we welcome contributions, message us if you are interested.
Links Github:
https://github.com/Mstrutov/Desbordante Deployed version:
https://desbordante.unidata-platform.ru/ Documentation:
https://mstrutov.github.io/Desbordante/ [1] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with metanome. Proc. VLDB Endow. 8, 12 (August 2015), 1860â€"1863.
https://doi.org/10.14 778/2824032.2 824 086 [2] M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344−354, doi: 10.23 919/FRUCT52173.2021.9 435 469.
https://fruct.org/publications/fruct29/files/Strut…