skip to main content
panel
Public Access

Automation of Data Prep, ML, and Data Science: New Cure or Snake Oil?

Published: 18 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    As machine learning (ML), artificial intelligence (AI), and Data Science grow in practical importance, a large part of the ML/AI software industry claims to have built tools and platforms to automate the entire workflow of ML. That includes vexing problems of data preparation (prep), studied intensively by the database (DB) community for decades, with basically no resolution so far. Such claims by the ML/AI industry face a stunning lack of scientific scrutiny from the DB and ML research worlds, largely due to the lack of meaningful, large, and objective benchmarks. As such tools rapidly gain adoption among enterprises and other customers, this panel will debate whether the new ML/AI industry is basically selling "snake oil" to such users, how to evolve away from the status quo by instituting meaningful new benchmarks, creating new partnerships between industry and academia for this, and other pressing questions in this important arena. We aim to spur vigorous conversations that will hopefully lead to genuine new cures for an age-old affliction in Data Science.

    References

    [1]
    Ahmed Abbasi, Brent Kitchens, and Faizan Ahmad. 2019. The Risks of AutoML and How to Avoid Them . https://hbr.org/2019/10/the-risks-of-automl-and-how-to-avoid-them
    [2]
    Datanami.com Alex Woodie. 2020. Data Prep Still Dominates Data Scientists' Time, Survey Finds. https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/
    [3]
    Figure Eight. 2016. CrowdFlower Data Science Report. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
    [4]
    Alon Halevy, Arun Kumar, and Nesime Tatbul. 2020. Scalable Data Science: A New Research Track Category at PVLDB Vol 14 / VLDB 2021 . https://wp.sigmod.org/?p=3033
    [5]
    Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools . SIGMOD Rec., Vol. 49, 3 (Dec. 2020), 18--29.
    [6]
    Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL . Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711.
    [7]
    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). 2018. Automated Machine Learning: Methods, Systems, Challenges. Springer. In press, available at http://automl.org/book.
    [8]
    Kaggle. 2020. State of Data Science and Machine Learning. https://www.kaggle.com/kaggle-survey-2020
    [9]
    Arun Kumar. 2018. ML/AI Systems and Applications: Is the SIGMOD/VLDB Community Losing Relevance? https://wp.sigmod.org/?p=2454
    [10]
    Hilary Mason. 2021. Twitter . https://twitter.com/hmason/status/1363924362659782657's=20
    [11]
    OpenML. 2021. Website . https://www.openml.org/
    [12]
    David Patterson. 2001. How to Have a Bad Career How to Have a Bad Career in Research/Academia . https://people.eecs.berkeley.edu/ pattrsn/talks/BadCareer.pdf
    [13]
    Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference . Proc. VLDB Endow., Vol. 10, 11 (Aug. 2017), 1190--1201.
    [14]
    Vraj Shah and Arun Kumar. 2019. The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning (DEEM'19). Association for Computing Machinery, Article 11, bibinfonumpages4 pages.
    [15]
    Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In ACM SIGMOD . https://adalabucsd.github.io/sortinghat.html.
    [16]
    Lisa Singh, Amol Deshpande, Wenchao Zhou, Arindam Banerjee, Alex Bowers, Sorelle Friedler, H.V. Jagadish, George Karypis, Zoran Obradovic, Anil Vullikanti, and Wangda Zuo. 2019. NSF BIGDATA PI Meeting - Domain-Specific Research Directions and Data Sets . SIGMOD Rec., Vol. 47, 3 (Feb. 2019), 32--35.
    [17]
    UW-Madison and Microsoft. 2020. Machine Learning Optimized Systems . https://remziarpacidusseau.wixsite.com/mlos
    [18]
    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning . SIGKDD Explorations, Vol. 15, 2 (2013), 49--60.
    [19]
    Denis Vorotyntsev. 2019. Towards Data Science: AutoML is Overhyped . https://towardsdatascience.com/automl-is-overhyped-1b5511ded65f
    [20]
    Gerhard Weikum. 2013. Where's the Data in the Big Data Wave? http://wp.sigmod.org/?p=786
    [21]
    Doris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee, Niloufar Salehi, and Aditya Parameswaran. 2021. Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows. In CHI . https://arxiv.org/abs/2101.04834.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
    June 2021
    2969 pages
    ISBN:9781450383431
    DOI:10.1145/3448016
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2021

    Check for updates

    Author Tags

    1. automl
    2. benchmark datasets
    3. data cleaning
    4. data preparation
    5. machine learning

    Qualifiers

    • Panel

    Funding Sources

    Conference

    SIGMOD/PODS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)11

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AI and the democratization of knowledgeScientific Data10.1038/s41597-024-03099-111:1Online publication date: 5-Mar-2024
    • (2024)Squeezing adaptive deep learning methods with knowledge distillation for on-board cloud detectionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107835132:COnline publication date: 1-Jun-2024
    • (2023)Pollock: A Data Loading BenchmarkProceedings of the VLDB Endowment10.14778/3594512.359451816:8(1870-1882)Online publication date: 1-Apr-2023
    • (2023)The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00077(936-949)Online publication date: Apr-2023
    • (2022)BETZE: Benchmarking Data Exploration Tools with (Almost) Zero Effort2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00224(2385-2398)Online publication date: May-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media