Benchmarks and process management in data science: Will we ever get over the mess?

Usama M. Fayyad, Arno Candel, Eduardo Ariño De La Rubia, Szilárd Pafka, Anthony Chong, Jeong Yoon Lee

Research output: Contribution to Book/Report typesConference contributionpeer-review

Abstract (may include machine translation)

This panel aims to address areas that are widely acknowledged to be of critical importance to the success of Data Science projects and to the healthy growth of KDD/Data Science as a field of scientific research. However, despite this acknowledgement of their criticality, these areas receive insufficient attention in the major conferences in the field. Furthermore, there is a lack of actual actions and tools to address these areas in actual practice. These areas are summarized as follows: 1. Ask any data scientist or machine learning practitioner what they spend the majority of their time working on, and you will most likely get an answer that indicates that 80% to 90% of their time is spent on "Data Chasing", "Data Sourcing", "Data Wrangling", "Data Cleaning" and generally what researchers would refer to-often dismissively-as "Data Preparation". The process of producing statistical or data mining models from data is typically "messy" and certainly lacks management tools to help manage, replicate, reconstruct, and capture all the knowledge that goes in 90% of activities of a Data Scientists. The intensive Data Engineering work that goes into exploring and determining the representation of problem and the significant amount of "data cleaning" that ensues creates a plethora of extracts, files, and many artifacts that are only meaningful to the data scientist. 2. The severe lack of Benchmarks in the field, especially ones at big data scale is an impediment to true, objective, measurable progress on performance. The results of each paper are highly dependent on the large degree of freedom an author has on configuring competitive models and on determining which data sets to use (often Data that is not available to others to replicate results on) 3. Monitoring the health of models in production, and deploying models into production environments efficiently and effectively is a black art and often an ignored area. Many models are effectively "orphans" with no means of getting appropriate health monitoring. The task of deploying a built model to production is frequently beyond the capabilities of a Data Scientists and the understanding of the IT team. For a typical company, a Machine Learning or Data Science expert is a major investment; yet these people are in such hot demand, that likelihood of churn is high. Typically, when a data scientist is replaced, the process pretty much starts over with a tabula rasa⋯ In fact, I would argue most data scientists coming back to tasks they built themselves 1-2 years before are unable to reconstruct what they did. For this panel, we have selected a unique set of experts who have different experiences and perspectives on these important problems and how they should be dealt with in real environments. It is our hope that the panel discussion will not only produce recommendations on what to do about these painful impediments to successful project deployments, but also serve as an eye opener for the research community to the importance of paying close attention to issues of Data and Model Management in KDD, as well the need to think carefully about the lifecycle of models and how they can be managed, maintained, and deployed systematically. Without addressing these critical deployment and practice issues, our field will be challenged to grow in a healthy and sustainable way. The expert panelists for this panel along with the panel moderator: Usama Fayyad are listed below along with their biographical sketches.

Original languageEnglish
Title of host publicationKDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsStan Matwin, Shipeng Yu, Faisal Farooq
PublisherAssociation for Computing Machinery
Pages31-32
Number of pages2
ISBN (Electronic)9781450348874
DOIs
StatePublished - 13 Aug 2017
Externally publishedYes
Event23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 - Halifax, Canada
Duration: 13 Aug 201717 Aug 2017

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
VolumePart F129685

Conference

Conference23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017
Country/TerritoryCanada
CityHalifax
Period13/08/1717/08/17

Keywords

  • Accuracy
  • Data benchmarks
  • Memory footprint
  • Model deployment and monitoring
  • Model management
  • Performance benchmarks
  • Software implementations
  • Training speed

Fingerprint

Dive into the research topics of 'Benchmarks and process management in data science: Will we ever get over the mess?'. Together they form a unique fingerprint.

Cite this