TY - GEN
T1 - Benchmarks and process management in data science
T2 - 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017
AU - Fayyad, Usama M.
AU - Candel, Arno
AU - De La Rubia, Eduardo Ariño
AU - Pafka, Szilárd
AU - Chong, Anthony
AU - Lee, Jeong Yoon
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/8/13
Y1 - 2017/8/13
N2 - This panel aims to address areas that are widely acknowledged to be of critical importance to the success of Data Science projects and to the healthy growth of KDD/Data Science as a field of scientific research. However, despite this acknowledgement of their criticality, these areas receive insufficient attention in the major conferences in the field. Furthermore, there is a lack of actual actions and tools to address these areas in actual practice. These areas are summarized as follows: 1. Ask any data scientist or machine learning practitioner what they spend the majority of their time working on, and you will most likely get an answer that indicates that 80% to 90% of their time is spent on "Data Chasing", "Data Sourcing", "Data Wrangling", "Data Cleaning" and generally what researchers would refer to-often dismissively-as "Data Preparation". The process of producing statistical or data mining models from data is typically "messy" and certainly lacks management tools to help manage, replicate, reconstruct, and capture all the knowledge that goes in 90% of activities of a Data Scientists. The intensive Data Engineering work that goes into exploring and determining the representation of problem and the significant amount of "data cleaning" that ensues creates a plethora of extracts, files, and many artifacts that are only meaningful to the data scientist. 2. The severe lack of Benchmarks in the field, especially ones at big data scale is an impediment to true, objective, measurable progress on performance. The results of each paper are highly dependent on the large degree of freedom an author has on configuring competitive models and on determining which data sets to use (often Data that is not available to others to replicate results on) 3. Monitoring the health of models in production, and deploying models into production environments efficiently and effectively is a black art and often an ignored area. Many models are effectively "orphans" with no means of getting appropriate health monitoring. The task of deploying a built model to production is frequently beyond the capabilities of a Data Scientists and the understanding of the IT team. For a typical company, a Machine Learning or Data Science expert is a major investment; yet these people are in such hot demand, that likelihood of churn is high. Typically, when a data scientist is replaced, the process pretty much starts over with a tabula rasa⋯ In fact, I would argue most data scientists coming back to tasks they built themselves 1-2 years before are unable to reconstruct what they did. For this panel, we have selected a unique set of experts who have different experiences and perspectives on these important problems and how they should be dealt with in real environments. It is our hope that the panel discussion will not only produce recommendations on what to do about these painful impediments to successful project deployments, but also serve as an eye opener for the research community to the importance of paying close attention to issues of Data and Model Management in KDD, as well the need to think carefully about the lifecycle of models and how they can be managed, maintained, and deployed systematically. Without addressing these critical deployment and practice issues, our field will be challenged to grow in a healthy and sustainable way. The expert panelists for this panel along with the panel moderator: Usama Fayyad are listed below along with their biographical sketches.
AB - This panel aims to address areas that are widely acknowledged to be of critical importance to the success of Data Science projects and to the healthy growth of KDD/Data Science as a field of scientific research. However, despite this acknowledgement of their criticality, these areas receive insufficient attention in the major conferences in the field. Furthermore, there is a lack of actual actions and tools to address these areas in actual practice. These areas are summarized as follows: 1. Ask any data scientist or machine learning practitioner what they spend the majority of their time working on, and you will most likely get an answer that indicates that 80% to 90% of their time is spent on "Data Chasing", "Data Sourcing", "Data Wrangling", "Data Cleaning" and generally what researchers would refer to-often dismissively-as "Data Preparation". The process of producing statistical or data mining models from data is typically "messy" and certainly lacks management tools to help manage, replicate, reconstruct, and capture all the knowledge that goes in 90% of activities of a Data Scientists. The intensive Data Engineering work that goes into exploring and determining the representation of problem and the significant amount of "data cleaning" that ensues creates a plethora of extracts, files, and many artifacts that are only meaningful to the data scientist. 2. The severe lack of Benchmarks in the field, especially ones at big data scale is an impediment to true, objective, measurable progress on performance. The results of each paper are highly dependent on the large degree of freedom an author has on configuring competitive models and on determining which data sets to use (often Data that is not available to others to replicate results on) 3. Monitoring the health of models in production, and deploying models into production environments efficiently and effectively is a black art and often an ignored area. Many models are effectively "orphans" with no means of getting appropriate health monitoring. The task of deploying a built model to production is frequently beyond the capabilities of a Data Scientists and the understanding of the IT team. For a typical company, a Machine Learning or Data Science expert is a major investment; yet these people are in such hot demand, that likelihood of churn is high. Typically, when a data scientist is replaced, the process pretty much starts over with a tabula rasa⋯ In fact, I would argue most data scientists coming back to tasks they built themselves 1-2 years before are unable to reconstruct what they did. For this panel, we have selected a unique set of experts who have different experiences and perspectives on these important problems and how they should be dealt with in real environments. It is our hope that the panel discussion will not only produce recommendations on what to do about these painful impediments to successful project deployments, but also serve as an eye opener for the research community to the importance of paying close attention to issues of Data and Model Management in KDD, as well the need to think carefully about the lifecycle of models and how they can be managed, maintained, and deployed systematically. Without addressing these critical deployment and practice issues, our field will be challenged to grow in a healthy and sustainable way. The expert panelists for this panel along with the panel moderator: Usama Fayyad are listed below along with their biographical sketches.
KW - Accuracy
KW - Data benchmarks
KW - Memory footprint
KW - Model deployment and monitoring
KW - Model management
KW - Performance benchmarks
KW - Software implementations
KW - Training speed
UR - http://www.scopus.com/inward/record.url?scp=85029030495&partnerID=8YFLogxK
U2 - 10.1145/3097983.3120998
DO - 10.1145/3097983.3120998
M3 - Conference contribution
AN - SCOPUS:85029030495
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 31
EP - 32
BT - KDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
A2 - Matwin, Stan
A2 - Yu, Shipeng
A2 - Farooq, Faisal
PB - Association for Computing Machinery
Y2 - 13 August 2017 through 17 August 2017
ER -