报告题目:Reproducible, Reusable and Robust Data Science: from Theory to Practice
报 告 人:Ana Trisovic 副教授 哈佛大学
照 片:
邀 请 人:李伟
报告时间:2020年12月21日上午9:55-10:40
腾讯会议ID:300 853 885
报告人简介:Ana Trisovic是哈佛大学定量社会科学研究所(IQSS)的斯隆博士后学者。她的研究主要集中于计算再生性、数据保护和数据科学,与Dataverse团队合作,研究了如何通过自动化、元数据和封装来促进研究数据和代码的再使用。此前,Ana Trisovic是芝加哥大学CLIR博士后研究员,在那里她与能源政策研究所(EPIC)和图书馆工作。她于2018年在剑桥大学完成了计算机科学博士学位,博士论文题目是“CERN LHCb实验的数据保存和再现性”。在欧洲核子研究所工作期间,她与LHCb合作、CERN开放数据和CERN分析保存小组一起工作。在她攻读博士学位期间,她是纽纳姆学院Muir Wood学者成员、申请到CERN博士生项目和谷歌Anita Borg纪念奖学金的获得者。
报告摘要:A new challenge of data science and machine learning is to ensure that published results are reliable and robust, which is essential for research verification and trustworthiness. However, in recent years we have observed issues in recreating and replicating machine learning models, causing a lack of research result reproducibility, which is defined as obtaining consistent results using the same input data, methods, and code. Furthermore, a reproducibility crisis has been reported, as much of the published results cannot be reproduced. To enable reproducibility, a researcher needs actionable steps that facilitate implementation and help mark progress in practice. This talk will focus on actionable steps in enabling reproducibility and reuse. In particular, we will discuss several aspects that can help improve the robustness of a data science analysis, such as data provenance, feature provenance, model provenance, and software environment. The talk will outline concrete guidelines and checklists as tools for building a reproducible machine learning pipeline and effectively sharing data science results.