报告人:Ana Trisovic, 副教授 哈佛大学
邀请人:李伟
报告人简介:Ana Trisovic是哈佛大学定量社会科学研究所(IQSS)的斯隆博士后学者。她的研究主要集中于计算再生性、数据保护和数据科学,与Dataverse团队合作,研究了如何通过自动化、元数据和封装来促进研究数据和代码的再使用。此前,Ana Trisovic是芝加哥大学CLIR博士后研究员,在那里她与能源政策研究所(EPIC)和图书馆工作。她于2018年在剑桥大学完成了计算机科学博士学位,博士论文题目是“CERN LHCb实验的数据保存和再现性”。在欧洲核子研究所工作期间,她与LHCb合作、CERN开放数据和CERN分析保存小组一起工作。在她攻读博士学位期间,她是纽纳姆学院Muir Wood学者成员、申请到CERN博士生项目和谷歌Anita Borg纪念奖学金的获得者。
报告时间:2021年6月1日8:30-11:00
腾讯会议 ID:387 574 450
Title 1: Improving code readability with machine learning: concept and experiment
Abstract: While there have been increasing standards for authors to make data and code available, many of these files are unintelligible or hard to follow in practice, leading to a lack of research reproducibility and reuse. This poses a major problem for students and researchers in the same field who cannot leverage the previously published findings for a new inquiry. To address this, we identify code readability as an important factor in reuse. We conduct a survey on students and researchers to learn what code features are considered desirable and create a machine learning model that predicts code readability based on its features.
Title 2: Improving code readability with machine learning: implementation and application
Abstract: This talk proposes an open-source platform that helps improve the reproducibility and readability of research projects involving R code. Our platform incorporates assessing code readability with a machine learning model trained on a code readability survey and an automatic containerization service that executes the code files uploaded by the user and warns them of any reproducibility errors. By doing this, researchers would ensure the reproducibility and readability of their projects and therefore fast-track their verification and reuse.
Title 3: Using clustering to identify patterns in open research datasets: a concept
Abstract: Over the last decade research has become more computational and increased the value of software as a primary scientific asset. Researchers do not only use existing software, but they develop and publish new code to carry out their studies. We use unsupervised machine learning and principal component analysis (PCA) to examine open research datasets to understand how research data and code cluster together.
Title 4: Using clustering to identify patterns in open research datasets: implementation and implications
Abstract: Sharing data and code for reuse have become increasingly important in scientific work over the past decade. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. We use results from unsupervised machine learning and principal component analysis (PCA) on open research datasets to propose improvements in research cyber-infrastructure.