Open research principles for depression diagnosis with machine learning
Andrew Bailey, PhD in machine learning for emotion recognition with Professor Mark Plumbley, has been focusing on diagnosing depression and tackling the challenges presented to new researchers in this field.
We noticed two main issues from our preliminary research into machine learning for depression detection:
- Datasets, such as Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ), commonly used for depression may contain erroneous data (errors in transcription files, corrupted files etc.). This is a barrier for new researchers as they will have to clean the dataset before they begin experimentation.
- Many researchers in this domain do not share their code with their publications. This is problematic as it leads to a lack of reproducibility/transparency and it hinders new researchers from working in the area of diagnosing depression.
Our approach and challenges
We created two frameworks to tackle issues 1) and 2):
- Framework to clean the dataset (due to data protection we were not allowed to share the data itself): Cleaning the dataset was challenging as there were many errors to fix (not all had been recorded by previous researchers). To add robustness to the dataset cleaning framework we employed software engineering code-testing methods.
- Framework to provide a baseline model: Creating a baseline model framework was equally challenging. Despite the many technical details reported in research papers, assumptions were still required as every parameter cannot be recorded in a research paper. We also used coding library frameworks that help to promote reproducibility, thus allowing anyone to repeat our experiment and come to the same conclusion.
Creating these two frameworks took a considerable amount of time. Even after completion, we hesitated to publish our code for two reasons:
- Fear of judgement regarding coding quality.
- Fear of plagiarism prior to publication. As we encountered these pressures, we assume that other researchers do too, which cultivates a lack of reproducibility/transparency in the research community.
Motivated by the importance of the research area, we felt the benefits outweighed our concerns and we placed both frameworks on GitHub.
As a result, we managed to connect with several international researchers, some of whom helped to improve the quality of the frameworks. This work also prompted conversations with existing researchers in the field to release their code to the community.
Making our frameworks freely available has not only helped promote reproducibility and transparency in this research area but has also resulted in collaboration with fellow researchers and has hopefully made the research area more accessible for new researchers.