Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at http://groups.inf.ed.ac.uk/cup/comment-locator/.
Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.
The surprising predictability of source code has triggered a boom in tools using language models for code. Code is much more predictable than natural language, but the reasons are not well understood. We propose a dual channel view of code; code combines a formal channel for specifying execution and a natural language channel in the form of identifiers and comments that assists human comprehension. Computers ignore the natural language channel, but developers read both and, when writing code for longterm use and maintenance, consider each channel’s audience: computer and human. As developers hold both channels in mind when coding, we posit that the two channels interact and constrain each other; we call these dual channel constraints. Their impact has been neglected. We describe how they can lead to humans writing code in a way more predictable than natural language, highlight pioneering research that has implicitly or explicitly used parts of this theory, and drive new research, such as systematically searching for cross-channel inconsistencies. Dual channel constraints provide an exciting opportunity as truly multi-disciplinary research; for computer scientists they promise improvements to program analysis via a more holistic approach to code, and to psycholinguists they promise a novel environment for studying linguistic processes.