Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is
essential for making progress on two seminal software engineering problems ? traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching
techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags
for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from
the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.
Casalnuovo Casey, Barr Barr, Dash Santanu Kumar, Devanbu Prem (2020) A Theory of Dual Channel Constraints,Proceedings of the 42nd International Conference on Software Engineering (New Ideas and Emerging Results) (ICSE NIER 2020)
Association for Computing Machinery (ACM)
The surprising predictability of source code has triggered a boom
in tools using language models for code. Code is much more predictable
than natural language, but the reasons are not well understood.
We propose a dual channel view of code; code combines a
formal channel for specifying execution and a natural language
channel in the form of identifiers and comments that assists human
comprehension. Computers ignore the natural language channel,
but developers read both and, when writing code for longterm use
and maintenance, consider each channel?s audience: computer and
human. As developers hold both channels in mind when coding,
we posit that the two channels interact and constrain each other;
we call these dual channel constraints. Their impact has been neglected.
We describe how they can lead to humans writing code
in a way more predictable than natural language, highlight pioneering
research that has implicitly or explicitly used parts of this
theory, and drive new research, such as systematically searching
for cross-channel inconsistencies. Dual channel constraints provide
an exciting opportunity as truly multi-disciplinary research; for
computer scientists they promise improvements to program analysis
via a more holistic approach to code, and to psycholinguists they
promise a novel environment for studying linguistic processes.
Programmers should write code comments, but not on every line
of code. We have created a machine learning model that suggests
locations where a programmer should write a code comment. We
trained it on existing commented code to learn locations that are
chosen by developers. Once trained, the model can predict locations
in new code. Our models achieved precision of 74% and recall of
13% in identifying comment-worthy locations. This first success
opens the door to future work, both in the new where-to-comment
problem and in guiding comment generation. Our code and data is
available at http://groups.inf.ed.ac.uk/cup/comment-locator/.