Publications

Profir-Petru Partachi, Santanu Dash, Christoph Treude, Earl T. Barr (2020)POSIT: Simultaneously Tagging Natural and Programming Languages, In: (ICSE) International Conference on Software Engineering

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

Constantin Cezar Petrescu, Sam Smith, Rafail Giavrimis, Santanu Kumar Dash (2023)Do names echo semantics? A large-scale study of identifiers used in C++’s named casts, In: Journal of Systems and Software202111693 Elsevier

Developers relax restrictions on a type to reuse methods with other types. While type casts are prevalent, in weakly typed languages such as C++, they are also extremely permissive. Assignments where a source expression is cast into a new type and assigned to a target variable of the new type, can lead to software bugs if performed without care. In this paper, we propose an information-theoretic approach to identify poor implementations of explicit cast operations. Our approach measures accord between the source expression and the target variable using conditional entropy. We collect casts from 34 components of the Chromium project, which collectively account for 27MLOC and random-uniformly sample this dataset to create a manually labelled dataset of 271 casts. Information-theoretic vetting of these 271 casts achieves a peak precision of 81% and a recall of 90%. We additionally present the findings of an in-depth investigation of notable explicit casts, two of which were fixed in recent releases of the Chromium project. •Information-theoretic approach to identify poor implementations of named casts.•Detecting poor naming choices for identifiers used in a cast operation.•Measuring accord between source and target identifiers using conditional entropy.•In-depth investigation of the use of C++ explicit cast operators from Chromium.•Provide open-source implementation and dataset of 271 manually labelled casts.

Casey Casalnuovo, Barr Barr, Santanu Kumar Dash, Prem Devanbu (2020)A Theory of Dual Channel Constraints, In: Proceedings of the 42nd International Conference on Software Engineering (New Ideas and Emerging Results) (ICSE NIER 2020) Association for Computing Machinery (ACM)

The surprising predictability of source code has triggered a boom in tools using language models for code. Code is much more predictable than natural language, but the reasons are not well understood. We propose a dual channel view of code; code combines a formal channel for specifying execution and a natural language channel in the form of identifiers and comments that assists human comprehension. Computers ignore the natural language channel, but developers read both and, when writing code for longterm use and maintenance, consider each channel’s audience: computer and human. As developers hold both channels in mind when coding, we posit that the two channels interact and constrain each other; we call these dual channel constraints. Their impact has been neglected. We describe how they can lead to humans writing code in a way more predictable than natural language, highlight pioneering research that has implicitly or explicitly used parts of this theory, and drive new research, such as systematically searching for cross-channel inconsistencies. Dual channel constraints provide an exciting opportunity as truly multi-disciplinary research; for computer scientists they promise improvements to program analysis via a more holistic approach to code, and to psycholinguists they promise a novel environment for studying linguistic processes.

Annie Louis, Santanu Kumar Dash, Earl T. Barr, Michael D. Ernst, Charles Sutton (2020)Where should I comment my code? A dataset and model for predicting locations that need comments, In: Proceedings of the 42nd International Conference on Software Engineering (New Ideas and Emerging Results) (ICSE NIER 2020) Association for Computing Machinery (ACM)

Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at http://groups.inf.ed.ac.uk/cup/comment-locator/.