Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically "translate" diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead.
Automatically generating commit messages from diffs using neural machine translation
Siyuan Jiang,A. Armaly,Collin McMillan
Published 2017 in International Conference on Automated Software Engineering
ABSTRACT
PUBLICATION RECORD
- Publication year
2017
- Venue
International Conference on Automated Software Engineering
- Publication date
2017-08-30
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
CONCEPTS
- commit message quality
The level of usefulness and appropriateness of a commit message for summarizing a code change.
- commit messages
Short textual summaries attached to commits in version control systems.
- corpus
The collection of training examples assembled for the model.
- diffs
Source-code change descriptions that capture edits between repository versions.
- filter
A selection step that removes lower-quality training examples before model training.
- higher-quality commit messages
Commit messages judged suitable for inclusion in training because they better match the desired message quality.
- human-written commit messages
Commit summaries authored by programmers and used as supervision data.
- neural machine translation
A sequence-to-sequence translation approach used to map one textual representation to another.
Aliases: NMT
- quality-assurance filter
A postprocessing check that assesses whether generated commit messages are likely to be acceptable.
Aliases: QA filter
- top 1k github projects
The set of the 1,000 most popular GitHub projects used as the source of training data.
- warning
A returned notice indicating that the system should not output an autogenerated commit message.
REFERENCES
Showing 1-55 of 55 references · Page 1 of 1