Code completion is a widely used tool that removes the element of repeatedly typing the same identifiers when writing code. With the use of different learning and ranking algorithms a code completion system constructs a list of possible completions given the first few characters of the identifier being written. This post presents an approach to improve the accuracy of the ranked candidate lists using historical data of the program being written. Different features are explored and tested in combination with the Perceptron and Logistic Regression Gradient Descent learning algorithms to find what bares the highest significance when predicting the identifier being typed. The experiments aim to prove that classification algorithms modified for rank calculation based on historical data could potentially improve code-completion systems for scripting languages such as Python.
What are the benefits of auto completion?
Code completion is a widely used feature of modern IDEs. It is a tool that provides a list of suggested completions after typing in the first few characters of a word or in this case and identifiers. Its benefits are self-explanatory as it greatly reduces the number of actions needed to complete a line of code. Furthermore, code completion could help developers find the exact identifier that they need. For example, in a large software system one could easily forget the name chosen for a particular method, class or variable that needs to be used on the current line. With the suggestions the auto completion system provides, the developer can save time by not browsing through files to find the exact name of the identifier. Another benefit of code auto-completion is that it allows developers to write longer and more descriptive method names as they do not have to type the whole names when they use them. This provides the opportunity for someone new to start working on the system and rapidly explore the functionality of what has been written so far. Longer identifier names, generally, are a great way of making the code significantly more readable and understandable. Code auto-completion is also important for its ability to provide all implemented methods in different libraries of the language its being used on. For example, in Eclipse simply importing the
javax.swing.JPanel class and creating an object
p of the type
JPanel would provide all the methods I the JPanel class by typing a dot after the object name. Furthermore, almost all of the modern IDEs have this feature and many of them have already dealt with the problem of auto-completion in languages like Java.
Code auto-completion for Python
In Python, which is extensively used as a scripting language, it is significantly harder to implement auto-completion. As it is not necessary to give a class name before using any of the variables and the fact that one can write multiple class definitions in one file, the task of code auto-completion becomes fairly complicated. IDEs use the class name and the dot(.) symbol as a hint to where to look for a particular identifier. But if that hint was removed and the file or system being worked on is over 10 thousand lines of code consisting of thousands of identifiers it would be unreasonable to simply give an alphabetical ordered list of possible completions given the first few characters. Therefore, it would be highly beneficial to implement a ranking algorithm that would provide a list of solutions with the ones that it deems most significant to be on top.
Eclipse is, probably, the most widespread IDE alongside with editors such as Vim, Notepad++, Emacs and Sublime. Eclipse with the PyDev plug in and Sublime work very well for built in functions on Python. However, they do not provide auto-completion for variable names, defined methods and classes. It is an area that requires research and significant work and development. Modern IDEs do have several very helpful features for python developers. Apart from suggesting built-in functions and library methods, Eclipse for example inserts the self parameter when a method is defined inside a class and keeps track of indentation. However, we consider and explore a few ways of implementing a ranking algorithm for auto-completion of variable, method and class names written on an empty line without the hint of the library, class or file they are defined in.
Python is a widely used programming language and exploring the possibility of a fully functional code completion system based on a well-defined ranking algorithm could potentially increase the productivity of Python developers.
The main issues that challenge a successful implementation of an auto-completion algorithm for the Python language are its specific syntax and usage. Python is both used as a scripting language and object-oriented language and does not require type definitions beyond higher level types such as classes or methods. This fact alone limits a syntax-based auto-completion. As mentioned before, most auto-completion systems use class definitions or type definitions to narrow down the suggestions. However, Python does not require such things. Everything can be written on a new line. Furthermore, in most object-oriented languages, methods require type declarations for their input identifiers. Again, Python does not require type declarations even for methods identifiers. As we can see, only the lack of a requirement for type casting and definition when initialising a variable seriously challenges auto-completion. The issue becomes further complicated form the fact that there could be numerous class definitions in one file. This raises the issue that the completion algorithm cannot differentiate between local, instance and global variables and, therefore, cannot reduce the length of the list of possible completions thus making the need for a ranking algorithm that considers all identifiers-obvious.
With this in mind, in order to implement a good auto-completion ranking al- gorithm, it should not depend mainly on syntax or structure-specific features of previously written code, but on patterns and more high-level characteristics of the identifiers. Characteristics like: frequency of use, type, etc. and anything else that was motivated by the extensive research carried on the topic of auto- completion and inferred through reasoning over the issue (both explained in later chapters). Furthermore, auto-completion should not be based simply on the extraction of the identifiers but on an algorithm that captures the aforementioned patterns and ranks the identifiers in order to narrow down the list of possible suggestions.
Related work in autocompletion
There has been a fairly limited research on code auto-completion, especially with the Python language. Most research papers are mainly based on strictly object oriented languages (such as Java) and tend to avoid the topic of scripting languages. However, there has been a lot of research on successfully implemented algorithms for linguistic completion of words and sentences simply because of the more widespread use of this in modern technology: phones (texting), search engines (Google autocomplete and suggestions), etc.
A related paper, even though written after a research on Java, tests an algorithm based on binary feature vectors that are believed to be of even significance for every language. The mentioned features are frequency and class membership. They test a k-NN algorithm with binary vectors for pattern matching and with the help of a good distance measure are able to generate a list of possible completions. Their association rule based code completion system produces recall of 65% and precision 85% and the frequency based code completion system produces 45% recall and 34% precision. This shows that there is, in fact, a pattern in coding languages that somewhat resembles the structure of sentences in English. Therefore, there exists a combination of attributes and the right algorithm to recognize these patterns. Even though, Python and Java have several core differences in term identifier definitions, implemented functions, syntax and structure, they do share the common object-oriented programming concepts and are used in a similar fashion.
The KNN algorithm fits remarkably well to the problem of code completion: when a developer wants to complete code at the usage pattern level, she might start to search for code fragments very similar to the already written ones (e.g. using Google Code) and takes inspiration from the rest of the code.
The concept of classification learning algorithms working well for ranking fits perfectly in this particular scenario. Python, as any other programming language, has its own unique structure and patters that are being followed no matter the essence of the project being written. Instead of looking for more general patterns throughout the code of a software system using a clustering algorithm (such as the k-Nearest-Neighbours), the problem of auto-completion could be looked at from the perspective of the unique identifiers. Instead of looking for a more global pattern in the previously written code, specific characteristics of every single one of the identifiers and their usage throughout could suggest the likelihood of this identifier being used again given the current state. The idea is that a learning algorithm would grasp these patterns for an identifier x given the set of all identifier that share the same initial few characters and their respective feature vectors, return an accurate objective function to calculate the rank (or probability) and suggest the correct identifier for completion. This leads to the core concept of the research: attributes and learning algorithms.
Research shows that there are two classification algorithms that are often used for ranking: Perceptron and Logistic Regression. The Perceptron has been successfully used to ranking and re-ranking in natural language processing and regression, in general, has been often used for ranking in data mining problems. The general concept here is that classification and ranking share the same general idea of a function to calculate likelihood as both look for patterns and similarity of every datum in the dataset. Following this logic combined with the results of similar experiments suggests that this hypothesis of the right feature vectors and learning algorithms should produce good results for the problem of code completion.
Several factors of the data have been mentioned that greatly influence the performance of auto-completion algorithms. The most important one is frequency in both code auto-completion and English words auto-completion. An example of this is the OpenOffice Writer which provides suggestions to complete the word currently being typed only based on words written so far. This, and the fact that a similar implementation and use of frequency and learning algorithms has been used for successful ranking in natural language processing, leads to the conclusion that there is common ground between code completion and human word completion with pattern matching. In word completion algorithms perform well due to the extensive datasets of semantic information and structure of words and sentences. In code completion there is not such previously defined structure, there is not semantic content (POS tags, Phrase structures, etc.). However, by examining the previously written code on the same system auto-completion is to be implemented on, extracting specific features (such as frequency) and checking for repeated patterns of identifiers, should produce similar results to syntactic work completion.
Another paper presents the results of testing a prefix-search auto-completion algorithm based on previously seen identifiers. Even though it is mostly oriented towards on-line auto-completion and inserting newly added identifiers in the preexisting dictionary, the results of testing accuracy of the algorithm suggest that prefix-search reduces the runtime of the algorithm and improves the accuracy as a big part of the training set is disregarded with the addition of more characters to the prefix. The results, from using an algorithm that is only based on prefixes and does not consider anything else, are a 16% accuracy in retrieving the correct identifier given the first 2-3 characters. Considering what class the identifiers has been defined and used in improves the performance significantly increasing the accuracy to 60-70% on the first 3 characters. However, as this algorithm is tested on JAVA, the results may not be that impressive when the algorithm is implemented on Python. But this particular paper motivates the use of other characteristics of the identifier: class membership, recently defined or used identifiers, frequency (again) and time and place of definition and initialization. The final version of their algorithm produces an impressive 80% accuracy.
Second part of the learning algorithms research.