In this post we’re going to continue the exploration of AI algorithms for the purpose of determining the Python identifier that would be called next (Part 1).
Overview
If we explore the problem of developing a ranking algorithm that provides accurate suggestions for code autocompletion in Python, at a higher level, there are several main aspects that may influence the overall outcome: the data being worked with, the constraints and accuracy of the algorithm and the means of testing.
Features
Choosing the right attributes of the dataset is in the core of this research. The combination of learning algorithms and the right attributes is what produces good results. Pattern matching and high rank suggestions are based on attributes extracted from previously written code. The combination of different attributes is what makes up the final feature vectors used for training the algorithm. After considerable research done on what attributes prove to have the highest significance, the ones that claim to be the best are:

Type of Identifer – the first attribute represents whether the identifier is a class name, method name or a variable name. Python does not require type definitions so looking for a type definition before the identifier currently being written is not possible. However, the type of the identifiers is important for other reasons. If we consider a rather large piece of code, class names are used significantly less that method names and, especially, variable names. Therefore, assigning different weights given the different types of the identifiers would represent this in the final rank. The assumption here is that the learning algorithm will automatically grasp the idea that, for example, variable will carry a bigger significance than class and, therefore, assign a higher weight to it.
This particular feature is not noted and discussed in most of the previously researched papers. However, this is due to the fact that in most programming languages there is a requirement to define the type of the variables or the type of the input of a method. Therefore, in most languages (like Java that is often used as an example) the type of the identifier is inferred from the context of the actual code. This is, thought, an important feature of the identifiers and should be considered.

Frequency – frequency is, probably, the most important feature in code autocompletion. It is one of the base features in all ranking and classification algorithms and also a core feature in the Google and Eclipse autocompletion algorithms. Brief research on the importance on frequency in classification and rank algorithm revels that it is indeed a core feature for almost every algorithm.
Frequency is often used for autocompletion on its own. As mentioned before, a good example of this is the Open Office Writer. It suggests only words that have been seen before and the more they have been used, the more likely it is for them to come up as suggestions before others. As the results in the Test section prove, on its own frequency performs well in comparison with the combination of different attributes and learning algorithms.
However, it is not a reliable attribute to base the whole task of autocompletion on as it is not consistent with its results. Its mean accuracy is around 59%. This is a, considerably, good performance considering all the uncertainty and issues related to implementing code completion for Python. The variance, as the tests explained later show, ranges around 10% depending on what part of the dataset has been used for training and how the files are arranged. Therefore, this proves the statement that frequency is a rather unreliable. The issues with frequency will further be discussed and analysed in the Testing and Discussion section.

Position of Repetitions – this is a very important factor for the overall performance of the ranking algorithms for the autocomplete. Its significance comes from the fact that no matter how many times an identifier has been used, if all the repetitions are contained within, for example, 50 lines and we are working on a system of 50 thousand lines of code then the significance of the frequency will be negated simply by the fact that this particular identifier may be a local variable for a certain class and is not to be used again.
Consider writing a script, if we are writing line number 1000 and the variable foo is defined on line 10 and used on lines 11,15,22 and never again afterwards, this would mean that this variable is no longer in use and should receive lower rank. On the contrary, if the variable has just been defined and used twice in the past 3 lines of code, then this would mean that the chance of this variable being needed soon again would be higher and, therefore, its rank should be higher.
However, there is a rather big and difficult to solve problem related to this particular feature. Most systems consist of multiple files and information about when was something written cannot always be obtained. In this case, the training step of the algorithms will depend mostly on the pure chance of getting the order of the files right. In such cases, where information for the complete time frame is not available, this feature can be used with certain constraints (Ex. only in the file currently being written).
Imposing such constraints raise other issues. For example if only the current file is considered for position of repetitions, what should be the value for the identifiers not mentioned in this particular file. Another issue is the fact that the whole parsing of the files should be changed to consider multiple files and store more information about the identifier which will result in higher training times. The possibility of including this feature fell out of the scope of this research due to the fact that the time set for the completion of the project was spent on features with better perspective to influence the overall performance.

Previous Identifier – this is another feature that ranks amongst the ones with highest significance for the good performance of the learning algorithms. Pattern matching plays a major role in autocompletion systems. Therefore, it would be wise to consider the past few previously used identifiers.
However, the issue with this particular attribute is that the computation time grows exponentially with every single one of the previous identifiers taken into account. Furthermore, storing all the data required to check for patterns involving a few of the previous identifier would also grow as the size of the project increases. This is an important attribute of the data and in order to make full use of it while not sacrificing processing time, only the one previous identifier will be used for testing.

Number of classes in – for all methods and variables, checking if they have been used in more than one class, would mean that they are more globally used throughout the system. Therefore, if the number of classes, that a certain identifier is used in, is high then the likelihood of this identifier being used throughout the whole project would be high.
Feature Representation: Binary vectors vs. realvalued vectors
Two methods of representation of the attributes of the feature vectors were considered: Binary vectors and Actualvalued feature vectors. Both feature vector types are a numerical representation of the data and have their own benefits and drawbacks.
Choosing the best way of representing the data depends on the data itself and features that are to be considered. The features used for this particular experiment are both in binary and regular numerical values.

Realvalue vectors – One way of representing the attributes in the feature vectors is keeping the actual values. For this way of representation, it is important to consider the methods of scaling the data as we have both binary and nonbinary values.
Big differences in the data values, however, may result in unexpected or inconsistent behaviour of the algorithms. In this particular case, the feature with the highest values is frequency. It is important to note that frequency is the most important of the features considered, but if it is kept in its normal values it will significantly overpower the rest of the features. The tests, later explained in the ‘Testing’ section, show that frequency is not necessary the key factor for accurate autocompletion.
Therefore, if it is kept in its normal values, the weight on frequency will increase proportionately to the size of the dataset. Considering this and the fact that the same thing holds for every single one of the other nonbinary attributes plus the fact that the whole feature vector is a mixture of binary and nonbinary attributes, suggests that realvalue vectors are not the best way to represent the dataset and its characteristics.

Binary representation – this is the method of representation chosen for testing the performance of the algorithms. Considering that some of the attributes considered are represented binary as ‘yes/no’, it is the best way of representing the data. Furthermore, the algorithms that are used (Perceptron and Logistic Regression, both explained in a later chapter) perform best using binary feature vectors.
As for the attributes with a nonbinary value, a few modifications were required in order for them be represented binary:
 Frequency – to represent frequency as a binary vector, instead of 1 feature, the attribute is split into a 3 dimensional vector. Each one of the dimensions in the vector shows whether the number of repetitions is within a certain range. For example, if the frequency of the work dog is 10 and the different thresholds are less than 5, less the 15 and more than 15 for each of the dimensions, then dog would fall in the second category.
However, using this representation, if the frequency of dog; is 3, then it satisfiers both the first and second constraints and the vector can be constructed in three difierent ways: (1,1,0), (1,0,0) or (0,1,0). Researching this particular issue showed that there is no definite answer to this particular problem and that testing would be the only way of determining what representation to use for this particular purpose.
 Position of repetitions the representation for this attribute is similar to the frequency and so are the issues related to it. Considering that the purpose of the attribute is to check whether an identifier is contained within a particular range of code lines, there is a need for multidimensional vector of representation. However, the issue here again is determining the appropriate size of the range. One way of overcoming this is to dynamically set the range depending on the size of the project. For example, in the combined size of the length of all the files is thirty thousand, then we could simply use 1% of it. But considering that with the project grows at a constant rate, then the size of the range of lines will grow also.
This issue combined with the uncertainty of when was each files written makes this particular attribute, theoretically, not worth exploring as the margin for error grows with the size of the project.
 Previous identifier depending on the n previous identifiers checked, there could be n dimensions to represent them. However, for the purpose of optimal runtime, n=1 is used for the testing of the accuracy of the proposed algorithms. What this attribute actually checks is whether the identifier currently being checked for a possible autocompletion has been used with the same previous identifier.
For example, if the identifier bone has been seen after using dog, then if the previous identifier we wrote was dog, the chance of bone being the one we are currently writing increases. The previous identifier used proved to be the trickiest feature to implement. The implementation is done during the actual training step of the algorithms. The problem with it is that one has to know the previous identifier in order to finish constructing the feature vectors for the training instances, and the previous identifier changes constantly. The way this features was included in the feature vectors is explained thoroughly in the ‘Implementation’ section.
 Number of classes in this attribute has the exact same issues related to it as the frequency. Again the solution can only come from extensive testing on various size datasets.
 Frequency – to represent frequency as a binary vector, instead of 1 feature, the attribute is split into a 3 dimensional vector. Each one of the dimensions in the vector shows whether the number of repetitions is within a certain range. For example, if the frequency of the work dog is 10 and the different thresholds are less than 5, less the 15 and more than 15 for each of the dimensions, then dog would fall in the second category.
Learning algorithms
There are several Python libraries that provide ready implementation of both the Perceptron and Logistic Regression Gradient Ascent algorithms. However, the original function and purpose of these algorithms is not in the scope of this research. Instead, they are used with the intention of rank calculation and there are several key changes that need to be implemented in order for the algorithms to fit this purpose. This is why both the learning algorithms were written from start to finish. The reason behind choosing these two particular algorithms lies mainly in the research done on previous successful use these two algorithms for ranking. As explained previously in the ‘Related Work’ chapter, both the Perceptron and Logistic Regression have produced satisfying results when used in similar implementation scenario as the one examined in this research.
The Perceptron
The Perceptron is a linear classifier algorithm for supervised classification. The algorithm takes as input a set of n vectors and learns a linear threshold function. After the weights are trained, the algorithm can calculate a weighted sum of the attributes in a feature vector and compare the sum to the predefined threshold to perform binary classification. The training of the Perceptron algorithm definitely converges on a linearly separable dataset proof of which can be found online.
Several modifications had to be applied to the original version of the Perceptron algorithm. The most important thing is that the output of the algorithm is not used for classification. Instead, the weighted sum is calculated to represent a rank of how likely it is for a particular identifier to be the one currently being typed. As this is an unusual use for the Perceptron algorithm and, even though, there is some research on the topic but definitely not enough for basing important decisions about the implementation on, there was extensive testing that need to be performed in order to make the algorithm fit for this purpose.
The first important modification comes from the fact that, in most cases, data about the semantic structure, frequency, etc. extracted from a large dataset, will not be linearly separable. Therefore, there needs to be a constraint that makes the algorithm converge. A reasonable way of settling on a set of weights is running the algorithm for a predefined number of iterations that are decided on through testing. A few different solution were considered in order to make the Pereptron algorithm coverage:
 A fixed number of iterations – this is the most generally used technique for making the Perceptron algorithm converge on a nonlinearly separable dataset. The problem here is to find the optimal number of iterations required in order for the algorithm to reach its optimal performance. Let us examine the following figure.
It shows a test over 3 separate datasets of different length between ten and thirty thousand lines. This test has nothing to do with how the accuracy is achieved – it is only to show at how many iterations does the algorithm stop improving its objective function in order to increase accuracy. The three different curves show the performance of the Perceptron over each of the datasets, the x axis is the number of iterations and the y axis is the actual accuracy. It become obvious that the algorithm ‘converges’ somewhere between 7 and 15 iterations as the accuracy increases significantly before this point and at a very small (if at all) afterwards.

A threshold for the error count – if this technique was to be used, there is no way of knowing what is a reasonable error count in order to stop the algorithm. This way of measuring is very dependent on the size of the dataset and would be different for every dataset. The reason for this is the
different number of unique identifiers in the dataset. 
Stop if the number of errors remains the same for n iterations – after testing it becomes clear that, again, this constraint is dependent on the size and how diverse the data is.
As we can see from the figure, the number of misclassified instances in the training set does decrease with the increase in the number of iterations. However, especially in sets 3 and 5, there are these longer periods where the error count remains the same but afterwards significantly deceases. More specifically, on set 3, the error count did not change for around 1015 iterations(between 5 and 20), but after this dropped down by half. After examining the results of this particular test, it appeared that the errors were not on the same instances but were changing from one to the other. This is due to the fact that many of the instance’s feature vectors are pretty similar. This whole problem the error count staying the same for a certain number of iterations and then changing for the better deemed this way of reaching a final set of weights infeasible. In result of this test, however, another issue appeared: determining the optimal size of the learning rate step.
The learning rate of the algorithm is the key attribute that ensures fast and accurate definition of a decision boundary for the dataset. Even though, the Perceptron is not used with original purpose of classification, it is important to get a good set of weights that accurately represents a division in the dataset, in order to calculate accurate rank. Research showed that the most reliable way of setting the learning rate as a proportion of the training instances in the dataset (1/n). It is important to note again that as this is not the general use of the algorithms as this in not a classification task, all the decisions made regarding the implementation and use of the algorithms is derived through extensive testing on different datasets and research on similar usage of the Perceptron.
Logistic Regression with Gradient Descent
The Logistic Regression Gradient Descent algorithm is very similar to the Perceptron. The core difference is the objective function that is being trained. Logistic regression trains a sigmoid function whose output can be considered as the probability of an instance to be a member of a specific class. The initial idea was to use the Perceptron only. However, during the implementation, many issues occurred not with the actual code required but with the assumption that had to be made and the parameters that needed to be tested and fitted.
The idea behind learning algorithms is that they t a mathematical function to the dataset. Generally, there is no strict way of defining what algorithm is better than the other as it all depends on the data and how it is presented and used for.
Here is where the general benefits of Logistic Regression seem to fit ideally in the scenario. Logistic Regression makes weaker assumptions about the data and in this case, given all the uncertainty about how the features are related, suits perfectly. Another clear benefit, that the research on Logistic Regression leads to the conclusion, is that Logistic Regression can handle nonlinear effects and relations. Even thought, it is a linear classification algorithm, the fact that it does not make the strong assumption of linear relation between the vectors and the outcome and that the data is normally distributed or has equal variance in each group. This makes the algorithm very appropriate for the constrains of the task of autocompletion.
As for the second part of the algorithm, the Gradient Descent, it is an optimization algorithm. Its idea is to help update the function of the Logistic Regression in order to train the best weights. Gradient Descent aims to find the local minimum of the objective function trained by the algorithm.
The combination Logistic Regression Gradient Descent is an often used one. It requires a bigger training set in order to train better weights, but in this case this a general characteristic of the dataset and works to the algorithm’s benefit. The issue that is expected here is that the algorithm would not perform as good on smaller datasets. However, the assumption here is that the quality of the features and their significance when representing the relations within the data should be sufficient even if the algorithm does not produce the best weights possible.