Artificial Intelligence: What Python identifier will be called next? (Part 2)

In this post we’re going to continue the exploration of AI algorithms for the purpose of determining the Python identifier that would be called next (Part 1).

Overview

If we explore the problem of developing a ranking algorithm that provides accurate suggestions for code auto-completion in Python, at a higher level, there are several main aspects that may influence the overall outcome: the data being worked with, the constraints and accuracy of the algorithm and the means of testing.

Features

Choosing the right attributes of the dataset is in the core of this research. The combination of learning algorithms and the right attributes is what produces good results. Pattern matching and high rank suggestions are based on attributes extracted from previously written code. The combination of different attributes is what makes up the final feature vectors used for training the algorithm. After considerable research done on what attributes prove to have the highest significance, the ones that claim to be the best are:

  • Type of Identifer – the first attribute represents whether the identifier is a class name, method name or a variable name. Python does not require type definitions so looking for a type definition before the identifier currently being written is not possible. However, the type of the identifiers is important for other reasons. If we consider a rather large piece of code, class names are used significantly less that method names and, especially, variable names. Therefore, assigning different weights given the different types of the identifiers would represent this in the final rank. The assumption here is that the learning algorithm will automatically grasp the idea that, for example, variable will carry a bigger significance than class and, therefore, assign a higher weight to it.

    This particular feature is not noted and discussed in most of the previously researched papers. However, this is due to the fact that in most programming languages there is a requirement to define the type of the variables or the type of the input of a method. Therefore, in most languages (like Java that is often used as an example) the type of the identifier is inferred from the context of the actual code. This is, thought, an important feature of the identifiers and should be considered.

  • Frequency – frequency is, probably, the most important feature in code auto-completion. It is one of the base features in all ranking and classification algorithms and also a core feature in the Google and Eclipse auto-completion algorithms. Brief research on the importance on frequency in classification and rank algorithm revels that it is indeed a core feature for almost every algorithm.

    Frequency is often used for auto-completion on its own. As mentioned before, a good example of this is the Open Office Writer. It suggests only words that have been seen before and the more they have been used, the more likely it is for them to come up as suggestions before others. As the results in the Test section prove, on its own frequency performs well in comparison with the combination of different attributes and learning algorithms.

    However, it is not a reliable attribute to base the whole task of auto-completion on as it is not consistent with its results. Its mean accuracy is around 59%. This is a, considerably, good performance considering all the uncertainty and issues related to implementing code completion for Python. The variance, as the tests explained later show, ranges around 10% depending on what part of the dataset has been used for training and how the files are arranged. Therefore, this proves the statement that frequency is a rather unreliable. The issues with frequency will further be discussed and analysed in the Testing and Discussion section.

  • Position of Repetitions – this is a very important factor for the overall performance of the ranking algorithms for the auto-complete. Its significance comes from the fact that no matter how many times an identifier has been used, if all the repetitions are contained within, for example, 50 lines and we are working on a system of 50 thousand lines of code then the significance of the frequency will be negated simply by the fact that this particular identifier may be a local variable for a certain class and is not to be used again.

    Consider writing a script, if we are writing line number 1000 and the variable foo is defined on line 10 and used on lines 11,15,22 and never again afterwards, this would mean that this variable is no longer in use and should receive lower rank. On the contrary, if the variable has just been defined and used twice in the past 3 lines of code, then this would mean that the chance of this variable being needed soon again would be higher and, therefore, its rank should be higher.

    However, there is a rather big and difficult to solve problem related to this particular feature. Most systems consist of multiple files and information about when was something written cannot always be obtained. In this case, the training step of the algorithms will depend mostly on the pure chance of getting the order of the files right. In such cases, where information for the complete time frame is not available, this feature can be used with certain constraints (Ex. only in the file currently being written).

    Imposing such constraints raise other issues. For example if only the current file is considered for position of repetitions, what should be the value for the identifiers not mentioned in this particular file. Another issue is the fact that the whole parsing of the files should be changed to consider multiple files and store more information about the identifier which will result in higher training times. The possibility of including this feature fell out of the scope of this research due to the fact that the time set for the completion of the project was spent on features with better perspective to influence the overall performance.

  • Previous Identifier – this is another feature that ranks amongst the ones with highest significance for the good performance of the learning algorithms. Pattern matching plays a major role in auto-completion systems. Therefore, it would be wise to consider the past few previously used identifiers.

    However, the issue with this particular attribute is that the computation time grows exponentially with every single one of the previous identifiers taken into account. Furthermore, storing all the data required to check for patterns involving a few of the previous identifier would also grow as the size of the project increases. This is an important attribute of the data and in order to make full use of it while not sacrificing processing time, only the one previous identifier will be used for testing.

  • Number of classes in – for all methods and variables, checking if they have been used in more than one class, would mean that they are more globally used throughout the system. Therefore, if the number of classes, that a certain identifier is used in, is high then the likelihood of this identifier being used throughout the whole project would be high.

Feature Representation: Binary vectors vs. real-valued vectors

Two methods of representation of the attributes of the feature vectors were considered: Binary vectors and Actual-valued feature vectors. Both feature vector types are a numerical representation of the data and have their own benefits and drawbacks.

Choosing the best way of representing the data depends on the data itself and features that are to be considered. The features used for this particular experiment are both in binary and regular numerical values.

  • Real-value vectors – One way of representing the attributes in the feature vectors is keeping the actual values. For this way of representation, it is important to consider the methods of scaling the data as we have both binary and non-binary values.

    Big differences in the data values, however, may result in unexpected or inconsistent behaviour of the algorithms. In this particular case, the feature with the highest values is frequency. It is important to note that frequency is the most important of the features considered, but if it is kept in its normal values it will significantly overpower the rest of the features. The tests, later explained in the ‘Testing’ section, show that frequency is not necessary the key factor for accurate auto-completion.

    Therefore, if it is kept in its normal values, the weight on frequency will increase proportionately to the size of the dataset. Considering this and the fact that the same thing holds for every single one of the other non-binary attributes plus the fact that the whole feature vector is a mixture of binary and non-binary attributes, suggests that real-value vectors are not the best way to represent the dataset and its characteristics.

  • Binary representation – this is the method of representation chosen for testing the performance of the algorithms. Considering that some of the attributes considered are represented binary as ‘yes/no’, it is the best way of representing the data. Furthermore, the algorithms that are used (Perceptron and Logistic Regression, both explained in a later chapter) perform best using binary feature vectors.

    As for the attributes with a non-binary value, a few modifications were required in order for them be represented binary:

    1. Frequency – to represent frequency as a binary vector, instead of 1 feature, the attribute is split into a 3 dimensional vector. Each one of the dimensions in the vector shows whether the number of repetitions is within a certain range. For example, if the frequency of the work dog is 10 and the different thresholds are less than 5, less the 15 and more than 15 for each of the dimensions, then dog would fall in the second category.

      However, using this representation, if the frequency of dog; is 3, then it satisfiers both the first and second constraints and the vector can be constructed in three difierent ways: (1,1,0), (1,0,0) or (0,1,0). Researching this particular issue showed that there is no definite answer to this particular problem and that testing would be the only way of determining what representation to use for this particular purpose.

    2. Position of repetitions the representation for this attribute is similar to the frequency and so are the issues related to it. Considering that the purpose of the attribute is to check whether an identifier is contained within a particular range of code lines, there is a need for multi-dimensional vector of representation. However, the issue here again is determining the appropriate size of the range. One way of overcoming this is to dynamically set the range depending on the size of the project. For example, in the combined size of the length of all the files is thirty thousand, then we could simply use 1% of it. But considering that with the project grows at a constant rate, then the size of the range of lines will grow also.

      This issue combined with the uncertainty of when was each files written makes this particular attribute, theoretically, not worth exploring as the margin for error grows with the size of the project.

    3. Previous identifier depending on the n previous identifiers checked, there could be n dimensions to represent them. However, for the purpose of optimal runtime, n=1 is used for the testing of the accuracy of the proposed algorithms. What this attribute actually checks is whether the identifier currently being checked for a possible auto-completion has been used with the same previous identifier.

      For example, if the identifier bone has been seen after using dog, then if the previous identifier we wrote was dog, the chance of bone being the one we are currently writing increases. The previous identifier used proved to be the trickiest feature to implement. The implementation is done during the actual training step of the algorithms. The problem with it is that one has to know the previous identifier in order to finish constructing the feature vectors for the training instances, and the previous identifier changes constantly. The way this features was included in the feature vectors is explained thoroughly in the ‘Implementation’ section.

    4. Number of classes in this attribute has the exact same issues related to it as the frequency. Again the solution can only come from extensive testing on various size datasets.

Learning algorithms

There are several Python libraries that provide ready implementation of both the Perceptron and Logistic Regression Gradient Ascent algorithms. However, the original function and purpose of these algorithms is not in the scope of this research. Instead, they are used with the intention of rank calculation and there are several key changes that need to be implemented in order for the algorithms to fit this purpose. This is why both the learning algorithms were written from start to finish. The reason behind choosing these two particular algorithms lies mainly in the research done on previous successful use these two algorithms for ranking. As explained previously in the ‘Related Work’ chapter, both the Perceptron and Logistic Regression have produced satisfying results when used in similar implementation scenario as the one examined in this research.

The Perceptron

200px-Perceptron.svg

The Perceptron is a linear classifier algorithm for supervised classification. The algorithm takes as input a set of n vectors and learns a linear threshold function. After the weights are trained, the algorithm can calculate a weighted sum of the attributes in a feature vector and compare the sum to the predefined threshold to perform binary classification. The training of the Perceptron algorithm definitely converges on a linearly separable dataset proof of which can be found on-line.

Several modifications had to be applied to the original version of the Perceptron algorithm. The most important thing is that the output of the algorithm is not used for classification. Instead, the weighted sum is calculated to represent a rank of how likely it is for a particular identifier to be the one currently being typed. As this is an unusual use for the Perceptron algorithm and, even though, there is some research on the topic but definitely not enough for basing important decisions about the implementation on, there was extensive testing that need to be performed in order to make the algorithm fit for this purpose.

The first important modification comes from the fact that, in most cases, data about the semantic structure, frequency, etc. extracted from a large dataset, will not be linearly separable. Therefore, there needs to be a constraint that makes the algorithm converge. A reasonable way of settling on a set of weights is running the algorithm for a predefined number of iterations that are decided on through testing. A few different solution were considered in order to make the Pereptron algorithm coverage:

  • A fixed number of iterations – this is the most generally used technique for making the Perceptron algorithm converge on a non-linearly separable dataset. The problem here is to find the optimal number of iterations required in order for the algorithm to reach its optimal performance. Let us examine the following figure.

    It shows a test over 3 separate datasets of different length between ten and thirty thousand lines. This test has nothing to do with how the accuracy is achieved – it is only to show at how many iterations does the algorithm stop improving its objective function in order to increase accuracy. The three different curves show the performance of the Perceptron over each of the datasets, the x axis is the number of iterations and the y axis is the actual accuracy. It become obvious that the algorithm ‘converges’ somewhere between 7 and 15 iterations as the accuracy increases significantly before this point and at a very small (if at all) afterwards.

  • A threshold for the error count – if this technique was to be used, there is no way of knowing what is a reasonable error count in order to stop the algorithm. This way of measuring is very dependent on the size of the dataset and would be diff erent for every dataset. The reason for this is the
    di fferent number of unique identifi ers in the dataset.
  • Stop if the number of errors remains the same for n iterations – after testing it becomes clear that, again, this constraint is dependent on the size and how diverse the data is.

    As we can see from the figure, the number of misclassified instances in the training set does decrease with the increase in the number of iterations. However, especially in sets 3 and 5, there are these longer periods where the error count remains the same but afterwards significantly deceases. More specifically, on set 3, the error count did not change for around 10-15 iterations(between 5 and 20), but after this dropped down by half. After examining the results of this particular test, it appeared that the errors were not on the same instances but were changing from one to the other. This is due to the fact that many of the instance’s feature vectors are pretty similar. This whole problem- the error count staying the same for a certain number of iterations and then changing for the better- deemed this way of reaching a final set of weights infeasible. In result of this test, however, another issue appeared: determining the optimal size of the learning rate step.

The learning rate of the algorithm is the key attribute that ensures fast and accurate definition of a decision boundary for the dataset. Even though, the Perceptron is not used with original purpose of classification, it is important to get a good set of weights that accurately represents a division in the dataset, in order to calculate accurate rank. Research showed that the most reliable way of setting the learning rate as a proportion of the training instances in the dataset (1/n). It is important to note again that as this is not the general use of the algorithms as this in not a classification task, all the decisions made regarding the implementation and use of the algorithms is derived through extensive testing on different datasets and research on similar usage of the Perceptron.

Logistic Regression with Gradient Descent

The Logistic Regression Gradient Descent algorithm is very similar to the Perceptron. The core difference is the objective function that is being trained. Logistic regression trains a sigmoid function whose output can be considered as the probability of an instance to be a member of a specific class. The initial idea was to use the Perceptron only. However, during the implementation, many issues occurred not with the actual code required but with the assumption that had to be made and the parameters that needed to be tested and fitted.

The idea behind learning algorithms is that they t a mathematical function to the dataset. Generally, there is no strict way of defi ning what algorithm is better than the other as it all depends on the data and how it is presented and used for.

Here is where the general benefits of Logistic Regression seem to fit ideally in the scenario. Logistic Regression makes weaker assumptions about the data and in this case, given all the uncertainty about how the features are related, suits perfectly. Another clear benefit, that the research on Logistic Regression leads to the conclusion, is that Logistic Regression can handle non-linear effects and relations. Even thought, it is a linear classification algorithm, the fact that it does not make the strong assumption of linear relation between the vectors and the outcome and that the data is normally distributed or has equal variance in each group. This makes the algorithm very appropriate for the constrains of the task of auto-completion.

As for the second part of the algorithm, the Gradient Descent, it is an optimization algorithm. Its idea is to help update the function of the Logistic Regression in order to train the best weights. Gradient Descent aims to find the local minimum of the objective function trained by the algorithm.

The combination Logistic Regression Gradient Descent is an often used one. It requires a bigger training set in order to train better weights, but in this case this a general characteristic of the dataset and works to the algorithm’s benefit. The issue that is expected here is that the algorithm would not perform as good on smaller datasets. However, the assumption here is that the quality of the features and their significance when representing the relations within the data should be sufficient even if the algorithm does not produce the best weights possible.

What is Shell Injection and how to avoid it?

It is a well known fact that developers should be paranoid about security. In this post I am going to briefly describe what is Shell Injection and general principles to avoid it.

Generally, shell injection can be considered as a software bug in input data processing that allows the user to inject code and execute commands on the server machine.

Let’s consider the following example:

import subprocess
from StringIO import StringIO
from wsgiref.simple_server import make_server

def application(environ, start_response):
    path = environ.get('PATH_INFO')[1:]

    stdout = StringIO()
    out, err = subprocess.Popen(['fortune {} | cowsay'.format(path)], stdout=subprocess.PIPE, shell=True).communicate()
    for line in out.splitlines():
        print >> stdout, line
    start_response("200 OK", [('Content-Type','text/plain')])
    return [stdout.getvalue()]

if __name__ == '__main__':
    httpd = make_server('', 8000, application)
    print "Starting cow server on port 8000"
    httpd.serve_forever()

It’s a very simple web application that relies on the fortune and cowsay tools to generate wise quotes and display them to the user in a funny way.

 ______________________________________
/ Do you mean that you not only want a \
| wrong answer, but a certain wrong    |
| answer?                              |
|                                      |
\ -- Tobaben                           /
 --------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

The application accepts user input, so that we could show different types of quotations. Let’s consider that the user requests offensive quotations and navigates to the following URL: http://localhost:8000/off. The /off parameter would be passed to the command we are executing and the output would be something like this:

 ________________________________________
/ You have been bi*chy since Tuesday and \
\ you'll probably get fired today.       /
 ----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

But what’s the problem with the code above? We accept user input which is used to format the command without applying any filtration or normalization on it. The cracker can execute any command on the server and see it’s output: http://localhost:8000/ cat cowfortune.py #.

Everything should be made as simple as possible, but not simpler.
		-- Albert Einstein
 ________________________________________
/ import subprocess from StringIO import \
| StringIO from wsgiref.simple_server    |
| import make_server                     |
|                                        |
| def application(environ,               |
| start_response):                       |
|                                        |
| path = environ.get('PATH_INFO')[1:]    |
|                                        |
| stdout = StringIO()                    |
|                                        |
| out, err = subprocess.Popen(['fortune  |
| {} | cowsay'.format(path)],            |
| stdout=subprocess.PIPE,                |
| shell=True).communicate()              |
|                                        |
| for line in out.splitlines():          |
|                                        |
| print >> stdout, line                  |
|                                        |
| start_response("200 OK",               |
| [('Content-Type','text/plain')])       |
|                                        |
| return [stdout.getvalue()]             |
|                                        |
| if __name__ == '__main__':             |
|                                        |
| httpd = make_server('', 8000,          |
| application)                           |
|                                        |
| print "Starting cow server on port     |
| 8000"                                  |
|                                        |
\ httpd.serve_forever()                  /
 ----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

As you see, the “cow” answers absolutely everything we ask about, so the solution is to filter the questions.

import subprocess
from StringIO import StringIO
from wsgiref.simple_server import make_server

def application(environ, start_response):
    path = environ.get('PATH_INFO')[1:]
    is_valid = path in ['off', 'some', 'other', 'options']

    stdout = StringIO()
    if is_valid:
        out, err = subprocess.Popen(['fortune {} | cowsay'.format(path)], stdout=subprocess.PIPE, shell=True).communicate()
        for line in out.splitlines():
            print >> stdout, line
    start_response("200 OK", [('Content-Type','text/plain')])
    return [stdout.getvalue()]

if __name__ == '__main__':
    httpd = make_server('', 8000, application)
    print "Starting cow server on port 8000"
    httpd.serve_forever()

What are the golden rules to avoid injection vulnerabilities?

  • Always verify user input.
  • Never use direct user input to format commands, SQL code or anything that executes.
  • Make sure you think about security on the go and don’t leave it for further refactoring.
  • Be paranoid.

Wireless Embedded Sensor Architecture

Emerging wireless embedded sensors combine sensing, computation, and communication in a single, tiny, resource constrained device. Their power lies in the ability to deploy a large number of nodes into a physical environment that automatically configure into a distributed sensing platform. Usage scenarios for these devices range from real-time tracking, to monitoring of environmental conditions, to ubiquitous computing environments, to in situ monitoring of the health of structures or equipment. While often referred to as networked sensors, they can
also be actuators that extend the reach of cyberspace into the physical world.

The core design challenge for wireless embedded sensors lies in coping with their harsh resource constraints. Embedded processors with kilobytes of memory are used to implement complex, distributed, ad-hoc networking protocols. These constraints derive from the vision that these devices will be produced in vast quantities, will be small and inexpensive, and ideally will operate off of ambient power. As Moore’s law marches on, these devices will get smaller, not just grow more powerful at a given size.

One of the most difficult resource constraints to meet in the context of wireless embedded sensors is power consumption. As physical size decreases, so does energy capacity. Underlying energy constraints end up creating computational and storage limitations. In these highly constrained devices, a new set of architectural issues has arisen. Many devices, such as cell phones and pagers, use specialized communication hardware in ASICs that provide ultra-low-power implementations of necessary protocol processing. This is possible because they target a narrow set of well-defined applications. The complex tradeoffs between power consumption, bandwidth, and latency can be explored and evaluated with respect to specific performance goals and the optimal protocol can be
implemented in hardware. However, the strength of networked embedded devices is their flexibility and universality. The wide range of applications being targeted makes it difficult to develop general-purpose protocols that are efficient for all applications.

Empybit’s Space Tour to Horizon 2020

empybit_horizon_2020

We are proud to announce that we participated in the Horizon 2020 program for Space research and innovative technologies. Alongside with the traditional presentation sessions, this year the Horizon 2020 really focused on face-to-face meetings with potential partners – different companies from whole Europe, doing research in different areas: hardware, sensor manufacturing, software development, earth observation and a lot other.

Empybit participated as a company, offering various Information Technology solutions, varying from hardware, to low and high level software design. Our solutions were highly valued by other participants of the event. We managed to speak with very interesting people, pioneering in the area of low-orbit satellites, reusable rocket launchers and different areas of space research. Everybody is on the same page – the future of space technology highly depends on innovative ideas in the area of hardware, embedded and high level software.

Artificial Intelligence: What Python identifier will be called next? (Part 1)

Code completion is a widely used tool that removes the element of repeatedly typing the same identifiers when writing code. With the use of different learning and ranking algorithms a code completion system constructs a list of possible completions given the first few characters of the identifier being written. This post presents an approach to improve the accuracy of the ranked candidate lists using historical data of the program being written. Different features are explored and tested in combination with the Perceptron and Logistic Regression Gradient Descent learning algorithms to find what bares the highest significance when predicting the identifier being typed. The experiments aim to prove that classification algorithms modified for rank calculation based on historical data could potentially improve code-completion systems for scripting languages such as Python.

What are the benefits of auto completion?

Qt_Creator_5.0-Autocomplete

Code completion is a widely used feature of modern IDEs. It is a tool that provides a list of suggested completions after typing in the first few characters of a word or in this case and identifiers. Its benefits are self-explanatory as it greatly reduces the number of actions needed to complete a line of code. Furthermore, code completion could help developers find the exact identifier that they need. For example, in a large software system one could easily forget the name chosen for a particular method, class or variable that needs to be used on the current line. With the suggestions the auto completion system provides, the developer can save time by not browsing through files to find the exact name of the identifier. Another benefit of code auto-completion is that it allows developers to write longer and more descriptive method names as they do not have to type the whole names when they use them. This provides the opportunity for someone new to start working on the system and rapidly explore the functionality of what has been written so far. Longer identifier names, generally, are a great way of making the code significantly more readable and understandable. Code auto-completion is also important for its ability to provide all implemented methods in different libraries of the language its being used on. For example, in Eclipse simply importing the javax.swing.JPanel class and creating an object p of the type JPanel would provide all the methods I the JPanel class by typing a dot after the object name. Furthermore, almost all of the modern IDEs have this feature and many of them have already dealt with the problem of auto-completion in languages like Java.

Code auto-completion for Python

In Python, which is extensively used as a scripting language, it is significantly harder to implement auto-completion. As it is not necessary to give a class name before using any of the variables and the fact that one can write multiple class definitions in one file, the task of code auto-completion becomes fairly complicated. IDEs use the class name and the dot(.) symbol as a hint to where to look for a particular identifier. But if that hint was removed and the file or system being worked on is over 10 thousand lines of code consisting of thousands of identifiers it would be unreasonable to simply give an alphabetical ordered list of possible completions given the first few characters. Therefore, it would be highly beneficial to implement a ranking algorithm that would provide a list of solutions with the ones that it deems most significant to be on top.

Eclipse is, probably, the most widespread IDE alongside with editors such as Vim, Notepad++, Emacs and Sublime. Eclipse with the PyDev plug in and Sublime work very well for built in functions on Python. However, they do not provide auto-completion for variable names, defined methods and classes. It is an area that requires research and significant work and development. Modern IDEs do have several very helpful features for python developers. Apart from suggesting built-in functions and library methods, Eclipse for example inserts the self parameter when a method is defined inside a class and keeps track of indentation. However, we consider and explore a few ways of implementing a ranking algorithm for auto-completion of variable, method and class names written on an empty line without the hint of the library, class or file they are defined in.

Python is a widely used programming language and exploring the possibility of a fully functional code completion system based on a well-defined ranking algorithm could potentially increase the productivity of Python developers.

The main issues that challenge a successful implementation of an auto-completion algorithm for the Python language are its specific syntax and usage. Python is both used as a scripting language and object-oriented language and does not require type definitions beyond higher level types such as classes or methods. This fact alone limits a syntax-based auto-completion. As mentioned before, most auto-completion systems use class definitions or type definitions to narrow down the suggestions. However, Python does not require such things. Everything can be written on a new line. Furthermore, in most object-oriented languages, methods require type declarations for their input identifiers. Again, Python does not require type declarations even for methods identifiers. As we can see, only the lack of a requirement for type casting and definition when initialising a variable seriously challenges auto-completion. The issue becomes further complicated form the fact that there could be numerous class definitions in one file. This raises the issue that the completion algorithm cannot differentiate between local, instance and global variables and, therefore, cannot reduce the length of the list of possible completions thus making the need for a ranking algorithm that considers all identifiers-obvious.

With this in mind, in order to implement a good auto-completion ranking al- gorithm, it should not depend mainly on syntax or structure-specific features of previously written code, but on patterns and more high-level characteristics of the identifiers. Characteristics like: frequency of use, type, etc. and anything else that was motivated by the extensive research carried on the topic of auto- completion and inferred through reasoning over the issue (both explained in later chapters). Furthermore, auto-completion should not be based simply on the extraction of the identifiers but on an algorithm that captures the aforementioned patterns and ranks the identifiers in order to narrow down the list of possible suggestions.

Related work in autocompletion

There has been a fairly limited research on code auto-completion, especially with the Python language. Most research papers are mainly based on strictly object oriented languages (such as Java) and tend to avoid the topic of scripting languages. However, there has been a lot of research on successfully implemented algorithms for linguistic completion of words and sentences simply because of the more widespread use of this in modern technology: phones (texting), search engines (Google autocomplete and suggestions), etc.

A related paper, even though written after a research on Java, tests an algorithm based on binary feature vectors that are believed to be of even significance for every language. The mentioned features are frequency and class membership. They test a k-NN algorithm with binary vectors for pattern matching and with the help of a good distance measure are able to generate a list of possible completions. Their association rule based code completion system produces recall of 65% and precision 85% and the frequency based code completion system produces 45% recall and 34% precision. This shows that there is, in fact, a pattern in coding languages that somewhat resembles the structure of sentences in English. Therefore, there exists a combination of attributes and the right algorithm to recognize these patterns. Even though, Python and Java have several core differences in term identifier definitions, implemented functions, syntax and structure, they do share the common object-oriented programming concepts and are used in a similar fashion.

KnnClassification.svgThe KNN algorithm fits remarkably well to the problem of code completion: when a developer wants to complete code at the usage pattern level, she might start to search for code fragments very similar to the already written ones (e.g. using Google Code) and takes inspiration from the rest of the code.

The concept of classification learning algorithms working well for ranking fits perfectly in this particular scenario. Python, as any other programming language, has its own unique structure and patters that are being followed no matter the essence of the project being written. Instead of looking for more general patterns throughout the code of a software system using a clustering algorithm (such as the k-Nearest-Neighbours), the problem of auto-completion could be looked at from the perspective of the unique identifiers. Instead of looking for a more global pattern in the previously written code, specific characteristics of every single one of the identifiers and their usage throughout could suggest the likelihood of this identifier being used again given the current state. The idea is that a learning algorithm would grasp these patterns for an identifier x given the set of all identifier that share the same initial few characters and their respective feature vectors, return an accurate objective function to calculate the rank (or probability) and suggest the correct identifier for completion. This leads to the core concept of the research: attributes and learning algorithms.

Research shows that there are two classification algorithms that are often used for ranking: Perceptron and Logistic Regression. The Perceptron has been successfully used to ranking and re-ranking in natural language processing and regression, in general, has been often used for ranking in data mining problems. The general concept here is that classification and ranking share the same general idea of a function to calculate likelihood as both look for patterns and similarity of every datum in the dataset. Following this logic combined with the results of similar experiments suggests that this hypothesis of the right feature vectors and learning algorithms should produce good results for the problem of code completion.

Several factors of the data have been mentioned that greatly influence the performance of auto-completion algorithms. The most important one is frequency in both code auto-completion and English words auto-completion. An example of this is the OpenOffice Writer which provides suggestions to complete the word currently being typed only based on words written so far. This, and the fact that a similar implementation and use of frequency and learning algorithms has been used for successful ranking in natural language processing, leads to the conclusion that there is common ground between code completion and human word completion with pattern matching. In word completion algorithms perform well due to the extensive datasets of semantic information and structure of words and sentences. In code completion there is not such previously defined structure, there is not semantic content (POS tags, Phrase structures, etc.). However, by examining the previously written code on the same system auto-completion is to be implemented on, extracting specific features (such as frequency) and checking for repeated patterns of identifiers, should produce similar results to syntactic work completion.

Another paper presents the results of testing a prefix-search auto-completion algorithm based on previously seen identifiers. Even though it is mostly oriented towards on-line auto-completion and inserting newly added identifiers in the preexisting dictionary, the results of testing accuracy of the algorithm suggest that prefix-search reduces the runtime of the algorithm and improves the accuracy as a big part of the training set is disregarded with the addition of more characters to the prefix. The results, from using an algorithm that is only based on prefixes and does not consider anything else, are a 16% accuracy in retrieving the correct identifier given the first 2-3 characters. Considering what class the identifiers has been defined and used in improves the performance significantly increasing the accuracy to 60-70% on the first 3 characters. However, as this algorithm is tested on JAVA, the results may not be that impressive when the algorithm is implemented on Python. But this particular paper motivates the use of other characteristics of the identifier: class membership, recently defined or used identifiers, frequency (again) and time and place of definition and initialization. The final version of their algorithm produces an impressive 80% accuracy.

Second part of the learning algorithms research.

Why we develop embedded systems

   One can wonder why in the modern world where the hardware has such a great performance, software developers would use embedded systems? There are many high-level programming languages that need just an OS kernel (preferably Linux or Android) and an interpreter to make anything possible, but at what price is that?

   An embedded system can be designed to do exactly the job that is needed and in most cases it doesn’t need the complexity of a Linux like OS and most importantly doesn’t need that great hardware performance of multiple gigahertz. Well if you are developing a video encoding/decoding/streaming program a few more megahertz might just be what you need, but the most usual applications of embedded software require just a few (50-100 200 max).

   One fact that you the reader might find interesting is that most automotive embedded systems (like cluster, sensors, body controller and more) work with less than 20MHz processors for the standard models, and for those of top class cars with big LCD screens systems integrate up to 200MHz MCUs.

   Another aspect that makes powerful systems not applicable everywhere is that because of the high frequencies they use, not only for clocking, but also for communication and other processes, they emit quite powerful electro-magnetic pulses and thus create electro-magnetic fields. How can such solution be integrated in industrial environment? Or imagine an audio device with analog and digital audio going all around the device and a magnetic field that causes a disturbance and glitches in the audio data…

   From production point of view embedded systems are much more adjustable when it comes to price. If your product is a simple one you can use as simple chips as it is enough for device to be fully operational. That’s why a common practice is to develop a product with chips that allow more flexibility (i.e. more periphery, more RAM, more FLASH, …) and when software development is over and comes time for production, squeeze hardware to the minimum to achieve the highest level economy of scale.

Empybit supports Business Booster

As graduates of the Technical University of Sofia, we support honourable ideas coming from there. That’s why we are proud to announce that Empybit actively participated in the development of Business Booster – center for Entrepreneurship and Innovations in Technical University Sofia which aims at establishing favourable for entrepreneurship environment for students.

boosting

The educational program offers unique opportunity for young engineers to realize their real impact on the economy and see the ways how they can employ their high potential in successful ventures. It is constructed using best practices from leading universities in the fields of entrepreneurship and managements such as Harvard Business School, Babson College and Florida International University . The syllabus is established in partnership with the Center for Economic Strategy and Competitiveness (CESC).

Come on, Business Booster! Keep on boosting young engineers to achieve their goals!

Using Python for existing Python code migration

Keeping software design tidy is not an easy task, especially when the code base grows significantly. In some cases systems are composed of thousands of modules, which in specific moment need to be updated, migrated or optimized. It is possible that a huge part of the code is auto-generated Python code and it would be a huge hustle to migrate all that code manually. In any case the work needs to be done and it clearly needs to be automated.

One option would be to use Regular Expressions to find existing patterns. You start describing different migration cases and hooking them to callback functions that need to get the job done. At the end you end up with a huge amount of scenarios that are extremely hard to maintain. The replace needs to happen in place and you risk that the already migrated code will be fetched by the next Regular Expression. Then you start again.

Fortunately, Python offers a better way to do it, hidden in the standard library and called lib2to3. Yes, this is a tool for migration from Python 2 to 3, it could be adapted to our needs. For the purpose we just need to extend it with our own fixers. Fixers allow you to directly modify the Parse Tree be changing values of Nodes and Leafs, deleting them and inserting new ones.

In order setup your code migration project you need to copy lib2to3 from the standard library in your workspace, delete all fixers from the lib2to3/fixes folder (unless you also want to migrate to Python 3) and you are ready to start writing your own migration rules. Writing your own fixers this way will allow you to keep migrations organized and encapsulated, making them more maintainable.

Here is the following snippet, just to get an idea how a fixer looks like:

from lib2to3.fixer_base import BaseFix
from lib2to3.pgen2 import token

class FixName1(BaseFix):
    
    _accept_type = token.NAME

    def match(self, node):
        if node.value == 'oldname':
            return True
        return False
    
    def transform(self, node, results):
        node.value = 'newname'
        node.changed()

Why we use Python at Empybit

At Empybit we extensively use Python as a primary programming language for the high-level part of the software systems we create. There are quite a lot reasons for this, and in this post I would like to share some of them.

Python means Agile

When creating low-to-high level systems there is a serious low-level embedded part of the system that needs to be coded in C, and there are reasons for this. Python is perfect for controlling those components, and its usage accelerates the development process. Also, Python is perfect for prototyping specific components of the whole systems. Python has a huge bunch of modules for interaction with low-level components and libraries. Its syntax is light-weight and fat-free, which makes code more maintainable.

Python is Flexible

Python is a language that use extensively used in a lot of industries. As a language developed in 1991, it has proven its stability. Python wasn’t developed to match some specific needs and doesn’t cover any special industry requirements, this makes it a good choice for a lot of cases. As a company doing custom development, this makes Python perfect for us.

Python accelerates Software Development

With its lightweight syntax, combination of functional and OOP programming approaches, huge collection of Open Source libraries and wide area of application, Python accelerates the software development process and reduces time from planning a project to seeing it in production. This allows focusing on other processes, like testing and quality assurance.

Python integrates nicely with Embedded systems

Written in C, Python nicely integrates with low-level systems. This allows us to design systems, where different components interact smoothly, without a matter if it is about a hardware device, firmware, server or mobile application. Development of complete software systems is our primary area of expertise, and as a language that matches perfectly with C, Java and .NET code, Python is perfect for us.