|Can Regular Expressions Be Safely Reused Across Languages?|
|Written by Nikos Vaggalis|
|Monday, 02 September 2019|
Page 1 of 2
It is a not well kept secret that programmers are huge fans of copying and pasting code snippets, regular expressions included, that are freely available across the web. But unlike copying and pasting code within the boundaries of the same programming language, does copying a regular expression that was crafted in one language into another work as assumed, or would it introduce errors, both semantically and in performance?
"Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions", a paper presented at the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), attempts to shed light on the question: are regular expressions truly portable?
But first things first. Do programmers engage in copying and re-using regular expressions to begin with? Really, DO they?
To find out, the researchers from Virginia Tech, James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant and Dongyoon Lee, surveyed 159 professional developers on the job in order to understand their perceptions and practices around regular expressions.The findings leave no room for misinterpretation: 94 percent of these developers copy and reuse regex constructs taken from Stack Overflow and other forums and 47% think that they are indeed portable across language barriers.
These findings firmly confirmed the researchers' beliefs that this is a real issue that has to be further investigated and get to the bottom of it. So the next stage was to measure the extent of that reuse:
To answer them, they built a regex corpus consisting of 537,806 regexes extracted from 193,524 libraries/modules written in
Based on their polyglot regex corpus, they then explored the issues of portability, starting out with the semantic portability defined as the case when two languages exhibit different features (or behaviors) for the same regex syntax.
To do that they ran a large set of randomly generated inputs against a large set of complex regular expressions in each language that supported them.This resulted in the so-called "Witness points" which were used as the basis of comparison among all the languages grouped by every possible pair.The comparison's outcome was plotted into a chart categorized by Witness type in order to highlight the differences. These categories were:
(1) Match witness: Languages disagree on whether there is a match
(2) Substring witness: Languages agree that there is a match but disagree about the matching substring
(3) Capture witness: Languages agree on the match and the matching substring, but disagree about the division of the substring into any capture groups of the regex
The research now turns to the causes of this behavior. Apparently:
The new findings were summarized in the following table:
Yet another conclusion is that these unusual behaviors could not be explained by peeking into each language’s regex documentation, as such "testing, not reading the manual, is the only way for developers to learn these behaviors".
|Last Updated ( Tuesday, 03 September 2019 )|