GPT For Regex - Pros And Cons
Written by Nikos Vaggalis   
Wednesday, 24 May 2023

RegExGPT is a online playground that lets you enter a source  and a target string to let GPT generate the regular expression for a match. But what are the pros and cons of the GPT approach to regex code generation?

Despite their power, regular expressions come with their own challenges; they have a tendency to quickly become unreadable so that understanding them becomes a matter of deobfuscation as well as learning how to use them involves a steep curve.

As such, there were always attempts to ease the path to regex mastery by generating regular expression using natural language or expressing intent, even before the advent of GPT.

One such occasion was utilizing Genetic Programming for that purpose, as described in Automatically Generating Regular Expressions with Genetic Programming :

Is it possible to "breed" a correct regular expression so that you don't have to go to the trouble of actually working it out for yourself?

When you construct a computer program you do so through a series of well defined instructions that work on a set of data and produce a desired outcome.

Given our focus here is on regular expressions, let's say that our goal is to match just the alphanumeric characters of the string:

'http://www.google.com'.

Sticking to the traditional way we would have to supply an instruction in the form of a regular expression, that is

'/[a-zA-Z]/g'

But what if we could start the other way around? That is, get the computer to solve problems without being explicitly programmed? How can this be done?

With Genetic Programming we can tell a computer program what we're after and let it generate a new program for us that will produce the same outcome as we had done it ourselves.

Although an experimental initiative it had success, but now with the advent of GPT this process has become mainstream as well as easier to pull, plus you also get an explanation in plain English along the generated code.

So the talk is about RegExGPT, a web app that uses the power of natural language processing to help developers deal with the hard and tedium of regular expressions.

With RegExGPT, you can describe the pattern you're looking for in plain English, and the AI-powered engine will generate the corresponding regular expression for you. The goal is to make regular expressions accessible to everyone, regardless of their level of expertise.

Let's put it to test. You provide a source string and the desired output and let the AI generate the appropriate expression.I'm using two examples in Perl, taken from Advanced Perl Regular Expressions - The Pattern Code Expression and Advanced Perl Regular Expressions - Extended Constructs:

Test1:

source text : image_of_a_&#x00A3

target text: image_of_a_£

this is the code it generated, together with a good explanation of it:

Test2. Yet another attempt :

source text : myimageऄwithधDevanagariमcharsफ. png

target text : myimagexwithxDevanagarixcharsx. png

generated :


To validate the tests I executed the generated code converting it into a full blown Perl programm. The result is that both regexes worked flawlessly.

Let's try some advanced scenarios and edge cases like recursively matching balanced text, the gold standard coming off straight from Perl's regex FAQ page:

Can I use Perl regular expressions to match balanced text?

source text:
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.

target text:
<brackets in <nested brackets> >
<another group <nested once <nested twice> > >

To have a reference point let's check the official answer by the FAQ in https://regexr.com/:

Now to the RegExGPT answer.

The initial code when wrapped in a Perl program resulted in syntax error since the regex was missing a prefix, although  contained in the explanation provided by GPT below the generated the code. After adding it we got the following expression:

$text =~ s/<([^<>]*(?:(?R)[^<>]*)*)>//g;

Looked promising enough, let's try it out in actual Perl program (Perl version 5. 26. 3). This is what gave:

$text='I have some <brackets in <nested brackets> > and <another group <nested once <nested twice> > > and thats it.';

my @groups = $text =~ m/<([^<>]*(?:(?R)[^<>]*)*)>/g;
print "@groups";

Result not exactly as expected:

brackets in <nested brackets>  another group <nested once <nested twice> >

but with a bit of tweaking :

$text =~ s/(<)([^<>]*(?:(?R)[^<>]*)*)(>)//g;

Result:

 < brackets in <nested brackets>  > < another group <nested once <nested twice> >  >

For the sake of it, let's give it another try by regenerating the response. This time GPT generated:

which turned out as a syntax error.

So the generated response might require some tweaking to work. While in my limited test of RegExGPT I haven't encountered such a case. However when adopting automatically generated code from tools, extra care ought to be paid. For example when generating regexes it would be good to be on the lookout for generated recipes prone to ReDOS, as analyzed in Can Regular Expressions Be Safely Reused Across Languages?:

The Regular expression Denial of Service (ReDoS) is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size).

An attacker can then cause a program using a Regular Expression to enter these extreme situations and then hang for a very long time.

Fortunately the Perl engine protects against it. To tell you the truth, that double star as in *)*) in the generated output above looked very suspicious, but it passed the test otherwise Perl would had spat out a "Infinite recursion in regex" error.

This is due to the protection PCRE provides against ReDOS. PCRE (Perl Compatible Regular Expressions) have explicit defenses against exponential time behavio as Perl maintains a cache of visited states in order to short circuit redundant paths through the NFA, permitting it to evaluate some searches in linear time that take polynomial or exponential time in other  engines.

In conclusion, mastering Regular Expressions was always a Black Art that required a lot of effort. Tools like RegExGPT will certainly improve productivity by acting as the programmers' sidekick. Just don't take their word at face value.

 

More Information

RegExGPT

Related Articles

Automatically Generating Regular Expressions with Genetic Programming

Advanced Perl Regular Expressions - The Pattern Code Expression 

Advanced Perl Regular Expressions - Extended Constructs

Can Regular Expressions Be Safely Reused Across Languages?

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Google Introduces PaliGemma, A New Visual Language Model
20/05/2024

Last week's Google I/O saw the introduction of PaliGemma, an open vision-language model (VLM), together with some details of what's coming in Gemma 2. 



Free Courses On Becoming A Data Analyst
21/05/2024

Microsoft has launched several, free, self-paced courses on Data Analysis using its Power BI suite.


More News

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 24 May 2023 )