Sep 13, 2019

The dangers of copy and paste: regular expressions may not be as portable across languages as you think

A recent paper presented at the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ‘19) revealed the hidden dangers of copying and pasting regular expressions across languages.

The paper, titled Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions, analyzed 537,806 regexes from 193,524 libraries in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. After building a large corpus of regexes, researchers compiled a list of regexes that were present in multiple languages and ran a set of inputs against those regexes to compare how different languages evaluated each expression. Approximately 15% of regexes exhibited different behavior across languages and 10% of regexes had performance disparities across languages.

Despite the inconsistencies in regexes across languages, copying and pasting code is a common practice, particularly when handling regexes that can be difficult to decipher. In a short survey, researchers discovered that 94% of developers copy and reuse regex constructs from Stack Overflow. More worryingly, roughly 47% believe regexes are portable across languages.

How do regex disparities impair software development? First, some of the differences between languages could not be identified through the documentation. According to the researchers, “testing, not reading the manual, is the only way for developers to learn these behaviors.” Poor documentation encourages risky regex usage and consumes developer resources. Second, poorly performing regexes are a security risk, exposing applications to ReDoS (Regular expression Denial of Service) attacks. During a ReDoS attack, hackers take advantage of poor regex implementations that can function slowly in extreme situations. When porting regexes between languages, developers neglect to make necessary security optimizations.

Regexes are a fundamental part of most programming languages, but without consistency throughout the ecosystem, developers will have to be extra diligent in finding, testing, and deploying them.

Want to get more of these in your inbox?

Subscribe for weekly updates from the Software team.