The dangers of copy and paste: regular expressions may not be as portable across languages as you think
A recent paper presented at the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ‘19) revealed the hidden dangers of copying and pasting regular expressions across languages.
Despite the inconsistencies in regexes across languages, copying and pasting code is a common practice, particularly when handling regexes that can be difficult to decipher. In a short survey, researchers discovered that 94% of developers copy and reuse regex constructs from Stack Overflow. More worryingly, roughly 47% believe regexes are portable across languages.
How do regex disparities impair software development? First, some of the differences between languages could not be identified through the documentation. According to the researchers, “testing, not reading the manual, is the only way for developers to learn these behaviors.” Poor documentation encourages risky regex usage and consumes developer resources. Second, poorly performing regexes are a security risk, exposing applications to ReDoS (Regular expression Denial of Service) attacks. During a ReDoS attack, hackers take advantage of poor regex implementations that can function slowly in extreme situations. When porting regexes between languages, developers neglect to make necessary security optimizations.
Regexes are a fundamental part of most programming languages, but without consistency throughout the ecosystem, developers will have to be extra diligent in finding, testing, and deploying them.
Want to get more of these in your inbox?
Subscribe for weekly updates from the Software team.