Sep 13, 2019 newsletter

The dangers of copy and paste: regular expressions may not be as portable across languages as you think

A recent paper presented at the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ‘19) revealed the hidden dangers of copying and pasting regular expressions across languages.

The paper, titled Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions, analyzed 537,806 regexes from 193,524 libraries in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. After building a large corpus of regexes, researchers compiled a list of regexes that were present in multiple languages and ran a set of inputs against those regexes to compare how different languages evaluated each expression. Approximately 15% of regexes exhibited different behavior across languages and 10% of regexes had performance disparities across languages.

Despite the inconsistencies in regexes across languages, copying and pasting code is a common practice, particularly when handling regexes that can be difficult to decipher. In a short survey, researchers discovered that 94% of developers copy and reuse regex constructs from Stack Overflow. More worryingly, roughly 47% believe regexes are portable across languages.

How do regex disparities impair software development? First, some of the differences between languages could not be identified through the documentation. According to the researchers, “testing, not reading the manual, is the only way for developers to learn these behaviors.” Poor documentation encourages risky regex usage and consumes developer resources. Second, poorly performing regexes are a security risk, exposing applications to ReDoS (Regular expression Denial of Service) attacks. During a ReDoS attack, hackers take advantage of poor regex implementations that can function slowly in extreme situations. When porting regexes between languages, developers neglect to make necessary security optimizations.

Regexes are a fundamental part of most programming languages, but without consistency throughout the ecosystem, developers will have to be extra diligent in finding, testing, and deploying them.


Web scraping public data sources is likely not a violation of anti-hacking law

A recent decision by the Ninth Circuit Court of Appeals in the case HiQ Labs v. LinkedIn holds that scraping a public website is likely not a violation of the Computer Fraud and Abuse Act (CFAA). The court’s decision is not a final one, but serves as a primary injunction: HiQ is seeking to preserve its access to LinkedIn’s public data while the lawsuit is pending, so the court is issuing its preliminary prediction of the likely outcome of the lawsuit.

HiQ is a data analytics company that scrapes public LinkedIn profile data for its talent management services. LinkedIn sent a cease and desist letter to HiQ, requesting that HiQ stop collecting public profile data from its site. HiQ refused to comply with LinkedIn’s request. To ensure that LinkedIn would not take any measures to disrupt its web scraping, HiQ sued LinkedIn to prove that its scraping activities are legal. So far, the court seems to be siding with HiQ, suggesting that harvesting publicly available data does not constitute hacking.

Designed to ensure that computer hacking crimes did not go unpunished, the CFAA is a federal cybersecurity bill enacted in 1986 that prohibits accessing a computer without authorization. The CFAA specifically prohibits circumventing a computer’s access permission, but what constitutes authorization is a difficult and complex question. Does authorization depend on a computer’s architecture or design (e.g. log in requirements) or what a computer owner wants (e.g. a cease and desist letter)? In the case of HiQ Labs v. LinkedIn, the lack of log in requirements to access profile data trumps LinkedIn’s cease and desist letter in determining whether or not HiQ has authorization to access data.

Court interpretations of the CFAA has zigzagged over time between an open internet and a closed one. A recent case Facebook v. Power Ventures gave more power to cease and desist letters as a form of authorization control. Power Ventures, however, accessed the data of logged in Facebook users, so the company did not have proper authorization and needed to comply with the cease and desist letter.

Web scraping has long been a murky subject for companies and engineering teams. For teams that navigate private and public data sources, data ownership remains difficult to define. While many companies hope to protect the data they have collected on their platforms, others are looking to leverage that data to build new platforms. With greater freedom for web scraping, developers will have access to many new and legal data sources, but may also need to take steps to protect their own public data.


What happens when the cloud goes down? AWS outage leads to permanent data loss

In the early morning of August 31st, the AWS data facility US-EAST-1 in North Virginia experienced a power failure. Backup generators promptly turned on to restore power to the data center. Unfortunately, just a few hours later, those backup generators began to fail, too.

In the immediate aftermath, roughly 7.5% of the EC2 instances and EBS volumes at that facility became unavailable. EBS is an elastic block storage service that helps teams keep data on a file system, even after shutting down EC2 instances. For some teams, EBS stores mission-critical data for applications and services.

Amazon slowly worked to restore its service throughout the morning, but for some the data loss was catastrophic. A few days later, AWS began notifying a small percentage of developers that hardware damage to Amazon’s infrastructure meant that some data could not be recovered. Engineering teams would either need to restore their data from a separate backup, or write off their data as permanently lost.

Despite its strong uptime and data recovery track record, Amazon’s terms state: “We have no liability whatsoever for any damages, liabilities, losses (including any corruption, deletion, or destruction or loss of data, applications or profits), or any other consequences resulting from the foregoing.”

The recent AWS incident highlights the importance of data redundancy and consistent backup creation as core parts of the software development workflow, even if engineering teams are using highly reliable, popular cloud services. Even the cloud, with its sprawling and distributed data centers, is subject to the consequences of unforeseen physical hardware failures.


Small bytes

  • Are you writing too much code? [BETTER PROGRAMMING]
  • In praise of developers who delete code [TECH REPUBLIC]
  • Tired of Stack Overflow: how a community became toxic [ARP242]
  • Personal projects make you a better developer [TNW]
  • Jack Ma and Elon Musk hold debate in Shanghai over artificial intelligence [YOUTUBE]

Tools

  • GitDuck combines both video and source code sharing in one place to help you collaborate more interactively [GITDUCK]
  • Reverse Interview is a set of questions to ask the company during your interview [GITHUB]
  • Daytripper is a a multifunctional laser tripwire [GITHUB]
  • Tiler is a tool to create an image using all kinds of other smaller images [GITHUB]
  • Chakra UI is a simple, modular and accessible component library that gives you all the building blocks you need to build your React applications [CHAKRA]
  • Appwrite is a simple to use backend for frontend and mobile apps [GITHUB]
  • Bitmelo is an online JavaScript game maker [BITMELO]
  • Cytoscape.js is a graph theory (network) library for visualisation and analysis [CYTOSCAPE]
  • Regex Crossword is a crossword puzzle game, where the crossword clues are defined using regular expressions [REGEX CROSSWORD]
Never miss the big news

Every week, our team will send you three of the most important stories for developers, including our analysis of why they matter. Software development changes fast, but src is your secret weapon to stay up to date in the developer world.

Featured articles
AI Ethics: How Diverging Global Strategies Open a Gaping Regulatory Void

Today global initiatives on AI are a series of regulatory and ethical gambles—a dangerous, potentially existential game.


Can Master Chief win the day for Microsoft Azure?

Why the Xbox will be Azure’s unlikely hero.


Churn Baby, Churn

Understanding churn rates can help developers be more productive and write quality code

Made with by Software. Read more about our mission.