Oct 04, 2019

GitHub tentatively enters the code search battlefield

As any developer will tell you, searching for solutions to coding questions on most search engines can be a challenge. With the announcement of its CodeSearchNet challenge, heavyweight GitHub finally starts working to fix the world of code search.

Mapping search queries to relevant code is a notoriously difficult challenge. Machines must learn semantic code search—retrieving relevant code given a natural language query. Code search today often fails due to the large disparity between search words and actual programming terminology.

Many developers turn to Stack Overflow, who is also working to solve the search problem. Dubbed CROKAGE, Stack Overflow’s experimental tool parses human-generated answers and code snippets to compose better answers.

Who will have superior solutions for developers? GitHub, with more code, or Stack Overflow, with more answers?

GitHub has a massive and incomparable data advantage—nearly 100 million repositories, in fact. Such an asset makes GitHub the 800-pound gorilla that is likely to dominate code search.

Even so, GitHub publicly released its data set, which includes natural language descriptions paired with related code snippets scraped only from open source projects. GitHub is not using its vast troves of private code data and plans to rely on its community of developers and researchers.

Holding such a vital place in the development supply chain, GitHub sees value in deferring to its developer community to advance code search algorithms with transparency. While GitHub may be the obvious winner of the code search race, maintaining developer trust will be critical to everything GitHub does in its pursuit of empowering developers.

Want to get more of these in your inbox?

Subscribe for weekly updates from the Software team.