MetaGraph Aims To Be The “Google For DNA,” Giving Scientists Control Of Big Data

MetaGraph Aims To Be The “Google For DNA,” Giving Scientists Control Of Big Data

Over the past twenty years, scientists have sequenced almost everything they can access—bacterial genomes from soil, viral samples from hospitals, gut microbiomes from people around the world, even the RNA inside single human cells. All of that sequencing output gets funneled into massive archives that have quietly become some of the largest data collections on the planet. 

In terms of volume, these repositories now contain more raw genetic data than Google has webpages. It should be a goldmine for scientific discovery, and maybe it is. However, most of it is practically unreachable because the data is fragmented and nearly impossible to search in its raw form.

That’s why a new tool called MetaGraph, recently published in Nature, is getting a lot of attention. Instead of treating genomic data like something that needs to be cleaned and organized first, it takes the opposite approach by embracing the chaos. 

MetaGraph was developed by a team of computational biologists and informatics researchers led by Gunnar Rätsch and André Kahles, along with several collaborators who specialize in large-scale sequence indexing and graph algorithms. 

Their goal was not to build another reference genome or annotation database, but to make raw sequencing data itself searchable at petabase scale. In practical terms, they wanted a system that works directly on the unassembled reads stored in global archives and still returns accurate biological answers—without reshaping the data to fit existing tools.

(Credits:Nature.com)

“It’s a huge achievement,” says Rayan Chikhi, a biocomputing researcher at the Pasteur Institute in Paris. “They set a new standard” for analyzing raw biological data — including DNA, RNA and protein sequences — from databases that can contain millions of billions of DNA letters, amounting to ‘petabases’ of information, more entries than all the webpages in Google’s vast index.

MetaGraph is described as “Google for DNA”, but Chikhi argues it’s actually closer to YouTube’s search engine, where it doesn’t just match keywords, it analyzes the content itself. It searches directly through raw DNA and RNA reads and can detect patterns or variants that were never annotated or even known to exist, making it possible to uncover signals traditional tools would completely miss.

To do this, MetaGraph arranges raw sequencing reads into a graph that represents how small fragments of DNA or RNA overlap across many datasets. It doesn’t try to assemble complete genomes. Instead, it captures the relationships between millions of short pieces, which allows the system to track where a particular sequence appears—even if it’s only a tiny fragment shared between distant species or environments.

The graph itself is stored in a compressed format, but remains directly searchable. When a researcher runs a query, MetaGraph doesn’t reprocess entire datasets. It navigates through the graph structure to locate areas where similar patterns have already been observed. This approach makes it possible to search very large collections of raw data in a reasonable amount of time, while still working at the level of the original reads rather than relying on annotations or pre-built references.

The researchers put MetaGraph to a real-world test with antibiotic resistance. They took 241,384 human gut microbiome samples collected from different parts of the world and asked a simple question: where in these samples are resistance genes hiding? Normally, answering that would mean assembling each dataset, building references, and running separate pipelines across thousands of files. 

That sort of manual work could take weeks or months. MetaGraph did it in about an hour on a high-performance machine. As the tool is built to search the raw reads directly, it was able to spot resistance genes even when they appeared only as tiny fragments or in species with no reference genome at all. The system also uncovered geographic patterns that lined up with known differences in antibiotic use. 

(PopTika/Shutterstock)

MetaGraph isn’t the only attempt to make massive sequencing archives searchable. Chikhi himself, together with Artem Babaian, has developed a separate platform called Logan that tackles the problem from a different angle. Instead of indexing raw reads, Logan stitches them into longer stretches of DNA, which allows it to quickly identify full genes and their variants across massive datasets.

That approach led to the discovery of more than 200 million natural versions of a plastic-degrading enzyme. However, assembly-based tools like Logan are optimized for specific targets, and they can miss signals that don’t form clean, complete sequences. MetaGraph is built to search raw data directly, offering greater scope and potentially more flexibility to researchers. 

If tools like MetaGraph become widely available, researchers anywhere could mine global datasets without massive infrastructure or custom pipelines. That could accelerate drug discovery, environmental monitoring and personalized medicine. 

Perhaps the most important shift is that future scientific breakthroughs may not require new experiments at all. They could come from data that has been sitting in archives for years, data we already collected but are only now able to truly search and understand.

Related Items

State of DNA Storage Discussed in New Whitepaper

Inside Microsoft Fabric’s Push to Rethink How AI Sees Data

Fine-Tuning LLM Performance: How Knowledge Graphs Can Help Avoid Missteps

The post MetaGraph Aims To Be The “Google For DNA,” Giving Scientists Control Of Big Data appeared first on BigDATAwire.