Nvidia Parabricks
{{short description|Suite of free genome analysis software by Nvidia}}
{{cs1 config|name-list-style=vanc|display-authors=6}}
{{Infobox software
| title = Nvidia Parabricks
| developer = Nvidia
| latest release version = 4.3.1-1
| latest release date = July 1, 2024
| platform = Nvidia GPUs
| language = English
| genre = Medical software
| website = {{URL|https://www.nvidia.com/en-us/clara/genomics/}}
}}
Parabricks company started at the University of Michigan by Mehrzad Samadi, Ankit Sethia, and Scott Mahlke. It was acquired by Nvidia in 2020.
Nvidia Parabricks is a suite of free software for genome analysis developed by Nvidia, designed to deliver high throughput by using graphics processing unit (GPU) acceleration.{{cite web |title=Clara for Genomics |url=https://www.nvidia.com/en-us/clara/genomics/ |access-date=8 July 2024 |website=NVIDIA}}
Parabricks offers workflows for DNA and RNA analyses and the detection of germline and somatic mutations, using open-source tools. It is designed to improve the computing time of genomic data analysis while maintaining the flexibility required for various bioinformatics experiments. Along with the speed of GPU-based processing, Parabricks ensures high accuracy, compliance with standard genomic formats and the ability to scale in order to handle very large datasets.
Users can download and run Parabricks pipelines locally or directly deploy them on cloud providers, such as Amazon Web Services, Google Cloud, Oracle Cloud Infrastructure, and Microsoft Azure.
== Accelerated genome analysis fundamentals ==
File:Genome-analysis-pipeline.png
File:DNBSEQ-G400.jpgThe massive reduction in sequencing costs{{Cite web |title=DNA Sequencing Costs: Data |url=https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data |access-date=2024-07-10 |website=www.genome.gov |language=en}} resulted in a significant increase in the size and the availability of genomics data{{cite journal | vauthors = Langmead B, Nellore A | title = Cloud computing for genomic data analysis and collaboration | journal = Nature Reviews. Genetics | volume = 19 | issue = 4 | pages = 208–219 | date = April 2018 | pmid = 29379135 | pmc = 6452449 | doi = 10.1038/nrg.2017.113 }} with the potential of revolutionizing many fields, from medicine to drug design.{{cite journal | vauthors = Ombrello MJ, Sikora KA, Kastner DL | title = Genetics, genomics, and their relevance to pathology and therapy | journal = Best Practice & Research. Clinical Rheumatology | volume = 28 | issue = 2 | pages = 175–189 | date = April 2014 | pmid = 24974057 | pmc = 4149217 | doi = 10.1016/j.berh.2014.05.001 | series = Advances in Paediatric Rheumatology and Translation of Research to Targeted Therapies }}
Starting from a biological sample (e.g., saliva or blood), it is possible to extract the individual's DNA and sequence it with sequencing machinery to translate the biological information into a textual sequence of bases.{{cite journal | vauthors = Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O | title = From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures | journal = Computational and Structural Biotechnology Journal | volume = 20 | pages = 4579–4599 | date = 2022 | pmid = 36090814 | pmc = 9436709 | doi = 10.1016/j.csbj.2022.08.019 }} Then, once the entire genome is obtained through the genome assembly process, the DNA can be analyzed to extract information that is key in several domains, including personalized medicine and medical diagnostics.{{cite book | vauthors = Jain KK | chapter = Basics of Personalized Medicine |date=2009 | title = Textbook of Personalized Medicine |pages=1–27 | veditors = Jain KK |place=New York, NY |publisher=Springer |language=en |doi=10.1007/978-1-4419-0769-1_1 |isbn=978-1-4419-0769-1}}
Typically, genomics data analysis is performed with tools based on Central Processing Units (CPUs) for processing. Recently, several researchers in this field have underlined the challenges in terms of computing power delivered by these tools and focused their efforts on finding ways to boost the performance of the applications.{{Cite journal | vauthors = Alser M, Bingol Z, Cali DS, Kim J, Ghose S, Alkan C, Mutlu O |date= September 2020 |title=Accelerating Genome Analysis: A Primer on an Ongoing Journey |url=https://ieeexplore.ieee.org/document/9154510 |journal=IEEE Micro |volume=40 |issue=5 |pages=65–75 |doi=10.1109/MM.2020.3013728 |arxiv=2008.00961 |issn=0272-1732}} The issue has been addressed in two ways: developing more efficient algorithms or accelerating the compute-intensive part using hardware accelerators. Examples of accelerators used in the domain are GPUs, FPGAs, and ASICs{{cite journal | vauthors = Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S | title = Technology dictates algorithms: recent developments in read alignment | journal = Genome Biology | volume = 22 | issue = 1 | pages = 249 | date = August 2021 | pmid = 34446078 | pmc = 8390189 | doi = 10.1186/s13059-021-02443-7 | doi-access = free }}
In this context, GPUs have revolutionized genomics by exploiting their parallel processing power to accelerate computationally intensive tasks.{{cite journal | vauthors = Taylor-Weiner A, Aguet F, Haradhvala NJ, Gosai S, Anand S, Kim J, Ardlie K, Van Allen EM, Getz G | title = Scaling computational genomics to millions of individuals with GPUs | journal = Genome Biology | volume = 20 | issue = 1 | pages = 228 | date = November 2019 | pmid = 31675989 | pmc = 6823959 | doi = 10.1186/s13059-019-1836-7 | doi-access = free }}{{cite journal | vauthors = Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D | title = Graphics processing units in bioinformatics, computational biology and systems biology | journal = Briefings in Bioinformatics | volume = 18 | issue = 5 | pages = 870–885 | date = September 2017 | pmid = 27402792 | pmc = 5862309 | doi = 10.1093/bib/bbw058 }} GPUs deliver promising results in these scenarios thanks to their architecture, composed of thousands of small cores capable of performing computations in parallel.{{Cite book | vauthors = Cheng J, Grossman M, McKercher T |url=https://books.google.com/books?id=q3DvBQAAQBAJ&dq=professional+cuda+c+programming&pg=PR17 |title=Professional CUDA C Programming |date=2014-09-09 |publisher=John Wiley & Sons |isbn=978-1-118-73932-7 |language=en}} This parallelism allows GPUs to process multiple tasks simultaneously, significantly speeding up computations that can be broken down into independent units. For instance, aligning millions of sequencing reads against a reference genome or performing statistical analyses on large genomic datasets can be completed much faster on GPUs than when using CPUs. This facilitates the rapid analysis of genomic data from diverse sources, ranging from individual genomes to large-scale population studies,{{cite journal | vauthors = Zhou C, Lang X, Wang Y, Zhu C | title = gPGA: GPU Accelerated Population Genetics Analyses | journal = PLOS ONE | volume = 10 | issue = 8 | pages = e0135028 | date = 2015-08-06 | pmid = 26248314 | pmc = 4527771 | doi = 10.1371/journal.pone.0135028 | doi-access = free | bibcode = 2015PLoSO..1035028Z }} accelerating the understanding of genetic diseases, genetic diversity, and more complex biological systems.
Featured pipelines
Parabricks offers end users various collections of tools organized sequentially to analyze the raw data according to the user's requirements, called pipelines. Nevertheless, users can decide to run the tools provided by Parabricks as a standalone, still exploiting GPU acceleration to overcome possible computational bottlenecks. Only some of the provided tools in the suite are GPU-based.{{Cite web |title=Welcome to NVIDIA Parabricks v4.3.1 |url=https://docs.nvidia.com/clara/parabricks/latest/index.html |access-date=2024-07-10 |website=NVIDIA Docs |language=en}}
File:Parabricks-pipeline-overview.png
Overall, all the pipelines share a standard structure. Most of the pipelines are built to analyze FASTQ data resulting from various sequencing technologies (e.g., short- or long-read). Input genomic sequences are firstly aligned and then undergo a quality control process. These two processes provide a BAM or a CRAM file as an intermediate result. Based on this data, the variant calling task that follows employs high-accuracy tools that are already widely used. As output, these pipelines provide the identified mutations in a VCF (or a gVCF).
= Germline pipeline =
The germline pipeline offered by Parabricks follows the best practices{{Cite web |date=2015-03-19 |title=Best Practices for Variant Calling with the GATK |url=https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1 |access-date=2024-07-09 |website=@broadinstitute |language=en}} proposed by the Broad Institute in their Genome Analysis ToolKit (GATK).{{Cite web |date=2010-06-08 |title=Genome Analysis Toolkit (GATK) |url=https://www.broadinstitute.org/scientific-community/software/genome-analysis-toolkit-gatk |access-date=2024-07-09 |website=@broadinstitute |language=en}} The germline pipeline operates on the FASTQ files provided as input by the user to call the variants that, belonging to the germ line, can be inherited.
This pipeline analyzes data computing the read alignment with BWA-MEM{{Citation | vauthors = Li H |title=Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM |date=2013-05-26 |arxiv=1303.3997 }}{{Cite web |title=Burrows-Wheeler Aligner |url=https://bio-bwa.sourceforge.net/ |access-date=2024-07-09 |website=bio-bwa.sourceforge.net}} and calling variants using GATK HaplotypeCaller,{{Citation | vauthors = Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Gauthier LD, Levy-Moonshine A |title=Scaling accurate genetic variant discovery to tens of thousands of samples |date=2018-07-24 |url=https://www.biorxiv.org/content/10.1101/201178v3 |access-date=2024-07-09 |language=en |doi=10.1101/201178 }} one of the most relevant tools in the domain for germline variant calling.
= DeepVariant germline pipeline =
Besides the pipeline that resorts to HaplotypeCaller to call variants, Parabricks also offers an alternative pipeline that still calls germline variants but is based on DeepVariant.{{cite journal | vauthors = Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, Gross SS, Dorfman L, McLean CY, DePristo MA | title = A universal SNP and small-indel variant caller using deep neural networks | journal = Nature Biotechnology | volume = 36 | issue = 10 | pages = 983–987 | date = November 2018 | pmid = 30247488 | doi = 10.1038/nbt.4235 }}{{Citation |title=google/deepvariant |date=2024-07-04 |url=https://github.com/google/deepvariant |access-date=2024-07-09 |publisher=Google}} DeepVariant is a variant caller, developed and maintained by Google, capable of identifying mutations using a deep learning-based approach. The core of DeepVariant is a convolutional neural network (CNN) that identifies variants by transforming this task into an image classification operation. In Parabricks, the inference process is accelerated in hardware. For this pipeline, only T4, V100, and A100 GPUs are supported.
Analyses performed according to this pipeline are compliant with the use of BWA-MEM for the alignment by Google's CNN for variant calling.
= Human_par pipeline =
Still compliant with GATK best practices, the human_par pipeline allows users to identify mutations in the entire human genome, including sex chromosomes X and Y, and, thus, it is compliant with their ploidy. For male samples, firstly, the pipeline runs HaplotypeCaller on all the regions that do not belong to the X and Y chromosomes and on the pseudoautosomal region with ploidy equal to 1. Then, HaplotypeCaller analyses the X and Y regions without the pseudoautosomal region with ploidy 2. Regarding female samples, instead, the pipeline runs HaplotypeCaller on the entire genome, with ploidy 2.
The sex of the sample can be determined in two main ways:
- Manually set with the
--sample-sex
option; - Specify the X vs. Y ratio with range options
--range-male
and--range-female
and let the tool automatically infer the sex of the samples based on the X and Y reads count.
The pipeline requires the user to specify at least one of these three options.
As for the germline case, since this pipeline targets the germline variants, the pipeline resorts to BWA-MEM for the alignment, followed by HaplotypeCaller for variant calling.
= Somatic pipeline =
Parabricks' somatic pipeline is designed to call somatic variants, i.e., those mutations affecting non-reproductive (somatic) cells. This pipeline can analyze both tumor and non-tumor genomes, offering either tumor-only or tumor/normal analyses for comprehensive examinations.
As in the germline pipeline, the alignment task is carried out using BWA-MEM followed by GATK Mutect{{cite journal | vauthors = Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G | title = Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples | journal = Nature Biotechnology | volume = 31 | issue = 3 | pages = 213–219 | date = March 2013 | pmid = 23396013 | pmc = 3833702 | doi = 10.1038/nbt.2514 }} to identify the possible mutations. Mutect is used instead of HaplotypeCaller due to its focus on somatic mutations, as opposed to germline mutations targeted by HaplotypeCaller.
= RNA pipeline =
This pipeline is optimized for short variant discovery (i.e., Single-nucleotide polymorphisms (SNPs) and indels) in RNAseq data. It follows the Broad Institute's best practices for these types of analyses.
It relies on the STAR aligner,{{cite journal | vauthors = Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR | title = STAR: ultrafast universal RNA-seq aligner | journal = Bioinformatics | volume = 29 | issue = 1 | pages = 15–21 | date = January 2013 | pmid = 23104886 | pmc = 3530905 | doi = 10.1093/bioinformatics/bts635 }} a read aligner specialized for RNA sequences for aligning the reads, and HaplotypeCaller for calling variants.
Parabricks tools
Parabricks provides a collection of tools to perform genomics analyses, classified into six main categories related to their task. These tools combined constitutes Parabricks' pipelines, and can be also used as-is.
For FASTQ and BAM files processing, the proposed tools are:
- {{Mono|applybsqr}}
- {{Mono|bam2fq}}
- {{Mono|bamsort}}
- {{Mono|bqsr}}
- {{Mono|fq2bam}}
- {{Mono|fq2bamfast}}
- {{Mono|fq2bam_meth}}
- {{Mono|markdup}}
- {{Mono|minimap2}} (beta)
For calling variants, the proposed tools are:
- {{Mono|deepsomatic}}
- {{Mono|deepvariant}}
- {{Mono|deepvariant_germline}}
- {{Mono|germline}} (GATK Germline Pipeline)
- {{Mono|haplotypecaller}}
- {{Mono|mutectcaller}}
- {{Mono|pacbio_germline}} (beta)
- {{Mono|postpon}}
- {{Mono|prepon}}
- {{Mono|somatic}} (Somatic Variant Caller)
For RNA processing, the proposed tools are:
- {{Mono|rna_fq2bam}}
- {{Mono|starfusion}}
For results quality control, the proposed tools are:
- {{Mono|bammetrics}}
- {{Mono|collectmultiplemetrics}}
For processing variants, the proposed tools are:
- {{Mono|dbsnp}}
For processing gVCF files, the proposed tools are:
- {{Mono|genotypegvcf}}
- {{Mono|indexgvcf}}
Hardware support
Users can download and run Parabricks pipelines on their local servers, allowing for private, on-site data processing and analysis. They also can deploy Parabricks pipelines on cloud platforms, with improved scalability for larger datasets. Supported cloud providers include AWS, GCP, OCI, and Azure.
In the latest release (v4.3.1-1), Parabricks includes support for the Nvidia Grace Hopper super chip. The Nvidia GH200 Grace Hopper Superchip is a heterogeneous platform designed for high-performance computing and artificial intelligence, combining an Nvidia Grace and a Hopper on a single chip.{{Cite book | vauthors = Simakov NA, Jones MD, Furlani TR, Siegmann E, Harrison RJ |chapter=First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads |date=2024-01-11 |title=Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops |chapter-url=https://doi.org/10.1145/3636480.3637097 |series=HPCAsia '24 Workshops |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=36–44 |doi=10.1145/3636480.3637097 |isbn=979-8-4007-1652-2}} This platform enhances application performance using both GPUs and CPUs, offering a programming model aimed at improving performance, portability, and productivity.{{Cite web |title=Grace Hopper Superchip |url=https://docs.nvidia.com/clara/parabricks/latest/GraceHopper.html |access-date=2024-07-10 |website=NVIDIA Docs |language=en}}
Applications
Due to the computational power required by genomics workloads, Parabricks has found application in several research studies with different applicative domains, especially in cancer research.{{Cite journal | vauthors = Crowgey EL, Vats P, Franke K, Burnett G, Sethia A, Harkins T, Druley TE | date = July 2021 |title=Abstract 165: Enhanced processing of genomic sequencing data for pediatric cancers: GPUs and machine learning techniques for variant detection |url=https://aacrjournals.org/cancerres/article/81/13_Supplement/165/667493/Abstract-165-Enhanced-processing-of-genomic |journal=Cancer Research |language=en |volume=81 |issue=13_Supplement |pages=165 |doi=10.1158/1538-7445.AM2021-165 |issn=0008-5472|url-access=subscription }}{{cite journal | vauthors = Ng JK, Vats P, Fritz-Waters E, Sarkar S, Sams EI, Padhi EM, Payne ZL, Leonard S, West MA, Prince C, Trani L, Jansen M, Vacek G, Samadi M, Harkins TT, Pohl C, Turner TN | title = de novo variant calling identifies cancer mutation signatures in the 1000 Genomes Project | journal = Human Mutation | volume = 43 | issue = 12 | pages = 1979–1993 | date = December 2022 | pmid = 36054329 | pmc = 9771978 | doi = 10.1002/humu.24455 }}{{cite journal | vauthors = Lee TH, Jang BS, Chang JH, Kim E, Park JH, Chie EK | title = Genomic landscape of locally advanced rectal adenocarcinoma: Comparison between before and after neoadjuvant chemoradiation and effects of genetic biomarkers on clinical outcomes and tumor response | journal = Cancer Medicine | volume = 12 | issue = 14 | pages = 15664–15675 | date = July 2023 | pmid = 37260182 | pmc = 10417181 | doi = 10.1002/cam4.6169 }}
Scientists from Washington University used the Parabricks DeepVariant pipeline for identifying variants (e.g., SNPs and small indels) in long-read Hi-Fi whole-genome sequencing (WGS) data generated with PacBio's Revio SMRT Cell technology.{{cite journal | vauthors = Manuel JG, Heins HB, Crocker S, Neidich JA, Sadzewicz L, Tallon L, Turner TN | title = High Coverage Highly Accurate Long-Read Sequencing of a Mouse Neuronal Cell Line Using the PacBio Revio Sequencer | journal = bioRxiv | date = June 2023 | pmid = 37333171 | pmc = 10274723 | doi = 10.1101/2023.06.06.543940 }}
In addition to the pipelines, individual components of Parabricks have been used as standalone tools in academic settings. For example, the accelerated DeepVariant has been employed in a novel process to reduce the processing time further for WGS Nanopore data.{{cite journal | vauthors = Goenka SD, Gorzynski JE, Shafin K, Fisk DG, Pesout T, Jensen TD, Monlong J, Chang PC, Baid G, Bernstein JA, Christle JW, Dalton KP, Garalde DR, Grove ME, Guillory J, Kolesnikov A, Nattestad M, Ruzhnikov MR, Samadi M, Sethia A, Spiteri E, Wright CJ, Xiong K, Zhu T, Jain M, Sedlazeck FJ, Carroll A, Paten B, Ashley EA | title = Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing | journal = Nature Biotechnology | volume = 40 | issue = 7 | pages = 1035–1041 | date = July 2022 | pmid = 35347328 | pmc = 9287171 | doi = 10.1038/s41587-022-01221-5 }}
In 2022, Nvidia announced a collaboration with the Broad Institute to provide researchers with the benefits of accelerated computing. This partnership includes the entire suite of Nvidia's biomedical hardware-accelerated software suite called Clara, that includes Parabricks and MONAI.{{Cite web |title=The Broad Institute and NVIDIA Bring NVIDIA Clara to Terra Cloud Platform Serving 25,000 Researchers Advancing Biomedical Discovery |url=http://nvidianews.nvidia.com/news/the-broad-institute-and-nvidia-bring-nvidia-clara-to-terra-cloud-platform-serving-25-000-researchers-advancing-biomedical-discovery |access-date=2024-07-09 |website=NVIDIA Newsroom |language=en-us}} Similarly, the Regeneron Genetics Center uses Parabricks to expedite the secondary analysis of the exomes they sequence in their high-throughput sequencing center, leverage the DeepVariant Germline pipeline inside their workflows.{{Cite web |title=UK Biobank Advances Genomics Research with NVIDIA Clara Parabricks |url=https://resources.nvidia.com/en-us-hc-genomics/uk-biobank |access-date=2024-07-09 |website=NVIDIA |language=en}}
See also
References
{{Reflist}}
Further reading
{{refbegin}}
- {{cite journal | vauthors = Franke KR, Crowgey EL | title = Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms | journal = Genomics & Informatics | volume = 18 | issue = 1 | pages = e10 | date = March 2020 | pmid = 32224843 | doi = 10.5808/GI.2020.18.1.e10 | publisher = BMC | pmc = 7120354 }}
- {{cite journal | vauthors = O'Connell KA, Yosufzai ZB, Campbell RA, Lobb CJ, Engelken HT, Gorrell LM, Carlson TB, Catana JJ, Mikdadi D, Bonazzi VR, Klenk JA | title = Accelerating genomic workflows using NVIDIA Parabricks | journal = BMC Bioinformatics | volume = 24 | issue = 1 | pages = 221 | date = May 2023 | pmid = 37259021 | pmc = 10230726 | doi = 10.1186/s12859-023-05292-2 | author-mask5 = et al. | doi-access = free }}
{{refend}}
External links
- [https://www.nvidia.com/en-us/clara/ NVIDIA Clara]
- [https://www.nvidia.com/en-us/clara/genomics/ NVIDIA Clara for Genomics]