Draft:BGZF

{{AFC submission|||u=Rare Moon|ns=118|ts=20250613231533}}

{{AFC submission|d|reason|The article was resubmitted without any edits since the previous review.|u=Rare Moon|ns=118|decliner=Caleb Stanford|declinets=20250613205930|ts=20250613171747}}

{{AFC submission|d|nn|u=Rare Moon|ns=118|decliner=SafariScribe|declinets=20250611013954|small=yes|ts=20250608204948}}

{{AFC submission|d|nn|u=Rare Moon|ns=118|decliner=CSMention269|declinets=20250604063106|small=yes|ts=20250603210318}}

{{AFC submission|d|context|u=Rare Moon|ns=118|decliner=Caleb Stanford|declinets=20250510204054|small=yes|ts=20250401234720}}

{{AFC submission|d|nn|u=Rare Moon|ns=118|decliner=Jlwoodwa|declinets=20250401232856|small=yes|ts=20250401230902}}

{{AFC submission|d|v|u=Rare Moon|ns=118|decliner=Liance|declinets=20250401212323|reason2=nn|small=yes|ts=20250401200547}}

{{AFC comment|1=I will hold off on reviewing this one for now and await an independent (3rd opinion) review. Caleb Stanford (talk) 20:49, 19 June 2025 (UTC)}}

{{AFC comment|1=At least one reference shows a basic misunderstanding of referencing, and you need to check them all. a reference must be about the subject nor merely a mention that it has been used in research. Each reference must meet the following tough criteria"

We require references from significant coverage about the topic of the article, and independent of it, in multiple secondary sources which are WP:RS please. See WP:42. Please also see WP:PRIMARY which details the limited permitted usage of primary sources and WP:SELFPUB which has clear limitations on self published sources. Providing sufficient references, ideally one per fact referred to, that meet these tough criteria is likely to allow this article to remain. Lack of them or an inability to find them is likely to mean that the topic is not suitable for inclusion, certainly today. ๐Ÿ‡ต๐Ÿ‡ธ‍๐Ÿ‡บ๐Ÿ‡ฆ FiddleTimtrent FaddleTalk to me ๐Ÿ‡บ๐Ÿ‡ฆ‍๐Ÿ‡ต๐Ÿ‡ธ 00:05, 14 June 2025 (UTC)}}

{{AFC comment|1=There are no changes since last submission โ€” please see my helpdesk post for discussion of sources regarding notability and context around submission. Thank you! Rare Moon (talk) 17:18, 13 June 2025 (UTC)}}

{{AFC comment|1=Please see my helpdesk post for discussion of sources regarding notability. Thank you! Rare Moon (talk) 17:18, 13 June 2025 (UTC)}}

{{AFC comment|1=Re: previous review(s) โ€” I have added additional citations that discuss BGZF and build upon the features of algorithsm. The GNG comment appears more generic and reflects a lack of WP:AFCPURPOSE โ€“ please note that this is a file format and algorithm (used widely in the field) as supported by citations. The key technical algorithm information obviously comes from the original spec, but that does not mean it's not notable. The use of file format / algorithm is discussed in many contexts and those manuscripts do not need to provide algorithm discussion again (as is the norm, to cite the original manuscript). Consider for example gzip article. Thank you! Rare Moon (talk) 20:49, 8 June 2025 (UTC)}}

{{AFC comment|1=Sources lack qualifying WP:GNG. โ˜ฎ๏ธCounter-Strike:Mention 269๐Ÿ•‰๏ธ(๐Ÿ—จ๏ธ โ— โœ‰๏ธ โ— ๐Ÿ“”) 06:31, 4 June 2025 (UTC)}}

{{AFC comment|1=The article uses a large amount of technical jargon and does not provide a general introduction to the subject. I am not sure if the topic meets GNG. Caleb Stanford (talk) 20:40, 10 May 2025 (UTC)}}

{{AFC comment|1=Publications by the developers of BGZF are not independent. jlwoodwa (talk) 23:28, 1 April 2025 (UTC)}}

----

{{Short description|File format for block-based Gzip compression}}

{{Draft topics|computing}}

{{AfC topic|stem}}

{{Infobox file format

| name = BGZF (Blocked GNU Zip Format)

| extension = .gz

| mime = application/gzip

| magic = `\x1f\x8b\x08\x04` (initial bytes, standard Gzip magic. BGZF adds specific extra fields in the header of each block.)

| owner = SAMtools project / HTSlib

| released = c. 2009 (along with SAM/BAM format specification)

| container_for = Commonly used for bioinformatics data like SAM, BAM, VCF records

| extended_from = Gzip

| standard = https://samtools.github.io/hts-specs/SAMv1.pdf#page=13.12

| genre = Compressed file format, Indexed file format

| website = {{URL|http://www.htslib.org/doc/bgzip.html}}

}}

Blocked GNU Zip Format (BGZF) is a variant of gzip file format that uses block compression, a method that compresses data in independent blocks of contentโ€”each of which is a valid gzip file. This design is utilized widely in bioinformatics for genomic data compression.{{Cite journal |last1=Lan |first1=Divon |last2=Tobler |first2=Ray |last3=Souilmi |first3=Yassine |last4=Llamas |first4=Bastien |date=2021-08-25 |title=Genozip: a universal extensible genomic data compressor |journal=Bioinformatics (Oxford, England) |volume=37 |issue=16 |pages=2225โ€“2230 |doi=10.1093/bioinformatics/btab102 |issn=1367-4811 |pmc=8388020 |pmid=33585897 |quote=BGZF-block level indexing that is common in standard indexes of genomic file formats}} The block-based design provides efficient storage, random access with indexed queries,{{Cite journal |last=Yamada |first=Taiju |date=2020-04-01 |title=7bgzf: Replacing samtools bgzip deflation for archiving and real-time compression |url=https://www.sciencedirect.com/science/article/abs/pii/S1476927119311375 |journal=Computational Biology and Chemistry |volume=85 |pages=107207 |doi=10.1016/j.compbiolchem.2020.107207 |pmid=32092548 |issn=1476-9271}}{{Cite journal |last1=Danecek |first1=Petr |last2=Bonfield |first2=James K |last3=Liddle |first3=Jennifer |last4=Marshall |first4=John |last5=Ohan |first5=Valeriu |last6=Pollard |first6=Martin O |last7=Whitwham |first7=Andrew |last8=Keane |first8=Thomas |last9=McCarthy |first9=Shane A |last10=Davies |first10=Robert M |last11=Li |first11=Heng |date=2021-02-01 |title=Twelve years of SAMtools and BCFtools |url=https://doi.org/10.1093/gigascience/giab008 |journal=GigaScience |volume=10 |issue=2 |pages=giab008 |doi=10.1093/gigascience/giab008 |issn=2047-217X |pmc=7931819 |pmid=33590861 |quote=[..] both formats can be either plain (uncompressed) or block-compressed with BGZF for random access and compact size.}} and parallel processing; allowing large-scale data processing.{{Cite journal |last1=Hernaez |first1=Mikel |last2=Pavlichin |first2=Dmitri |last3=Weissman |first3=Tsachy |last4=Ochoa |first4=Idoia |date=2019-07-20 |title=Genomic Data Compression |url=https://www.annualreviews.org/content/journals/10.1146/annurev-biodatasci-072018-021229 |journal=Annual Review of Biomedical Data Science |language=en |volume=2 |issue= |pages=19โ€“37 |doi=10.1146/annurev-biodatasci-072018-021229 |issn=2574-3414}}

The format was developed as part of SAM/BAM specification and SAMtools.{{Cite journal |last1=Li |first1=Heng |last2=Handsaker |first2=Bob |last3=Wysoker |first3=Alec |last4=Fennell |first4=Tim |last5=Ruan |first5=Jue |last6=Homer |first6=Nils |last7=Marth |first7=Gabor |last8=Abecasis |first8=Goncalo |last9=Durbin |first9=Richard |date=15 August 2009 |title=The Sequence Alignment/Map format and SAMtools |journal=Bioinformatics |volume=25 |issue=16 |pages=2078โ€“2079 |doi=10.1093/bioinformatics/btp352 |issn=1367-4803 |pmc=2723002 |pmid=19505943 |collaboration=1000 Genome Project Data Processing Subgroup}} It is a core component of the common BAM format (the binary version of the Sequence Alignment Map format) and is also used to compress and index Variant Call Format (VCF), FASTA, and BED files.{{Cite journal |last1=Bonfield |first1=James K |last2=Marshall |first2=John |last3=Danecek |first3=Petr |last4=Li |first4=Heng |last5=Ohan |first5=Valeriu |last6=Whitwham |first6=Andrew |last7=Keane |first7=Thomas |last8=Davies |first8=Robert M |date=2021-02-01 |title=HTSlib: C library for reading/writing high-throughput sequencing data |url=https://academic.oup.com/gigascience/article/10/2/giab007/6139334 |journal=GigaScience |volume=10 |issue=2 |pages=giab007 |doi=10.1093/gigascience/giab007 |issn=2047-217X |pmc=7931820 |pmid=33594436}} Because each block is a standard gzip block, a BGZF file can be decompressed by any standard gzip-compatible tool, ensuring backward compatibility.{{Cite web |title=bgzip(1) manual page |url=https://www.htslib.org/doc/bgzip.html |access-date=2025-06-03 |website=www.htslib.org}} A general purpose compression utility bgzf is distributed with HTSlib software library.

Uses

BGZF is widely utilized in bioinformatics for the compression of large datasets where efficient random access is a crucial requirement. Due to large sizes of next-generation sequencing data formats like SAM files,{{Cite news |last=Weeks |first=N. T. |date=2018 |title=Openmp task parallelism for faster genomic data processing. |url=https://www.openmp.org/wp-content/uploads/OpenMP-Task-Parallelism-for-Faster-Genomic-Data-Processing.pdf |quote=Reading, decoding, sorting, encoding, and writing large sequence alignment files (tens or hundreds of GBs) can be time-consuming and resource intensive.}} they are compressed into binary BAM format utilizing BGZF compression.{{Cite book |last1=Sadikin |first1=Rifki |last2=Arisal |first2=Andria |last3=Omar |first3=Rofithah |last4=Mazni |first4=Nur Hidayah |chapter=Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster |date=November 2017 |title=2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE) |chapter-url=https://ieeexplore.ieee.org/document/8285542 |pages=421โ€“425 |doi=10.1109/ICITISEE.2017.8285542|isbn=978-1-5386-0658-2 }}

For random access, an index file is created for a BGZF-compressed file, typically using Tabix. This index stores the file offsets of the compressed blocks alongside the corresponding genomic coordinates, thus allowing a program to seek directly to the block containing the data queried, decompress only them, and retrieve the requested information, avoiding the need to process the entire file.

The format is also extensively employed for compressing variant call files (VCF) along with their associated Tabix indexes,{{Cite journal |last=Li |first=Heng |date=2011-03-01 |title=Tabix: fast retrieval of sequence features from generic TAB-delimited files |url=https://academic.oup.com/bioinformatics/article/27/5/718/262743 |journal=Bioinformatics |volume=27 |issue=5 |pages=718โ€“719 |doi=10.1093/bioinformatics/btq671 |issn=1367-4803 |pmc=3042176 |pmid=21208982}} and similarly for other substantial genomic data files such as BED, GFF/GTF, and occasionally FASTQ when indexed access is necessary. A broad range of bioinformatics software packages are equipped to read and write BGZF-compressed files; these include well-known tools like SAMtools, HTSlib, BCF/VCFtools,{{Cite journal |last1=Danecek |first1=Petr |last2=Bonfield |first2=James K |last3=Liddle |first3=Jennifer |last4=Marshall |first4=John |last5=Ohan |first5=Valeriu |last6=Pollard |first6=Martin O |last7=Whitwham |first7=Andrew |last8=Keane |first8=Thomas |last9=McCarthy |first9=Shane A |last10=Davies |first10=Robert M |last11=Li |first11=Heng |date=1 February 2021 |title=Twelve years of SAMtools and BCFtools |journal=GigaScience |volume=10 |issue=2 |doi=10.1093/gigascience/giab008 |issn=2047-217X |pmc=7931819 |pmid=33590861}} Picard tools, the GATK, and libraries such as Biopython.{{cite web |title=Bio.bgzf module โ€” Biopython 1.85 documentation |url=https://biopython.org/docs/1.85/api/Bio.bgzf.html |access-date=3 June 2025 |publisher=Biopython}}{{cite web |title=BlockCompressedOutputStream (htsjdk 2.8.1 API) |url=https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/BlockCompressedOutputStream.html |access-date=3 June 2025 |publisher=SAMtools/HTSlib}} The standard command-line utility for creating BGZF-compressed files and their corresponding .gzi indexes is bgzip, which is distributed as part of HTSlib.

BGZF has been adapted for development of more efficient data-specific compression methods and algorithms leveraging its block based design.{{Cite journal |last1=Li |first1=Miaoxin |last2=Li |first2=Jiang |last3=Li |first3=Mulin Jun |last4=Pan |first4=Zhicheng |last5=Hsu |first5=Jacob Shujui |last6=Liu |first6=Dajiang J. |last7=Zhan |first7=Xiaowei |last8=Wang |first8=Junwen |last9=Song |first9=Youqiang |last10=Sham |first10=Pak Chung |date=2017-05-19 |title=Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework |journal=Nucleic Acids Research |volume=45 |issue=9 |pages=e75 |doi=10.1093/nar/gkx019 |issn=1362-4962 |pmc=5435951 |pmid=28115622}}

Design schema

A BGZF file consists of a series of concatenated BGZF blocks. Each block, whether in its compressed or uncompressed state, is limited to a maximum size of 64 kilobytes. Each BGZF block is itself a fully compliant gzip archive, adhering to the specifications outlined in RFC 1952.

File:BGZF Block.png

Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:

  1. The F.EXTRA bit in the header is set to indicate that extra fields are present.
  2. The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII โ€˜BCโ€™).
  3. The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
  4. The payload of the BGZF extra field is a 16-bit unsigned integer in little-endian format. This integer gives the size of the containing BGZF block minus one.

This block design allows use of an associated index file (storing offsets of each BGZF block) to fetch and decompress only the block of data that pertains to the query, thus avoiding the computational overhead of reading and decompressing all BGZF blocks.

= Random access =

= EOF marker =

End-of-file marker for BGZF enables detection of erroneously truncated files and generate warnings or errors for the user. The EOF marker block is an empty (data block of length zero) BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes:

1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00

The presence of an EOF marker by itself does not signal an end of the file, however, an EOF marker present at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it.{{Cite web |title=HTS format specifications |url=https://samtools.github.io/hts-specs/ |access-date=2025-04-01 |website=samtools.github.io}}

See also

References

{{reflist}}{{Compression methods}}{{Draft categories|:Category:File formats

:Category:Data compression

:Category:Bioinformatics

}}