Silesia corpus

The Silesia corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 2003 as an alternative for the Canterbury corpus and Calgary corpus, based on concerns about how well these represented modern files. It contains various data types, including large text documents, executable files, and databases. {{Cite thesis |last=Deorowicz |first=Sebastian |title=Universal Lossless Data Compression Algorithms |publisher=Silesian University of Technology |url=https://sun.aei.polsl.pl/~sdeor/pub/deo03.pdf |pages=93–95 |archive-url=https://web.archive.org/web/20240828110434/https://sun.aei.polsl.pl/~sdeor/pub/deo03.pdf |archive-date=2024-08-28}}

The corpus consists of 12 files, totaling 211MB. The files were chosen to represent what the author considered to be data types likely to grow rapidly in size over time, such as computer programs and databases, along with more traditional compression benchmarks, such as large text files.

class="wikitable sortable" \|+ Overview of files, their sizes, descriptions, and data types ! File ! Size (B) ! Description ! Type of data
dickens \| 10192446 \| The works of Charles Dickens \| English text
mozilla \| 51220480 \| Executable files for Mozilla 1.0 \| Executable
mr \| 9970564 \| MRI Images \| 3D image
nci \| 33553445 \| A database of chemical structures \| Database
office \| 6152192 \| A shared library from OpenOffice \| Executable
osdb \| 10085684 \| A Sample MySQL database from the Open Source Database Benchmark \| Database
reymont \| 6625583 \| The text of the book Chłopi by Władysław Reymont \| PDF in Polish
samba \| 21606400 \| The source code of Samba 2‑2.3 \| Executable
sao \| 7251944 \| The SAO star catalogue \| Binary database
webster \| 41458703 \| The 1913 Webster Unabridged Dictionary \| HTML
xml \| 5345280 \| Collected XML files \| XML
x-ray \| 8474240 \| A medical X-Ray \| Image
colspan="1"\| Total ! 211938580 ! colspan="2"\|

class="wikitable sortable"

|+ Overview of files, their sizes, descriptions, and data types

! File

! Size (B)

! Description

! Type of data

dickens

| 10192446

| The works of Charles Dickens

| English text

mozilla

| 51220480

| Executable files for Mozilla 1.0

| Executable

| 9970564

| MRI Images

| 3D image

nci

| 33553445

| A database of chemical structures

| Database

office

| 6152192

| A shared library from OpenOffice

| Executable

osdb

| 10085684

| A Sample MySQL database from the Open Source Database Benchmark

| Database

reymont

| 6625583

| The text of the book Chłopi by Władysław Reymont

| PDF in Polish

samba

| 21606400

| The source code of Samba 2‑2.3

| Executable

sao

| 7251944

| The SAO star catalogue

| Binary database

webster

| 41458703

| The 1913 Webster Unabridged Dictionary

| HTML

xml

| 5345280

| Collected XML files

| XML

x-ray

| 8474240

| A medical X-Ray

| Image

colspan="1"| Total

! 211938580

! colspan="2"|

Because it has a broader and more modern selection of datatypes, it is considered a better source of test data for compression algorithms when compared to the Calgary corpus.{{Cite book |last1=Gupta |first1=Apoorv |last2=Bansal |first2=Aman |last3=Khanduja |first3=Vidhi |chapter=Modern lossless compression techniques: Review, comparison and analysis |date=2017-02-22 |title=2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT) |chapter-url=https://ieeexplore.ieee.org/document/8117850 |publisher=IEEE |pages=1–8 |doi=10.1109/ICECCT.2017.8117850 |isbn=978-1-5090-3239-6}}

References

External links

https://sun.aei.polsl.pl//~sdeor/pub/deo03.pdf

Category:Data compression

Category:Test items

Silesia corpus

Contents

See also

References

External links