Wikipedia talk:Database download#Proposal to usurp shortcut WP:DD
{{talk page}}
{{User:MiszaBot/config
|archiveheader = {{talkarchivenav}}
|maxarchivesize = 70K
|counter = 3
|minthreadsleft = 4
|algo = old(365d)
|archive = Wikipedia talk:Database download/Archive %(counter)d
}}
:Please note that questions about the database download are more likely to be answered on the [https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l xmldatadumps-l] or [http://mail.wikipedia.org/mailman/listinfo/wikitech-l wikitech-l] mailing lists than on this talk page.
{{archives}}
How to use multistream?
The "How to use multistream?" shows
"
For multistream, you can get an index file, pages-articles-multistream-index.txt.bz2. The first field of this index is the number of bytes to seek into the compressed archive pages-articles-multistream.xml.bz2, the second is the article ID, the third the article title.
Cut a small part out of the archive with dd using the byte offset as found in the index. You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID.
See https://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressor for info about such multistream files and about how to decompress them with python; see also https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txt and related files for an old working toy.
"
I have the index and the multistream, and I can make a live usb flash drive with
https://trisquel.info/en/wiki/how-create-liveusb
lsblk
umount /dev/sdX*
sudo dd if=/path/to/image.iso of=/dev/sdX bs=8M;sync
,but I do not know how to use dd that well to
"Cut a small part out of the archive with dd using the byte offset as found in the index."
than
"You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID. "
Is there any video or more information on Wikipedia about how to do this, so I can look at Wikipedia pages, or at least the text off-line?
Thank you for your time.
Other Cody (talk) 22:46, 4 December 2023 (UTC)
:https://trisquel.info/en/forum/how-do-you-cut-wikipedia-database-dump-dd
:has someone called Magic Banana who has information about how to do this.
:Maybe others as well. Other Cody (talk) 15:44, 26 January 2024 (UTC)
:Perhaps a bit on the late side, but while I'm here (And for the benefit of anybody wondering)...
:The multistream archives use BZip2 compression in a mode where the compression algorithm resets periodically e.g. every megabyte or so. An upshot of this (And is for why the multistream archive is always the better option to choose if you're unsure) is that you don't have to decompress the entire file from the start; If you know roughly where in the archive the data you want is located (Which is why the index is provided, and it's necessary with the multistream archives) you can jump to that part of the file, read and extract only a very small portion of it. This considerably speeds things up if you only want to access one or two articles, and reduces resource burdens accordingly.
:From a quick look at the index file for jawiki-pages-articles-multistream.xml.bz2 (Japanese dump, 2025-06-01) we can see an entry for the Wikipedia article Wikipedia:Sandbox listed as 688:6:Wikipedia:Sandbox which may be read as Read data from offset (Byte) 688 and extract article no. 6 from the XML stream.
:To achieve this using a combination of dd and bzip2 (And assuming you're using Linux), try...
:dd if=./jawiki-20250601-pages-articles-multistream.xml.bz2 bs=1 skip=688 count=1048576 | bzip2 -dc >> ./jpwiki-test.xml.txt
:This uses dd to skip the first 688 bytes of the pages-articles-multistream file, read 1MiB of compressed data to stdout, which is passed to bzip2 that decompresses the data and saves it to a file named jpwiki-test.xml.txt, which now contains a decompressed 4MiB section of the pages-articles file. You can then grep this text file for the desired information, or process it in some other way according to your needs.
:tl,dr; Use dd to read out a short section of the compressed data, use bzip2 to decompress that section, and feed the output into another file or wherever you need it. HtH! :-)
:ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking|T·C) 11:01, 14 June 2025 (UTC)
A tool for a similar multistream compressed file was written for xz compression and lives at https://github.com/kamathln/zeex . This will give a preliminary idea and could be adapted for bz2 as well. kamathln (talk) 12:21, 22 January 2025 (UTC)
How many "multiple" is "These files expand to multiple terabytes of text." - 4TB Drives are...
...cheap as chips.
In early 2025, a 4TB disk drive is $70 USD while SSD is just $200, and 24 TB Discs are under $500...
It's clear that the "current version only" expands to just 0.086 TB - Can anyone further clarify whether "multiple" a few lines below that is talking about expanding to 2 TB or 200 TB? Jonathon Barton (talk) 06:17, 16 February 2025 (UTC)
Semi-protected edit request on 20 March 2025
{{edit semi-protected|Wikipedia:Database download|answered=yes}}
within "SQL Schema" section, change the link pointing to tables.sql to either tables-generated.sql or tables.json, I'd go with the former as it's more compact and readable.
the original tables.sql is empty as of Aug. 2024 and will be removed, see https://phabricator.wikimedia.org/T191231
old: https://phabricator.wikimedia.org/diffusion/MW/browse/master/maintenance/tables.sql
to either: https://phabricator.wikimedia.org/source/mediawiki/browse/master/sql/mysql/tables-generated.sql (prefered)
or: https://phabricator.wikimedia.org/source/mediawiki/browse/master/sql/tables.json
quoting from the issue: "we may want to switch to YAML later" hasn't happened yet. YAML would be the most readable format. KlausSchwab (talk) 14:22, 20 March 2025 (UTC)
:{{Done}} -- John of Reading (talk) 17:43, 23 March 2025 (UTC)
Semi-protected edit request on 27 April 2025
{{edit semi-protected|Wikipedia:Database download|answered=yes}}
The compressed size of 19GB is not the same as mentioned on https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia, perhaps one of the pages got stale 2601:600:8480:2D10:5D63:3732:312A:9F99 (talk) 20:12, 27 April 2025 (UTC)
:Yes, the figures at Wikipedia:Size of Wikipedia#Size of the English Wikipedia database are stale. Each figure is labelled "As of
Non-English dumps from 2025-06-01 now on Archive.org
Noticing there hadn't been any apparent updates/new torrents for non-English language versions of the Wikipedia dumps for quite a while I've cloned the 2025-06-01 dewiki, eswiki and frwiki multistream dumps to Archive.org, which generates web-seeded torrents for uploaded material. These have been added to the MediaWiki Data dump torrents page with magnet links. Although Archive.org supports hosting and distribution of material via BitTorrent it also adds its own small metafiles to each collection, meaning the additional files cause Archive.org torrents not to gel/share with independent torrents of the same underlying file.
Firstly; Could anybody with write access to the wikimediadownloads and wikicollections collections on Archive.org please propose or add the following Archive.org articles to them:
- wikipedia-20250601-de-pages-articles-index
- wikipedia-20250601-de-pages-articles
- wikipedia-20250601-es-pages-articles-index
- wikipedia-20250601-es-pages-articles
- wikipedia-20250601-fr-pages-articles-index
- wikipedia-20250601-fr-pages-articles
Secondly; Seeing that a lot of people seem to want to know the decompressed size of the dumps (Information I've added to the Data dump torrents page where I have been able) there is a shell command that can achieve this without having to extract the entire file to disk. On Linux and MacOS computers which support bzip2 and dd, the following command may be used:
- bzip2 -dc ./pages-articles-multistream.xml.bz2 | dd bs=1048576 iflag=fullblock of=/dev/null
This has the effect of decompressing the file into a memory buffer, passing it through the dd utility which counts the amount of uncompressed data that has been extracted by bzip2 (In this case; In 1MiB blocks) before discarding the extracted data. The output (Which may take a long time to be produced) will give the decompressed size of the file in real megabytes minus one.
(i.e: An output of 20301 records will be 20,302MiB in size. The presence of a +1 in the record count shows the number of incomplete records - Almost always 0 or 1 - And any number higher than one may indicate a damaged or failing storage device.)
Thankyou :-)
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking|T·C) 12:59, 10 June 2025 (UTC)
(A long-time Wikipedia user, but who lost access to his original account many years ago. A curse on passwords! :-)