User:Beluga

How I downloaded Jay's Google Docs documents
Jay retired and put all his Google Docs links to his user page. I took a backup of all the ones except the "folderview" stuff.

I opened the wiki editor for the page and copied the whole contents to a text file.

I used a Python script to extract the URLs.

It has urlextract as a dependency, so I first installed it (on Linux) with

sudo pip install urlextract

This is the script:

I copied the output to a text editor for further cleanup with find & replace and throwing stuff out. I put the documents and spreadsheets into different text files so all they had were the IDs separated with line breaks like so:

1YznFkDn91kMH6hoNsXkLzjRxzOJGdF1sqYXhPgPAN18 1FXFHzGzu2m9OcWu-m5BYf1Q-OBsDjkqiXtygLfPyhNk 1AmDXLkQiFK5OcOrtuGn7iAcqp2Yhc_eNNnlADk-ji-U 198zpaE2SKD0MIQUmSKb-s9vVCZy5dsdLg5JXhJ82iHg 1zBRA3a8B8lR_Dw-7sJ0jsxKYRw5AlnuvvKnQCYBFWRk

Then I created docx.sh:

and xlsx.sh:

I ran them like so to download the files:

./docx.sh docx.txt

All pages under QA
https://wiki.documentfoundation.org/index.php?title=Special%3APrefixIndex&prefix=QA&namespace=0

How I did the initial processing for the OOo Developer's Guide renovation
Converting to wikitext, ran the equivalent of this in the directory with the separate .odt chapters:

It changes the branding to LibreOffice where appropriate, but leaves OOo in place for cases that refer to a specific version. The sed command adds the TDF wiki navigation and categories.

After unzipping the chapters to examine the images, I noticed the images in wiki.openoffice.org were higher quality. I set about downloading them all.

I copied all the items from here (and subsequent pages) to a file: https://wiki.openoffice.org/w/index.php?title=Special%3APrefixIndex&prefix=Documentation%2FDevGuide&namespace=0

Did some find & replace text manipulation to get the links (one per line) into the format https://wiki.openoffice.org/wiki/Documentation/DevGuide/Accessibility/A_Simple_Screen_Reader

I downloaded all the pages, but noticed many of them were 0 bytes. Dennis said that these problematic pages could still be opened in edit mode (some problem with a missing extension in the wiki). So I had to create a more complex downloading pipeline that directs all the erroring page links to a new .txt file. The if statement $? == 22  looks for the curl error code 22. Between downloads it sleeps a little bit to not hammer the server too much. The forward slashes are replaced by plus signs in the filenames.

Then I downloaded the edit mode links:

Downloading the actual images was its own special exercise in hula hooping. Xmllint is a cool tool for grabbing elements from markup, but unfortunately it is unable to print (href) attribute values of multiple elements at the same time. So instead of the values, I had to print the whole href="blabla" and trim it with awk and sed. The awk expression prints what's after the = (equal sign). In the sed expression, t without any label conditionally skips all following commands, d deletes line. It only keeps the lines with .png to be sure no unwanted ones are taken. It also removes the quotes around the url. Again, care is taken to add + as a separator in place of /. The most important thing in the result is that the image filenames include the chapter and section information.

For downloading through the normal pages:

For downloading through the edit mode pages (targeting wikitext tags instead of any html):