User:Korrawit/CleanupWikiLinks

Disclaimer: Some data may change as time goes on. And I've written this two weeks after, so something may be incorrect. :)

Since an upgrading of Mediawiki to version 1.17, there was a problem regarding hard-coded direct links to files, ones with cgi_img_auth.php. The problem is links are not usable now. So, I decided to clean this up by replacing with Media: namespace, which will give you direct link on-the-fly.

Tool
I've used Pywikipediabot.

Setting up

 * 1) Installation
 * 2) Some more setup
 * 3) You have 2 ways to create family file. I named it families/tdf_family.py.
 * 4) Run python generate_family_file.py. This will create it automatically.
 * 5) Manually create it.
 * 6) Create user-config.py as described here

This is my families/tdf_family.py.

And this is my user-config.py.

Process overview
Everything is run from command line, at Pywikipediabot root folder.
 * 1)  to this wiki with login.py
 * 2)  in article (main) namespace with pagegenerators.py
 * 3)  using regular expression with replace.py

Logging in
Just run python login.py and enter your password.

Getting a list of all pages
Use pagegenerators.py to list all pages. Run python pagegenerators.py -start:! > pageall -start:! means start at the beginning of list.

Oops! Problem occurred. The page keywords caused it, because the bot thinks it's in Svn</tt> namespace, but it isn't.

But that page doesn't have any hard-coded links to replace, just skip it. Looking at All pages starting at Svn:keywords, the next page is System Operations</tt>. So, run python pagegenerators.py -start:"System Operations" > pageall2 to get the rest of list. Note that the ideal solution is to rename that page.

Next, combine two files and named pageall</tt>. It now looks like this: And have about thousand lines. We would like to have a list of wikilinks, so gives:

Explanation of regular expression used: Note that we have to escape {}</tt> and </tt>, but not []</tt>.

Replace hard-coded links
Now we get a list of all pages in pageall_link</tt>. Use replace.py to make a fix.

The hard-coded links are in many formats. For example:

So, replacing ones with piping link first:

Explanation of regular expressions used:

Next, fix another pattern: a bare link without []</tt>.

Explanation of regular expressions used:

These two should covers most cases, but we would like to check again with very simple match: This will save all remaining matched pages into page_remaining</tt>. And since we only -save</tt> it, there is no real edits and no harm, so just -always -save</tt>.

And check page_remaining</tt> again for some strange pattern. Pywikipediabot has an option to let we manually edit via browser anyway.

Summary
At first, I replaced links with File:</tt> namespace. But just noticed that Media:</tt> namespace should be more correct. So, result is at Special:Contributions/KorrawitBot.

And as suggested by Florian on website mailing-list, I wrote Help:Editing

TODO

 * Patrol for new version (currently usable) of hard-coded links. Maybe automatically?

Thanks

 * Jan Holesovsky who noticed a problem
 * MediaWiki help
 * Pywikipediabot
 * an example of regular expression description by Bart Kiers from stackoverflow