In two days... (August 2007)

Changelog:

August 13, 2007:Slashdotted!
August 22, 2007:Fixed bugs with "extreme" links in some wiki text
September 9, 2007:Introduced extra script to install the templates used by a page (improves rendering of some pages)
September 17, 2007:It appears that woc.fslab.de (the site offering the standalone wiki renderer) is down. Until they come back up, you can download the latest snapshot of the renderer here.
October 21, 2007:I found an article whose wiki text caused the woc.fslab.de parser to abort - modified show.pl to use the plain wiki content in that case.
March 31, 2008:Robin Paulson notified me that the same process can be used to install Wiktionary offline!
April 10, 2009:Ivan Reyes found out why some wiki texts caused the woc.fslab.de parser to fail, mediawiki_sa.tar.7z updated. Thanks, Ivan!
September 9, 2009:Meng WANG made some modifications for searching in the Chinese (and probably other non-English) wikipedias.
September 19, 2009:James Somers added caching of LaTEX-made images. His patch is available here.
November 6, 2009:The woc.fslab.de repository is apparently no longer accessible... I also found out that PHP now has "namespace" as a keyword, and that the old fslab.de tarball has PHP code that uses a "class Namespace". I therefore patched it to make it work with today's PHP interpreters (the updated woc.fslab.de tarball is here).

Executive summary

It's strong points: Here's a screenshot:
screenshot

What this is and why I built it

Wikipedia needs no introductions: it is one of the best - if not the best - encyclopedias, and it's freely available for everyone.

Everyone can be a relative term, however... It implies availability of an Internet connection. This is not always the case; for example, many people would love to have Wikipedia on their laptop, since this would allow them to instantly check for things they want regardless of their location (business trips, hotels, firewalled meeting rooms, etc). Others simply don't have an Internet connection - or they don't want to dial up one every time they need to look something up.

Up to now, installing a local copy of Wikipedia is not for the faint of heart: it requires a LAMP or WAMP installation (Linux/Windows, Apache, MySQL, php), and it also requires a tedious - and VERY LENGTHY procedure that transforms the "pages/articles" Wikipedia dump file into data of a MySQL database. When I say *lengthy*, I mean it: last time I did this, it took my Pentium4 3GHz machine more than a day to import Wikipedia's XML dump into MySQL. 36 hours, to be precise.

The result of the import process was also not exactly what I wanted: I could search for an article, if I knew it's exact name; but I couldn't use parts of the name to search; if you don't use the exact title, you get nothing. To allow these "free-style" searches to work, one must create the search tables - which I'm told, takes days to build. DAYS!

Wouldn't it be perfect, if we could use the wikipedia "dump" data JUST as they arrive after the download? Without creating a much larger (space-wise) MySQL database? And also be able to search for parts of title names and get back lists of titles with "similarity percentages"?

Follow me...

Identifying the tools

First, let's try to keep this as simple as possible. We'll try to avoid using large, complex tools. If possible, we'll do it using only ready made tools and scripting languages (Perl, Python, PHP). If we need something that runs fast, we'll use C/C++.

What do we need to pull this through?

Anyway, I think I'll stop here. Make sure you have Python, Perl, Php (the big hammers), Xapian and Django (the small ones), install this package, and you will be able to enjoy the best (currently) possible offline browsing of Wikipedia. Just untar it, type 'make' and follow the instructions.

Version wise - some people asked this - I used the following (but since I am only using simple constructs, I believe other versions will work just fine): Perl 5.8.5, Python 2.5, PHP 5.2.1, Xapian 1.0.2 and Django 0.9.6.

Update, September 9, 2007: Some of the pages appear less appealing than what they can - they use wiki templates, which are not installed in woc.fslab.de's renderer by default. These templates are, however, inside the .bz2 files - all we need to do is install them in the renderer. If you meet such a page, execute fixTemplatesOfCurrentPage.pl from within the installation directory: the script (part of the package) will read the data of the last page shown, and install the necessary templates in the renderer. Simply refresh your browser, and the templates will then be rendered correctly.

Update, September 17, 2007: It appears that woc.fslab.de (the site offering the standalone wiki renderer) is down. You can download the latest snapshot of the wiki renderer here, so comment out the subversion checkout command from the Makefile, and just untar this instead.

Update, October 21, 2007: I found an article whose wiki text caused the woc.fslab.de parser to abort - modified show.pl to use the plain wiki content in that case.

Update, March 31, 2008: According to Robin Paulson, the same process described here also works for Wiktionary. I didn't even know there was a Wiktionary!... Thanks, Robin!

Update, February 27, 2012: It is now almost 5 years since I published my technique for offline Wikipedia browsing... Other methods have surfaced in the meantime, and people from the Wikimedia foundation asked me to add a link to Kiwix: an ongoing effort to create an easy to install, open-source, offline Wikipedia reader. Looks interesting...

P.S. Isn't the world of Open Source amazing? I was able to build this in two days, most of which were spent searching for the appropriate tools. Simply unbelievable... toying around with these tools and writing less than 200 lines of code, and... presto!


profile for ttsiodras at Stack Overflow, Q&A for professional and enthusiast programmers
GitHub member ttsiodras
 
Index
 
 
CV
 
 
Updated: Sun Oct 22 14:41:45 2023
 

The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable; or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.