Import wiki content

From Organic Design
Revision as of 08:52, 8 December 2009 by Rob (talk | contribs) (add better proc using tar)
Procedure.svg Import wiki content
Organic Design procedure

Importing articles

Articles can be imported into a wiki from an XML file which was generated by an export from another wiki.

  • Ensure you're logged in with sysop privileges
  • Go to the Special:Import page
  • Browse for the XML file you wish to import
  • Click "Upload file"

Importing a directory of files

This simply involves running the importImages.php maintenance script which takes the directory containing the files to import as its only parameter. See run a wiki maintenance script for details.

  • Ensure that the images directory is writable and that directories can be created in it by PHP
  • There are a number of options available: --user, --comment, --overwrite

Example:

php maintenance/importImages.php /path/to/images/directory png jpg gif bmp PNG JPG GIF BMP

Note: the files (perhaps to do with our wikia setup) get copied to the root directory (/files) instead of into the files directory of the wiki in question, so until this is fixed you will need to copy them over the current structure. Assume the wiki we're dealing with is called bikewiki.

cd /files/bikewiki
tar cf ~/bikewiki.import.tar .
mv ~/bikewiki.import.tar /var/www/wikis/bikewiki/files/bikewiki/
cd /var/www/wikis/bikewiki/files/bikewiki/
tar xf bikewiki.import.tar 

Importing Wikipedia Dumps

The Wikipedia export are enormous and may by very difficult to import. Here's a thread from the Wikitech-L mail list which may be useful regarding this.

Hi,
	I hate to be resurrecting an old thread, but I think for the purpose of 
completion I would like to post my experience with the Import of XML 
Dumps of Wikipedia into Mediawiki, so that it would help someone else 
looking for this information. I started this thread after all.

	I was attempting to import the XML/SQL dumps of the English Wikipedia 
http://download.wikimedia.org/enwiki/20081008/ (not the most recent 
version) using the three methods described at 
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

I.	Using importDump.php:
While this is the recommended method, I did run into memory issues. The 
PHP (CLI) runs out of memory after a day or two, and then you have to 
restart the import. (The good thing is that it skips quickly over pages 
it is already imported after the restart.) However the fact that this 
crashed too many times made me give up on it.

II.	Using mwdumper:
This is actually pretty fast, and does not give errors. However I could 
not figure out why this imports only 6.1 Million Pages, as compared to 
7.6 Millon pages in the dump mentioned above (not the most recent dump.) 
The command line output correctly indicates that 7.6 M pages have been 
processed – but when you count the entries in the page table, only 6.1M 
show up. I don’t know what happens to the rest, because as far as I can 
see there were no errors.

III.	Using xml2sql:
Actually this is not the recommended way of importing the XML dumps 
according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only 
way that really worked for me. However as compared to the other tools, 
this needs to be compiled/installed to get it to work. As Joshua 
suggested a simple:
$   xml2sql enwiki-20081008-pages-articles.xml
$  mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt

worked for me.

Notes: Your local MediaWiki will still not look like the online wiki 
(even after you take into account that Images do not come with these 
dumps).
1.	For that I first imported the SQL Dumps into the other tables that 
were available at http://download.wikimedia.org/enwiki/20081008/ (except 
page – since you have already imported it by now.)
2.	I next installed the extensions listed in the “Parser hooks” section 
under “Installed extensions” on 
http://en.wikipedia.org/wiki/Special:Version
3.	Finally, I recommend that you use HTML Tidy, because even after the 
above steps, the output is screwed up. The settings for HTML Tidy are in 
the LocalSettings.php. These are not there by default, you need to get 
them from includes/DefaultSettings.php. The settings that worked for me 
were:
$wgUseTidy = true;
$wgAlwaysUseTidy = false;
$wgTidyBin = '/usr/bin/tidy';
$wgTidyConf = $IP.'/includes/tidy.conf';
$wgTidyOpts = '';
$wgTidyInternal = extension_loaded( 'tidy' );

And

$wgValidateAllHtml = false;

Ensure this last one is false - else you would get nothing for most of 
the pages.

I hope the above information helps others who also want to Import of XML 
Dumps of Wikipedia into Mediawiki.

Thanks to all who answered my posts,
O. O.

See also