Difference between revisions of "Talk:Import CSV data into a wiki"

From Organic Design wiki
(Template: Document idea)
m (Nad moved page Talk:Csv2wiki.pl to Talk:Import CSV data into a wiki: better desc)
 
(36 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Description ==
 
csv2wiki is a way to import data from a CSV file into a wiki running mediawiki.  There are two versions of this program. The first is a command-line driven version written in the Perl language. The second is a PHP version that runs as a MediaWiki command-line maintenance script so that it can integrate directly with the wiki database instead of working via HTTP requests.
 
 
== How It Works ==
 
The script uses [[wiki.pl]] to login and edit wiki articles
 
 
=== Job File ===
 
 
The job file contains all the information necessary to update your wiki from content in the Source File. Fields in the job file include:
 
 
*csv: ''Source File'' #full path and file name
 
*wiki: ''http://mydomain.com/wiki/index.php5'' #wiki URL (must be long form including the index.php
 
*user: ''username'' #an active user on the wiki
 
*pass: ''password'' #user's password
 
*separator # delimiter between records (default is comma, this is a regular expression, so you can use it to remove white-space, eg '''\s*,\s*''' or specify tabs with '''\t''')
 
*title: ''n'' # the column number that gives the article it's title
 
*template: # defaults to [[Template:Record]] if there is none
 
 
=== Source File ===
 
 
*Unicode: Ensure that the CSV file is encoded in UTF-8 if it contains special characters
 
 
 
 
=== Template ===
 
=== Template ===
A really cool thing to do would be have a template that the article parses and fishes out the {{{params}}} from a required template, and operates on those columns accordingly, then it can take the template as input and generate the calls always like this:
+
A nice approach would be for the script to fetch the template intended for use on the wiki using the persistent '''template:''' argument in the ''job.txt''' file. The [[Csv2wiki.pl]] script can then  parse out all the {{{params}}} which are not between <nowiki><noinclude>...</noinclude></nowiki> tags from a required template, since any content between those tags is not acted upon during transclusion of the template intances. The script can then obtain a list of template parameters and check them against the column names of the input file. Using this information the uploaded template wikitext calls can be restricted to a subset of columns of interest out of the input file. e.g.
  
 +
Input has columns: Foo Fodda Fum fi
 +
From template:Foo;
  
From template:Foo;
+
<nowiki>{{Foo|1= |2= |fodda=|fi=123}}</nowiki>
  
{{Foo|1= |2= |fodda=|fi=123}}
+
Column Fum is not uploaded as it is not in the template definition.
  
 
;Approach
 
;Approach
#It needs a template as input, it then uses wiki.pl to fetch the article wikitext
+
#It needs to find a template as input, if it doesn't exist then it throws a warning and uploads everything, it then uses wiki.pl to fetch the article wikitext.
 
#It parses through the template and grabs all {{{params}}}
 
#It parses through the template and grabs all {{{params}}}
 +
# Th columns are matched between the unique template parameters and the infile columns
  
 
+
====Todo====
--[[User:Sven|Sven]] 13:38, 28 May 2008 (NZST)
+
Most of the template uploading functionality has been written.
 +
* Add a '''prefix:''' argument which adds to the beginning of article names.
 +
* Add the <code> use strict </code>pragma
  
 
=== Logging In ===
 
=== Logging In ===
Line 66: Line 48:
 
*set unicode type (Document, Set encoding, Unicode, UTF-8)
 
*set unicode type (Document, Set encoding, Unicode, UTF-8)
 
*save
 
*save
 +
 +
==Append/Overwrite==
 +
*how does the script indicate if it should overwrite or append?
 +
: Currently this script does not append, it always overwrites an existing wikipage. The code that does this is;
 +
<source lang="PHP">
 +
$done  = wikiPageEdit($::wiki,$::prefix . $record[$::title],$text,"$template updated by [[User: $::user]] using csv2wiki.pl");
 +
</source>
 +
however it could easily be modified to do so by adding a conditional to allow appending of article content aswell using <code>wikiPageAppend($::wiki,$::prefix . ...)</code> instead. A new field would have to be added in the input file to signal that appending of content is required --[[User:Sven|Sven]] 11:57, 22 June 2008 (NZST)
 +
: Now implemented append, overwrite, could also implement wikiUpdateTemplate too. --[[User:Sven|Sven]] 22:03, 22 June 2008 (NZST)
 +
::swell, thanks for updating the example job file, too --[[User:Phalseid|Phalseid]] 17:17, 23 June 2008 (NZST)
 +
 +
 +
==Future fork==
 +
This script could be more generic by adding in parser rules which define what the record structure is of any file the script is uploading into a wiki.
 +
The idea is to maintain a set of parsing rules as defined by a template. For the instances of the '''template=''' parameter in the '''job.txt''' script the <nowiki><noinclude></nowiki> sections should provide information about the parsing rules regular expressions
 +
e.g.
 +
 +
:Record: m/^(.+?(\t|\n))$/g;
 +
: $1 = m/(.+? ).+?/;
 +
: ...
 +
 +
The record describing regular expressions output can be the input to another regular expression for each parameter of interest to capture. These would be named as 1= etc in the template. Security issues need to be addressed around the <nowiki> /.../x</nowiki> regular expression option, which needs to be suppressed using a search and replace regular expression itself for each regular expression specified.
 +
 +
==bug==
 +
 +
if "*append: 0 " it exits without overwriting the file.
 +
 +
i think it should have three settings:
 +
 +
*append 0 create if new, skip if present
 +
*append 1 create if new, append if present
 +
*append 2 create if new, overwrite if present
 +
 +
--[[User:Phalseid|Phalseid]] 08:34, 12 July 2008 (NZST)
 +
: I will look at your suggestions, currently wikiPageEdit always overwrites all content. If the new content is the same as the current content then the edit will not be made,. WikiPageAppend currently appends content if new key/value parameters are present --[[User:Sven|Sven]] 08:31, 16 July 2008 (NZST)
 +
 +
::sorry i missed you on skype. does it overwrite all page content, or just for the named template?  i can't tell because it just seems to skip if 0.  --[[User:Phalseid|Phalseid]] 08:59, 16 July 2008 (NZST)
 +
::: Is the content new?  Because both wikiPageEdit and wikiPageAppend probably dont add anything if there is no change in article content -[[User:Sven|Sven]] 20:43, 16 July 2008 (NZST)
 +
Did another test yesterday, and it is definitely working, for '*append:0' and  '*append :1', one possibility is that the CSV file had and error in it which the perl script did not detect, but processed anyway. Thinking about the options, articles will be skipped if the content is the same anyway so that option is redundant. However, the function 'wikiUpdateTemplate' in [[Wiki.pl]] might be a good option to add. So you have the ability to overwrite, append, and just update the first matched template. --[[User:Sven|Sven]] 08:44, 13 August 2008 (NZST)
 +
: Note function naming changes in [[Wiki.pl]]  wikiPageEdit => wikiEdit etc. I probably was using an old copy of [[Wiki.pl]] when I did the test that was not effected by the naming change --[[User:Sven|Sven]] 21:48, 3 January 2009 (NZDT)
 +
 +
== Unrecognised character ==
 +
When i am using csv2wiki i got the error, Unrecognized character \x7F at wiki.pl line 466.
 +
Compilation failed in require at csv2wiki.pl line 13.
 +
Could you please help me.
 +
:That sounds like a problem with your text editor because there are no such characters in the script, what OS and editor etc are you using. I recommend the [www.geany.org Geany] text editor. --[[User:Nad|nad]] 20:52, 24 May 2009 (UTC)
 +
:: Now I used the geany text editor in windows xp system and I uploaded the file wiki.pl to linux server, but still its giving the same error
 +
:::Try copying and pasting from the orginal again, there definitely is no such character in the source, so somehow it's got corrupted --[[User:Nad|nad]] 11:50, 26 May 2009 (UTC)
 +
::::Try the latest version of [[wiki.pl]], I've cleaned the whitespace incase there were bad characters in it --[[User:Nad|nad]] 00:21, 30 May 2009 (UTC)

Latest revision as of 00:31, 7 May 2020

Template

A nice approach would be for the script to fetch the template intended for use on the wiki using the persistent template:' argument in the job.txt file. The Csv2wiki.pl script can then parse out all the {{{params}}} which are not between <noinclude>...</noinclude> tags from a required template, since any content between those tags is not acted upon during transclusion of the template intances. The script can then obtain a list of template parameters and check them against the column names of the input file. Using this information the uploaded template wikitext calls can be restricted to a subset of columns of interest out of the input file. e.g.

Input has columns: Foo Fodda Fum fi From template:Foo;

{{Foo|1= |2= |fodda=|fi=123}}

Column Fum is not uploaded as it is not in the template definition.

Approach
  1. It needs to find a template as input, if it doesn't exist then it throws a warning and uploads everything, it then uses wiki.pl to fetch the article wikitext.
  2. It parses through the template and grabs all {{{params}}}
  3. Th columns are matched between the unique template parameters and the infile columns

Todo

Most of the template uploading functionality has been written.

  • Add a prefix: argument which adds to the beginning of article names.
  • Add the use strict pragma

Logging In

Windows

If you are needing to run Csv2wiki from a Windows machine, you will first need to install Active Perl. Refer to their site for documentation.

Install and Run

  • Copy your job description file (eg job.txt) to the same directory as the csv2wiki.pl and wiki.pl scripts
  • Open up a command prompt
  • change to the directory containing the csv2wiki.pl script
perl csv2wiki.pl job.txt
  • you can run multiple jobs at the same time by opening multiple cmd windows in windows. Make sure the jobs are different!

Issues

  • what is the delay time for logging in? (troubleshooting a slow server)
Login should take about a second

Tips

  • Open in excel
  • copy spreadsheet and paste special as values only (gets rid of formulas) and resave
  • remove or replace all "," characters in file
  • save excel spreadsheet as Unicode Text
    • Save As, Save as Type, Unicode Text (*.txt)
  • Open and Edit text in Geany (better than notepad),
  • search and replace TAB character with "," (cut and paste a tab character from notepad to get this)
NOTE, if you want to use tabs, set separator to \t
  • set unicode type (Document, Set encoding, Unicode, UTF-8)
  • save

Append/Overwrite

  • how does the script indicate if it should overwrite or append?
Currently this script does not append, it always overwrites an existing wikipage. The code that does this is;
$done  = wikiPageEdit($::wiki,$::prefix . $record[$::title],$text,"$template updated by [[User: $::user]] using csv2wiki.pl");

however it could easily be modified to do so by adding a conditional to allow appending of article content aswell using wikiPageAppend($::wiki,$::prefix . ...) instead. A new field would have to be added in the input file to signal that appending of content is required --Sven 11:57, 22 June 2008 (NZST)

Now implemented append, overwrite, could also implement wikiUpdateTemplate too. --Sven 22:03, 22 June 2008 (NZST)
swell, thanks for updating the example job file, too --Phalseid 17:17, 23 June 2008 (NZST)


Future fork

This script could be more generic by adding in parser rules which define what the record structure is of any file the script is uploading into a wiki. The idea is to maintain a set of parsing rules as defined by a template. For the instances of the template= parameter in the job.txt script the <noinclude> sections should provide information about the parsing rules regular expressions e.g.

Record: m/^(.+?(\t|\n))$/g;
$1 = m/(.+? ).+?/;
...

The record describing regular expressions output can be the input to another regular expression for each parameter of interest to capture. These would be named as 1= etc in the template. Security issues need to be addressed around the /.../x regular expression option, which needs to be suppressed using a search and replace regular expression itself for each regular expression specified.

bug

if "*append: 0 " it exits without overwriting the file.

i think it should have three settings:

  • append 0 create if new, skip if present
  • append 1 create if new, append if present
  • append 2 create if new, overwrite if present

--Phalseid 08:34, 12 July 2008 (NZST)

I will look at your suggestions, currently wikiPageEdit always overwrites all content. If the new content is the same as the current content then the edit will not be made,. WikiPageAppend currently appends content if new key/value parameters are present --Sven 08:31, 16 July 2008 (NZST)
sorry i missed you on skype. does it overwrite all page content, or just for the named template? i can't tell because it just seems to skip if 0. --Phalseid 08:59, 16 July 2008 (NZST)
Is the content new? Because both wikiPageEdit and wikiPageAppend probably dont add anything if there is no change in article content -Sven 20:43, 16 July 2008 (NZST)

Did another test yesterday, and it is definitely working, for '*append:0' and '*append :1', one possibility is that the CSV file had and error in it which the perl script did not detect, but processed anyway. Thinking about the options, articles will be skipped if the content is the same anyway so that option is redundant. However, the function 'wikiUpdateTemplate' in Wiki.pl might be a good option to add. So you have the ability to overwrite, append, and just update the first matched template. --Sven 08:44, 13 August 2008 (NZST)

Note function naming changes in Wiki.pl wikiPageEdit => wikiEdit etc. I probably was using an old copy of Wiki.pl when I did the test that was not effected by the naming change --Sven 21:48, 3 January 2009 (NZDT)

Unrecognised character

When i am using csv2wiki i got the error, Unrecognized character \x7F at wiki.pl line 466. Compilation failed in require at csv2wiki.pl line 13. Could you please help me.

That sounds like a problem with your text editor because there are no such characters in the script, what OS and editor etc are you using. I recommend the [www.geany.org Geany] text editor. --nad 20:52, 24 May 2009 (UTC)
Now I used the geany text editor in windows xp system and I uploaded the file wiki.pl to linux server, but still its giving the same error
Try copying and pasting from the orginal again, there definitely is no such character in the source, so somehow it's got corrupted --nad 11:50, 26 May 2009 (UTC)
Try the latest version of wiki.pl, I've cleaned the whitespace incase there were bad characters in it --nad 00:21, 30 May 2009 (UTC)