I plan to look at moving my snipsnap bliki to wordpress, i.e, here. This snip was originally placed on that site. I have transferred it for one reason because it contains xml tags which xslt has problems dealing with. This article consists of my notes made while designing a transfer mechanism.
This page is currently structured as follows, there’s a section called Previous Projects which documents Ryan Barret’s code that moved from snipsnap to pybloxom vi subversion, there’s a section on the wrx schema, and some notes on how I might just run a script against the file system.
In June 2013, I cleaned up a dump file which I used to restore the bliki and bring it back online. I hope I have removed all snips that contain the string <![CDATA[…..]]>, which would be those in which I inserted Disqus comments and this snip, which recorded my notes on WRX. This may make the xslt solution more feasible. I need to buy a book.
Previous Projects
Ryan Barret at http://snarfed.org has already done this, so I hope his resources are helpful. He used xsltproc, and wordpress uses xml as its import/export medium. I wrote to Ryan, who pointed me at the following
- http://snarfed.org/2010-08-08_migrated_to_wordpress
- http://snarfed.org/2006-08-23_virtual_housewarming
by him, and also
This last article describes how to change Snipsnap to create an RSS file of more than 10 articles. I have been looking at how to use the {weblog} macro to extend the RSS file, it seems that this will have no effect on the RSS , if I could get it to work, then the blog page would be bigger but the RSS would not. The author of snipsnap to wordpress used RSS as his base source. This has some advantages over using the dump, which is what I was planning to use.
Inspired by Ryan my first plan was to learn xslt and write an xslt script that would convert the dump from snipsnap dump xml format to WRX, but having seen how simple a wordpress RSS file is, it might be simpler to write a script in bash or python.
Links
- man xsltproc
- tizag’s xslt Tutorial, I like this one.
- filtering with xslt
- quackit’s xslt tutorial, not as good as tizag’s, IMHO
and an article at stackoverflow, which includes a book list
Schema
The namespace files for WRX can be discovered by undertaking an EXPORT. The format and rules are best explained at WRX explained at the Developer’s Tidbit. The namespaces used are,
<rss version="2.0" xmlns\:excerpt="http://wordpress.org/export/1.1/excerpt/" xmlns\:content="http://purl.org/rss/1.0/modules/content/" xmlns\:wfw="http://wellformedweb.org/CommentAPI/" xmlns\:dc="http://purl.org/dc/elements/1.1/" xmlns\:wp="http://wordpress.org/export/1.1/" >
The WRX content looks like this.
<wp:category> <wp:term_id>9</wp:term_id> <wp:category_nicename>about-me</wp:category_nicename> <wp:category_parent></wp:category_parent> <wp:cat_name><![CDATA[About Me]]></wp:cat_name> </wp:category> <item> <title>Sample Page</title> <link>http://wpress.davelevy.info/?page_id=2</link> <pubDate>Thu, 02 Jun 2011 23:01:28 +0000</pubDate> <dc:creator>admin</dc:creator> <guid isPermaLink="false">http://wpress.davelevy.info/?page_id=2</guid> <description></description> <content:encoded><![CDATA[This is an example page....]]></content:encoded> <excerpt:encoded><![CDATA[]]></excerpt:encoded> <wp:post_id>2</wp:post_id> <wp:post_date>2011-06-03 01:01:28</wp:post_date> <wp:post_date_gmt>2011-06-02 23:01:28</wp:post_date_gmt> <wp:comment_status>open</wp:comment_status> <wp:ping_status>open</wp:ping_status> <wp:post_name>sample-page</wp:post_name> <wp:status>publish</wp:status> <wp:post_parent>0</wp:post_parent> <wp:menu_order>0</wp:menu_order> <wp:post_type>page</wp:post_type> <wp:post_password></wp:post_password> <wp:is_sticky>0</wp:is_sticky> <wp:postmeta> <wp:meta_key>_wp_page_template</wp:meta_key> <wp:meta_value><![CDATA[default]]></wp:meta_value> </wp:postmeta> </item>
I wonder how to assign labels/categories to items.
The input is more complex, it might be best to use Ryan’s script since it is so complex. We have the following cases, I think,
- a simple snip, with no comments, attachments and no category set, versions = 1versions > 1
- comments > 0
- attatchments > 1
- type = blog, which may have to be taken from the name attribute, and the transformation needs to deal with the name
- type = category
I have a seperate wiki page that deals with the contents of the <content:encoded> element, called Converting a Snip.
Scripting by Hand
This is probably only poissible because I used the file system as my persistence provider. The top level directory consists of snips, which are directories, property files and content files, which in the case of the top level directory for user ‘Dave’ consist of two image files. The snip text content is help in content.txt. The other attributes are held in a file called metadata.properties.
I need to write a recursive walk, since snips can contain snips. A directory contains, property files, content files and directories which represent comments, versions and sub-snips. The snip taxonomy is best discovered using using the find UNIX utility. I need to flatten it out anyway, since wordpress does not support that hierarchy, although wxr does have a parent field.
I have some python code that looks to extract the necessary meta data.
The time facts are held in milli-seconds since epoch so the following python code converts it into human readable form and places it in dictionary D.
import time myformat = '%a, %d %b %Y %H:%M:%S +0000' D = {} msecsinepoch = myfuntion() pubdate = time.strftime(myformat, time.gmtime(msecsinepoch/1000)) D['publicationdate'] = pubdate
gmtime requires seconds as an argument, hence msecsinepoch/1000. The other wxr elements need another format statement. myfunction returns the snipsnap held value of time in milliseconds. It’s an integer. In the code example above, the dictionary assignment statement hard codes the dictionary key as a string ‘publicationdate’.
I also have three snips which contain the ‘.’ character.
I have tried to strip out the snips containing WRX from the dumpfile, using visual inspection and ‘sed’, but this is proving difficult.
I have tried to exclude the snips containing xml from the xml dump, but chrome renders the xml into html and makes it unusable 🙁
I have tried to use sed to make the XML unrecognisable to the xslt script, but this doesn’t yet work. I may need to change out the “<" character. However, my bliki uses the file system as its persistence technology. I am considering removing the offending snips from the data store using 'rm', but on investigation, it might be much easier to write a shell script in bash, or python. 😈
I now have a good dump file.
I eventually tranferred them by hand. WordPress’s drag and drop is pretty cool and a lot of the blog articles didn’t warrant being copied since they were pointers to the wiki to cover up the poor RSS fetures of snipsnap.