I have a fairly large wiki (20K pages) and I want to do a number of things by automation. I'm using Python 3.5 and the PyCharm IDE. (I'm in the c/c++/c# tradition, but I have a working knowledge of Python.)
So far, I've been working with a zipped backup — I have code which will analyze the entire site, resolve all indirect references and provide lists of most wanted pages and most referenced pages, but this approach doesn't let me get at things like tags, so I've started digging into the API. (BTW, I'd be happy to post the code if anyone is interested.)
(I started asking questions in the Community forum and got some useful help, but not enough. But one of the pieces of help was a pointer to this site.)
Problem with Whiffle
I started trying to use Whiffle, but have been entirely unable to get it to work. I fixed a few include issues which seem to be related to Whiffle still using Python 2.7 libraries, but it still simply generates exceptions. (It's possible it's failing on some more subtle inconsistency, of course.) Tracing through the code, it seems to fail on a call to Wikidot using the following URL:
'mlo1:rdSY.......80g@www.wikidot.com'
I got the response back
'<?xml version="1.0" encoding="UTF-8"?>\n<methodResponse><fault><value><struct><member><name>faultCode</name><value><int>406</int></value></member><member><name>faultString</name><value><string>Site does not exist</string></value></member></struct></value></fault></methodResponse>\n'
(The "…" is an elided section of the ID. I checked to make sure it is correct.) As far as I can see, there is nothing there to specify the wiki to be accessed, and — so far, anyway — I can't figure out why.
At any rate, I messed about with Whiffle some more and finally gave up and started looking at using ServerProxy directly and had immediate success.
Using ServerProxy Directly
For example,
s = client.ServerProxy('https://fancyclopedia:rdSY...n80g@www.wikidot.com/xml-rpc-api.php')
onepage=s.pages.get_one({"site": "fancyclopedia", "page": "_default:fapa"})
yielded exactly what I hoped for. But I still ran into problems:
First, a second call to the object "s" (above in the code box) generates an exception ('http.client.RemoteDisconnected'). It appears that I'm missing something here, but what?
Second, how would I get a list of all pages — a call to pages.select returns a list of around 250 pages followed by an error message.
Third, with 20K pages, walking the site as a prelude to a test or analysis is going to be pretty slow, so I'm thinking it may make sense to create a local database of the site and then keep it up-to-date by using the Recent Changes feature to just download changed pages. But how do I get at the Recent Changes list? (I could write code to parse the wiki's page, but that's necessarily rather ugly and fragile code.)
Fourth, is there a way to get content from pages with embedded things like "[[include …]]" and similarly a [[listpages …]]?
Many thanks for any help!