While downloading files pages from our Fancyclopedia site I ran into a problem with imbedded special characters that were keyed or copied into some (about 10% of them so far) of them while they were being initially entered.
page = s.pages.get_one({'site': 'fancyclopedia', 'page': file_name}) print page {'commented_at': None, 'rating': 0, 'updated_by': 'mlo1', 'title': 'Anticipation', 'created_at': '2011-06-29T00:24:41+00:00', 'title_shown': 'Anticipation', 'updated_at': '2011-06-29T00:24:54+00:00', 'created_by': 'mlo1', 'children': 0, 'content': u'The 2009 [[[Worldcon]]] held in the Palais des congr\xe8s de Montr\xe9al. [[[GoHs]]]: [[[Neil Gaiman]]] (pro) . . . , n-worldcons">Canadian Worldcons</a>.</p>\n', 'commented_by': None, 'comments': 0, 'parent_title': None, 'fullname': 'anticipation', 'parent_fullname': None, 'tags': ['worldcon', 'convention', 'canada'], 'revisions': 1} content = page['content'] print content The 2009 [[[Worldcon]]] held in the Palais des congrès de Montréal. [[[GoHs]]]: [[[Neil Gaiman]]] (pro), [[[Elisabeth Vonarburg]]] (pro), [[[David Hartwell]]] (editor), [[[Tom Doherty]]] (publisher) and [[[FGoH]]]: [[[Taral Wayne]]]. [[[Chairmen]]]: [[[René Walling]]] and [[[Robbie Bourget]]]. [[[Ralph Bakshi]]] was originally announced as Artist [[[GoH]]] but withdrew for health reasons. See also [[[Canadian Worldcons]]]. temp.write(content) Traceback (most recent call last): File "D:\My Stuff\Wikidot\chkfiles.py", line 35, in <module> temp.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 52: ordinal not in range(128)
There are both the '\x' with two hex characters and a '\u' form with four hex characters and both of them caused failures, the '\u' form failed when trying to print.
Traceback (most recent call last):
File "D:\My Stuff\Wikidot\chkfiles.py", line 31, in <module>
print content
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 36: character maps to <undefined>
Is there a way to get around this without having to manually correct the existing files?
Thanks for your help…
Jack Weaver Fanac Fan History Project