Cocoa (
momijizukamori) wrote2021-06-15 01:44 pm
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Entry tags:
Webfonts as DRM
Writing this up mostly because this is absolutely buckwild, and I am sure some of the other programmers (and probably some of the non-programmers) on my list will be equal parts horrified and fascinated like I am.
As I mentioned previously, I've been working on scripts to scrape JJWXC, a Chinese webnovel site, and turn the scraped content into epubs, for offline reading, etc. I actually have the scripts basically done, but when I was checking the results of running it against paid-only chapters on a novel I had paid for, I noticed there was some weirdness - most of the text was right but there were random characters missing or encoded as random unicode symbols.
My first two thoughts were that it was either an encoding issue, or a font-support issue. JJWXC uses GB 18030 encoding, which is a Simplified Chinese-specific encoding, and irritatingly doesn't declare this in the HTTP headers, just the page metadata, so any tools that don't process for rendering incorrectly assume it to be something else (either utf8 or Latin-1/ISO 8859-1, I didn't dig enough to check) unless you manually override it. And epub requires utf8-encoded text, so everything was being re-encoded. The fact that most of the text was fine suggested that the encoding wasn't an issue though - I'd have expected more errors in the text than what I was seeing (if anyone remembers mojibake - that's the sign of encoding mismatches).
So it was on to fonts, and I actually hunted down some fonts explicitly designed for good coverage of Simplified Chinese in Unicode, and tried setting those in my epub reader... and the characters were still messed up. But they were rendering fine in my browser, so I went back to poke at the JJWXC page and see what font they were calling that actually worked. And that is when I discovered that what was going on was far, far weirder.
On free chapters, the fontlist specified for the novel text is "'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" - the first two are the default Windows and default OSX sans-serif fonts for Simplified Chinese. On paid chapters, however, the list was "jjwxcfont_004hu, 'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" (where the series of numbers and letters at the end of the first font varies by chapter). This mystery new font was an embedded webfont, so I downloaded it, converted it to an OpenType Font, and popped it open in FontForge, which is an open source program for viewing/editing font glyphs. When I hid the undefined codepoints, what I got was this - a series of 200 characters assigned to seemingly-random locations within the Unicode Private Use Area, some of which have more common mappings that matched the errors I was seeing in my epub files. I repeated the process with another one of the webfonts, and got this - same 200 characters, different random assignments in the PUA block.
To back up for a second - a lot of these sites have various 'anti-copy' mechanisms, which usually take the form of nightmarish Javascript that block attempts at using select and copy/paste commands. JJWXC doesn't have any of those, so I thought the worst I had to deal with was the fact that they don't use element ids/classes much, which makes selecting elements programmatically kind of a pain. But I think this webfont weirdness is their version of an anti-piracy method - a lot of these are fairly common characters (I was able to recognize about half of them from their use in Japanese, where they're all taught at the elementary-school level), so it'd be like taking a collection of common basic English words, and replacing them in a text with random symbols. You might be able to read the text still, but it'd be a lot more difficult.
Of course, making it accessible in a browser means that there's always a way to work around the wackiness - at worst, epub 3.0 supports embedded webfonts, so I can theoretically just save the webfonts off and construct additional CSS. That incurs a filesize hit, though (~1mb per 40 chapters, if each chapter has a different font file), and it limits your choices for viewing fonts, because that character set is only available in one font. The fact that each chapter I've checked (so far) has a different webfont suggests that they're being built programmatically, though, and the fact that I've gotten the same webfont for the same chapter, checking at different times and on different browsers suggests that it's deterministic, so it may be possible to reverse programmatically, and I'm going to spend some time trying to do that first.
As I mentioned previously, I've been working on scripts to scrape JJWXC, a Chinese webnovel site, and turn the scraped content into epubs, for offline reading, etc. I actually have the scripts basically done, but when I was checking the results of running it against paid-only chapters on a novel I had paid for, I noticed there was some weirdness - most of the text was right but there were random characters missing or encoded as random unicode symbols.
My first two thoughts were that it was either an encoding issue, or a font-support issue. JJWXC uses GB 18030 encoding, which is a Simplified Chinese-specific encoding, and irritatingly doesn't declare this in the HTTP headers, just the page metadata, so any tools that don't process for rendering incorrectly assume it to be something else (either utf8 or Latin-1/ISO 8859-1, I didn't dig enough to check) unless you manually override it. And epub requires utf8-encoded text, so everything was being re-encoded. The fact that most of the text was fine suggested that the encoding wasn't an issue though - I'd have expected more errors in the text than what I was seeing (if anyone remembers mojibake - that's the sign of encoding mismatches).
So it was on to fonts, and I actually hunted down some fonts explicitly designed for good coverage of Simplified Chinese in Unicode, and tried setting those in my epub reader... and the characters were still messed up. But they were rendering fine in my browser, so I went back to poke at the JJWXC page and see what font they were calling that actually worked. And that is when I discovered that what was going on was far, far weirder.
On free chapters, the fontlist specified for the novel text is "'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" - the first two are the default Windows and default OSX sans-serif fonts for Simplified Chinese. On paid chapters, however, the list was "jjwxcfont_004hu, 'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" (where the series of numbers and letters at the end of the first font varies by chapter). This mystery new font was an embedded webfont, so I downloaded it, converted it to an OpenType Font, and popped it open in FontForge, which is an open source program for viewing/editing font glyphs. When I hid the undefined codepoints, what I got was this - a series of 200 characters assigned to seemingly-random locations within the Unicode Private Use Area, some of which have more common mappings that matched the errors I was seeing in my epub files. I repeated the process with another one of the webfonts, and got this - same 200 characters, different random assignments in the PUA block.
To back up for a second - a lot of these sites have various 'anti-copy' mechanisms, which usually take the form of nightmarish Javascript that block attempts at using select and copy/paste commands. JJWXC doesn't have any of those, so I thought the worst I had to deal with was the fact that they don't use element ids/classes much, which makes selecting elements programmatically kind of a pain. But I think this webfont weirdness is their version of an anti-piracy method - a lot of these are fairly common characters (I was able to recognize about half of them from their use in Japanese, where they're all taught at the elementary-school level), so it'd be like taking a collection of common basic English words, and replacing them in a text with random symbols. You might be able to read the text still, but it'd be a lot more difficult.
Of course, making it accessible in a browser means that there's always a way to work around the wackiness - at worst, epub 3.0 supports embedded webfonts, so I can theoretically just save the webfonts off and construct additional CSS. That incurs a filesize hit, though (~1mb per 40 chapters, if each chapter has a different font file), and it limits your choices for viewing fonts, because that character set is only available in one font. The fact that each chapter I've checked (so far) has a different webfont suggests that they're being built programmatically, though, and the fact that I've gotten the same webfont for the same chapter, checking at different times and on different browsers suggests that it's deterministic, so it may be possible to reverse programmatically, and I'm going to spend some time trying to do that first.
no subject
no subject
The number is finite, yes, but potentially extremely large - there are 6.4k codepoints in the private use area, so with 200 selections you have ~ 7.4 * 10^759 possibilities (provided I have grabbed the correct permutation formula anyway). That said, I scraped a larger selection of chapters last night and found some repeats, so they're either using a pregenerated subset, or whatever they use for the generating biases towards certain results.
no subject
no subject
Yeah, it really kind of is trap streets for font glyphs. Points for creativity, at least.
no subject
This is so utterly hostile to screenreaders, I want to scream. (I use a screenreader for TC Mandarin because it’s quicker than firing up the dictionary, although the intonation is so off from the vernacular sometimes that I do still have to fire up the dictionary. Illiteracy!)
no subject
Yeah most of these anti-piracy methods are varying degrees of bad-to-terrible for accessibility, which makes me mad beyond my kneejerk 'don't tell me how to enjoy my content' reaction.
no subject
no subject
As I said, equal parts horrified and fascinated. It is at least more interesting to try and reverse than fighting with the various nightmare JS libraries for copy-protection.
no subject
TIL the term mojibake; reminds me of dealing with text copied from Word files into an old version of MySQL where all the non-ASCII characters got garbled. We probably didn't catch all of them.
no subject
no subject
Better or worse than the sites that trap you in an infinite debugger loop if you dare to open the devtools panel on them? You decide!
(these sites are like 'do you want to see some coding crimes? no? too bad')
no subject
Ooooh! I encountered a site that does this same thing with the english alphabet, but they have just one font, so it was fairly easy to map!
no subject
Was it Chrysanthemum Garden XD? I ended up using stuff I learned on JJWXC to write an extension to scrape that too.
no subject
I think it was! I just wrote a little yaml dictionary by hand, since it was stable.
no subject
Yeah, it's basically a simple substitution cipher, heh.
no subject
no subject
They've actually made it worse since then - back in May they added a thing where the text for VIP chapters is sent encrypted in the response, and then decrypted in the browser. And if you you try to open the developer console to poke at it, it sends you into a debugger infinite loop (which usually freezes up the tab, if not the whole browser). But at that point fixing it was personal, hahah.
no subject