momijizukamori: (tits against the rte)
Cocoa ([personal profile] momijizukamori) wrote2021-06-15 01:44 pm
Entry tags:

Webfonts as DRM

Writing this up mostly because this is absolutely buckwild, and I am sure some of the other programmers (and probably some of the non-programmers) on my list will be equal parts horrified and fascinated like I am.

As I mentioned previously, I've been working on scripts to scrape JJWXC, a Chinese webnovel site, and turn the scraped content into epubs, for offline reading, etc. I actually have the scripts basically done, but when I was checking the results of running it against paid-only chapters on a novel I had paid for, I noticed there was some weirdness - most of the text was right but there were random characters missing or encoded as random unicode symbols.

My first two thoughts were that it was either an encoding issue, or a font-support issue. JJWXC uses GB 18030 encoding, which is a Simplified Chinese-specific encoding, and irritatingly doesn't declare this in the HTTP headers, just the page metadata, so any tools that don't process for rendering incorrectly assume it to be something else (either utf8 or Latin-1/ISO 8859-1, I didn't dig enough to check) unless you manually override it. And epub requires utf8-encoded text, so everything was being re-encoded. The fact that most of the text was fine suggested that the encoding wasn't an issue though - I'd have expected more errors in the text than what I was seeing (if anyone remembers mojibake - that's the sign of encoding mismatches).

So it was on to fonts, and I actually hunted down some fonts explicitly designed for good coverage of Simplified Chinese in Unicode, and tried setting those in my epub reader... and the characters were still messed up. But they were rendering fine in my browser, so I went back to poke at the JJWXC page and see what font they were calling that actually worked. And that is when I discovered that what was going on was far, far weirder.

On free chapters, the fontlist specified for the novel text is "'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" - the first two are the default Windows and default OSX sans-serif fonts for Simplified Chinese. On paid chapters, however, the list was "jjwxcfont_004hu, 'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" (where the series of numbers and letters at the end of the first font varies by chapter). This mystery new font was an embedded webfont, so I downloaded it, converted it to an OpenType Font, and popped it open in FontForge, which is an open source program for viewing/editing font glyphs. When I hid the undefined codepoints, what I got was this - a series of 200 characters assigned to seemingly-random locations within the Unicode Private Use Area, some of which have more common mappings that matched the errors I was seeing in my epub files. I repeated the process with another one of the webfonts, and got this - same 200 characters, different random assignments in the PUA block.

To back up for a second - a lot of these sites have various 'anti-copy' mechanisms, which usually take the form of nightmarish Javascript that block attempts at using select and copy/paste commands. JJWXC doesn't have any of those, so I thought the worst I had to deal with was the fact that they don't use element ids/classes much, which makes selecting elements programmatically kind of a pain. But I think this webfont weirdness is their version of an anti-piracy method - a lot of these are fairly common characters (I was able to recognize about half of them from their use in Japanese, where they're all taught at the elementary-school level), so it'd be like taking a collection of common basic English words, and replacing them in a text with random symbols. You might be able to read the text still, but it'd be a lot more difficult.

Of course, making it accessible in a browser means that there's always a way to work around the wackiness - at worst, epub 3.0 supports embedded webfonts, so I can theoretically just save the webfonts off and construct additional CSS. That incurs a filesize hit, though (~1mb per 40 chapters, if each chapter has a different font file), and it limits your choices for viewing fonts, because that character set is only available in one font. The fact that each chapter I've checked (so far) has a different webfont suggests that they're being built programmatically, though, and the fact that I've gotten the same webfont for the same chapter, checking at different times and on different browsers suggests that it's deterministic, so it may be possible to reverse programmatically, and I'm going to spend some time trying to do that first.