momijizukamori: (tits against the rte)
Cocoa ([personal profile] momijizukamori) wrote2021-06-15 01:44 pm
Entry tags:

Webfonts as DRM

Writing this up mostly because this is absolutely buckwild, and I am sure some of the other programmers (and probably some of the non-programmers) on my list will be equal parts horrified and fascinated like I am.

As I mentioned previously, I've been working on scripts to scrape JJWXC, a Chinese webnovel site, and turn the scraped content into epubs, for offline reading, etc. I actually have the scripts basically done, but when I was checking the results of running it against paid-only chapters on a novel I had paid for, I noticed there was some weirdness - most of the text was right but there were random characters missing or encoded as random unicode symbols.

My first two thoughts were that it was either an encoding issue, or a font-support issue. JJWXC uses GB 18030 encoding, which is a Simplified Chinese-specific encoding, and irritatingly doesn't declare this in the HTTP headers, just the page metadata, so any tools that don't process for rendering incorrectly assume it to be something else (either utf8 or Latin-1/ISO 8859-1, I didn't dig enough to check) unless you manually override it. And epub requires utf8-encoded text, so everything was being re-encoded. The fact that most of the text was fine suggested that the encoding wasn't an issue though - I'd have expected more errors in the text than what I was seeing (if anyone remembers mojibake - that's the sign of encoding mismatches).

So it was on to fonts, and I actually hunted down some fonts explicitly designed for good coverage of Simplified Chinese in Unicode, and tried setting those in my epub reader... and the characters were still messed up. But they were rendering fine in my browser, so I went back to poke at the JJWXC page and see what font they were calling that actually worked. And that is when I discovered that what was going on was far, far weirder.

On free chapters, the fontlist specified for the novel text is "'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" - the first two are the default Windows and default OSX sans-serif fonts for Simplified Chinese. On paid chapters, however, the list was "jjwxcfont_004hu, 'Microsoft YaHei', PingFangSC-Regular, HelveticaNeue-Light, 'Helvetica Neue Light', sans-serif" (where the series of numbers and letters at the end of the first font varies by chapter). This mystery new font was an embedded webfont, so I downloaded it, converted it to an OpenType Font, and popped it open in FontForge, which is an open source program for viewing/editing font glyphs. When I hid the undefined codepoints, what I got was this - a series of 200 characters assigned to seemingly-random locations within the Unicode Private Use Area, some of which have more common mappings that matched the errors I was seeing in my epub files. I repeated the process with another one of the webfonts, and got this - same 200 characters, different random assignments in the PUA block.

To back up for a second - a lot of these sites have various 'anti-copy' mechanisms, which usually take the form of nightmarish Javascript that block attempts at using select and copy/paste commands. JJWXC doesn't have any of those, so I thought the worst I had to deal with was the fact that they don't use element ids/classes much, which makes selecting elements programmatically kind of a pain. But I think this webfont weirdness is their version of an anti-piracy method - a lot of these are fairly common characters (I was able to recognize about half of them from their use in Japanese, where they're all taught at the elementary-school level), so it'd be like taking a collection of common basic English words, and replacing them in a text with random symbols. You might be able to read the text still, but it'd be a lot more difficult.

Of course, making it accessible in a browser means that there's always a way to work around the wackiness - at worst, epub 3.0 supports embedded webfonts, so I can theoretically just save the webfonts off and construct additional CSS. That incurs a filesize hit, though (~1mb per 40 chapters, if each chapter has a different font file), and it limits your choices for viewing fonts, because that character set is only available in one font. The fact that each chapter I've checked (so far) has a different webfont suggests that they're being built programmatically, though, and the fact that I've gotten the same webfont for the same chapter, checking at different times and on different browsers suggests that it's deterministic, so it may be possible to reverse programmatically, and I'm going to spend some time trying to do that first.
silveradept: A kodama with a trombone. The trombone is playing music, even though it is held in a rest position (Default)

[personal profile] silveradept 2021-06-15 08:19 pm (UTC)(link)
That's an interesting way of going about trying to prevent copying of work. Feels like there's only so many combinations that have to be worked through?
silveradept: A kodama with a trombone. The trombone is playing music, even though it is held in a rest position (Default)

[personal profile] silveradept 2021-06-16 05:08 pm (UTC)(link)
Good skill at figuring it out and making things more enjoyable and accessible for everyone. I guess this particular site decided to use the "dummy streets" option for protecting their work, which seems a bit novel compared to other anti-piracy schemes.
adevyish: Icon of chibi Shizuo emphatically throwing a vending machine at chibi Izaya (tableflip)

[personal profile] adevyish 2021-06-16 05:32 am (UTC)(link)

This is so utterly hostile to screenreaders, I want to scream. (I use a screenreader for TC Mandarin because it’s quicker than firing up the dictionary, although the intonation is so off from the vernacular sometimes that I do still have to fire up the dictionary. Illiteracy!)

brainwane: My smiling face, including a small gold bindi (Default)

[personal profile] brainwane 2021-06-16 09:30 pm (UTC)(link)
MY GOODNESS. I know a few people I'm going to share this with...
sleeplesspotato: tabby kitten looking up (Default)

[personal profile] sleeplesspotato 2021-06-17 07:09 pm (UTC)(link)
It's the kind of ingenuity that makes me wonder if the person(s) who devised it enjoyed the potential frustration it would cause. ^^;;

TIL the term mojibake; reminds me of dealing with text copied from Word files into an old version of MySQL where all the non-ASCII characters got garbled. We probably didn't catch all of them.
kaberett: Trans symbol with Swiss Army knife tools at other positions around the central circle. (Default)

[personal profile] kaberett 2021-06-26 09:45 pm (UTC)(link)
... they WHAT NOW.
brownbetty: (Default)

[personal profile] brownbetty 2021-10-01 03:49 pm (UTC)(link)

Ooooh! I encountered a site that does this same thing with the english alphabet, but they have just one font, so it was fairly easy to map!

brownbetty: (Default)

[personal profile] brownbetty 2021-10-02 12:15 am (UTC)(link)

I think it was! I just wrote a little yaml dictionary by hand, since it was stable.

trobadora: (Default)

[personal profile] trobadora 2022-11-29 09:31 am (UTC)(link)
What the ever-loving fuck. Wow. This is fascinating!
luckyzukky: kara zor-el from dc comics (dc | kara #3)

[personal profile] luckyzukky 2025-07-03 06:35 pm (UTC)(link)
DRM FREAKS MUST BE STOPPED