Skip to content

xml parser skips files with DOCTYPE entries #39

@jhansche

Description

@jhansche
> Task :my-module:gatherModuleInfo
e: Invalid xml file my-module/src/main/res/values/strings.xml
   line 4; column 10: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
   <!DOCTYPE resources [
            ↑
Skipping file parsing

This does not terminate the process, it just skips the file.

I'm not sure what the reason is for disallowing DOCTYPE, but it is useful to add named entities for unusual characters. For example, we use it to define entity aliases like these that we can then use in our strings:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE resources [
    <!ENTITY ldquo  "&#8220;">
    <!ENTITY rdquo  "&#8221;">
    <!ENTITY lsquo  "&#8216;">
    <!ENTITY rsquo  "&#8217;">
    <!ENTITY hellip "&#8230;">
    <!ENTITY prime  "&#8242;">
    <!ENTITY Prime  "&#8243;">
    <!ENTITY bull   "&#8226;">
    <!ENTITY thinsp "&#8201;">
    <!ENTITY hairsp "&#8202;">
    ]>

Then later we can refer to these standard entity names in the strings:

<string name="string_name">One last thing&hellip;</string>

In this case, we prefer to use &hellip; here, because it is more meaningful for translators, and it is more grammatically correct. I.e., using vs .... It also translates differently for some languages - for example, some languages prefer a different type of ellipsis, like the midline () or vertical ellipsis ().

Inlining the unicode character can often be difficult for people reading the file to understand that it is a unicode character rather than its similar non-unicode counterpart (i.e., ' apostrophe vs right-single quotation or rsquo), which is why we use the &#<>; notation. And inlining that notation into the string, someone reading the file won't understand what that number represents unless they look it up.

So the workaround is using the named XML entity: it gives us the exact unicode representation, with a meaningful name, without compromising the character width which can have an impact on some older editors that aren't well equipped to handle multi-byte unicode characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions