Problem
_check_content_type used the full Content-Type header string (lowercased) and matched it with startswith(...) against allowed prefixes.
That is mostly fine when the server sends a bare type like application/pdf. It breaks down when vendors send parameters on the same header (e.g. name="…", charset=…). In theory application/force-download; name="…" should still start with application/force-download, but in practice you can get:
Leading whitespace or a UTF‑8 BOM before the type token, so the string no longer starts with your prefix even though the MIME type is correct.
Confusing logs: logging only the lowercased full header is fine, but the decision should be based on the standardized MIME essence (type + subtype, no parameters), which is what other stacks use for “what is this?”
So the fix is to parse the header the usual way and only then apply your allowlist.
What changed
_content_type_essence(header_value)
Takes everything before the first ; (the essence).
Strips whitespace, lowercases, strips a leading BOM (\ufeff) so odd clients/proxies don’t break the check.
_check_content_type
Reads the raw content-type header once.
Runs startswith on the essence, not on the full header with parameters.
Rejects if the essence is empty (missing or useless header).
Logging uses the raw header string (or (missing header)), so operators still see exactly what the server sent.
Call sites and allowed prefixes (image/, application/pdf, etc.) are unchanged; only how the string is normalized before comparison changes.
Security / SSRF
This does not replace URL / SSRF controls; it only makes post-fetch type checking consistent with how Content-Type is defined (essence vs parameters). You are not widening the allowlist—same prefixes, stricter handling of “empty” and clearer matching on the actual type token.
Risk / regression
Low: same allowed prefixes, strictly more tolerant of benign formatting (whitespace, BOM, parameters). The only stricter case is empty essence after strip (e.g. malformed header), which correctly fails the check.
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
I have reviewed the proposal and these edits will handle cases where the string we match against for the content_type is cleaned up more before comparing against the allow list of content_types.
I have tested this, and confirm that I do not get any errors loading PDFs for game manuals using this. Please consider this, as this should be compatible with the existing content type allowlist, and easily work with any new types added to it.