Module talk:Unicode data
{{Request edit button}}
About RTL
I am researching RTL scripts. I met this:
- A
: 0xa9 -- {{#invoke:Unicode data|lookup|name|41}}
: {{#invoke:Unicode data|lookup|script|41}}
: is_rtl: {{#invoke:Unicode data|is|rtl|41}}
- ث
: 0x062B -- {{#invoke:Unicode data|lookup|name|062B}} [https://www.fileformat.info/info/unicode/char/062b/index.htm]
: {{#invoke:Unicode data|lookup|script|062B}}
: is_rtl: {{#invoke:Unicode data|is|rtl|062B}}
- ש
: 0x05E9 -- {{#invoke:Unicode data|lookup|name|05E9}} [https://www.fileformat.info/info/unicode/char/05e9/index.htm]
: {{#invoke:Unicode data|lookup|script|05E9}}
: is_rtl: {{#invoke:Unicode data|is|rtl|05E9}}
- ߖ
: 0x07D6 -- {{#invoke:Unicode data|lookup|name|07D6}} [https://www.fileformat.info/info/unicode/char/07d6/index.htm]
: {{#invoke:Unicode data|lookup|script|07D6}}
: is_rtl: {{#invoke:Unicode data|is|rtl|07D6}}
I'd expect the Arab, Hebr, Nkoo characters to be rtl=true. Am I misunderstanding something? {{ping|Erutuon}} -DePiep (talk) 20:58, 9 January 2021 (UTC)
: {{reply to|DePiep}} The invocation
checks whether the literal characters 05E9
are right-to-left. To check the right-to-leftness of the Hebrew character, put in the literal character or a HTML character reference:
or
. #invoke:Unicode data|is|rtl
as well as #invoke:Unicode data|is|valid_pagename
and #invoke:Unicode data|is|Latin
interpret their arguments as strings rather than code points in hexadecimal because the corresponding functions in the module take strings. (They could take hexadecimal arguments if someone edited the module to add another parameter to tell them to interpret their argument this way.) — Eru·tuon 01:02, 10 January 2021 (UTC)
::{{re|Erutuon}} Thanks, will work for me. Great module! (Second code example is
). -DePiep (talk) 17:28, 10 January 2021 (UTC)
- The four characters, is_rtl:
: using &#x...; {{#invoke:Unicode data|is|rtl|A}}
: using &#x...; {{#invoke:Unicode data|is|rtl|ث}}
: using &#x...; {{#invoke:Unicode data|is|rtl|ש}}
: using &#x...; {{#invoke:Unicode data|is|rtl|ߖ}}
is_pagename
Missing documentation: Hangul, Aliases
is_RTL check?
About {{unichar|0634|ARABIC LETTER SHEEN}} [https://www.fileformat.info/info/unicode/char/0634/index.htm]:
:
I expect true (is_rtl), right? -DePiep (talk) 23:00, 28 March 2022 (UTC)
:Solved: enter the character <ش >, not the U+hex:
:*
Edit request 20 November 2023
{{Edit template-protected|answered=yes}}
Description of suggested change: the module code says "-- No image data modules on Wikipedia yet."
We have them now. Can this be enabled? — Alexis Jazz (talk or ping me) 05:37, 20 November 2023 (UTC)
:Can you sandbox the code? — Martin (MSGJ · talk) 12:46, 20 November 2023 (UTC)
::MSGJ, I don't speak Lua.. I edited :Module:Unicode data/sandbox to sync with the current version and I uncommented the block.
returns {{#invoke:Unicode data/sandbox|lookup|image|0xA9}} (:File:{{#invoke:Unicode data/sandbox) so I think this works? — Alexis Jazz (talk or ping me) 21:19, 20 November 2023 (UTC)
: {{done}} I'm not sure I agree with your importing of so many modules from other wikis, but in any event there was never any good reason to comment out that code as opposed to just letting uses of it fail. * Pppery * it has begun... 21:36, 22 November 2023 (UTC)
Edit request 20 April 2024
{{Edit fully-protected|answered=yes}}
Description of suggested change: Creation of p.is_noncharacter()
as a separate function
Diff:
{{TextDiff|1=function p.lookup_name(codepoint)
-- U+FDD0-U+FDEF and all code points ending in FFFE or FFFF are Unassigned
-- (Cn) and specifically noncharacters:
-- https://www.unicode.org/faq/private_use.html#nonchar4
if 0xFDD0 <= codepoint and (codepoint <= 0xFDEF
or floor(codepoint % 0x10000) >= 0xFFFE) then
return ("
end|2=function p.is_noncharacter(codepoint)
-- U+FDD0-U+FDEF and all code points ending in FFFE or FFFF are Unassigned
-- (Cn) and specifically noncharacters:
-- https://www.unicode.org/faq/private_use.html#nonchar4
return 0xFDD0 <= codepoint and (codepoint <= 0xFDEF
or floor(codepoint % 0x10000) >= 0xFFFE)
end
function p.lookup_name(codepoint)
if is_noncharacter(codepoint) then
return ("
end}} Eievie (talk) 20:48, 20 April 2024 (UTC)
: {{Done}} * Pppery * it has begun... 15:22, 21 April 2024 (UTC)
Edit request 1 January 2025
{{Edit template-protected|answered=yes}}
Description of suggested change:
Allow looking up the {{mono|kCantonese}} [https://www.unicode.org/reports/tr38/ Unihan property]. As an example, {{mono|
Diff:
function p.lookup_kCantonese(codepoint)
local data = loader[('Unihan/kCantonese/%02X'):format(floor(codepoint / 0x1000))]
if data then
return data[codepoint]
end
end
Northern Moonlight 03:54, 1 January 2025 (UTC)
: {{done}} * Pppery * it has begun... 23:05, 13 January 2025 (UTC)
Edit request 15 June 2025
{{edit template-protected|answered=yes}}
Description of suggested change:
Reorder the name_hooks
table so its entries are sorted in codepoint order. binary_range_search
assumes the entries are sorted in this way currently and therefore does not work correctly. {{tl|unichar}} is currently broken by this bug as can be seen in {{section link|CJK Unified Ideographs Extension I|Background}}. Specifically {{unichar2|}} and {{unichar2|}} incorrectly appear as reserved. I have made the change in the sandbox.
Diff:
See {{compare pages|1269285012|1295718701|comparison of sandbox with main}} Warudo (talk) 12:20, 15 June 2025 (UTC)
{{TextDiff|1=-- For the algorithm used to generate Hangul Syllable names,
-- see "Hangul Syllable Name Generation" in section 3.12 of the
-- Unicode Specification:
-- https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
local name_hooks = {
{ 0x00, 0x1F, "
{ 0x7F, 0x9F, "
{ 0x3400, 0x4DBF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension A
{ 0x4E00, 0x9FFF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph
{ 0xAC00, 0xD7A3, function (codepoint) -- Hangul Syllables
local Hangul_data = loader.Hangul
local syllable_index = codepoint - 0xAC00
return ("HANGUL SYLLABLE %s%s%s"):format(
Hangul_data.leads[floor(syllable_index / Hangul_data.final_count)],
Hangul_data.vowels[floor((syllable_index % Hangul_data.final_count)
/ Hangul_data.trail_count)],
Hangul_data.trails[syllable_index % Hangul_data.trail_count]
)
end },
-- High Surrogates, High Private Use Surrogates, Low Surrogates
{ 0xD800, 0xDFFF, "
{ 0xE000, 0xF8FF, "
-- CJK Compatibility Ideographs
{ 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
{ 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
{ 0x17000, 0x187F7, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph
{ 0x18800, 0x18AFF, function (codepoint)
return ("TANGUT COMPONENT-%03d"):format(codepoint - 0x187FF)
end },
{ 0x18D00, 0x18D08, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph Supplement
{ 0x1B170, 0x1B2FB, "NUSHU CHARACTER-%04X" }, -- Nushu
{ 0x20000, 0x2A6DF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension B
{ 0x2A700, 0x2B739, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension C
{ 0x2B740, 0x2B81D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension D
{ 0x2B820, 0x2CEA1, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension E
{ 0x2CEB0, 0x2EBE0, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension F
-- CJK Compatibility Ideographs Supplement (Supplementary Ideographic Plane)
{ 0x2F800, 0x2FA1D, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
{ 0xE0100, 0xE01EF, function (codepoint) -- Variation Selectors Supplement
return ("VARIATION SELECTOR-%d"):format(codepoint - 0xE0100 + 17)
end},
{ 0x30000, 0x3134A, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension G
{ 0x31350, 0x323AF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension H
{ 0x2EBF0, 0x2EE5D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension I
{ 0xF0000, 0xFFFFD, "
{ 0x100000, 0x10FFFD, "
}|2=-- For the algorithm used to generate Hangul Syllable names,
-- see "Hangul Syllable Name Generation" in section 3.12 of the
-- Unicode Specification:
-- https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
-- binary_range_search assumes these are ordered by codepoint. Do not place them in a random order!
local name_hooks = {
{ 0x00, 0x1F, "
{ 0x7F, 0x9F, "
{ 0x3400, 0x4DBF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension A
{ 0x4E00, 0x9FFF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph
{ 0xAC00, 0xD7A3, function (codepoint) -- Hangul Syllables
local Hangul_data = loader.Hangul
local syllable_index = codepoint - 0xAC00
return ("HANGUL SYLLABLE %s%s%s"):format(
Hangul_data.leads[floor(syllable_index / Hangul_data.final_count)],
Hangul_data.vowels[floor((syllable_index % Hangul_data.final_count)
/ Hangul_data.trail_count)],
Hangul_data.trails[syllable_index % Hangul_data.trail_count]
)
end },
-- High Surrogates, High Private Use Surrogates, Low Surrogates
{ 0xD800, 0xDFFF, "
{ 0xE000, 0xF8FF, "
-- CJK Compatibility Ideographs
{ 0xF900, 0xFA6D, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
{ 0xFA70, 0xFAD9, "CJK COMPATIBILITY IDEOGRAPH-%04X" },
{ 0x17000, 0x187F7, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph
{ 0x18800, 0x18AFF, function (codepoint)
return ("TANGUT COMPONENT-%03d"):format(codepoint - 0x187FF)
end },
{ 0x18D00, 0x18D08, "TANGUT IDEOGRAPH-%04X" }, -- Tangut Ideograph Supplement
{ 0x1B170, 0x1B2FB, "NUSHU CHARACTER-%04X" }, -- Nushu
{ 0x20000, 0x2A6DF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension B
{ 0x2A700, 0x2B739, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension C
{ 0x2B740, 0x2B81D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension D
{ 0x2B820, 0x2CEA1, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension E
{ 0x2CEB0, 0x2EBE0, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension F
{ 0x2EBF0, 0x2EE5D, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension I
{ 0x2F800, 0x2FA1D, "CJK COMPATIBILITY IDEOGRAPH-%04X" }, -- CJK Compatibility Ideographs Supplement (Supplementary Ideographic Plane)
{ 0x30000, 0x3134A, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension G
{ 0x31350, 0x323AF, "CJK UNIFIED IDEOGRAPH-%04X" }, -- CJK Ideograph Extension H
{ 0xE0100, 0xE01EF, function (codepoint) -- Variation Selectors Supplement
return ("VARIATION SELECTOR-%d"):format(codepoint - 0xE0100 + 17)
end},
{ 0xF0000, 0xFFFFD, "
{ 0x100000, 0x10FFFD, "
} }} --Warudo (talk) 13:56, 15 June 2025 (UTC)
:{{done}} in Special:Diff/1296263621, thank you. U+2ED9D and U+2EDE0 are now shown correctly. I've also added a test at Template:Unichar/testcases#U%2B2ED9D %E2%80%93 grass radical to show the effect. —andrybak (talk) 22:57, 18 June 2025 (UTC)