Chinese Voice to Text — 中文语音转文字

"Chinese" is not one language but a family — Mandarin, Cantonese, Hokkien, Shanghainese, and dozens of others share a writing system but are mutually unintelligible in speech. Voice recognition treats them entirely separately. Mandarin (普通话) is well-supported with high accuracy. Cantonese (廣東話) has dedicated model support. Other varieties have little to no ASR support. Chinese also has the most extreme character selection problem of any language — every syllable maps to dozens of possible characters, and the model must choose correctly based on context. Four tones add another disambiguation layer. This page explains how all of it works.

中文语音识别面临的主要挑战:四声调系统、同音字选择(比日语更复杂)、普通话与粤语的根本区别,以及简体与繁体字的选择。本页详细介绍每个挑战及如何获得最佳效果。

Mandarin vs Cantonese: Two Separate Languages

The most important thing to understand about Chinese speech recognition is that Mandarin and Cantonese are not accents of the same language — they are mutually unintelligible spoken languages that happen to share a largely common writing system. A Mandarin ASR model is completely useless for Cantonese speech, and vice versa. Selecting the wrong locale produces not just errors but near-total gibberish.

🇨🇳

普通话 Mandarin (zh-CN / zh-TW)

Mandarin is the official language of mainland China and Taiwan, and one of four official languages of Singapore. It has four tones plus a neutral tone. Around 920 million native speakers. By far the most supported Chinese variety in ASR — Google, Baidu, Apple, and Microsoft all have mature Mandarin models. Use zh-CN for Simplified Chinese output (mainland), zh-TW for Traditional Chinese output (Taiwan).

🇭🇰

廣東話 Cantonese (zh-HK / yue)

Cantonese is the dominant spoken language of Hong Kong, Macau, and Guangdong province, and is widely spoken by overseas Chinese communities worldwide. It has six tones (some analyses count nine). Around 85 million native speakers. ASR support exists — zh-HK in the Web Speech API routes to a Cantonese model — but it is significantly less mature than Mandarin, with higher error rates and less training data.

The same written sentence — completely different spoken forms

Written: 我去學校

Mandarin: wǒ qù xuéxiào

Tonal pattern: ˅ ˄ ˊ˅ ˄˅

Written: 我去學校

Cantonese: ngóh heui hohk-haau

Different words, 6-tone pattern

Other Chinese varieties — minimal ASR support

Hokkien/Minnan (spoken in Fujian, Taiwan as Taiwanese, and across Southeast Asia), Shanghainese/Wu (Shanghai and surrounding areas), Hakka, and other Sinitic languages have no dedicated locale in browser-based speech recognition. These communities must either speak Mandarin for ASR purposes or use specialised apps built for specific varieties.

Tones and Speech Recognition — 声调与语音识别

Mandarin has four tones plus a neutral tone — each changes the meaning of a syllable entirely. The syllable "ma" produces four completely different words depending on pitch contour:

TonePinyinCharacterMeaningContour
1st — High levelmother55 — flat high
2nd — Risinghemp / numb35 — rises
3rd — Dippinghorse214 — dips then rises
4th — Fallingto scold51 — falls sharply
Neutral — Unstressedmaquestion particleShort, no contour

Crucially, Mandarin ASR models do not rely primarily on tones for character selection — they use tones as one signal among many, but contextual language models do most of the disambiguation work. This has two practical implications: first, non-native speakers with imperfect tones still get reasonable accuracy for common vocabulary, because context compensates. Second, tone alone cannot reliably distinguish characters — the model must consider the full sentence context regardless.

Practical implication for non-native speakers

If you're a non-native Mandarin speaker with imprecise tones — HSK 4–5 level — you will still get reasonable accuracy for common vocabulary because the language model compensates. Where you'll see more errors is in low-frequency words where tone is the only distinguishing factor and context doesn't help. Native speakers with tone sandhi patterns (particularly 3rd tone + 3rd tone → 2nd + 3rd) produce correct output because models are trained on natural connected speech.

The Character Selection Problem — 同音字:比日语更复杂

Japanese has many homophones. Chinese has dramatically more. Because Chinese characters were assigned to monosyllabic morphemes over millennia, a single syllable with a single tone can correspond to dozens of characters with completely different meanings. The model must select the correct character for every syllable based solely on sentence context — there is no additional acoustic information to help.

Syllable + ToneCharacter options (selection)MeaningsRisk level
yì (4th)义/意/亿/易/译/艺/议/异righteousness/meaning/hundred million/easy/translate/art/discuss/differentVery high
shì (4th)是/事/市/室/示/式/视/试to be/matter/market/room/show/style/vision/testVery high
zhì (4th)治/制/志/质/至/智/致govern/system/aspiration/quality/arrive/wisdom/causeHigh
jī (1st)机/鸡/积/击/基/激/肌machine/chicken/accumulate/hit/base/激励/muscleHigh
xīn (1st)心/新/薪/欣/辛/芯heart/new/salary/joy/hardship/coreMedium — context usually clear

The model handles this using a large language model — essentially predicting which character is statistically most likely given all surrounding characters. For common words in typical sentence contexts, accuracy is high. Where it fails is in: ambiguous single-character words out of context, technical vocabulary with rare characters, and proper nouns (names, place names) where frequency statistics don't apply.

Tip: speak in full sentences

Single words dictated in isolation produce more character errors than the same words in sentences. "我需要机票" (I need plane tickets) lets the model choose 机 (machine/plane) correctly from context. "机" alone could be any of eight common characters. Always provide sentence context — finish your thought before pausing.

Simplified vs Traditional Script — 简体字与繁体字

Chinese is written in two script systems — Simplified (简体, used in mainland China and Singapore) and Traditional (繁體, used in Taiwan, Hong Kong, and Macau). The speech is identical for Mandarin; the output characters differ. The locale you select determines which script you receive:

🇨🇳

zh-CN → Simplified 简体

爱/国/语/书/话 — mainland China and Singapore standard. Fewer strokes, standardised by the PRC in the 1950s–60s.

🇹🇼

zh-TW → Traditional 繁體

愛/國/語/書/話 — Taiwan standard. Full historical characters, also used in academic and classical contexts globally.

🇭🇰

zh-HK → Traditional + Cantonese

Traditional characters with Cantonese-specific characters for particles (囉, 囉, 喎, 啩) not used in Mandarin writing.

An important subtlety: the acoustic input for zh-CN and zh-TW Mandarin is identical — both locale models recognise the same spoken Mandarin. The difference is purely in output script. A Taiwan speaker can use zh-CN and get Simplified output, or zh-TW and get Traditional output, regardless of accent differences between mainland and Taiwan Mandarin. Script selection is a post-processing step, not an acoustic one.

Mandarin Regional Varieties — 普通话各地变体

Standard Mandarin (普通话 in mainland China, 國語 in Taiwan) is trained on Beijing-based standard speech. Regional varieties of Mandarin differ in accent, vocabulary, and in some cases grammar:

🏙️

Beijing Mandarin — Best Results

Beijing speech is the basis for standard 普通话 and the reference model for all mainland Chinese ASR. The distinctive Beijing erhua (儿化音 — adding /r/ to syllables) is well-handled by the zh-CN model. "这儿" (zhèr), "哪儿" (nǎr), "一点儿" (yīdiǎnr) — the erhua r is recognised and output correctly. Expect word error rates of 5–10% in clear quiet conditions.

🇹🇼

Taiwan Mandarin (國語) — Very Good

Taiwan Mandarin differs from mainland standard in several ways: erhua is mostly absent, the retroflex consonants zh/ch/sh are often pronounced as z/c/s (deretroflexion), and vocabulary differences exist (捷運 vs 地铁 for metro, 機車 vs 摩托车 for motorcycle). The zh-TW model handles Taiwan Mandarin well. Use zh-TW for Traditional character output; the model recognises Taiwan Mandarin phonology correctly.

🇸🇬

Singapore Mandarin (华语) — Good

Singapore Mandarin (华语, Huáyǔ) is spoken with influence from Hokkien, Malay, and English. Vocabulary differences (巴士 for bus, 德士 for taxi) and some distinctive phonological features (shorter vowels, English-influenced intonation in code-switching) affect accuracy slightly. The zh-CN or zh-TW model handles Singapore Mandarin moderately well; heavy Singlish-Mandarin mixing is more challenging.

🏭

Accented Mandarin (Wu/Min/Yue influence) — Challenging

Mandarin spoken by native Wu (Shanghainese), Min (Hokkien/Fujianese), or Yue (Cantonese) speakers carries heavy substrate influence. Shanghai-accented Mandarin flattens tones; Cantonese-accented Mandarin shifts vowels and tones significantly; Min-accented Mandarin merges retroflex consonants. These varieties cause moderate-to-high error rates. Speakers should approximate standard 普通话 pronunciation for formal dictation.

如何开始 — How to Start

1

选择正确的语言区域:zh-CN(简体/大陆)、zh-TW(繁体/台湾)或 zh-HK(粤语/香港)

Locale selection is critical. zh-CN and zh-TW both recognise Mandarin — they differ only in output script. zh-HK routes to a Cantonese model. Wrong locale for Cantonese speakers = near-total failure.

2

点击"Start 🎤"并在提示时允许麦克风访问

Click Start and allow microphone access. Chrome on desktop provides best Chinese ASR results.

3

以自然的普通话速度说完整句子,句末再停顿——字符选择依赖完整的句子语境

Complete full sentences before pausing. Character selection relies on full sentence context — pausing mid-sentence forces the model to guess without complete information.

4

复制文本或下载为TXT。汉字(简体或繁体)直接粘贴到微信、Word或任何应用

Copy text or download as TXT. Chinese characters render correctly in WeChat, Word, and all modern apps. Check proper nouns and rare characters manually.

Cantonese Voice Recognition — 粵語語音識別

Cantonese ASR is a separate, less mature field from Mandarin ASR. If you speak Cantonese, select zh-HK — this routes to a Cantonese recognition model. Here are the specific challenges Cantonese presents:

六聲 Six Tones (vs Mandarin's Four)

Cantonese has six phonemically distinct tones (some analyses nine, including the entering tones). The syllable "si" produces six meanings: 詩 (poem, high level), 史 (history, high rising), 試 (try, mid level), 時 (time, low falling), 市 (market, low rising), 事 (matter, low level). Greater tonal complexity means more potential for disambiguation errors than Mandarin.

Cantonese-Specific Written Characters

Cantonese has written characters for colloquial particles and words that don't exist in Mandarin writing: 囉 (sentence-final particle lò), 喎 (I've heard that / 喎 woh), 咋 (only — zaa3), 㗎 (emphatic particle — gaa3), 囉 (lô — resignation). A Cantonese model must output these correctly — a Mandarin model would have no concept of them.

Written vs Spoken Cantonese

Like Tamil and Arabic, Cantonese has a formal/informal split — formal written Cantonese uses Classical Chinese structures, while spoken Cantonese is used in informal digital writing. In Hong Kong, a "written Cantonese" standard has emerged for social media and messaging that closely mirrors spoken Cantonese. The zh-HK model handles this modern written Cantonese better than purely formal Classical Chinese.

Hong Kong English Code-Switching

Hong Kong Cantonese mixes English so thoroughly it has its own name — "港式英語" or simply the natural HK code-switching style. "你有冇 check 個 email 呀?" (Have you checked the email?) is typical. The zh-HK model handles common English words in Cantonese sentences moderately — common tech terms appear in Roman script within the Chinese text. Less familiar English may be phonetically approximated in Cantonese characters.

Measure Words — 量词:A Unique Chinese ASR Challenge

Chinese requires a measure word (量词, liàngcí) between a number and a noun — similar to Japanese counters but even more pervasive. "Three books" is "三本书" (sān běn shū) — the measure word 本 (běn) must appear. "Three people" is "三个人" (sān gè rén) — the measure word 个 (gè) is required. There are over 100 measure words in Mandarin, and selecting the wrong one is grammatically incorrect.

For speech recognition, measure words are actually one of the easier components — native speakers almost never make measure word errors in speech, so the model hears the correct word and simply transcribes it. Where ASR errors occur is when: (1) the measure word is acoustically swallowed in fast speech and the model must infer it, or (2) a non-native speaker uses the wrong measure word, and the model transcribes what was said rather than the correct form. The model transcribes what it hears, not what should have been said.

Chinglish: Code-Switching in Mandarin

Urban Chinese professionals — particularly in tech, finance, and academia — mix English and Mandarin extensively. The pattern differs from Hinglish or Tanglish in that English words often enter as standalone terms within Chinese sentence structure:

"我们明天有个 meeting,你能把 PPT 发我吗?"

"这个 deadline 太紧了,我需要 approve 一下。"

"你 check 一下 data,看看有没有问题。"

Red = English words in Mandarin sentence structure

The zh-CN/zh-TW model handles common English tech terms well — "meeting," "PPT," "deadline," "data," "email" appear in Roman script within the Chinese output. Note that unlike Bengali or Tamil, Chinese does not inflect English words with Chinese grammatical suffixes — English words enter as uninflected stems and Chinese grammar handles them syntactically.

Less common English words may be phonetically transliterated into Chinese characters — "approve" may become "阿普鲁夫" or similar. For formal documents, type English terms manually during dictation pauses.

提高准确率的技巧 — Tips for Best Accuracy

✅ 提高准确率

  • • 选择正确的区域设置 — zh-CN、zh-TW 或 zh-HK
  • • 说完整句子再停顿 — 语境决定字符选择
  • • 先说出话题语境,再讲内容 — 帮助消歧义
  • • 普通话:保持声调清晰,特别是在多义词上
  • • 粤语:选择 zh-HK,避免在 zh-CN 上说粤语
  • • 英文词汇最好暂停后手动输入
  • • Chrome 浏览器对中文 ASR 支持最好

⚠️ 常见错误及解决方法

  • 同音字误选 — 提供更多语境,手动修正
  • 人名地名出错 — 专有名词必须手动核对
  • 粤语选了普通话区域 — 改用 zh-HK
  • 带口音的普通话 — 尽量接近标准普通话发音
  • 英文被音译成汉字 — 暂停后手动输入英文
  • 台湾词汇被转成大陆用语 — 使用 zh-TW 区域

Who Uses Chinese Voice to Text — 谁在使用

💬

WeChat Users

WeChat has over 1.3 billion users, the vast majority communicating in Chinese. Voice-to-text for long WeChat messages — avoiding the slow Pinyin input method — is one of the most common Chinese ASR use cases globally. Dictate the message, review the characters, send.

💼

Business Professionals

Chinese business professionals dictate emails, reports, and meeting notes. Chinese character input via Pinyin IME is slower than Latin-script typing — voice bypasses this entirely, outputting characters directly. Particularly valuable for long documents where Pinyin input overhead is significant.

🎓

Students & Academics

Chinese university students use voice dictation for essay drafts, research notes, and thesis writing. Academic Chinese — formal, structured — matches well with what Mandarin ASR models recognise best. Speaking formal 普通话 and editing the character output is faster than Pinyin-based composition for long texts.

📺

Content Creators

Mandarin YouTube, Bilibili, and Douyin creators use voice-to-text for scripts, captions, and video descriptions. The Chinese-language content market is massive — voice dictation for scripts is a standard part of the workflow for high-volume creators.

🌏

Overseas Chinese Communities

Chinese diaspora in North America, Europe, Australia, and Southeast Asia use voice dictation for Chinese family communication, community organisation, and Chinese language maintenance. Voice removes the friction of Chinese input methods on non-Chinese keyboards and operating systems.

🗣️

Chinese Language Learners

Advanced Mandarin learners use voice-to-text as a tone accuracy checker — if the model outputs the correct character, tones were accurate. Incorrect character selection reveals tone errors or mispronunciation. A practical pronunciation feedback tool at HSK 3+ level.

中文语音命令 — Voice Commands in Chinese

Say these words during dictation to add punctuation. Chinese uses specific punctuation marks — the 。full stop, 、enumeration comma, and 「 」quotation marks differ from Western equivalents:

Punctuation / 标点符号

说 / Say插入 / Inserts
"句号"。(Chinese full stop)
"逗号",(Chinese comma)
"顿号"、(enumeration comma)
"问号"
"叹号"
"冒号"
"分号"
"引号""" or 「」(quotes)
"破折号"——(em dash)
"省略号"……(ellipsis)

Format / 格式

说 / Say动作 / Action
"换行"New line
"新段落"New paragraph
"删除"Delete last word

Chinese punctuation note

Chinese uses full-width punctuation marks (。,?!) and the 、enumeration comma (different from,). The model outputs Chinese punctuation automatically in Chinese-mode dictation — you don't need to request full-width marks specifically. The 省略号 ellipsis is six dots (……) not three (…) in Chinese convention.

转录中文音频文件 — MP3, WAV, MP4

Upload Chinese audio recordings — meetings, lectures, podcasts, interviews. Pro plan handles files up to 5 hours with timestamps. / 上传中文录音,获得带时间戳的文字记录。

查看Pro计划 →

常见问题 — FAQ

粤语(广东话)和普通话能用同一个工具吗?

不能用同一个区域设置。普通话和粤语是不同的语言,不能互相识别。说粤语必须选择 zh-HK 区域——这会启用粤语识别模型。如果在 zh-CN 或 zh-TW 区域说粤语,输出将是完全错误的乱码。粤语 ASR 比普通话 ASR 不够成熟,错误率更高,但 zh-HK 是目前最好的选择。

Does it output Simplified or Traditional Chinese?

It depends on your locale selection. zh-CN outputs Simplified Chinese (简体字) — used in mainland China and Singapore. zh-TW outputs Traditional Chinese (繁體字) — used in Taiwan. zh-HK outputs Traditional Chinese with Cantonese-specific characters. The spoken input is the same Mandarin for zh-CN and zh-TW — the locale determines output script only. Switch locales at any time to change the output script.

How does tone accuracy affect character selection?

Tones are one input signal to the model, but contextual language modelling does most of the disambiguation work. This means: (1) non-native speakers with imprecise tones still get reasonable accuracy for common vocabulary because sentence context compensates; (2) tone sandhi (the 3rd tone + 3rd tone → 2nd + 3rd change in connected speech) is handled correctly by models trained on natural speech; (3) in ambiguous sentences where context doesn't help, wrong tones will produce wrong characters more frequently.

台湾的国语和大陆普通话有什么区别?

台湾国语在发音上主要区别:卷舌音(zh、ch、sh)常被发为平舌音(z、c、s);几乎不用儿化音;语调更平稳。词汇差异:台湾用"捷運"(大陆:地铁)、"機車"(大陆:摩托车)、"健保"(大陆:医保)等。使用 zh-TW 区域可获得繁体字输出,模型对台湾口音的识别也更准确。如果用 zh-CN 说台湾国语,发音识别虽可以,但输出会是简体字。

Can non-native Mandarin speakers use this effectively?

Yes, from HSK 3–4 level upward. The contextual language model compensates for mild tone errors in common vocabulary. The main challenges for non-native speakers: (1) tones on less frequent words where context doesn't disambiguate; (2) retroflex consonants (zh, ch, sh, r) which non-native speakers often pronounce as z, c, s — models handle this partially; (3) the neutral tone, which non-native speakers often over-stress. Accuracy improves significantly with Mandarin proficiency. Many advanced learners use the tool as a pronunciation checkpoint — if the model outputs the correct character, pronunciation was sufficient.

我的录音会被保存或发送给第三方吗?

不会。免费的实时听写工具使用浏览器内置的 Web Speech API——您的语音在浏览器层面处理,不经过我们的服务器。Pro 文件上传功能会将文件发送到服务器进行处理,转录完成后立即自动删除。我们不存储任何录音。

Related Tools

开始用中文语音输入 — Start Dictating in Chinese

免费,无需安装,无需注册。汉字自动输出,无需拼音输入法。

推荐使用 Chrome — 中文语音识别效果最佳