Barefoot hippie's Swiss-knife app
Barefoot simple solutions, no corporate nonsense.

Text Stats — Word, Character, and N-gram Statistics

An offline, browser-based application for counting words, characters, and n-grams, providing unique counts, frequency ratios, and CSV export.

Vocabulary for translation plugins
Paste
Clear
Expand
Collapse
(slow)
Large list may slow the browser
Copy CSV
Download CSV
Paste here the text whose words shall be counted
Words
Characters (with spaces)
Characters (no spaces)
Unique words
Unique 2-grams
Unique 3-grams
Item
Count
Ratio to top
Ratio to total
Ratio to top (no spaces)
Ratio to total (no spaces)
Top words
Top 2-word n-grams
Top 3-word n-grams
Top characters
Top 2-char n-grams
Top 3-char n-grams
Self-test starting
Tests summary
pass
fail
All tests passed; the analyzer appears to be working correctly.
Some tests failed; the analyzer may not be functioning as expected.
Service Worker unavailable

Instructions

Insert or paste text into the input field (use Paste if available).

Consult the main statistics: Words, Characters including spaces, Characters excluding spaces, followed by Unique words, Unique 2-grams, Unique 3-grams.

Examine the frequency tables: Top words, Top 2-word/3-word n-grams, Top characters, and character n-grams.

Use Expand/Collapse to toggle between the Top-5 items and the complete list for each table.

Select Copy CSV or Download CSV to obtain a full export of any table.

Detailed explanation

Methodological principles

  • Local execution: All computation occurs in the browser; no data is transmitted externally.
  • Unicode compliance: Characters are processed as code points, ensuring accurate handling of all scripts and emoji.
  • Word tokenization: Words are defined as sequences of letters or digits, with internal apostrophes (' or ') and hyphens (‐) preserved. Tokens are lowercased to consolidate variants (e.g., "The" and "the").
  • Character n-grams: Formed directly from characters, including punctuation and whitespace.
  • Word n-grams: Constructed from the lowercased word tokens, disregarding whitespace and punctuation as separators.

Primary counters

  • Words: Total number of detected word tokens.
  • Characters including whitespace: Total number of code points, including spaces, line breaks, punctuation, and symbols.
  • Characters excluding whitespace: Number of code points minus whitespace.
  • Unique words: Number of distinct words.
  • Unique 2-word n-grams: Number of distinct consecutive two-word sequences.
  • Unique 3-word n-grams: Number of distinct consecutive three-word sequences.

Tables and metrics

Each table presents items sorted by frequency. In collapsed mode only the five most frequent items are shown; expanded mode reveals the entire set.

Columns:

  • Item: The character, word, or n-gram.
  • Count: Frequency of the item in the text.
  • Ratio to top: Proportion of the item's frequency relative to the most frequent item in the table.
  • Ratio to total: Proportion of the item's frequency relative to the total number of tokens in that category.
  • Ratio to top (no spaces) (character tables only): Ratio relative to the most frequent non-whitespace character.
  • Ratio to total (no spaces) (character tables only): Proportion relative to all non-whitespace characters.

CSV export

  • Content: The exported file always contains the complete list of items, never limited to the Top-5.
  • Format: Standard CSV with appropriate escaping and quoting of fields containing commas, quotes, or line breaks.

Design considerations and limitations

  • Lowercasing: Ensures consistent counting of lexical variants differing only in case. Character tables preserve original forms.
  • Token boundaries: Apostrophes and hyphens within tokens are retained; other punctuation acts as delimiters.
  • No-whitespace ratios: Reduce the distorting effect of space characters in frequency measures.
  • Language sensitivity: The method is general and does not account for specific morphological or syntactic features of all languages. Results remain valid but may not align with linguistic segmentation in every case.

Privacy and portability

  • Fully self-contained, requires no external libraries or connections.
  • Input text never leaves the user's environment.

Warning: Expanding a table to display all items may be computationally intensive if the input text is very large. In such cases the browser may slow down noticeably or even become unresponsive. For lengthy analyses it is advisable to use the Top-5 view for quick inspection, and if complete data are required, prefer the CSV export.