English letter frequencies that include spaces? [SOLVED]

Help/Discuss

SanjeevKumarSharma 2015-11-01 19:34:10 UTC #1

I'm starting to add bigrams and am wondering if there are any with spaces

did the existing sources analyse spaces but no bigram with space was frequent enough to end up in the charts?

if there are chars that include spaces please point me to them

paul 2015-11-01 19:43:25 UTC #2

I think this has what you need.
http://norvig.com/mayzner.html

tony 2015-11-02 18:04:31 UTC #3

Paul, Great link!
Here's a link to the Google Ngram Viewer too:
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

JacobR 2015-11-03 22:26:22 UTC #4

@SanjeevKumarSharma That Google Ngram data is amazing! Doesn't look like it includes "_x" bigrams, but it does have "x_" bigrams. And that source is if you want to do the analysis yourself.

Dr. Norvig's analysis is excellent, thanks for sharing, @paul! Although it won't account for punctuation, the "Letter Counts by Position Within Word" section would probably be the most useful to you. You can see the most common first letter is "T" (which would imply "_T" is a common bigram starting with a space) and that "E" is the most common last letter (implying "E_" is a common bigram).

Watch out though; "E" is also the most common third letter. And in his own analysis, Norvig points out that the word "The" is 7.14% of all words in his dataset. That single word strongly affects any other analysis. What are the two most common bigrams? "TH" and "HE". So whether you want to use "_T" or "E_" or just "_THE_" is up to you. Or just exclude those because you should already have "THE" programmed in

Let us know what you end up programming in and find useful.

SanjeevKumarSharma 2015-11-04 14:58:05 UTC #5

Norvig also claims the commonest starting letter is t - in context "the" 's thumb's on the scale.

I'll just pull a Pareto; add chord "S 0MMM" (shifted "the") for " the "
then move onto other high yield bigrams

JonT 2017-01-27 13:53:35 UTC #6

I thought the heading of this topic would answer my question, but not quite. I couldn't find the answer elsewhere on the internet, so I thought I'd ask here. There are many sites that address the frequency of occurrence of the 26 English letters. However, I do not see anything that includes the "space" as one of the characters. I suppose that this could be approximated by some function of the letter-frequency and the average-word-length, but I don't see this explicitly stated anywhere. Anybody?

tony 2017-01-30 22:48:05 UTC #7

@JonT check out this link:
Statistical Distributions of English Text

furman 2017-10-13 03:40:32 UTC #8

Frequencies of characters in English, including the comma, blank, and such, can be found at
www.fitaly.com/board/domper3/posts/136.html .