@SanjeevKumarSharma That Google Ngram data is amazing! Doesn't look like it includes "_x"
bigrams, but it does have "x_"
bigrams. And that source is if you want to do the analysis yourself.
Dr. Norvig's analysis is excellent, thanks for sharing, @paul! Although it won't account for punctuation, the "Letter Counts by Position Within Word" section would probably be the most useful to you. You can see the most common first letter is "T" (which would imply "_T"
is a common bigram starting with a space) and that "E" is the most common last letter (implying "E_"
is a common bigram).
Watch out though; "E" is also the most common third letter. And in his own analysis, Norvig points out that the word "The" is 7.14% of all words in his dataset. That single word strongly affects any other analysis. What are the two most common bigrams? "TH" and "HE". So whether you want to use "_T"
or "E_"
or just "_THE_"
is up to you. Or just exclude those because you should already have "THE" programmed in
Let us know what you end up programming in and find useful.