Home > Keyboarding Theory, Keyboards, New Keyboard Layout Project > New Keyboard Layout Project: Have We Been Mistaken All Along?

New Keyboard Layout Project: Have We Been Mistaken All Along?

Everyone who has designed a prominent keyboard layout, and I mean everyone, assumes that finger travel distance is by far the most important factor. It makes sense on an intuitive level: we should move our fingers around as little as possible. Colemak places the eight most common keys on the home row, as does Arensito, Michael Capewell’s layout, and others. I used to agree.

But have we been mistaken all along?

Enormous benefits can be gained if we are willing to sacrifice a little finger travel distance. I was running my keyboard generator program and it came up with this layout:

b l o u ; j d c p y
h r e a , m t s n i
k x ‘ . z w g f v q

This surprised me at first. I thought, I must have something wrong. The ‘o’ isn’t on the home row. That can’t be right. But then I considered further. Maybe it’s worth it to sacrifice some finger travel distance in order to gain other benefits. This layout boasts great inward rolls (notice ‘he’, ‘in’, ‘is’, ‘re’, ‘it’) and very few outward rolls. With four vowels on one hand and only one on the other, it also has pretty good hand alternation, thus pleasing both the “rolls” crowd and the “alternation” crowd. Same finger usage is amazingly low — lower than any other major keyboard layout.

The trouble is, I’ve never tried a layout like this, nor do I have the time to. I want to stick with the layout I have and try to get faster using that one layout; remembering both it and QWERTY is not too hard, but remembering three layouts is far more difficult. It would be really nice if I had a research grant and could hire a group of 50 or so college students to learn this layout, and compare it to one where finger travel distance is valued more highly.

Perhaps a different sort of layout is better than the conventional type. The trouble is, we don’t really know. But there’s still the possibility that we’ve been mistaken all along.

Advertisement
  1. Atle
    June 30, 2010 at 1:30 pm | #1

    Hi Michael,

    I notice, except for the O, the rest of the layout has extremely good reach (finger travel distance). Even the O is in a prime top-row position – the next best thing.

    Conveniently, the most frequent 14 letters can be found within easy reach in the home and top row. So, on balance, trading a slightly sacrificing the O for extremely good access to the best half of the alphabet, is probably a profitable tradeoff in terms of overall reach.

    In other layouts, it bothers me when the relatively frequent U, C, or M falls down into the bottom row.

    • June 30, 2010 at 3:16 pm | #2

      Thank you for your comment. I always appreciate feedback and new ideas.

      You make a good point. What I’m primarily concerned about in this case is not finger travel distance, but that the middle finger may be overworked. Placing O and E on the same finger is an awful lot to be typing. Perhaps the middle finger can handle it; and perhaps it’s worth it for the decreased load on the left pinky finger.

      I agree that it is important to place moderately common letters like U, C, or M in good positions. That said, I don’t think that the bottom row right index finger (QWERTY ‘m’ key) is very difficult to type. Generally, though, my algorithm places somewhat less common things (like ‘w’ or ‘g’) in that position.

      Regarding the semicolon, it is actually relatively common among punctuation. My layouts always use the 4 punctuation keys that are used on Dvorak. I do this because: they are among the most common punctuation; it’s easier to keep things consistent that way; and layouts get much harder to learn when you start moving too much punctuation around. In my own experience, I have no trouble switching between my layout and QWERTY, but when I modified the punctuation I had much more difficulty.

      One other thing to be considered is that the semicolon will end up being placed in one of the worst positions, which may actually be worse than some punctuation outside of the main 30 keys (for example, I find that the QWERTY apostrophe is easier to type than the QWERTY slash).

      • Atle
        June 30, 2010 at 7:06 pm | #3

        Michael: ‘Regarding the semicolon, it is actually relatively common among punctuation.’

        Atle: I doubt the frequency of the semicolon is true anymore.

        Sure, if you use the obsolete ‘classics’ literature from the previous two centuries as the test data, yeah. Youl find alot of semicolons.

        But if you go to Barnes&Nobles and buy a paperback romance novel or even a hi-tek scifi novel. I bet you, no semicolon.

        I consider your use of semicolons above:

        “I do this because: they are among the most common punctuation; it’s easier to keep things consistent that way; and layouts get much harder to learn when you start moving too much punctuation around.”

        Occasionally, I see semicolons used to separate a series of clauses, like ‘capital commas’ sotospeak. But, 1, this is grammatically improper anyway. 2, the same sentence is clear with proper commas. 3, the sentence actually uses a colon .. because a colon is useful, as opposed to the semicolons that are gratuitous. Significantly, 4, the semicolons enabled a sprawling (less ideal) writing style.

        Sentences that refuse to succumb to periods may be good for Immanuel Kant, but theyr bad for modern Americans.

        Especially in our era of navigating the information rather than memorizing the information, the information itself must be instantly clear.

        When the concepts are complex, moreso style must be simple. To paraphrase Einstein, ‘Be as simple as possible, but no simpler.’

        This is a matter of style. Yet objectively, today, I bet the dollar sign is more common than the semicolon. Certainly, informal writing uses exclamations more than semicolons! Except for emoticons ;)

        • Atle
          June 30, 2010 at 7:11 pm | #4

          But :) is probably more frequent than .. ;)

          • feurry
            July 14, 2010 at 12:21 am | #5

            the semicolon is used often in programming. It’s not like it’s in a prime location anyways. It could probably be swapped with k or some other keys and hardly affect this layout.

            If you really want to move some bad punctuation keys, move the up to the number pads onto 1,2 and bring the ! and @ in to replace them. How many times do you type email addresses vs > … probably a whole bunch. I’ve moved a few other keys though on my layout so i actually put ? and ! with , and . to keep things consistent by grouping punctuation keys. I moved @ to the shifted / (but even / is in a different spot on my board)

            i like the way Michael has just used the 30 keys. The other ones can be changed to ones own needs quite easily, and their usage patterns can vary greatly depending on what you type. i.e if you program. Their usage varies too much with how you use your computer. Some people use $ a lot and others /.

            I have used many of Michael’s layouts and other layouts for more than a month each and can say first hand the punctuation keys have never been a sore spot. At the same time, adapting my punctuation mods to any of them has been trivial and are always better than qwerty.

        • June 30, 2010 at 10:09 pm | #6

          Here you may find the frequency of every ASCII printable character. A sizable proportion of my corpus is text that I pulled off the internet (forums, etc), so it’s pretty modern. Semicolons are still relatively common.

          • Atle
            December 30, 2010 at 7:49 pm | #7

            Im not sure how I missed this updated data. But it looks strong. :-D I especially like where it lists the semicolon as less frequent than the dollar sign ‘$’ – as I predicted earlier! And, the semicolon is less frequent than the colon – as I experienced.

            As the frequency of the semicolon is comparable to the dollar sign, it seems safe to remove it from the keyboard.

            Personally, I make SHIFT-comma the semicolon and SHIFT-period the colon, and am very happy with this key assignment. Thus the semicolon and colon dont hog up space, but are conveniently in reach for the times when I need them.

  2. Atle
    June 30, 2010 at 2:17 pm | #8

    On a separate issue, the semicolon is evil.

    Contemporary American English seeks clarity and concision, avoids adverbs and adjectives unless vital, and rejects the semicolon. The semicolon is the opposite of American English. It is obsolete.

    I estimate I use a semicolon about two or three times .. per year. I use grammar checks to *remove* semicolons from preexisting documents, replacing them with a period and capital.

    More frequently I use a colon: a grammatical equal sign.

    The semicolon doesnt belong in any keyboard layout. For most typists, it may be best to swap the colon with the doublequote. However, my ideolect never uses doublequotes either. I use ‘British’ singlequotes, and dont use apostrophes for contractions. I would swap the semicolon with the colon.

    • June 30, 2010 at 10:10 pm | #9

      I am rather fond of the semicolon; I consider it to be dreadfully under-utilized. In run-on sentences, it often can serve to add more clarity and smoothness than breaking up the sentence.

    • dphrei
      November 21, 2010 at 2:59 pm | #10

      i use it almost daily.

  3. Atle
    June 30, 2010 at 7:17 pm | #11

    I wonder what the character frequency of English Wikipedia is. That! is probably the best sample for people who care about keyboards. Brackets and all.

  4. Atle
    July 7, 2010 at 7:31 pm | #12

    Maybe this keyboard can get a reality check? For the algorithm, lock the block of four vowels in place (with the O above E), and leave everything else open. Hypothetically, the outcome should be this same keyboard. However, if the other keyboards rarely exhibit excellent fitness, it suggests this keyboard is a statistical fluke, and perhaps less good than the fitness suggests. However, if different kinds of keyboards exibit excellent fitness, a block of vowels probably enjoys better efficiency.

  5. phynnboi
    July 8, 2010 at 6:30 am | #13

    I can’t remember if I’ve mentioned this or not: My testing method of choice was to take TypeFaster’s “Common Words” exercise and translate it from the layout I wanted to test into a layout I already know.

    So, for instance, assume we know Qwerty and want to test Michael’s layout. The first sentence of “Common Words” is,

    The man almost called to the boy.

    This translates from Michael’s layout into Qwerty as,

    Jad hfl fwhekj ifwwdu je jad qepv

    Type that on Qwerty and you’ll get a feel for what it’d be like to type it with Michael’s layout.

    It’s pretty easy to write a script in Perl (or, I assume, Python or Ruby) to do this translation for the whole file (or whatever file you want). Assuming you’re facile enough in your main layout to type garbage at a reasonable speed, that’ll give you a pretty good feel for the new layout without having to spend months actually learning the thing. I found this system invaluable in vetting my evaluation function and finally settling on a homebrew layout I was happy with.

  6. Atle
    July 14, 2010 at 12:24 am | #14

    Hi guys, I meant to respond earlier, but I got pulled away by a side project.

    This keyboard is interesting. Iv been making use of it. Il let you know it goes. (Using it now.)

    A main concern of mine is a ‘private domain area’ to reuse for variant keyboards. This keyboard is actually fantastic for that.

  7. feurry
    July 14, 2010 at 1:19 am | #15

    I’d been using your program for a while back in ’09 trying to come up with layouts that favoured the middle finger. I was tweaking the keyboard position costs and also manually making changes. I wasn’t exactly thinking about sacrificing finger travel distance, but more about sacrificing rolls, since they aren’t my thing.

    My perception is the middle finger is as easy to type in the top row as the pinky in the home row, and far more powerful.

    Anyways, i’d try the layout and comment, but i’m already using a layout with a similar o e ‘ combo, so i’ll comment on that. I love it, and this layout would be pretty darn good too, but i won’t be testing it as i can tell already i don’t like the hre grouping.

    I’m using a modded version of your 2nd layout from sept 12/09. I moved jx; for personal tastes. This is the modded version.
    Hands: 52% 47%
    Fingers: 8% 9% 19% 14% 15% 14% 10% 7%
    y c o u j k m d p w
    i s e a . l h t n r
    x z ‘ , ; v f g b q

    it has even more middle finger action by having t and d on the right.

    I’m much happier with this layout than mtgap 2.0, my previous layout of choice.

    • July 14, 2010 at 7:48 pm | #16

      If you can easily move your middle finger around, I can definitely see how putting more on the middle finger could be a good idea. It looks especially promising because same finger can be so low — td/dt and oe/eo are very rare.

      What are all the statistics for your modded layout?

      • feurry
        July 15, 2010 at 8:33 pm | #17

        here’s the full stats
        Hands: 52% 47%
        Fingers: 8% 9% 19% 14% 15% 14% 10% 7%

        y c o u j k m d p w
        i s e a . l h t n r
        x z ‘ , ; v f g b q

        Fitness: 2313114799
        Distance: 9580900
        Inward rolls: 7.08%
        Outward rolls: 2.23%
        Same hand: 16.85%
        Same finger: 0.68%
        Row change: 7.59%
        Home jump: 0.24%
        To center: 1.60%

        as a bonus here’s another layout with a really low same finger.. about 1% lower than your posted layout. Same hand is much lower though.. so i’m really amazed your layout has such a low same finger and higher same hand.

        Hands: 52% 47%
        Fingers: 8% 9% 19% 14% 14% 14% 10% 7%

        y l o u ; q m d p w
        i h e a , f s t n r
        j k ‘ . z v c g b x

        Fitness: 2308232051
        Distance: 9523500
        Inward rolls: 6.15%
        Outward rolls: 2.89%
        Same hand: 16.54%
        Same finger: 0.58%
        Row change: 7.77%
        Home jump: 0.39%
        To center: 0.77%

        not sure why i didn’t end up using that layout.

        • feurry
          July 15, 2010 at 9:13 pm | #18

          while i’m on low same finger, here’s another one of my layouts that looks pretty kick ass from the stats.

          Hands: 51% 48%
          Fingers: 7% 10% 19% 13% 14% 13% 11% 10%

          q l o u j k y m c p
          a r e i , f h t s n
          z x ‘ . ; v d w g b

          Fitness: 101377344
          Distance: 9580900
          Inward rolls: 7.49%
          Outward rolls: 3.94%
          Same hand: 19.03%
          Same finger: 0.51%
          Row change: 8.42%
          Home jump: 0.29%
          To center: 0.84%

          Despite the numbers this layout wasn’t working for me. I don’t like the 10% on the pinky and or, er, ng. Maybe someone else will like it.

  8. Atle
    July 15, 2010 at 5:48 pm | #19

    I also dont seem to mind the o above e, at the mid top. I dont know if it will become an issue later at higher speed, but right now seems fine.

  9. Atle
    July 15, 2010 at 5:59 pm | #20

    ‘The hre grouping’

    Funny enough, Im finding that awkward as well.

    • feurry
      July 15, 2010 at 8:42 pm | #21

      there, here, were aka the ere combo is very common which leads to a lot of direction changing while doing rolls. If er were on the index and middle it would probably be fine.

  10. Atle
    July 15, 2010 at 8:20 pm | #22

    Heh. Surprisingly, the most difficult key for me is the pinky bottom. The need to not activate the touchpad gives the heel less mobility. The wrist flexes up, twists and strains.

    • dphrei
      November 21, 2010 at 3:14 pm | #23

      i would suggest setting a hotkey to toggle the pad on/off. should be in mouse settings.

  11. Atle
    July 16, 2010 at 2:58 am | #24

    These four letters form the base of a family of keyboards:

    N R S T

    Out of curiosity, with no other factors, which sequence has the best rolls?

    • July 16, 2010 at 3:11 am | #25

      Without using a computer program, I can tell you that ST is the most common digraph possible using those four letters, and NT is the second most common. There aren’t any others that are very common. Almost all common digraphs are vowel-consonant or consonant-vowel.

  12. Atle
    July 22, 2010 at 3:47 pm | #27

    So, not all rolls are equal. As a tentative estimate, I rank the rolls from best to worst as follows, where I=index, M=middle, R=ring, P=pinky:

    Inward rolls from easiest to least easy:
    MI,RM,PR,RI,PI,PM.

    Outward rolls from easiest to least easy:
    IM,IR,IP,RP,MP,MR.

    Personally I find the MR roll surprisingly awkward, the R (ring finger) seems uncoordinated and erratic in that sequence. I suspect thats why having the letters ‘ER’ on the fingers MR may be uncomfortable for some.

    Anyway, try rank the rolls yourselves to see if you experience them similarly.

    After a concensus emerges, determine if certain outward rolls are easier than certain inward rolls. Once the home positions are clear, use it as a basis to estimate the value of rolls involving the center and other rows.

    • Atle
      December 29, 2010 at 11:24 pm | #28

      Im wondering if the problems with rolls are extremely specific. So, all rolls a great except for these problems.

      I am currently using a personalized i-s-e-a keyboard. The reversing diagraph e-s-e disrupts (confuses?) the flow of typing noticeably. Especially at high speeds.

      Mid-Ring-Mid: Specifically, the problem seems to be the roll of mid-ring-mid. By contrast the ring-mid-ring roll seems fine! The problem only exists for high frequency rolls. Using mid-ring-mid is intolerable for the topmost frequent e-r-e roll, problematic but potentially worth the sacrifice for e-s-e roll, and simply negligeable for less frequent characters.

    • Atle
      December 29, 2010 at 11:41 pm | #29

      I want to call attention to a truly surprising problem. Id like to see what the experiences of other people are.

      I created an h-i-e-a keyboard (hand home) with mid e-o. It has zero outward rolls with regard to the frequent diagraphs, and good same finger, distance, and finger distribution. It looks excellent according to the methodology.

      I was shocked to discover the top frequent *inward* roll of pinky-h then mid-e (such as ‘the’, ‘he’, ‘there’, etcetera) is very straining to the wrists. Im half wondering if this and similar usage is responsible for all carpal tunnel problems.

      Specifically, the problem occurs when you exert pressure with both the pinky and the middle finger without exerting the ring finger as well.

      If you lift your ring finger and roll back and forth with your mid and pinky, you can feel out exactly were the position becomes problematic.

      Do you guys have a similar experience?

      If you feel

      • Atle
        December 30, 2010 at 11:24 pm | #30

        I was looking for ergonomic way to resolve this issue of straining.

        Coincidentally, the program also tends to produce the solution that avoids the straining.

        While the program tends to put the e on the middle finger, it also tends to put the i on the pinky.

        The resulting inward i-e roll is relatively infrequently, and the outward e-i roll seems so infrequent as to be off the radar.

        The infrequency of i-e prevents any issues from arising. Indeed, Iv been using a i-s-e-a variant for a while, and never had issue with straining. It was only after I created a layout with a frequent h-e roll that I even discovered this could be an issue.

        With regard to the other hand, I use r-n-t-h (right home, reversed with pinky on r and index on h). In this case, the inward r-t roll and outward t-r roll are both relatively infrequent. Iv never noticed a problem here either. So their frequency between the 80th to 100th is probably a good ballpark for a safe frequency, and the i-e rolls falls in this range too.

  13. Pieter
    October 26, 2010 at 10:25 pm | #31

    Or – you might drop the keyboard conventions completely and go for different things. Like honeycomb shaped keys – I suppose you all know this paper that I found?

    http://www.almaden.ibm.com/u/zhai/papers/Softkeyboard/UISTCamera.pdf

    • October 26, 2010 at 11:04 pm | #32

      That’s actually quite interesting, I’ve never seen it before. I’ll be sure to look at it more closely later.

  14. Atle
    December 29, 2010 at 11:10 pm | #33

    For a while now, Iv been using a mod of an i-s-e-a keyboard (hand home) with the e-o middle finger. I love having the mid-e-o.

    As others have evidenced, I feel the real reason for its success is the minimal same-finger usage for consecutive characters. I love having the lowest same finger.

    Sharing the e-o on one finger leaves more room for the other home keys to share with their optimal characters. The mid finger can definitely handle the workload and worth the sacrifice of distance. I am impressed the program is accurate enough to predict this.

  15. December 31, 2010 at 1:40 am | #34

    Regarding the rarity of the semicolon:

    The letter frequency on my website is old and there was a bug in my program. It didn’t have much of an effect on more common characters such as letters but it was drastically affecting punctuation. Actual punctuation frequency (according to my data) is this:

    . , _ ( ) ; = ” / – $ * ‘ { } : > [ ] < + \ ! # | @ ? & % ~ ^ `

    Which means that the semicolon is actually more common than $ after all.

    • Atle
      December 31, 2010 at 2:51 am | #35

      . , _ ( ) ; = ” / – $ * ‘ { } : > [ ] < + \ ! # | @ ? & % ~ ^ `

      Huh.

      It seems like the textual corpus for these punctuations includes quite a bit of computer coding. That might explain the strangely high frequency of the spacebar _ (even more frequent than quotation marks?), as well as how the braces {} which are usually quite rare could be even more frequent than brackets [].

      • December 31, 2010 at 3:06 am | #36

        I explain how I calculated it here. While it’s true that most people are not programmers, people who are programmers do a lot of programming, and programs usually contain a lot of punctuation. Maybe I’m a little biased because I’m a programmer.

        • Atle
          December 31, 2010 at 4:55 am | #37

          Actually, same link I was referring to below. Good stuff. I read it last year, but was already rereading it last night when thinking about punctuation.

      • Atle
        December 31, 2010 at 3:06 am | #38

        Computer codes – especially source codes for webpages and wiki editing.

        • December 31, 2010 at 3:24 am | #39

          I don’t think it does include much computer code if at all.

          • Atle
            December 31, 2010 at 4:53 am | #40

            Whats the reason for the remarkably high frequency of spacebar _ ?

  16. Atle
    December 31, 2010 at 4:50 am | #41

    ‘Maybe I’m a little biased because I’m a programmer.’

    Heh, exactly.

    Actually, last night I was studying the data that you compiled, that separates into the punctuation into different genres of writing style: casual, news, formal, prose, and programming. The differences in the frequencies of the letters are small. Whether the r or h is more frequent, the c or u, f or w, or x or j, doesnt noticeably affect the easy of typing. It is possible to accurately approximate the comparable frequencies, without needing to be precise. It usually doesnt matter what writing style one uses, its more or less the same letter frequency, and they use keyboard in roughly the same way.

    However the punctuation frequency differs drastically from one genre to the next.

    It occurred to me. Punctuations and writing style are the same thing. It is precisely the use punctuation that defines what a genre is – or at least shapes the form that the genre takes. There are accepted and unaccepted uses of punctuation in formal writing, relaxed and pretentious ways of puctuation in casual writing, polished and raw punctuation in prose, clean and messy punctuation in news, and correct and buggy punctuation in coding.

    Once I realized the identity between punctuation and genre, I tried to come up with a way to organize the data meaningfully. I came up with this method that Im pretty happy with.

    For each genre of punctuation systems, I compare their punctuation frequencies against their number frequencies. The results are telling.

    • Generally, the punctuations whose frequencies intermingle the frequencies of the letters of the alphabet (generally near the frequency of the letter b) and are more frequent than any number: these are the must-have punctuations that define the writing style.
    • The punctuations whose frequencies intermingle those of the numbers: these are the supportive punctuations that characterize the mood of the genre.
    • Here in this post I ignore any punctuation that is less frequent than the least-frequent number. I also subdivide these into rare but still part of the repertoire of the genre versus alient to the genre, except in accidental circumstances, such as referring to an other genre.

    In the genre descriptions below, I give two lines.

    • The first line represents the punctuations whose frequencies are greater than that of any number.
    • The second represents the puctiuations that are roughly the same as the frequencies of numbers.

    Each genre prefers its favorite punctuations, reusing them as part of its own punctuation system.

    CASUAL
    . ,

    - ‘ ) ( : ” / ! ?

    Casual writing is simple comments, periods and commas. Less elaborate formats. At the same time, casual writing is more likely to type number symbols for the sake of brevity rather than spell the words out. So all of the supportive punctuations are at about the same frequencies as the numbers. (The supportive punctuation includes a reasonably high frequency of parentheses for stray comments that interject into the flow of the conversation – and dashes too.) There is less concern for quotation marks, as there is less need to formally document the casual conversations.

    PROSE
    , . ” ‘ – ! ; ? : ) (

    *

    Prose formally presents casual conversations. These are the novels and narrative descriptions. The high frequency of quotation marks shows hyper concern for exactly which person says what. The high frequncy of apostrophes shows the formal presentation of vernacular speech, especially in the form of contractions. Notice also that the numbers are less frequent than the bulk the punctuation in use because the narrative format prefers to spell out any numbers (such as ‘forty-two’) rather than jot their symbols (’42′). As far as know, the chevrons (really inequality signs) are often for narrative markers, such as to visually signify the statement by a character is actually speaking a different language when saying it. Note the infrequency of the equal sign, so these chevrons rarely convey numerical computations. Novels similarly use italics to signify to the reader what someone is thinking. As for the astrix *, every now and then, the author must interrupt the flow the story to inform the reader about the significance of a term or circumstance.

    FORMAL
    , . – ” ;

    ‘ ) ( : []

    Formal writing, includes academic writing as well as other professional writing styles. This genre cares about who makes what claims, using quotation marks to communicate the text of interest, as well as parentheses to document the source, simultaneously with numbers to convey date and page number, and even a relatively high frequency of brackets [] to interrupt the quote with hyper analysis. The formal style also includes frequent use of dashes and semicolons to visually divide complex thoughts.

    NEWS
    , . ” –

    ‘ :

    LOL. News says it all. Crisp clean statements. Done.

    PROGRAMMING
    _ ” . , = ‘ ( : ) > < [ ] /

    @ \ | ? { } – %

    Yeah. And then theres programming, where the punctuation is itself the language. Its like looking at cuneiform.

    The subtle way that punctuation writes genres fascinates. Each style reuses the punctuaitions for its own distinctive system. The community of each system even enforces what punctuations are appropriate or not.

    With regard to keyboard design, it seems the designers can specialize the keyboard for specific punctuation systems. Likewise each keyboard user can note which punctuations they tend to find themselves using when selecting their layout. The keyboarder can even swap among several keyboard layouts, each with its own punctuation arrangement.

    • December 31, 2010 at 5:52 am | #42

      That’s a really fascinating discovery you’ve made. I think it makes perfect sense.

      Atle :

      Whats the reason for the remarkably high frequency of spacebar _ ?

      (You are referring to the underscore right?)

      I think I have your answer. This is the letter frequency for the C files only:


      23389
      e 5967
      t 5616
      i 4271
      s 3769
      o 3688
      r 3597
      n 3516
      a 3415
      _ 2404
      l 2370
      c 2348
      d 2072
      / 1909
      u 1736
      m 1729
      * 1712
      p 1697
      ) 1689
      ( 1689
      = 1621
      ; 1620
      f 1455
      h 1243
      g 1208
      b 996
      - 969
      , 942
      . 764
      " 713
      x 683
      y 670
      278
      z 233
      ' 224
      j 201
      & 192
      \ 174
      3 165
      4 115
      5 98
      ! 94
      | 94
      6 89
      8 84
      q 79
      7 61
      9 49
      % 48
      ? 19
      @ 6
      ~ 5
      ^ 4
      $ 1

      _ is one of the most common characters because it is so frequently used in variable names. It’s close to as common in Ruby (but less common in Java because Java variables are usually written differently).

      The dollar sign is common for a similar reason. Here are the five most common letters in Perl:


      e 7053
      s 4818
      t 4361
      r 3983
      $ 3963

      $ is used in every variable so it’s very common in Perl, but not nearly as common in any other programming language or in normal writings.

      I produced a new letter frequency for non-programmers and put it up.

      e t a o i n s r h l d c u m g f p w y b , . v k ' " - x 0 j 1 q 2 z ) ( : ! ? 5 ; 3 4 9 / 8 6 7 [ ] % $ | * = _ + > \ < & ^ # @ ` ~ { }

      You’ll notice that underscores are much less common here.

  17. Atle
    December 31, 2010 at 10:15 pm | #43

    LOL, Iv never used the _ charcter for anything except as a bar to represent an empty space. I even forgot they were called ‘underbars’.

  18. Atle
    December 31, 2010 at 11:34 pm | #44

    Your new stats look excellent.

    Applying my methodology that I describe above, the results feel accurate when using your stats.
    • Must-have punctuations: these punctuations are more frequent than any number.
    • Finetuning punctuations: these are equally frequent with the range of numbers.
    • Repertoire punctuations: the rest of the punctuations, which are less frequent than any number, can further subdivide. A third category has the punctuations that are rare but remain in the repertoire of the genre as backups for specific situations.
    • Almost-never punctuaitions: The as-of-yet vague fourth category has the punctuations that are alien to the genre and unlikely to find use.

    MUST-HAVE PUNCTUATIONS!
    , . ‘ ” –

    FINETUNING PUNCTUATIONS
    ) ( : ! ? ; /

    REPERTOIRE PUNCTUATIONS
    [ ] % $ | * = _ + > \ < & ^ # @

    ALMOST-NEVER PUNCTUATIONS
    ` ~ { }

    The resulting categories of frequency feel both reliable to represent the tendencies of the population and to quantify the variations for a particular individual. For example, I use paretheses very frequently. So for me personally, these punctuations that are already on the upper cusp of the Finetuning frequency category probably sneak up into my Must-Have category. Oppositely, I personally avoid the use of doublequotes (using singlequotes instead), semicolon, and number sign. So in my own usage, these 'cusp punctuations' would probably fall down into the next lower catagory, respectively.

    Your results for the character frequencies work well. Even when they segment into separate categories, each catagory on its own appears accurate.

    Happy New Year!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.