Which Numbers are the Most Common?

February 4, 2011 28 comments

I’ve often wondered which numbers are used most frequently. The low numbers (0, 1, 2) are probably the most common and the single-digit numbers are all among the most common, but how common are larger numbers and which larger numbers are the most common? Today I decided to find out.

I analyzed my text corpus and came up with the following as the 100 most common numbers:


1 2 0 3 5 4 10 000 6 8 7 20 12 2005 15 11 9 30 100 2008 16 2007 14 50 18 25 13 17 19 1812 24 2006 23 2004 21 40 22 26 60 27 70 2001 2003 28 200 2002 29 31 80 500 2000 300 1805 150 35 90 1000 101 45 32 36 07 33 00 08 1809 99 75 1990 1984 34 1999 400 48 800 95 06 44 1807 47 1998 41 85 55 250 83 53 1813 43 02 52 37 39 600 1980 51 03 120 04 64

As you might expect, the single-digit numbers are all among the most common. 10, 000, 20, 12, 2005, 15 and 11 are all more common than the least common single-digit number. I’m not surprised by 10 or 20, but I don’t know how 000 got there. 2005 is also an interesting one which I’ll get to later.

The most common numbers from 10 to 19:

10 12 15 11 16 14 18 13 17 19

This isn’t too surprising either.

Here’s every two digit number:


10 20 12 15 11 30 16 14 50 18 25 13 17 19 24 23 21 40 22 26 60 27 70 28 29 31 80 35 90 45 32 36 07 33 00 08 99 75 34 48 95 06 44 47 41 85 55 83 53 43 02 52 37 39 51 03 04 64 38 42 46 65 49 01 09 54 66 58 84 89 67 05 59 98 72 56 73 77 62 78 68 76 57 92 63 61 81 82 69 97 88 86 96 71 79 94 93 74 91 87

Every number from 10 to 99 appears at least once, which I suppose is what you’d expect. Lower numbers tend to be smaller, as do round numbers like 50, 25, and 40.

I also looked at the most common dates:


2005 2008 2007 1812 2006 2004 2001 2003 2002 2000 1805 1000 1809 1990 1984 1999 1807 1998 1813 1980

Probably this means that most of the text corpus is from the late 2000’s. But besides recent dates, a few other dates stand out: 1812, 1805, 1809, 1984. (1000 is probably not used as a date most of the time.) 1812 is probably because of references to the War of 1812 and 1984 is because of references to the book. 1805 and 1809 both appear repeatedly in War and Peace, one of the books used in the text corpus.

For our last exercise, let’s find the first number that doesn’t appear. Every number from 0 to 567 appears at least once. The first number not to appear at all is 568.

I’ve always wondered which numbers are the most common, and now we have an answer.

Advertisements
Categories: Language

Easy-to-Use Keyboard Optimization Program

January 22, 2011 31 comments

I’ve made some modifications to the keyboard optimization program. It is now much easier to use, especially for someone who doesn’t have much experience with computer programming. You can get it here. I added a makefile and a better readme, but more importantly, a command-line user interface. You can now customize the costs, use various features, and even change how the text corpus is weighted.

Fully Optimized Standard Keyboard

January 16, 2011 34 comments

I recently proposed a fully optimized layout built for the Kinesis physical keyboard. By popular request, I have now created a fully optimized layout for the standard keyboard.


= 1 2 3 4 5 6 7 8 9 0 q z
y p o u - k d l c w x / j
i n e a , m h t s r "
( ) ; . _ v f g b '

Fitness: 184428299
Distance: 364416
Inward rolls: 10.16%
Outward rolls: 2.36%
Same hand: 34.93%
Same finger: 1.60%
Row change: 13.17%
Home jump: 0.27%
To center: 2.38%
To outside: 0.52%

It looks strikingly similar to the latest version of the Kinesis layout:


1 2 3 4 5 6 7 8 9 0 q
y p o u - v d l c w x
i n e a , m h t s r "
( ) ; . _ k f g b '
/ =             z j

Fitness: 186751864
Distance: 959128
Inward rolls: 10.16%
Outward rolls: 2.36%
Same hand: 34.94%
Same finger: 1.60%
Row change: 13.40%
Home jump: 0.30%
To center: 2.38%
To outside: 0.39%

Most of what I have to say about fully optimized keyboard layouts has been said. I do find it interesting that the standard and Kinesis layouts look so similar; it looks like the rare keys around the edges have barely any effect at all.

Starting to Fully Optimize the Keyboard

January 4, 2011 24 comments

(Edit: I found a bug in the way rolls were being calculated. MTGAP 0.1 (shown below) is no longer the best layout.)

It’s been a while since I’ve done anything with the New Keyboard Layout Project, but I read a comment on one of my posts and I got to thinking about punctuation. Every keyboard I’ve designed has just been based on the main 30 keys and used .,’; as the four punctuation marks, because those are the ones that Dvorak used. But why use those four punctuation marks? Why not use a different set?

In fact, why not simply try to optimize the entire keyboard instead of just the main 30 keys?

Previously, the answer to that question was that it would be too slow. But now, thanks to a much-improved algorithm, I no longer have that excuse. That means I can evaluate the entire keyboard.

The Physical Keyboard

Changing the size of the keyboard requires rewriting large portions of the program. For this reason I didn’t want to rewrite it for a standard physical keyboard — why design such a highly optimized layout for a suboptimal physical keyboard? Instead, I rewrote the program to optimize on the Kinesis Advantage Pro keyboard. (You can see a good picture of the full keyboard here.) I ignored the thumb pads, tab, shift and caps lock for aesthetic reasons, and the arrow keys and function keys because it is nearly impossible to determine the frequency of those keys. This leaves four rows and 47 keys. The QWERTY keyboard looks like this:

1 2 3 4 5 6 7 8 9 0 -
Q W E R T Y U I O P \
A S D F G H J K L ; '
Z X C V B N M , . /
` =             [ ]

Shifted Keys

My program doesn’t deal with shifted keys, and modifying it to do so would be a much greater task than what I am currently doing. Rather than try to get the program to deal with shifted keys, I decided to simply choose the most common punctuation and put those on the unshifted slots.

There are 26 letters and 10 numbers. Out of 47 spots this leaves 11 spots for punctuation. The 11 most common punctuation marks are:

, . ) ( _ \ " ; - ' = /

So I pulled off the standard punctuation and stuck those on.

Results

After some (not insignificant) modifications, the program was able to optimize a full-sized keyboard. You can download my earliest functional version of the program here. It hasn’t been extensively tested and it’s messy, but it’s functional.

The first layout it came up with was this one:

Hands: 53% 46%
Fingers: 10% 10% 10% 21% 13% 14% 10% 8%


x 3 6 5 q / " 9 2 8 0
u l c o ; v m d p ) j
a r s e , f h t n i -
( ' w . = k y g b _
4 1             z 7

Fitness: 20014435648
Distance: 97925496
Inward rolls: 16.44%
Outward rolls: 5.40%
Same hand: 48.05%
Same finger: 2.02%
Row change: 21.05%
Home jump: 0.80%
To center: 3.03%
To outside: 0.40%

(I’ve added a new cost: “to outside.” It’s similar to “to center”: it penalizes a layout every time the user has to reach to the outside of the keyboard with the pinky before or after typing a letter on that same hand.)

Highly optimized and aesthetically horrible. The number keys, instead of being in a nice straight line, are all over the place. The parentheses aren’t even next to each other. This isn’t much of an issue when you’re dealing with the main 30 characters because there are no real aesthetics to speak of, but once you expand it becomes a serious problem.

The solution is to require that the program put certain keys in certain places: the number keys on the top row and the parentheses next to each other. There are two fundamental ways to do this: force it to, or give a penalty for not doing so. I found that the best way to keep the number keys in place was simply to tell the computer that it wasn’t allowed to move them. That doesn’t quite work with parentheses though, because they should still be able to move around; they just should stay next to each other. If one moves, the other moves. Forcing them to be next to each other but still be able to move around as a chunk would require adding an extra layer of complexity to the program. The simpler solution is to heavily penalize a keyboard layout every time the parentheses aren’t next to each other.

After adding these restrictions and tweaking the costs a bit, I came up with this layout:

MTGAP Full 0.1

Hands: 52% 47%
Fingers: 9% 10% 18% 13% 13% 14% 10% 9%


1 2 3 4 5 6 7 8 9 0 q
y c o u ( ) l d p w x
i s e a , m h t n r k
_ v " . ; ' f g b -
/ =             z j

Fitness:       193491944
Distance:      956628
Inward rolls:  8.42%
Outward rolls: 2.20%
Same hand: 36.00%
Same finger: 1.48%
Row change: 17.14%
Home jump: 0.26%
To center: 2.29%
To outside: 0.50%

Some of its numbers are quite impressive. For comparison, here’s Colemak (with punctuation modified a bit to fit on the keyboard):

Hands: 46% 53%
Fingers: 8% 8% 11% 18% 18% 15% 10% 9%


1 2 3 4 5 6 7 8 9 0 -
q w f p g j l u y ; =
a r s t d h n e i o '
z x c v b k m , . /
_ "             ( )

Fitness:       230028740
Distance:      1006256
Inward rolls:  4.53%
Outward rolls: 2.62%
Same hand: 42.86%
Same finger: 2.01%
Row change: 18.93%
Home jump: 0.74%
To center: 7.54%
To outside: 0.48%

My layout beats Colemak on every single metric except “to outside” (and possibly outward rolls, depending on whether you like those or not). Notice that, even though my layout puts ‘o’ (the fourth most common letter) off the home row, it still has lower travel distance than Colemak.

(In case you’re new here, the reason I compare my layout to Colemak is because Colemak is my favorite keyboard layout that I didn’t design.)

Also, if you’re interested, here’s Dvorak:

Hands: 44% 55%
Fingers: 8% 8% 12% 14% 16% 13% 13% 11%


7 5 3 1 9 0 2 4 6 8 =
' , . p y f g c r l /
a o e u i d h t n s -
; q j k x b m w v z
_ "             ( )

Fitness:       247807385
Distance:      1020108
Inward rolls:  4.14%
Outward rolls: 1.25%
Same hand: 31.14%
Same finger: 3.16%
Row change: 14.36%
Home jump: 0.50%
To center: 7.39%
To outside: 0.39%

The Interpreter, Part 1 Conclusion

August 21, 2010 4 comments

The final chapter in one man’s first journey to write an interpreter.

Jackson spent the next month or so adding smaller features and fixing bugs. At last, The Interpreter Version 1.0 was ready for release. He named his language Simfpl: Simple Interpreted Mathematically-Oriented Functional Programming Language. He put the source code on the internet for everyone to see.

To be able to use it, you first need to install GMP and MPFR.

EDIT 7/1/13: I’ve moved the code to GitHub.

Categories: The Interpreter

The Interpreter, Chapter 7

July 20, 2010 1 comment

The continuing story of one man’s quest to write an interpreter.

Jackson’s interpreter was coming along smoothly. But for it to have a complete foundation, he still needed to add one final feature: user-defined functions.

Jackson wanted to keep the language definition simple. This would avoid special syntax, making the interpreter easier to write, and also would make the language easier to understand and even more extensible. If statements and while loops were defined not with special syntax, but as functions. Similarly, Jackson wanted function definitions to themselves be functions.

So he set up a “def” function, which would be a function to define other functions. It would take three arguments: the function name, the variable list, and the function definition. So far, so good. He implemented this pretty quickly.

But there was a problem: it was impossible to create a recursive function. If the programmer created a function (f) and tried to call it within its own body, the expression would not compile correctly. For (def) to act like a function it had to work at run-time, but the compiler needed to know that a reference to (f) was a function.

After brooding over this problem for some time, Jackson came up with a solution. He created a new data type called a function shell, designed to hold just enough information for the compiler to know what to do. Whenever the compiler saw “def”, it would look for the function name. Then it would find all other references to the function name and convert them into function shells. The program would compile knowing that (f) was a function, and would actually define the function after the (def) function was called at runtime.

All of the foundational features had been implemented. But there was still work to do. Jackson had to implement a few smaller features, fix the myriads of bugs that had sprung up, and optimize the code.

Stay tuned for the exciting continuation of The Interpreter!

Categories: The Interpreter

Improved Keyboard Layout Program

I made some modifications to my keyboard layout program, and it now runs about twice as fast. On my laptop, it can score 12,000 layouts per second — this is six times as fast as Michael Capewell’s program.

You can download it here.