Path: utzoo!utgpu!watmath!att!tut.cis.ohio-state.edu!mailrus!ames!lll-winken!uunet!ncrlnk!ncr-sd!hp-sdd!hplabs!hpfcdc!hpldola!hp-lsd!prisma!kolstad
From: kolstad@prisma
Newsgroups: comp.text
Subject: Re: Urban Legends (was Re: Dvorak Keyboard Layout)
Message-ID: <10500004@prisma>
Date: 22 Jul 89 01:12:00 GMT
References: <787@dms>
Lines: 328

I wasn't able to mail this to Mr. Leichter:

-----------------------------------------------------------------------

In comp.text, you say:

> ...But it turns out that that model is just
> plain wrong!  ...
> A side-effect of the Scholes layout is to place many
> of the common "units" on alternating hands, which makes typing them easier.
> Dvorak, on the other hand, tends to place many units under the SAME hand,
> which interferes with typing.

I am not a real fan of the Dvorak keyboard but knew someone who could
hit 160 WPM (Andrew Shapira:  shapira@docsun.rpi.edu).  Because he
could type a couple per cent faster than I, my ego was bruised and I
tried the keyboard for a bit (at the behest of Dan Kopetzky, I
believe).

At any rate, while I never became proficient at all, in the limited
number of tests I did (i.e., writing letters like this one), I found
that your thesis that the units are under the same hand is not born
out.  It is understand that one will always have a few combinations
that turn out that way (witness the word `recede' on the QWERTY
keyboard), nevertheless the number of digrams and trigrams that were
true alternation appeared to me to be very high on the Dvorak
keyboard.

If one divides the keyboard like this (I copied this keyboard from an
earlier article and split it as best I could) and ran /usr/dict/words
through a trivial script:

    left              right
/  ,  .  P  Y  --- F  G  C  R  L
 A  O  E  U  I --- D  H  T  N  S  ;
  '  Q  J  K  X --- B  M  W  V  Z

tr "'PYAOEUIQJKXFGCRLDHTNSBMWVZpyaoeuiqjkxfgcrldhtnsbmwvz" \
    llllllllllllrrrrrrrrrrrrrrrlllllllllllrrrrrrrrrrrrrrr < /usr/dict/words

Then we have a file which tells which fingers get used (here's an
excerpt):
l
lll		<-- obviously bad
lllr
llrrlr
llrlr
lrl		<-- the best we can do
lrlrl		<-- the best we can do
lrlrl		<-- the best we can do
lrlrlr		<-- the best we can do
lrlrlrl		<-- the best we can do

Now if we count the transitions, we should be able to measure the
`goodness' of a keyset.  (I'm doing this in real time as I type, and I
have to think about this for a moment.  For you, it will be appear to
be an instant cuz you'll get this all at once!)  Let's make a chart:

		number  words w/n `alternations'
length of word  0  1  2  3  4  5 ...
    2           x  x  
    3           x  x  x
    4           x  x  x  x
    5           x  x  x  x  x  
    6           x  x  x  x  x  x  
    7			and so on...

The program appears as Appendix A below.

tr script < /usr/dict/words | program yields:

             0    1    2    3    4   5     6    7    8    9   10   11
 2 n= 131   48   83 
 3 n= 775   60  517  198 
 4 n=2152   29  864 1163   96 
 5 n=3093   16  462 1902  679   34 
 6 n=3794    3  130 1698 1619  329   15 
 7 n=3929    0   23  913 2013  896   82    2 
 8 n=3484    0    7  366 1299 1441  347   22    2 
 9 n=2970    0    1  121  735 1292  694  121    6    0 
10 n=1883    0    0   22  287  680  673  195   26    0    0 
11 n=1052    0    0    0   66  248  429  266   42    1    0    0 
12 n= 542    0    0    1   13   70  169  203   76   10    0    0    0 
13 n= 260    0    0    0    2   20   55   98   66   15    4    0    0    0 
14 n= 102    0    0    0    1    5   15   27   34   17    3    0    0    0    0 
15 n=  39    0    0    0    0    3    1   10   12    9    4    0    0    0    0    0 
             0    1    2    3    4   5     6    7    8    9   10   11
[29 uninteresting cases of >16 character words omitted]

Now, it doesn't look too bad.  I can't think of a quick metric that
says `oh it's obvious this is great'.  Let's quickly write another tr
script for qwerty to have some raw data to compare:

tr "qwertyasdfgzxcvbyuiophjklnm.'&QWERTYASDFGZXCVBYUIOPHJKLNM" \
    llllllllllllllllrrrrrrrrrrrrrrllllllllllllllllrrrrrrrrrrr \
			< /usr/dict/words > /tmp/alternates

Now how does QWERTY do?
             0    1    2    3    4   5     6    7    8    9   10   11
 2 n= 131   68   63 
 3 n= 775  186  388  201 
 4 n=2152  266  837  790  259 
 5 n=3093  211  795 1168  749  170 
 6 n=3794  142  696 1217 1164  482   93 
 7 n=3929   85  398  983 1249  850  322   42 
 8 n=3484   39  231  616  979  929  505  164   21 
 9 n=2970   18  135  421  691  782  565  280   71    7 
10 n=1883    7   46  154  339  453  460  283  111   29    1 
11 n=1052    3   17   56  135  238  270  173  109   42    9    0 
12 n= 542    0    1    8   50   92  142  111   90   36    9    3    0 
13 n= 260    0    0    5   11   36   39   68   58   32    8    3    0    0 
14 n= 102    0    0    0    6    6   16   25   29   10    6    4    0    0    0 
15 n=  39    0    0    0    1    1    5    7   11   10    2    1    1    0    0    0 
             0    1    2    3    4   5     6    7    8    9   10   11
[29 uninteresting cases of >16 character words omitted]

Unfortunately, I must admit that there doesn't seem to be a tremendous
obvious difference in the alternating behavior.  There is some, but
it's not just overwhelming.  Consider the most common words, those of
7 letters:
                     0    1    2    3    4   5     6
dvorak:  7 n=3929    0   23  913 2013  896   82    2 
qwerty:  7 n=3929   85  398  983 1249  850  322   42 

Now qwerty has a few more perfect words, a bunch more almost perfect
but also has dramatically more `poor' words (0 and 1 alternations).
Let's calculate the average alternations:
dvorak:  ( 0*0+  23*1+ 913*2+ 2013*3+ 896*4+  82*5+  2*6) / 3929 = 3.02723
qwerty:  (85*0+ 398*1+ 983*2+ 1249*3+ 850*4+ 322*5+ 42*6) / 3929 = 2.89462
This shows a very slight (4.38%) improvement for dvorak.  I'll go back
and modify the program to calculate this for us (see Appendix C):

 l   n    qwerty dvorak
 2 n= 131 0.4809 0.6336
 3 n= 775 1.0194 1.1781
 4 n=2152 1.4842 1.6162
 5 n=3093 1.9586 2.0818
 6 n=3794 2.3761 2.5762
 7 n=3929 2.8946 3.0272
 8 n=3484 3.3789 3.5250
 9 n=2970 3.7832 3.9912
10 n=1883 4.3542 4.4302
11 n=1052 4.8042 4.9743
12 n= 542 5.4244 5.5277
13 n= 260 5.9769 6.0269
14 n= 102 6.3627 6.4804
15 n=  39 6.9231 6.8974

Well, the dvorak keyboard wins every time -- but not by much!

Maybe what we REALLY want to know is how much time we spend off the
home row ... could that be the REALLY important metric?

Let's translate into keyboard row numbers for dvorak:

tr "/,.PYFGCRLAOEUIDHTNS;'QJKXBMWVZ&pyfgcrlaoeuidhtnsqjkxbmwvz" \
    1111111111222222222223333333333411111112222222222333333333  \
			< /usr/dict/words > /tmp/alternates

And let's modify the program to calculate on-home-row -vs-
off-home-row (see Appendix D).

[My buddy just pointed out to me that few people type the dictionary and
we should use more realistic text like a book or a newgroup/notesfile.
OOPS.  I'll just continue on this tack for now.]

OK, that done, let's also make a tr script for the qwerty keyboard for
rows:
    tr "qwertyasdfgzxcvbyuiophjklnm.'&QWERTYASDFGZXCVBYUIOPHJKLNM" \
        111111222223333311111122233324111111222223333311111122233 \
			< /usr/dict/words > /tmp/alternates

Running the home key calculation program yields (with a bit of text
editing for ease of reading):

                qwerty                  dvorak

 2 n= 131 nhome=  75 = 28.63%	nhome=  157 = 59.92%
 3 n= 775 nhome= 747 = 32.13%	nhome= 1367 = 58.80%
 4 n=2152 nhome=2914 = 33.85%	nhome= 5196 = 60.36%
 5 n=3093 nhome=4892 = 31.63%	nhome= 9328 = 60.32%
 6 n=3794 nhome=6778 = 29.78%	nhome=14335 = 62.97%
 7 n=3929 nhome=8080 = 29.38%	nhome=17297 = 62.89%
 8 n=3484 nhome=8279 = 29.70%	nhome=18078 = 64.86%
 9 n=2970 nhome=7376 = 27.59%	nhome=17359 = 64.94%
10 n=1883 nhome=4841 = 25.71%	nhome=12474 = 66.25%
11 n=1052 nhome=2827 = 24.43%	nhome= 7604 = 65.71%
12 n= 542 nhome=1579 = 24.28%	nhome= 4323 = 66.47%
13 n= 260 nhome= 796 = 23.55%	nhome= 2260 = 66.86%
14 n= 102 nhome= 335 = 23.46%	nhome=  933 = 65.34%
15 n=  39 nhome= 126 = 21.54%	nhome=  389 = 66.50%
16 n=  15 nhome=  50 = 20.83%	nhome=  157 = 65.42%
17 n=	6 nhome=  20 = 19.61%	nhome=   70 = 68.63%
18 n=	4 nhome=  19 = 26.39%	nhome=   46 = 63.89%
20 n=	1 nhome=   5 = 25.00%	nhome=   11 = 55.00%
21 n=	2 nhome=   8 = 19.05%	nhome=   26 = 61.90%
22 n=	1 nhome=   5 = 22.73%	nhome=   12 = 54.55%

Well, it appears that the dvorak keyboard stays on the home row about
60-65% of the time and that qwerty keyboard stays on the home row
about 20-30% of the time (for the most part).  That would be a factor
of 2x improvement of home row keys.  Not bad.  I'll bet that's the big
difference.

[electroencephalography is the 22 letter word, by the way].

So, in summary:
    * Alternation is just a bit better (pretty much always)
    * Home row keys are phenomenally better placed

Now we know.

Thanks for providing fodder for this interesting exercise.

ps:  Re-reading your note and this one, I find that I might have been
a bit more clever about my treatment of common digrams and trigrams.
Oh well.

============================= program listings (appendices) follow =======

		APPENDIX A
----------------------------------- tr script < /usr/dict/words | program
#include <stdio.h>

int     nlengths[40];
int     nalternates[40][40];

main ()
{
    char    buf[512];
    int     l;		/* length of this word */
    int     n;		/* counter of alternates */
    int     i, j;
    char   *p;
    char    thishand;


    while (gets (buf) != NULL)
    {
	l = strlen (buf);
	if (l < 2)
	    continue;
	nlengths[l]++;
	p = buf;
	thishand = *p++;
	for (n = 0; *p; p++)
	    if (*p != thishand)
	    {
		*p = thishand;
		n++;
	    }
	nalternates[l][n]++;
    }
    for (i = 2; i < 40; i++)
    {
	if (nlengths[i] == 0)
	    continue;
	printf ("%2d n=%4d ", i, nlengths[i]);
	for (j = 0; j < i; j++)
	    printf ("%4d ", nalternates[i][j]);
	printf ("\n");
    }
    exit (0);
}
-------------------------------------------------------

		APPENDIX B
The actual tr script for dvorak:
tr ".'PYAOEUIQJKXFGCRLDHTNSBMWVZ&pyaoeuiqjkxfgcrldhtnsbmwvz" \
    lllllllllllllrrrrrrrrrrrrrrrrlllllllllllrrrrrrrrrrrrrrr \
			< /usr/dict/words > /tmp/alternates

The actual tr script for qwerty:
tr "qwertyasdfgzxcvbyuiophjklnm.'&QWERTYASDFGZXCVBYUIOPHJKLNM" \
    llllllllllllllllrrrrrrrrrrrrrrllllllllllllllllrrrrrrrrrrr \
			< /usr/dict/words > /tmp/alternates

-------------------------------------------------------

		APPENDIX C

The program which computes average alternations:
#include <stdio.h>

int     nlengths[40];
int     nalternates[40][40];

main ()
{
    char    buf[512];
    int     l;		/* length of this word */
    int     n;		/* counter of alternates */
    int     i, j;
    char   *p;
    char    thishand;


    while (gets (buf) != NULL)
    {
	l = strlen (buf);
	if (l < 2)
	    continue;
	nlengths[l]++;
	p = buf;
	thishand = *p++;
	for (n = 0; *p; p++)
	    if (*p != thishand)
	    {
		*p = thishand;
		n++;
	    }
	nalternates[l][n]++;
    }
    for (i = 2; i < 40; i++)
    {
	double sum;
	if (nlengths[i] == 0)
	    continue;
	sum = 0;
	printf ("%2d n=%4d ", i, nlengths[i]);
	for (j = 0; j < i; j++)
		sum += j * nalternates[i][j];
	printf ("%6.4f\n", sum/nlengths[i]);
    }
    exit (0);
}

-------------------------------------------------------