Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site peregrine.UUCP
Path: utzoo!utcs!lsuc!pesnta!pertec!peregrine!chris
From: chris@peregrine.UUCP (Chris Cole)
Newsgroups: net.math
Subject: Average length of keys in index with compression.
Message-ID: <122@peregrine.UUCP>
Date: Sun, 9-Jun-85 14:24:05 EDT
Article-I.D.: peregrin.122
Posted: Sun Jun  9 14:24:05 1985
Date-Received: Mon, 10-Jun-85 14:25:43 EDT
Distribution: net
Organization: Peregrine, Systems, Irvine, Ca
Lines: 25

<>
I do not know if the following problem has been solved. If so, please send
a reference:
Consider an alphabetized list of words which has been compressed by replacing
the characters repeated between subsequent entries with a count. For example,
"hello", "help" and "hi" are stored as:
[0] hello
[3] p
[1] i
The numbers in brackets are the counts of repeated characters. This is a common
way of compressing keys stored in indices. The question is:
What is the average length of the stored words?
For the sake of simplicity, we can:
1. Ignore the fixed overhead of the count of repeated characters 
2. Assume some simple distribution of word lengths (e.g., uniform)
Four cases are of interest:
1. The simple case: The probability of a character in a word is the same for
	all characters.
2. The known probability case: The probability of a character is known but
	not necessarily the same for all characters.
3. The realisitic case: The probability of a character or sequence of
	characters is known.
4. The unknown probability case: Nothing is known about the probability of
	characters.
As usual, please reply via mail and I will post a summary of the replies.