Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site peregrine.UUCP Path: utzoo!utcs!lsuc!pesnta!pertec!peregrine!chris From: chris@peregrine.UUCP (Chris Cole) Newsgroups: net.math Subject: Average length of keys in index with compression. Message-ID: <122@peregrine.UUCP> Date: Sun, 9-Jun-85 14:24:05 EDT Article-I.D.: peregrin.122 Posted: Sun Jun 9 14:24:05 1985 Date-Received: Mon, 10-Jun-85 14:25:43 EDT Distribution: net Organization: Peregrine, Systems, Irvine, Ca Lines: 25 <> I do not know if the following problem has been solved. If so, please send a reference: Consider an alphabetized list of words which has been compressed by replacing the characters repeated between subsequent entries with a count. For example, "hello", "help" and "hi" are stored as: [0] hello [3] p [1] i The numbers in brackets are the counts of repeated characters. This is a common way of compressing keys stored in indices. The question is: What is the average length of the stored words? For the sake of simplicity, we can: 1. Ignore the fixed overhead of the count of repeated characters 2. Assume some simple distribution of word lengths (e.g., uniform) Four cases are of interest: 1. The simple case: The probability of a character in a word is the same for all characters. 2. The known probability case: The probability of a character is known but not necessarily the same for all characters. 3. The realisitic case: The probability of a character or sequence of characters is known. 4. The unknown probability case: Nothing is known about the probability of characters. As usual, please reply via mail and I will post a summary of the replies.