Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!mimsy!dave
From: dave@mimsy.UUCP
Newsgroups: comp.ai
Subject: Re: analysis of unknown data
Message-ID: <5933@mimsy.UUCP>
Date: Mon, 23-Mar-87 14:01:10 EST
Article-I.D.: mimsy.5933
Posted: Mon Mar 23 14:01:10 1987
Date-Received: Wed, 25-Mar-87 00:48:30 EST
References: <5681@mimsy.UUCP> <11160001@hpldolm.HP.COM>
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 94

In article <11160001@hpldolm.HP.COM>, ben@hpldolm.HP.COM (Benjamin Ellsworth) writes:
> My first comment on this whole discussion, as I understand it, is that
> it is silly.  We are being asked to find "the" meaning of some large
> file without any context for the file.  Is it text?  Is it integer
> data?  Is it floating point data?  Is it encrypted in any way? The 
> search for meaning in the absence of context is a waste of time. 

    Maybe I am at fault for inadequately describing the problem,
    but it is neither silly nor a waste of time.  Apart from these
two comments and the later one about test for randomness being
ridiculous, Ben's comments are helpful in further detailing the
possibilities.

> What is meaningful in one context is often not meaningful in another.
> However, sometimes, it is.  A file full of integer measurement data will
> usually be indistinguishable from a file of a bit-mapped color image.
> A bunch of integers is a bunch of integers (unless some *recognizable*
> context information is included).  If you take a group of integers and
> make a pretty picture with them, what will you do when I tell you that
> they were process measurements from a ball-bearing factory?  What will
> you do when you interpret a Mandelbrot image as a bad lot of wafers
> in an otherwise well controlled fab?
> I'm sure that you would like to say that you can't make a pretty
> picture with ball bearing data.  Perhaps not in every case, but I know
> of a gentleman who *sells* "art" generated from HP stock performance
> data.  He has given some stock data meaning in a new context.

    I wouldn't like to say you can't have multiple representations of a set of data poin

    However, one man's "art" is simply another man's pictoral or
    imagic presentation of stock data.  (Particularly if the raw
stock data was not convaluted by the artist).  In fact, it might
be a useful presentation for certain kinds of trend analysis.

> The best response to this question was the one from  Mr. Adrian
> who suggested that you look for the context(s) that the file
> was used in.  If you can't find the correct context, you cannot
> ascertain the correct meaning.  If the data exists in a vacuum, you can
> choose whatever context that you wish and with enough massaging you
> can make the data meaningful.

    Certainly there is a pitfall in the analytic process; one may
    "discover" meaning that was not the intent of the creator of
the data.  So it goes, sometimes.

    "finding the correct context" and "finding the meaning" are the same thing!

> Random is too loose of a term.  Are they "random" samples from a
> uniform distribution, or "random" samples from a Gaussian distribution?
> In either case is the distribution a real population, or a mathematical
> model of a distribution function?
> I don't want to sound like a flame, but testing for randomness is
> ridiculous!  You *cannot* prove a set of data to be "random."  In fact
> the key to some encryption schemes is to make a dataset appear "random"
> to most simple minded tests.  This does not mean that there is no
> information in the data.  It just means that the context of the
> information is well hidden from such simple minded filters.

    Hmm.  I think what I mean is that if the data set appears to be a Gaussian
distribution, then I'm not going to apply any other tests.

> What you are saying when you say that you will test for randomness is
> that you will test to see if the data is meaningful in any known
> context.  Do you know all possible contexts?  Will you live long enough
> to test for all of them?  What happens when the data is meaningful in
> more than one context?

    I can't possibly imagine all conceivable or theoretic contexts.  I can imagine too many to try.
I am looking for an analytic process that is more efficient than enumerating all the context tests I can
imagine.  If multiple context tests yield "reasonable" representations,
I might just have to flip a coin or allow for all interpretations.

    I never said that the data has no context!  I simply said that I don't know a-priori what its context
is.  It *is* the case that data points can be analysed in the absence of
knowledge of the structure of the function which produced them.  The object is to detect patterns, if possible,
and search for "meaningful" interpretations.

   Some of the discussion of this subject sounds like the participants
   are frustrated by these two facts:

1.  I *won't* live long enough to apply every possible context
test.  (Discovery by enumeration).

and

2.  they don't know of any more efficient methodology than discovery
by enumeration, ergo the problem is silly or a waste of time.


-- 
       Dave Stoffel (703) 790-5357
       seismo!mimsy!dave
       dave@Mimsy.umd.edu
       Amber Research Group, Inc.