Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!gatech!psuvax1!eds1!wa3wbu!compnect!johncore
From: johncore@compnect.UUCP (John Core )
Newsgroups: comp.lang.c
Subject: Re: Hash??? Not quite clear on what this is...
Summary: just a hash war story
Message-ID: <885@compnect.UUCP>
Date: 29 Nov 90 12:07:50 GMT
References: <17291@netcom.UUCP> <1990Nov22.235522.26487@ux1.cso.uiuc.edu>
Organization: John Core at home, Harrisburg,PA
Lines: 87

In article <1990Nov22.235522.26487@ux1.cso.uiuc.edu>, gordon@osiris.cso.uiuc.edu (John Gordon) writes:
> avery@netcom.UUCP (Avery Colter) writes:
> 
> >I'm seeing all this talk of "Hash Functions".
> 
> >Are you talking about parsing? What is being defined as "hashing"?
> 
> 	Oh, goody.  We talked about hashing the last 2 weeks in class, so I
> get to explain it.
> 
> 	Hashing is: a method of storing items in an array such that they can
> be quickly recalled when needed.  This is possible because hashing performs a
> transmogrification (:-)) on the item and comes up with an array address, so
> you can go get it directly without having to search for it.  
> 
> "Gee, ralph, how many flames do ya think THIS post will get??"

This is not a flame! (well maybe it is.)

Last week I was talking to a prof at the local college, he is teaching a 'C'
course and had a textbook on 'c'. They spent 2 chapters on sorts, all of
them on ram memory sorts. NO file sorts. This is why I will never worry
about losing my job as systems analyst to a bright new wizard out of college.
College programming courses seem to never deal with the REAL (ie buisness)
world. As per the sorts, no-buisness (except maybe the local Macdonalds)
has data bases that fit in ram memory. Your reply indicated hashing
arrays in ram. Even a VAX 9000 with a gig of ram can not in memory sort
a decent buisness database, whether it be invoices, po's , or a mailing
list, unless of course you talk the system administrator into taking
the machine into single user mode to run your application. 

my reply to the original poster:

*>a hash is a unique numerical value for a string between a range of
*>preset values. it can be used for record indexing. On LARGE databases
*>we would create a hash value (between 1 and our max records) of a 
*>key and stick the key at that position in a header file with a pointer
*>to the actual record position of the data file. Programmers are always
*>searching for better hash routines since there is always a collision 
*>possibility (two different keys hashing to the same value)

then you said:
>	2) Many times you do not know the data set beforehand.
>	3) hash functions never use something so simple as strlen().  They
>	   break the item down into bits and twiddle them to produce the
>	   array address.


Actually you have to know a lot, like the key's max length
and how many total records will be in your data base, since you have
to give the hash function a range to deal with. Example:
years back I worked for a firm that wrote the software for General Signals's
electric motor factories. We had to manage a database that stored over
200000 items. It consisted of indevidual parts for various electric motors,
the motors the parts were associated with, supplier, rate of consumption,
lag time from order to deliver and price history. We played it safe and 
presumed the base would never go over 300000 items. We would take an
items search key and hash it to a number between 1 and 360000. The
extra 60000 was the collision overflow. We would then create an index
file that contained the primary key, secondary keys, and a pointer
to the actual record's file position in the data base. If part 
'AB-DDD-104-20-5A' hashed to 7 it's key and actual file position 
(a new item went to the bottom of the master file) was written to
record 7 of the index file. If another key hashed to 7 it was written
to an overflow (collision) area. As the data base grows you have to
watch the collision area. If you have a lot of collisions, you have
to adjust your hashing scheems, and the max records (high hash value)
of your hash index. In our hash calculations we seeded the hash 
equasion with a prime number that was equivelant to 80% of the total
number of records. The prime was kept in a seperate file so we could
change the hash with-out recompiling. I was sent one day to GS's Wisconsin
plant (there software had been running great for 2 years) on the complaint
that the system was really SLOW. Taking up to 5 minutes to retrieve a
record. The cause (it was the last place I looked) , one of our 'new' boys 
and been tuning the hash and used a non prime number. 50% of new entries
were going into collision. Took 2 days to rebuild the index.

--just another old fart telling a war story...


Wizard Systems              |    UUCP:   uunet!wa3wbu!compnect!johncore
P.O. Box 6269               |INTERNET:   johncore@compnect.wa3wbu
Harrisburg, Pa. 17112-6269  |a public bbs since 1978. Data(717)657-4992 & 4997
John Core, SYSOP            |-------------------------------------------------
----------------------------| No matter where you go, there you are!
a woman is just a woman, but a good cigar is a smoke.   -R. Kipling