Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!emory!gatech!psuvax1!eds1!wa3wbu!compnect!johncore From: johncore@compnect.UUCP (John Core ) Newsgroups: comp.lang.c Subject: Re: Hash??? Not quite clear on what this is... Summary: just a hash war story Message-ID: <885@compnect.UUCP> Date: 29 Nov 90 12:07:50 GMT References: <17291@netcom.UUCP> <1990Nov22.235522.26487@ux1.cso.uiuc.edu> Organization: John Core at home, Harrisburg,PA Lines: 87 In article <1990Nov22.235522.26487@ux1.cso.uiuc.edu>, gordon@osiris.cso.uiuc.edu (John Gordon) writes: > avery@netcom.UUCP (Avery Colter) writes: > > >I'm seeing all this talk of "Hash Functions". > > >Are you talking about parsing? What is being defined as "hashing"? > > Oh, goody. We talked about hashing the last 2 weeks in class, so I > get to explain it. > > Hashing is: a method of storing items in an array such that they can > be quickly recalled when needed. This is possible because hashing performs a > transmogrification (:-)) on the item and comes up with an array address, so > you can go get it directly without having to search for it. > > "Gee, ralph, how many flames do ya think THIS post will get??" This is not a flame! (well maybe it is.) Last week I was talking to a prof at the local college, he is teaching a 'C' course and had a textbook on 'c'. They spent 2 chapters on sorts, all of them on ram memory sorts. NO file sorts. This is why I will never worry about losing my job as systems analyst to a bright new wizard out of college. College programming courses seem to never deal with the REAL (ie buisness) world. As per the sorts, no-buisness (except maybe the local Macdonalds) has data bases that fit in ram memory. Your reply indicated hashing arrays in ram. Even a VAX 9000 with a gig of ram can not in memory sort a decent buisness database, whether it be invoices, po's , or a mailing list, unless of course you talk the system administrator into taking the machine into single user mode to run your application. my reply to the original poster: *>a hash is a unique numerical value for a string between a range of *>preset values. it can be used for record indexing. On LARGE databases *>we would create a hash value (between 1 and our max records) of a *>key and stick the key at that position in a header file with a pointer *>to the actual record position of the data file. Programmers are always *>searching for better hash routines since there is always a collision *>possibility (two different keys hashing to the same value) then you said: > 2) Many times you do not know the data set beforehand. > 3) hash functions never use something so simple as strlen(). They > break the item down into bits and twiddle them to produce the > array address. Actually you have to know a lot, like the key's max length and how many total records will be in your data base, since you have to give the hash function a range to deal with. Example: years back I worked for a firm that wrote the software for General Signals's electric motor factories. We had to manage a database that stored over 200000 items. It consisted of indevidual parts for various electric motors, the motors the parts were associated with, supplier, rate of consumption, lag time from order to deliver and price history. We played it safe and presumed the base would never go over 300000 items. We would take an items search key and hash it to a number between 1 and 360000. The extra 60000 was the collision overflow. We would then create an index file that contained the primary key, secondary keys, and a pointer to the actual record's file position in the data base. If part 'AB-DDD-104-20-5A' hashed to 7 it's key and actual file position (a new item went to the bottom of the master file) was written to record 7 of the index file. If another key hashed to 7 it was written to an overflow (collision) area. As the data base grows you have to watch the collision area. If you have a lot of collisions, you have to adjust your hashing scheems, and the max records (high hash value) of your hash index. In our hash calculations we seeded the hash equasion with a prime number that was equivelant to 80% of the total number of records. The prime was kept in a seperate file so we could change the hash with-out recompiling. I was sent one day to GS's Wisconsin plant (there software had been running great for 2 years) on the complaint that the system was really SLOW. Taking up to 5 minutes to retrieve a record. The cause (it was the last place I looked) , one of our 'new' boys and been tuning the hash and used a non prime number. 50% of new entries were going into collision. Took 2 days to rebuild the index. --just another old fart telling a war story... Wizard Systems | UUCP: uunet!wa3wbu!compnect!johncore P.O. Box 6269 |INTERNET: johncore@compnect.wa3wbu Harrisburg, Pa. 17112-6269 |a public bbs since 1978. Data(717)657-4992 & 4997 John Core, SYSOP |------------------------------------------------- ----------------------------| No matter where you go, there you are! a woman is just a woman, but a good cigar is a smoke. -R. Kipling