Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!plaid!chuq From: chuq@plaid.Sun.COM (Chuq Von Rospach) Newsgroups: news.software.b Subject: Re: news software speedup Message-ID: <46774@sun.uucp> Date: 24 Mar 88 06:35:30 GMT References: <649@bms-at.UUCP> <10150@ncc.UUCP> <546@fig.bbn.com> Sender: news@sun.uucp Reply-To: chuq@sun.UUCP (Chuq Von Rospach) Organization: Fictional Reality Lines: 73 >It can get you some speed benefits, but makes updates (e.g., expire, >cancelled articles, new articles) much harder, and some things much >slower. Currently, to add a new article, you need merely lock the active >file for a moment to bump a number. This way you'd have to lock two big >files for a significant amount of time. At one point, I circulated a preliminary design that did something similar. Basically, you took the batch file, and instead of splitting it up, you simply stuck filename and lseek pointers into the history file for each article. No unpacking overhead at all, and significantly reduced filesystem and system call overhead. Expiration is fairly trivial, too. A batchfile gets over a certain age, you zap it and update the pointers. It got shot down then for a number of reasons, generally good ones. o You lose the "Expires:" header. Stuff that is supposed to stay longer can't. o You lose adjustable expirations. You can't expire talk.* faster, because it's all stuck in with everything else. Not useful for small disk systems (and this is back when usenet traffic was really heavy, like almost 8 megabytes a month!) o It isn't clean for locally posted or non-batched articles. At the simplest layer, they're simply batches with single articles. But if you've got lots of local posting or non-batched articles floating around, the system degenerates into a setup WORSE than the current system, because the tree is completely flat. ooph. One possibility is to create two files for every newsgroup, a pointer file, an index file, and a data file. The pointer file has three pieces of data: the message number, the lseek pointer and the Subject line (maybe a fourth, the poster). You add new entries to the bottom of the file, once a day a program runs that expires articles by zapping the data and rewriting the pointer file with the updated pointers. The fun part is the data. You want to avoid doing a copy of the data file each night. So rather than brute force, you get fancy. Since the file is split up into known size blocks, you can simply delete an article in place and mark the block free. When a new article comes in, it looks for a first fit (or best fit, I don't care) in the free space and writes itself into the block, using the index file to look for free space without having to linearly search the data file (perhaps the index file could be put in the data file in the block header info....). If there's no space, append it to the end and increase the size of the file. This would require a program that would allow an admin to scrunch the file as well, to zap holes and fragmentation. But unless you get into a space crunch, you shouldn't have to run it. Thinking about it, this has another interesting advantage. In theory, you could get rid of the expiration phase completely. Instead, set a maximum size for each newsgroup. When a new message comes in, if if doesn't fit in the file, the installation program goes to the oldest message and deletes it, and continues to delete messages until it does fit. This requires a more sophisticate indexing scheme than above, because you'd likely want to get into garbage collection and compacting fragmented space. But ALL of the maintenance of the files would be done at article creation time, with no midnight processing required. Sounds like a neat masters thesis, if you ask me. comments? Chuq Von Rospach chuq@sun.COM Delphi: CHUQ Speed it up. Keep it Simple. Ship it on time. -- Bill Atkinson