Path: utzoo!utgpu!attcan!uunet!lll-winken!ncis.llnl.gov!helios.ee.lbl.gov!pasteur!agate!bionet!csd4.milw.wisc.edu!mailrus!purdue!mentor.cc.purdue.edu!mace.cc.purdue.edu!dls From: dls@mace.cc.purdue.edu (David L Stevens) Newsgroups: comp.os.xinu Subject: Xinu Internet Gateway: a report Keywords: Internet, Gateway, IP, ICMP, RIP Message-ID: <1541@mace.cc.purdue.edu> Date: 22 Jan 89 05:07:23 GMT Organization: PUCC UNIX Group Lines: 190 The following is a description of some Xinu work done as a semester project in Doug Comer's graduate level course on Internetworking here at Purdue. We divided the class into groups of 1 or 2 people and built Internet Gateways on top of the Sun version of Xinu V7. This describes the experiences, successes and failures of one of the nine groups. Much of the work is the result of ideas or suggestions from Rajendra Yavatkar, Jim Griffoen, Doug Comer and through discussion with other members of the class. Because of the size of this project and the limited bandwidth offered by newsgroups, I've left out many details in this description so that I might complete it with a single posting. Feel free to contact me via e-mail for particulars; I'll respond to all queries as time permits. Othernets on Ether In order to implement a large number of gateways without actually requiring several dedicated machines and multiple physical networks, we took the approach of simulating a fictional network technology, called "Othernet," on top of physical Ethernet hardware. To keep Ethernet addresses distinct, we used multicast addresses on Ethernet hardware that has programmable address recognition (Sun's standard LANCE Ethernet chip). Each machine would recognize its "real" Ethernet address and some number of Othernet addresses configured at boot/compile time. Comer gave each group an id (a number) and a simple scheme for computing distinct Othernet hardware and IP addresses based on the group id (and the machine's Ethernet address, for non-broadcast addresses) so that multiple simulations could run on the same (physical) network without interfering. Othernet Simulation The Othernets are indistinguishable from real devices in the configuration file. Each has othinit(), othread(), othwrite(), etc. The interrupt function is ethinter() and one function of othinit() is to associate each simulated Othernet with an Ethernet. It keeps the simulated machine and broadcast addresses in the device control block for the Ethernet. When a packet arrives, ethinter() demultiplexes it to the appropriate Xinu device by comparing the address in the packet buffer with the address list for the simulated devices. Promiscuous mode (not part of stock v7) is still supported because packets that don't match the simulated addresses go to the ETHER device by default, without an address comparison. The Othernet device control block, among other things, includes the associated Ethernet device, so it can honor the Ether device's write semaphore. A Generic Network Interface layer With multiple interface technologies (real or simulated) comes device type dependent detailed information which is not directly relevant to the basic network I/O that IP needs. To hide this information from IP, we added a new layer, the "Network Interface" layer, which includes I/O queues, interface statistics, the maximum transfer unit and IP address information for the purpose of routing. It also includes hardware addresses, but with size information which can thus accomodate more than just Ethernets. Each network interface has it's own netin() and netout() that do the blocking on reads and writes and queue/dequeue packets bound for IP. Other protocols (eg, ARP and RARP) they implement without an intermediate queue. The IP process acts as a central switch with queues to/from the netin() and netout() processes, so that it may process packets from multiple interfaces without blocking on any one. The functions putp() and getp() handle these queues, among other things. Note that local packets are routed via IP just like outbound packets. The netout() function for this "local interface" looks up the appropriate UDP port or ICMP id and delivers the packet to the correct process. It handles ICMP functions that don't interact with local processes directly. A Routing Table We define routing table entries as follows: struct route { IPaddr rt_net; /* net for this route */ IPaddr rt_mask; /* mask for this net */ IPaddr rt_gw; /* next IP hop */ short rt_metric; /* distance metric */ short rt_intf; /* interface number */ short rt_key; /* sort key */ short rt_ttl; /* time to live */ struct route *rt_next; /* next for this hash */ /* stats */ int rt_refcnt; /* current ref count */ int rt_usecnt; /* use total */ }; The routing table is globally locked during lookups and the reference count prevents newly deleted routes from being freed before all references disappear. Every route has a timer associated with it, with a special value "RT_INF" to mean "don't expire this route" (though this latter wasn't needed, since we used RIP and no static routes). The lookup mechanism is a hash based on the low order portions of the IP network number (not including host or subnet). We order entries with the same hash by the number of bits in the mask, so that matches occur on most specific (ie, host) routes first and successively towards least specific routes (standard IP net routes). Although subnet masks are not defined to follow this ordering, multiple matches on a subnet of that sort is ambiguous and in practice the problem doesn't occur. These semantics allow for easy implementation of a subnet hierarchy, rather than just mutually exclusive subnet numbering, as well. If the above scheme fails to find a matching route, the getrt() function returns a distinguished default route (network 0.0.0.0), if one was set. One shortcoming of this implementation is that routes have a single metric. A better solution would be to have (potentially) multiple (protocol,metric) pairs as outlined in MIB (RFC 1066). Thus, eg, EGP and RIP metrics would be distinct. As it is, we only implement RIP, so the conflict does not occur (yet). The IP Process The IP process is a simple loop that gets a packet off of one of the interface queues (round robin via function "getp()") and routes the packet to an appropriate interface based on the routing table. Some additional complications come because of checks for identical IN/OUT interfaces (for ICMP redirects) and it is here that we do directed broadcast interpretation, so that both the gateway (acting as a host) receives a directed broadcast as well as forwarding it to other hosts on the common network. Finally, the IP process computes new time to live and checksum fields for all routed packets and stamps the source IP address for locally generated packets. The function putp() actually queues the packet on the selected interface's output queue for delivery. IP Reassembly For IP reassembly queues, we use a set of generic priority queue functions (also used in the routing table) to order the received fragments by offset. We add fragment packet buffers to the queue of fragments with the same IP address and IP id. When all are present, we allocate a buffer from a special "large buffer pool" and copy the fragments in. After IP header adjustments on the newly created datagram, we return the completed datagram to the higher layers. We set a timer for each fragment queue and reset it for each fragment we add to the queue. If it expires, we free the fragments and generate the ICMP fragment timed out message. If any errors occur on a partially completed reassembly queue, we free all of the fragments on the queue and leave a "stub" queue that simply collects other inbound fragments for this queue and discards them; eventually, the timer expires and deletes the queue stub. Reassembly is part of the local interface's netout() process, since only packets bound for the local host are candidates for reassembly. IP Fragmentation Fragmentation is one of the functions of putp(). It computes the fragment size based on the maximum transfer unit of the selected interface and the "right shifted by three" semantics of the IP fragment offset header field. It then breaks packets up as needed, duplicates and corrects the IP header and queues the packet on the interface's output queue. ICMP We implement virtually all of ICMP, with the exceptions listed below. We do this as two functions; icmp_in() handles ICMP messages bound for the local machine directly. This includes routing table changes (from redirects), process demultiplexing and delivery (for echo replies, eg.) and all of the requests from other hosts (information, mask, etc). The icmp() function is how we generate ICMP messages for a remote host or gateway. We use it throughout the network code for generating redirects, error messages, mask requests, etc. The ICMP functions we do not support are ICMP SRC QUENCH, TIMESTAMP and the SRC ROUTE error for DESTINATION UNREACHABLE. Hosts and Gateways We organized the project into two separate directories for building kernel images, both sharing the same sources. Where appropriate, we used the ifdef's or ifndef's for "GATEWAY" to distinguish hosts and gateways. The hosts configure only a simulated Othernet and act in isolation otherwise. All routing information, mask information, time setting, host name translations, etc come from the gateway or by packets routed through it. All hosts run the same image. The assigned Internet network number for the simulation we select at boot (read from the console, after a prompt) and the IP host part and Othernet hardware numbers are designed based on the hardware addresses, so no special care is needed to insure they are all unique is required. The gateway boots with some knowledge already, including the subnet masks for the interfaces. It acquires all possible, including routing information, dynamically, though. RIP We implement active and passive RIP, Split Horizon and correct interpretation of RIP Infinity. We also translate subnet routes to net routes when broadcasting on non-subnetted networks. We implement all that current popular RIP implementations do, but we do not support full RFC 1058 RIP. In particular, we do not support Poison Reverse and Triggered Updates. Unfinished Business We began an implementation of SGMP. The data structures and basic variable get/set functions are in place, but with no network transactions support. We also have hooks, but no more, for EGP. In Closing Though ambitious in scale, this project provided direct experience with the problems, often subtle, that arise from Internetwork engineering. By actually building a working Internet gateway, we gained direct experience that would not be possible with strictly classroom lecture. -- +-DLS (dls@mace.cc.purdue.edu)