Description of some internals details of K_TTY

Warning

This file is intended for developers' reference. It probably makes no sense if you don't understand the keyman language, and only little sense to non-programmers. Latter parts are probably best read with a copy of the source code open in another window. I've tried to put the most general stuff at the start, so once it's just too much gibberish, give up and you shouldn't have missed much else that you're likely to understand.

Caching

Certain keyboard definitions make extensive use of the any() command in such a way that many CPU cycles could be saved by the interpreter remembering if the typed key (e.g. "Z") is part of store(foo). E.g.

any(A) + any(foo) > context index(bar,2)
any(B) + any(foo) > index(bar,2) context 
any(C) + any(foo) > context index(bar,2)

Thus the decision was made to store the fact that an any() lookup had occurred on that history item and store id, and what the result (true/false) was. It was later realised that for certain keyboards this was entirely pointless. E.g. consider a keyboard of form:

any(foo) "X" + "Y" > index(fooOUT,1)
any(foo) "X" + "Z" > index(fooOUT,1) "Z"
any(bar) "X" + "Y" > index(barOUT,1)

In this keyboard there is no posibility that any values that might be stored in a cache might be useful, but the caches are taking up valuable kernel memory and cache lookups and updates take cpu cycles. E.g. a keyboard that has 32 stores (some known keyboards have double this) will require 128*32*2/8 bytes, or 1k of additional kernel storage per tty in use, just for the cache. While this may seem a trivial amount on a modern machine, it all adds up when you have a dozen virtual consoles, a few xterms and some more users logged in over the net. Swapping is by no means a thing of the past and the average user does not want to wait for too many page faults every time they press a key.

Too many file formats

Two file output mode (deprecated)

Once everything is compiled, the user has 3 sorts of files to deal with. If the user has kept to the (current) default naming scheme, these are:

Keyboard definition (.kmn) files:: These files hold the keyboard definition in human-readable format.
Support (.supp) files:: These files are also human readable and contain information that the keyboard loader (kttyload) uses to convince itself and the kernel that the .out file is a valid keyboard. In loading the keyboard, the information in this file is loaded into a struct which also has a pointer to the byte code. When switching keyboards the pointer is NULL, but otherwise the struct is filled in the same way. This means that the keyboard loader needs the .supp in order to switch keyboards.
Compiler output (.out) files:: These files contain the binary data which the kernel interprets to change what you type into what we hope you want. They contain nothing but that byte code.

Thus we currently have the messy situation of one input file to the compiler and two output files, both of which are needed to load a keyboard (assuming you didn't just tell the compiler to load the keyboard into the kernel). Keeping these in sync is also an issue, though only becomes a problem if someone accidently scrubbs the .out file and the make rule is checking for the .supp file.

Single file output mode

A number of alternative methods of doing things (rather than the "two-output-files-per-input-file" described above) were briefly considered, but eventually the following thoughts came to have dominance:

Keeping the two files in sync becomes a nightmare when one of the files is not in the current directory.
Human readable is nice for dubugging, but not really necessary.
We can give users access to things via a correctly written "magic" file.
The "human friendly" version may sometime/someplace tempt people to do dangerous things (e.g. change the version numbers).

Thus the decision has been made to switch towards a single binary file format.

The single file format (foo.ktty) includes an 8 byte header "K_TTY b\n" (the "b" standing for binary), the keyboard name (up to 32 bytes), a file format number, the sanity-checking information included in the support file (except as binary data). and the keyboard data itself.

Group data format

In the byte-code, groups of commands in the original input file are preserved, albeit rearranged into the propper search order, and two methods of storing them are permitted by the kernel. It is intended that the compiler should make an inteligent choice regarding how best to store each group.

Fixed length.

In a fixed length group, each rule (a single line in the input) is zero-padded to a given length, named seqlen. This costs us

(max_rule_length - avg_rule_length)*number_of_rules

in "wasted space", and is obviously ideal where all the rules compile to the same length.

Variable length.

In a variable length group, each rule is preceeded by its length (unsigned short). This costs us

sizeof(unsigned short) * number of rules

in wasted space, and also requires a few more CPU cycles on each rule as we need to read the length of the next rule as we go through. It is the obvious choice when we have a some very long rules and the rest are quite short.

Binary data format

The binary data follows the following format (as described in k_tty.h):

Pairs of (unsigned int) offsets to groups, and values of seqlen, (seqlen & ff000000 stores flags for the individual group);
a single (int) 0
initial keyboard flags (may include start-up flags specified by the user and also specifies things like if the cache should be used); (unsigned int)
stored strings (aka stores) (standard zero terminated values). The number of these is not limited, but the kernel keeps a record where the the first MAX_K_STR (currently 255) of them are. These should ideally be sorted such that caching may be optimal, but are currently sorted by (decreasing) length order, which we hope will not be too bad an approximation;
an extra (char) zero;
one or more group(s) of (input, output) rule pairs, each group is terminated by an optional match or nomatch rule and an END_TBL rule (0); The starts of consecutive rules are separated by precisely seqlen bytes. The rules are sorted by decending length and order in the input file. If seqlen (excluding any flags) for a group is zero then we have variable seqlens, and the first sizeof(SEQLNTYPE) bytes in each rule pair hold the length of this rule, excluding the sizeof(SEQLNTYPE). Such groups should also have the group flag F_GRP_VARSEQ set, though if they don't then the kernel fixes this.
an extra END_TBL rule (i.e a null table).

        

Limitations: The block must be small enough that kmalloc can grab
a block that size, and must not be longer than the max value an
unsigned int can hold (since this is 4Gb on a x86, this should be plenty). 
A single compiled `line' of a match rule cannot add up to more than
the length storable in the residual bits of a SEQLNTYPE & F_GRP_SEQLEN.
(16777215 with an int and F_GRP_SEQLEN=0x00FFFFFF) 
There may not be more than 65535 (MAX_SHORT) groups or stores 


There will be a performance penalty if you use more stores
than MAX_K_STR, but I'm not sure how large. Probably negligable,
unless you use late string values a lot. Use of more stores than
MAX_CACHE in any() tests will probably cause a big slowdown, since for each
any() lookup that needs to be done it'll have to hunt through the
(uncached) stores to find the trailing zeros. A large MAX_CACHE 
will eat RAM. On the otherhand, its probably all probably 
negligable on a modern machine anyway.


Error counting 
Whenever the interpreter finds an unexpected command it increments an
error counter (on a per-tty basis) and writes an error message. Such errors
should not exist, and almost certainly represent a programming error in the
compiler or interpreter. However it may be that the keyboard remains
usable, except for a rare occurrence, and it may be more confusing for the
user to be dropped out of the keyboard than not. Thus the interpreter keeps
count of the errors and only "drops" the keyboard when a certain number of
errors have been reached (currently 10).