At first, we hoped to use only the first and second letters in the outline, such as "H: Social Sciences", and "HG: Finance". However, it soon became apparent that this would not provide sufficient detail. For example, "BF" is described as "Psychology, Parapsychology, Occult Sciences", putting too many disparate topics together in a single node. Computer Science is buried five levels down in the tree, beneath "QA: Mathematics"; in fact, it is deeper in the tree than "Slide Rules", which is only four levels down. Clearly we needed, as a minimum, the nodes listed in the full LCC Outline.
We located a few versions of the LCC Outline on the Web. These were in a flat format, with one page for each of the 21 major LCC categories. After parsing these pages to remove the HTML and correcting misspellings and numerical range errors, we were left with 21 lists of the following form:
|A General Works|
|AC 1-999 Collections, Series, Collected Works|
|1-195 Collections of Monographs, Essays, etc.|
|801-895 Inaugural and Program Dissertations|
|901-995 Pamphlet Collections|
|AE 1-90 Encyclopedias (General)|
|. . .|
|B Philosophy, Psychology, Religion|
|B 1-5739 Philosophy (General).|
|. . .|
While we were able to work around problems with the layout of this data and, in particular, decipher the nesting structure of the letters and numerical ranges, other problems were not as straightforward. Most of the LCC places classification numbers with two alphabetic characters as children of those with one alphabetic character in the tree; in the "K" section this extends to three beneath two. For example, in the above list, "AC" is a child of "A" in the tree. Similarly, "KGC" is a child of "KG". However, the exceptions to this rule make the automatic parsing of the data difficult. For example, "DAW" is a child of "D", placed between "DA" and "DB". "E" has no other letters beneath it at all, just numbers. Most top-level categories repeat the one-letter symbol as its first child, so that, for example, "R: Medicine (General)" is a child of "R: Medicine". The "K"s were by far the worst. For example, both "KK" and "KKA" are children of "KJ". Also troublesome is that several nodes in the tree were completely unlabeled and left out, so that a parent node and its grandchildren exist without an intermediate level being explicitely noted. In order to impose a somewhat rigorous tree structure, all these situations had to be discovered and corrected, including not only the data, but also the classification scheme itself.