Resource and collection "quality" is becoming an increasingly important topic for educational digital libraries. Computational models of quality and automated approaches for computing the quality of digital resources are necessary components of next generation cognitive tools aimed at supporting collection curators in making quality decisions. This research identifies and computes metrics for 16 quality indicators (e.g., cognitive authority, resource currency, cost, and advertising) and employs machine-learning techniques to classify resources into different quality bands based on these indicators. Two experiments were conducted to determine if these indicators could be used to accurately classify resources into different quality bands and to determine which indicators positively or negatively influenced resource classification. The results suggest that resources can be automatically classified into quality bands, and that focusing on a subset of the identified indicators can increase classification accuracy.
In recent years, "quality" has emerged as a dominant, yet poorly understood concern within national educational digital library efforts such as the National Science Digital Library (www.NSDL.org) and the Digital Library for Earth System Education (www.DLESE.org). Educational digital libraries are deeply concerned with quality for several reasons. First, quality resources and collections are expected to be the hallmark of National Science Foundation (NSF) funded efforts, by both library users (such as teachers)  and library sponsors (NSF). Additionally, prior research indicates that the perceived quality of digital library resources and collections is an important factor influencing library use and adoption in formal classroom settings . This suggests that understanding quality per se, and how to develop and manage quality collections is a critical and growing issue as educational digital libraries mature. As such, these library initiatives are devoting significant resources to establish policies and procedures to support developing, accessioning, and curating quality resources and collections [16, 17, 20].
Within these national educational digital library efforts, library developers engaged in resource selection and collection curation processes are increasingly being tasked with designing and managing collections to reflect specific library policies and goals aimed at promoting quality. Concerns about the quality of library resources often revolve around issues of accuracy of content, appropriateness to intended audience, effective design and information presentation, and completeness of associated documentation or metadata descriptions. As such, quality evaluations require making difficult, complex, time-consuming, and variable human judgments to assess whether resources belong in particular collections or libraries. These judgments are influenced by a variety of factors, for example the information present in the resource, structural and presentational aspects of the resource, and knowledge about the resource creators. Thus, there is a critical need in educational digital libraries for interfaces and services that can serve as cognitive tools [21, 24] to support library developers, and ultimately library users, to more effectively and efficiently assess the quality of educational resources and collections.
Developing tools and the underlying algorithms necessary to support and scale curation processes around quality is an important motivator for the research described in this article. Our long-term research objective is to use state-of-the-art methodologies in machine learning and natural language processing to develop a computational model of quality that approximates expert human judgments. Developing a computational model of quality that approximates expert judgments is a foundational requirement for developing interfaces and tools that can optimize and scaffold the complex human decision processes and procedures associated with collection curation. If the dimensions of quality can be effectively modeled and represented, we can envision a suite of future intelligent collection curation tools based on this underlying computational model including:
For instance, imagine a scenario where Jennifer, a collection curation staff member, needs to determine whether a specific digital collection meets the library's guidelines and policies with respect to resource and metadata quality. Jennifer needs to decide whether or not to recommend this collection for inclusion in the library; it contains over 1,000 digital educational resources such as lesson plans, classroom activities, and laboratory activities. Jennifer would like a quick, objective characterization of the quality of these resources, as well as the collection as a whole. She navigates to a collection support tool that provides an interface for such purposes. This tool contacts a web service that computes a number of quality metrics for each of the resources in the collection and returns these metrics to the tool where the results are displayed for Jennifer to consider. This display highlights a number of specific resources that appear to be outside the preferred boundaries for particular quality metrics. Jennifer uses this information to quickly identify specific outliers of differing quality from the rest; i.e., potentially problematic resources. Outliers of differing quality may require further review, may be more appropriate in another collection, or may not be included in the library's collections at all. Jennifer can use this information to provide feedback to the collection developer about the problematic resources, and to decide if the collection meets the library's quality requirements.This article describes the results of a pilot study that lays critical groundwork towards developing intelligent collection curation tools such as the one described in the scenario. Specifically, we report on our efforts to develop an initial computational model of quality and attendant machine learning algorithms capable of detecting quality variations in digital library resources. The questions guiding this research are:
We first review prior research on understanding and modeling evaluative criteria, processes, and strategies used to assess the quality and credibility of online information. We then review resource metadata elements in an effort to find information that may be indicative of quality. We use this analysis to identify potential indicators of quality that may be useful for discriminating between resources and that can be identified using automated techniques. Next we present the results of a pilot study, where we investigated whether machine-learning algorithms could be trained to recognize various quality indicators and to classify resources into quality bands based on the identified indicators. We also examined which indicators positively or negatively influenced the classification of the resource into a particular quality band. Resources used in this experiment were drawn from existing DLESE collections.
Identifying Quality Indicators
Prior research suggests that quality is a complex and multi-dimensional construct, and that features of information sources can be correlated with human judgments about quality. Within the information sciences, there is a history of research studying the cognitive processes of users making judgments about the relevancy, credibility, and more recently, quality of online information sources. These studies are often domain-independent, focusing on evaluative judgments when engaged in general information seeking activities, and largely geared towards generating design guidelines for web site developers. The reviewed studies utilized a variety of qualitative and quantitative research methods to identify the design elements, content characteristics, and other factors that lead online information sources to be highly rated by users.
Fogg et al.  conducted a large-scale online survey, with over 1,400 participants, to identify what factors affect perceptions of web site credibility. Using statistical analysis techniques, they identified five factors boosting web site credibility (real-world feel, ease-of-use, expertise, trustworthiness, and tailoring) and two that hurt credibility (commercial implications and amateurism). Rieh  used verbal protocols and post-search interviews to study how people make judgments related to quality and cognitive authority while searching for information on the web. Rieh identified five factors that influenced quality assessments: goodness, accuracy, currency, usefulness, and importance. Factors influencing cognitive authority assessments were trustworthiness, reliability, scholarliness, credibility, officialness, and authoritativeness. Other studies have demonstrated that it is possible to correlate low-level design issues, such as the amount and positioning of text, or the overall portion of a page devoted to graphics, with expert judgments of overall site quality .
To develop a preliminary set of indicators for use in the machine learning pilot study, we conducted an extensive literature review and meta-analysis to identify 16 features of resources or metadata that are potentially useful for detecting quality variations across resources. Table 1 describes each of the quality indicators used in this research, and presents the prior research or motivation for including the indicator in this study. A complete description of this literature review and meta-analysis is beyond the scope of this article; details are available in .
We categorized the quality indicators into five categories: provenance, description, content, social authority, and availability. Provenance refers to the authority of the source of the information. Both cognitive authority and site domain are indicators of provenance. Description refers to the amount and quality of resource metadata. Content is concerned with aspects of the information that is displayed to the user. Social authority refers to the idea that if several people think a resource contains good or useful information, the resource must be high quality. And finally, availability refers to the accessibility and technical functionality of a resource.
Machine Learning Study
We approached the problem of detecting quality variations in digital library resources as a classification problem. A common technique for investigating classification problems is supervised machine learning, where a system is trained to perform a task such that its outputs approximate the judgments of humans as represented in a large body of annotated training materials. Of the many supervised learning algorithms available, we chose to use support vector machines (SVM)  because SVM has been applied to such tasks as word sense disambiguation , text classification , part-of-speech tagging , web page classification , and question classification  with high accuracy. In his tutorial on SVM, Burges  summarizes some of the recent applications of SVM, and states that "In most of these cases, SVM generalization performance either matches or is significantly better than that of competing methods."
Automating assessments of quality such that the automated assessments replicate human assessments involves having humans annotate a large body of examples that can be used to train a system and to test the accuracy of the trained system. Collections personnel working with the DLESE project and serving as the domain experts for this study were asked to rank DLESE collections according to their gestalt sense of quality and their personal preferences with respect to quality. We used these collection rankings as the rankings for individual resources within the collections. This methodology introduces inaccuracies into the labeling of the training and testing data, because according to the domain experts, collections contain resources of varying quality. We used these rankings in spite of the possible inaccuracies because having domain experts individually assess several hundred resources would have taken a significant amount of time. Additionally, it was thought that generally the resources would be correctly labeled with a few outliers, and this level of inaccuracy would be acceptable for a pilot study.
The corpus required for machine learning consisted of 600 DLESE resources. In an effort to assess quality by a traditional school grading scheme of A+, A, and A-, this study utilized three classification categories. Each of the three categories contained 200 DLESE resources. Within each classification category, the resources were further divided into training and testing sets where 80% of the resources were used to train the model and the remaining 20% were used to test the accuracy of the trained model; resources were randomly divided into the training and testing sets.
To compute metrics for the quality indicators, we developed a software application that parses HTML and XML. Once the metrics were computed and the data labeled with the domain experts' rankings, the labeled data was given as input to SVM software. We used LIBSVM , a library for support vector machines, because of its support for multi-class classification. LIBSVM provided an accuracy rating in the form of the percentage of resources it correctly classified. To determine which quality indicators positively or negatively contributed to the classification and to determine which indicators had no effect on the classification, we conducted a series of add-one-in analyses. The first step in an add-one-in analysis is to train the model on each quality indicator individually. Then, starting with the indicator that produced the highest accuracy, successively add indicators based on their individual accuracy ratings. If the addition of the indicator increases the model's accuracy, the indicator is retained; otherwise it is removed. If quality indicators tie for the next one to be added (i.e., they have the same individual accuracies), all possible combinations of adding those indicators are tried. The combination that produces the highest accuracy is retained.
Experiments and Results
We conducted two experiments. To determine if the methodology described above would be useful at all in classifying resources, only the A+ and A- resources were included in the first experiment. Presumably the A+ and A- resources would be the most different in terms of quality thereby simplifying the classification problem. The trained model for this experiment correctly classified 75 of the 80 resources in the test set, producing an accuracy of 93.75%. Since the outcome of this experiment was favorable, we conducted a second experiment where all three categories of resources were included (A+, A, and A- resources). The trained model for this experiment correctly classified 92 of the 120 resources in the testing set, producing an accuracy of 76.67%.
While the results are favorable, some of the quality indicators could be negatively impacting the classification, which could decrease the accuracy of the models. To determine which quality indicators, if any, were negative contributors, we conducted an add-one-in analysis for each experiment. Table 2 lists the quality indicators and the nature of their contributions to the classification.
1Resource currency and link count had the same individual accuracies. Individually adding resource currency and link count had no effect on the training; however, adding them both at the same time increased the model's accuracy.
2Cognitive authority, WWW, and cost all had the same individual accuracies. While individually adding WWW had a negative impact on the training, adding it in combination with cognitive authority and cost produced the highest accuracy compared with any other combination.
Including only the quality indicators that positively contributed to the classification increased the models' accuracy in both experiments. The highest accuracy achieved for the first experiment was 98.75%, and the highest accuracy achieved for the second experiment was 81.67%.
The machine learning approach to computing quality has provided promising results. We identified 16 quality indicators, computed metrics for each indicator, and trained SVM models in an effort to assess resource quality. Two experiments demonstrated that the quality indicators and associated metrics are sensitive enough to detect differences in the quality of resources cataloged in DLESE.
These initial results are particularly encouraging because the data set used in these experiments is skewed: all the resources cataloged in DLESE are considered to be high-quality resources. These experiments were able to successfully differentiate between the A+, A, and A- resources.
The first experiment classified resources into A+ and A- classes with 93.75% accuracy when all quality indicators were utilized in the classification. Eliminating the quality indicators that either negatively contributed to the classification or that had no effect on the classification increased the model's accuracy to 98.75%. The quality indicators producing the 98.75% accuracy were metadata currency, multimedia, site domain, cost, resource currency, and link count. The single best quality indicator for this experiment was metadata currency, which correctly classified 77.5% of the resources on its own.
The second experiment classified resources into A+, A, and A- classes with 76.67% accuracy. Again removing the quality indicators that negatively contributed to the classification or that had no effect on the classification resulted in increasing the model's accuracy to 81.67%. The quality indicators producing the 81.67% accuracy were metadata currency, multimedia, element count, site domain, cognitive authority, WWW, cost, and link count. Again metadata currency was the single best quality indicator, correctly classifying 59.17% of the resources on its own.
This study utilized expert ratings for DLESE collections to train the SVM models, assigning the rating given to the collection as a whole to each of the individual resources in the collection. In effect, we used collection ratings as proxies for individual resource ratings in this pilot study. This methodology, although appropriate for a pilot study, has significant drawbacks in that not all resources within a collection are the same quality. Future work could benefit from having both more accurate and finer-grained training data. Specifically, further research could benefit from having experts rate the quality of resources individually, and by having experts rate resources according to specific quality features, such as provenance, description, content, social authority, and availability. Having these finer grained ratings would allow for training of more focused SVM models, making it possible to provide measurements for specific quality features that may be particularly salient to a library's collection and resource policies.
Future studies should also include educational resources that are not currently cataloged into a digital library such as DLESE. This would require the machine learning classifier to rely solely on the contents of the resource itself, and to not rely on indicators derived from analyses of metadata. In this study, quality metrics were computed only for the first page of a resource. Future work could compute metrics for several pages of a resource, and in so doing could consider using site-level quality indicators (e.g., measures of site size or complexity) in addition to the page-level quality indicators used here.
In this article, we discussed the critical need to provide intelligent decision-support tools to support the complex human judgments and processes involved in making quality determinations, and in curating digital collections to align with library policies on resource and collection quality. We believe that identifying useful quality indicators and developing algorithms for automatically computing quality metrics and classifying resources based on these indicators are important steps towards this goal. We described the results of an extensive literature review and metadata analysis that yielded 16 promising quality indicators and two experiments that examined the utility of these indicators.While not the explicit focus of this work, we believe that learners will ultimately be important beneficiaries of these research outcomes. Learners often do not have the background knowledge, information seeking skills, or metacognitive skills necessary to make effective evaluative judgments about digital resources. Graham and Metazas  found that students increasingly use the Internet as their primary source of information without critically evaluating the information they find. This capacity to evaluate and discriminate is the highest level of critical thinking skill in Bloom's taxonomy, which is a widely respected taxonomy for thinking about educational goals . If we can model and represent dimensions of quality effectively, we can envision a future where next generation digital library interfaces, utilizing this computational model of quality, actively and routinely support learners to make sophisticated evaluative judgments.
1. Aladwani, A.M. and Palvia, P.C. Developing and validating an instrument for measuring user-perceived web quality. Information and Management, 39 (6). 467-476, 2002.
2. Amento, B., Terveen, L. and Hill, W. Does "authority" mean quality? Predicting expert quality ratings of Web documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece, 2000). ACM Press, 296-303.
3. Bloom, B.S., Englehart, M.D., Furst, E.J., Hill, W.H. and Krathwohl, D.R. Taxonomy of Educational Objectives: The Classification of Educational Goals. David McKay, New York, 1956.
4. Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2 (2). 121-167, 1998.
5. Cabezas, C., Resnik, P. and Stevens, J. Supervised sense tagging using support vector machines. In Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2) (Toulouse, France, 2001).
7. Custard, M. Computing Quality of Web Content for Educational Digital Libraries. M.S. thesis, Department of Computer Science, University of Colorado, Boulder, 2005.
8. Fogg, B.J., Marshall, J., Laraki, O., Osipovich, A., Varma, C., Fang, N., Paul, J., Rangnekar, A., Shon, J., Swani, P. and Treinen, M. What makes Web sites credible? A report on a large quantitative study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA, 2001). ACM Press, 61-68.
9. Fogg, B.J., Soohoo, C., Danielson, D.R., Marable, L., Stanford, J. and Tauber, E.R. How do users evaluate the credibility of Web sites? A study with over 2,500 participants. In Proceedings of the 2003 Conference on Designing for User Experiences (San Francisco, California, USA, 2003). ACM Press, 1-15.
10. Fritch, J.W. and Cromwell, R.L. Evaluating Internet resources: Identity, affiliation, and cognitive authority in a networked world. Journal of the American Society for Information Science and Technology, 52 (6). 499-507, 2001.
12. Graham, L. and Metaxas, P.T. "Of course it's true; I saw it on the Internet!" Critical thinking in the Internet era. Communications of the ACM, 46 (5). 71-75, 2003.
13. Ivory, M.Y. and Hearst, M.A. Statistical profiles of highly-rated web sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing our world, changing ourselves (Minneapolis, Minnesota, USA, 2002). ACM Press, 367-374.
14. Ivory, M.Y., Sinha, R.R. and Hearst, M.A. Empirically validated web page design metrics. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA, 2001). ACM Press, 53-60.
15. Joachims, T. A statistical learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA, 2001). ACM Press.
16. Kastens, K. How to identify the "best" resources for the reviewed collection of the Digital Library for Earth System Education in column "Another Node on the interNet". Computers and the Geosciences, 27 (3). 375-378,
17. Kastens, K. Quality Workshop Report and Recommendations. Lamont Doherty Earth Observatory, Columbia University, New York, 2003. Available at <http://www.dlese.org/documents/reports/collections/quality_wkshop.html>.
18. Katerattanakul, P. and Siau, K. Measuring information quality of web sites: Development of an instrument. In Proceeding of the 20th International Conference on Information Systems (Charlotte, North Carolina, USA, 1999). Association for Information Systems, 279-285.
19. Murata, M., Ma, Q. and Isahara, H. Comparison of three machine-learning methods for Thai part-of-speech tagging. ACM Transactions on Asian Language Information Processing (TALIP), 1 (2). 145-158, 2002.
20. NSF. National Science, Technology, Engineering, and Mathematics Education Digital Library (NSDL), National Science Foundation, Washington, D.C., 2005, 18.
21. Reeves, T.C., Laffey, J.M. and Marlino, M.R. Using technology as cognitive tools: Research and praxis. In Proceedings of the Annual Conference of the Australasian Society for Computers in Tertiary Education (ASCILITE 97) (Perth, Western Australia, 1997), 481-485.
22. Rieh, S.Y. Judgment of information quality and cognitive authority in the Web. Journal of the American Society for Information Science and Technology, 53 (2). 145-161, 2002.
23. Sumner, T., Khoo, M., Recker, M. and Marlino, M. Understanding educator perceptions of "quality" in digital libraries. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries (Houston, Texas, USA, 2003). IEEE Computer Society, 269-279.
24. Sumner, T. and Marlino, M. Digital libraries and educational practice: A case for new models. In Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2004) (Tucson, Arizona, USA, 2004). ACM Press, 170-178.
25. Sun, A., Lim, E.-P. and Ng, W.-K. Web classification using support vector machine. In Proceedings of the 4th International Workshop on Web Information and Data Management (McLean, Virginia, USA, 2002). ACM Press.
26. Vapnik, V.N. The Nature of Statistical Learning Theory. Springer, New York, 1995.
27. Zhang, D. and Lee, W.S. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, Canada, 2003). ACM Press.
28. Zhu, X. and Gauch, S. Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece, 2000). ACM Press, 288-295.
(On October 18, the email address for Myra Custard was corrected.)
Copyright © 2005 Myra Custard and Tamara Sumner