To monitor severe acute respiratory syndrome (SARS) infection, a coronavirus protein microarray that harbors proteins from SARS coronavirus (SARS-CoV) and five additional coronaviruses was constructed. These microarrays were used to screen ≈400 Canadian sera from the SARS outbreak, including samples from confirmed SARS-CoV cases, respiratory illness patients, and healthcare professionals. A computer algorithm that uses multiple classifiers to predict samples from SARS patients was developed and used to predict 206 sera from Chinese fever patients. The test assigned patients into two distinct groups: those with antibodies to SARS-CoV and those without. The microarray also identified patients with sera reactive against other coronavirus proteins. Our results correlated well with an indirect immunofluorescence test and demonstrated that viral infection can be monitored for many months after infection. We show that protein microarrays can serve as a rapid, sensitive, and simple tool for large-scale identification of viral-specific antibodies in sera.
In November 2002, an outbreak of severe acute respiratory syndrome (SARS) occurred in southern China and rapidly spread across five continents. SARS was characterized by fever and respiratory compromise; the World Health Organization estimated that SARS infected 8,439 individuals with a mortality rate of ≈9% overall and 40% in people older than 60 years (1). A novel coronavirus, SARS coronavirus (SARS-CoV), was identified as the etiological agent for the illness and was found to be related to, but distinct from, other coronaviruses, including two previously identified human coronaviruses, HCoV-229E and HCoV-OC43, single-stranded RNA viruses that collectively cause ≈30% of common colds in humans (2). Like other coronaviruses, SARS-CoV encodes two RNA-dependent replicases, 1a and 1b, a spike protein, a small envelope protein, a membrane protein, and a nucleocapsid (N) protein, as well as nine predicted proteins that lack significant similarity to any known proteins.
At present, no effective treatment of SARS is available. Isolation and stringent infection-control practices were the sole means to control the epidemic. Hence, rapid, accurate, and early diagnostic tests are necessary to monitor the course of the disease.
The World Health Organization classification for SARS infection in adults is based on four criteria: fever, respiratory symptoms, close proximity to infected individuals, and radiological evidence of lung infiltrates (3). Several diagnostic approaches have also been used for detecting SARS-CoV, including RT-PCR techniques, ELISAs, and the indirect immunofluorescence test (IIFT). RT-PCR is sensitive, specific, and useful during the infection (4–7). However, it is not useful once the infection is cleared and can be challenging to implement in clinical application; the collection of samples such as nasopharyngeal or bronchial alveolar aspirates from SARS patients is dangerous and can put healthcare workers at high risk. ELISAs tend not to be highly sensitive and usually require large amounts of sample (8–11). Moreover, existing ELISAs, such as one manufactured by Euroimmun (Luebeck, Germany), use whole viral extracts, thereby increasing the chance of misdiagnosis due to crossreactivity with proteins from other viruses. Currently, an IIFT kit (Euroimmun) to detect SARS IgG antibody response is considered the serological gold-standard method in the clinic. However, IIFT limitations include (i) difficulty in diagnosis in the urgent acute phases of the disease, (ii) failure to diagnose ≈5% of sera that contain high concentrations of antinuclear factor, and (iii) visual inspection of fluorescently stained cells, which is both subjective and of modest throughput. Thus, more tests for diagnosing the disease need to be developed.
We report the construction of a coronavirus proteome microarray that contains the entire proteomes of the human SARS-CoV and HCoV-229E viruses and the partial proteomes of human HCoV-OC43, mouse MHVA59, bovine coronavirus (BCoV), and feline coronavirus (FIPV). The coronavirus protein microarrays were used to screen serum samples collected from fever and respiratory patients during the period of SARS outbreak in Beijing and Toronto. Algorithms to optimally diagnose SARS-infected patients were devised to generate a microarray test that is rapid, sensitive, accurate, and adaptable for detection of many other types of viral infections.
Development of a Coronavirus Protein Microarray and a SARS Detection Assay.
A protein microarray approach was developed to rapidly identify SARS-CoV and other coronavirus-infected patients with high sensitivity and accuracy. Genes or gene fragments that cover the entire genome of SARS-CoV and the majority of the HCoV-229E and MHVA59 genomes were amplified by PCR and cloned into a yeast expression vector that produces the viral proteins with GST at their N terminus (Fig. 1). Using the limited sequence information available at the time, regions of the BCoV, HCoV-OC43, and feline coronavirus genomes were also cloned (Fig. 1). A total of 82 expression constructs, about one-third (25) of which originate from SARS-CoV and the rest from the other coronaviruses, were purified from yeast cells by using their GST tags. Immunoblot analysis revealed that most purified proteins could be detected and migrated at their expected molecular weights, including the glycoproteins.
Regions of six coronaviruses represented on the microarray. The positions of the cloned and expressed fragments are marked with light-gray bars. The pink bars represent SARS features selected as classifiers in the supervised cluster analysis (both k-NN and LR). The light-blue bars are features bound by the MHVA59-infected mouse serum.
To test whether a protein microarray approach could be used to detect SARS-CoV infection, we fabricated a microarray containing the 82 purified proteins. Serial dilutions prepared from four serum samples collected from Chinese patients clinically diagnosed as SARS-positive and which also tested positive by a local ELISA were used to probe the array. The presence of human-anti-SARS antibodies was detected with Cy3-labeled goat anti-human IgG antibodies (12–16). As shown in Fig. 2A, the sensitivity of the microarray assay is extremely high; reactivity is readily detected at 1:10,000-fold dilution for the strong positive serum and 1:800-fold for the weakly positive sera. The assay is 50-fold more sensitive than ELISAs performed using the same sera. Importantly, <1 μl of serum is needed for the protein microarray assay, which is crucial because the sera from SARS patients are extremely precious.
Analysis of patient serum samples in a protein microarray format. (A) A SARS-CoV-positive serum from a diagnosed SARS-CoV-infected patient in Beijing was tested at eight dilutions. The signals for the five SARS N protein fragments are shown on the chart. The vertical line indicates the detection limit. (B) Examples of coronavirus protein microarrays probed with various sera from SARS-CoV-infected or uninfected individuals. The first image shows probing with an anti-GST antibody. The second image shows probing with a serum from a SARS patient. The N protein and its fragments were the most antigenic protein on the array [indicated by the yellow boxes (second image)]. The third image shows probing with a serum from a non-SARS patient. The fourth image shows probing with a serum from MHVA59-infected mouse. Light-blue boxes, the MHV N protein; pink boxes, the BCoV N protein. The red boxes indicate the signals from the human IgG used as the positive controls.
Serum Probing of the Coronavirus Proteome Microarray with Human Sera.
The coronavirus protein microarrays were used to screen sera from 399 Canadian and 203 Chinese infected and noninfected individuals in a double-blind format. The Canadian samples included 181 clinical- and laboratory-confirmed SARS-CoV sera (see Materials andMethods) (3), as well as anonymized clinical samples from patients who had presented with respiratory illness during the outbreak period but who failed to meet the case definition and did not develop SARS. Other SARS-CoV-negative sera were from asymptomatic healthcare workers. The Chinese sera were from patients with fever during the SARS outbreak; some of these were classified as SARS-positive and others, SARS-negative.
To accomplish the screening, each of the 82 purified coronavirus proteins was spotted in duplicate on eight identical blocks per microscope slide. Human IgG protein was also included as positive control (see below). The amount of immobilized coronavirus proteins and protein fragments present on the microarray was quantified by probing with anti-GST antibodies (Fig. 2B).
The serum samples were screened at a 200-fold dilution, and bound antibodies were detected with Cy3-labeled goat anti-human IgG. The signals were analyzed by using algorithms that we developed. Positive sera usually exhibited strong reactivity for ≈10% of the proteins on the microarrays. The full-length and two C-terminal derivatives of SARS N-protein were strongly recognized by the antibodies present in the SARS-CoV-infected patient sera but not in sera from noninfected individuals (Fig. 2B). The C-terminal fragments of the SARS N protein, which contains a short lysine-rich region (KTFPPTEKKDKKKKTDEAQ; amino acids 362–381) unique to SARS CoV, exhibit the highest antigenic activity (SARS-N-C2; Fig. 2A Right). These results are consistent with previous studies that identified the N proteins of coronaviruses as the most abundant and reactive antigens (11).
Although the N proteins are conserved among coronaviruses, the SARS-CoV-infected sera from the Chinese and Canadian patients showed little crossreactivity with proteins of other coronaviruses on the array, including N proteins. One exception is that many (88%) of the sera from the Chinese patients showed a slight reactivity to the first half of BCoV N protein, which shares ≈40% identity through its first 210 amino acids with the SARS-CoV N protein. Interestingly, sera from infected Canadian patients did not react with this protein. In addition, ≈20% of the sera from both SARS-positive and -negative Canadian individuals specifically recognized the HCoV-229E N protein but not the N proteins from the other species. We expect that many Canadian patients may have been exposed to HCoV-229E (see below).
To further test the specificity of our assays, we probed the coronavirus protein microarray with ≈30 sera from MHVA59-infected and control mice. As shown in Fig. 2B, a mouse-infected serum recognized the MHVA59 N protein, whereas control mouse sera did not react with proteins on the array. This serum also crossreacted with the N protein from BCoV and not with proteins from other coronaviruses. Because the N proteins from MHVA59 and BCoV share 70.7% identity and 87.9% similarity over their entire protein sequences, crossreactivity between these two proteins is not surprising.
In summary, although a few instances of crossreactivity occurred among highly similar proteins, the protein microarray approach demonstrated that different serum samples could be differentiated at a high degree of specificity. Most importantly, the protein microarray was able to distinguish reactivity between the human coronaviruses (HCoV-229E and SARS).
Detection of SARS-Infected Patients in the Canadian Samples.
To determine whether an accurate SARS diagnostic test can be devised by using the protein microarray data, we analyzed the results obtained from the Canadian patients using computational approaches. The sera were first clustered according to the relative signal intensities of all of the coronavirus proteins immobilized on the microarrays in an unsupervised fashion (17). The sera fell into two major groups, which upon subsequent comparison with clinical IIFT data were largely correlated with SARS-positive and -negative sera (Fig. 3). The unsupervised method correctly predicted 138 of 181 infected serum samples (76% sensitivity, with sensitivity defined as the percentage of correct positives of the total positives) and 210 of 218 sera from healthy individuals (96% specificity, with specificity defined as the percentage of correctly classified negatives of the total negatives). In the cluster of markers, five of the SARS N protein fragments associated tightly (Fig. 3, at the bottom). Most of the sera clustered as originating from SARS-infected patients exhibited unambiguous reactivity with this group of markers as expected (Fig. 2B). The SARS sera also exhibited statistically significant binding to one spike protein fragment.
Unsupervised 2D clustering of the Toronto sera and microarray features. The 399 Toronto IgG sera were clustered according to their reactivity to the microarray signals, and the microarray features were clustered according to their serum reactivity. The corresponding Euroimmun IIFT SARS-CoV IgG results are indicated on top of the diagram, where black and white bars represent SARS-positive and -negative sera, respectively. The different coronaviruses are color-coded on the left of the diagram. The yellow color is low or background signal on the arrays, whereas the orange color represents signals above the background level. The black box highlights the features that help classify SARS-infected sera from the microarray assays. All of the classifiers in the black rectangle are SARS N proteins and SARS N fragments.
We next set out to improve our prediction by identifying the meaningful classifiers and conducting a supervised classification. Because only a limited number of proteins showed differences between the SARS-CoV-positive and -negative patients (Fig. 3), we selected the top 10 features that demonstrated the most significant differences between these two types of patients as candidates for classifier selection (18). Many of the selected candidates were SARS N protein fragments.
To determine the best classifiers and classification model, we applied two different supervised analysis approaches, k nearest neighbor (k-NN) (19) and logistics regression (LR) (20). k-NN measures the similarity between a new case and all of the known cases to make a prediction and is determined by the identities of its k closest neighbors (Fig. 4A). Using this method, five features were selected by the algorithm as the best classifiers: SARS N [pEGH-55 (Y)], SARS N (pEGH-B4), SARS N-C1 (pEGH-B7), 229E-S 1/4, and SARS spike [first half (Y)] (note that 229E-S1/4 negatively correlates with SARS). The best k value selected by the model is 9, indicating that the nine closest-neighboring samples to the tested case were used for the prediction. At the confidence cutoff of 0.5, this model achieved 91% accuracy with 15 positive and 18 negative cases missed [163 of 181 positive cases were correct (90% sensitivity) and 203 of 218 negative sera correct (93% specificity)] (Table 1).
Models generated by k-NN (A) and LR (B). The cutoff for the prediction is the probability of 0.5, which is indicated by the black horizontal line: (lane a) signals for the selected classifiers, (lane b) confidence calculated from the classifier signals (range from 0 to 1), and (lane c) the IIFT annotations, where the black and white boxes represent IIFT-positive and -negative, respectively. On the top are depicted the names of the features that were selected by the k-NN and LR models.
We also analyzed our microarray results using LR, which is a generalized linear regression for binary responses (Fig. 4B). The features selected by LR included SARS N-C1 (pEGH-B7), SARS N (pEGH-55) (Y), SARS N (pEGH-B4), and SARS N-C2 (pEGH-B8 #1). The accuracy of this model was 92% (89% sensitivity and 94% specificity). To determine whether k-NN or LR performed better, we used the receiver operating characteristic curve (21) and plotted the rate of true positives against that of false positives at different cutoff points. Using the area under the curve (AUR), we measured the quality of the model and found that both AUR values were close to 0.95, indicating that both models performed equally well. Interestingly, although both LR and k-NN predictions exhibited only ≈92% overlap with the IIFT results (Table 1), 97% of their predictions were shared, indicating that the discrepancy between our models and the standard IIFT test does not depend on the analysis method but rather on the experimental data.
That both k-NN and LR performed similarly prompted us to repeat the probings of the 33 discrepant sera along with some of those that agreed with the predictions. After these probings, eight reproducibly false-negative samples remained by both methods even after a third round of probings.
To test whether IgM would yield better results than IgG, particularly for patients during the acute phase of the disease, ≈90% of the Toronto sera were also probed for IgM reactivity on the microarray. Except for one serum, the probings performed equal to or worse than the IgG probings, consistent with previous results (22–24).
Validation of the SARS Proteome Array Classification Method.
To further examine the accuracy, sensitivity, and specificity of our approach, we conducted another double-blind experiment using 56 sera collected from Chinese patients; 36 of the patients were diagnosed as SARS-infected, and 20 were diagnosed as uninfected. All of the sera were collected from SARS patients who recovered from respiratory disease. Of the 56 serum samples, only one serum was misclassified by our models (98% accuracy, 100% sensitivity, and 95% specificity). Importantly, both the k-NN and the LR models predicted this serum to be positive with a confidence value of 1 on a 0 to 1 scale. Taken together, these results demonstrated that our prediction algorithms performed well and accurately identified the SARS-infected samples from a large population.
Comparing the Protein Microarray Results with ELISAs.
To determine how the viral protein microarray compared with the current methods of diagnosis, we compared the performance of two independent ELISA tests on serum samples from both Canada and China. The Euroimmun ELISA was used on all but three of the serum samples taken from Canadian patients and resulted in two false-positive, six false-negative, and 26 borderline (uncertain/inconsistent) classifications. Thus, the Euroimmun ELISA is 91% accurate, as compared with 92% accuracy for the proteome array method. The samples missed by the two assays were not identical.
We also compared the microarray approach with a local ELISA used in China that used only the purified N protein. A set of 147 serum samples collected from fever patients during the SARS outbreak in China was used to probe the coronavirus protein microarray. The SARS status of these patients is not known. Similar to the results presented above, we found 85% agreement between the predictions made from the microarray assay and those made from the ELISA; all 70 sera that were SARS-CoV-positive by the ELISA were also positive by microarray. The microarray identified an additional 21 sera as SARS-CoV-positive that were not found by using the ELISA. Because (i) 15 of the 21 serum samples had confidence scores >0.72, the lowest-confidence score for the 56 known Chinese SARS-infected sera presented above, and (ii) the rate of false positives in our assays is <7% (the overall specificity for the sera from characterized patients is >99.56%), it is likely that most of these samples originated from SARS patients. In summary, these results indicate that the protein microarray method is at least as sensitive as the Euroimmun ELISA and more sensitive than the local Chinese ELISA, and therefore is an excellent assay for detecting SARS.
Anti-SARS Antibodies Can Persist Long After Initial Infection.
One useful feature of a serum test relative to a nucleic acid diagnostic test is that anti-SARS antibodies can potentially be detected after infection. We therefore tested how long anti-SARS antibodies remained present in recovering patients after infection. Serum samples drawn from five Canadian individuals (two respiratory illness other than SARS and three confirmed SARS-CoV cases) at different times postinfection were tested by using the protein microarrays (Fig. 5). Reactivity to five N proteins (four SARS N proteins and one CoV-229E N protein) was scored. Sera from non-SARS patients (Patients 1 and 4 in Fig. 5) did not exhibit significant reactivity to any of the five SARS-CoV markers. In contrast, sera from SARS-CoV-positive patients (Patients 2, 3, and 5, Fig. 5) reacted strongly with each of the SARS N peptides, and for the two cases that were monitored over a long period (120–320 days), reactivity remained high for two N peptides. Furthermore, the above two SARS CoV N antigens were the same ones that reacted most strongly in the 36 SARS-confirmed patients from the group of 56 Chinese respiratory patients. These results demonstrate that at least some patients retain reactive antibodies for extended periods, and they can be detected by protein microarrays.
Time-course analysis of serum reactivity of five Canadian individuals. (Top) Graphs from two individuals with non-SARS respiratory disease; (Bottom) Results from three SARS patients. The relative levels of antibodies against four of the SARS N protein constructs along with that of HCoV-229E N protein were monitored at different times. The vertical lines indicate the time at which the individuals were diagnosed as SARS-positive by biochemical assays.
Extending the Protein Microarray Approach to Detecting Other Coronaviruses.
Although this study was aimed at developing a systematic screen for SARS-infected sera, proteins from other human coronaviruses such as the HCoV-229E were included on the microarray, thus allowing the detection of antibodies directed toward other coronaviruses (25–27). Using 10 HCoV-229E-related proteins as classifiers, we identified 82 serum samples with substantial signal (52 of 218 SARS-CoV-negative (23.9%) and 30 of the 218 SARS-CoV-positive sera (13.8%). The presence of 52 HCoV-229E-positive sera in SARS-CoV-negative patients suggests that these patients were or had been infected with HCoV-229E. The observation that many (150) patients are SARS-CoV-positive and lack HCoV-229E antibodies indicates that HCoV-229E and SARS-CoV infections can occur independently of each other. Because these sera were not tested for HCoV-229E infection, the number of false positives and negatives could not be scored. Nonetheless, these results indicate that our approach can likely be used to diagnose infections from related human coronaviruses.
Prediction performance of the two classification methods
In this study, we present the construction and use of a coronavirus protein microarray to screen human sera for antibodies against human SARS and related coronaviruses. We tested >600 sera from two different parts of the world and predicted the nature of serum samples with >90% accuracy. To our knowledge, it is the largest study of this type conducted thus far, and the first to analyze patients from the two major geographical locations of the SARS epidemic. We compared our results with the current available methods and showed that the coronavirus protein microarray is at least as sensitive as and more specific than the available ELISA tests and has the advantage that multiple antigens from different coronavirus are tested simultaneously. Thus, this system has enormous potential to be used as an epidemiological tool to screen human and other sera for many types of viral infections as well as other types of disease (e.g., cancer).
Sensitivity and Accuracy of the Protein Microarray Assay.
Using the Euroimmun IIFT plus epidemiological data as reference, the protein microarray assay offered several advantages relative to the commercially available Euroimmun ELISA. First, the assays were sensitive and functioned at high dilutions, allowing small amounts of sera to be used (1/200 dilution was used here instead of the 1/50 commonly used in ELISAs). This is particularly important for SARS research, because the sera are extremely precious and not replaceable. Consistent with an increased sensitivity, more Chinese patients were diagnosed as SARS-positive by using the protein microarray over the Chinese ELISA. Second, the accuracy of our assay is as good as, if not better than, the Euroimmun ELISA: 92% vs. 91% accuracy. Third, our assay has greater reliability, in that multiple antigens are followed, and a weighted scoring scheme based on probabilities was developed, instead of relying on the results of one or a mix of antigens. To our knowledge, a probabilistic test of this type has not been described previously for viral detection using sera, and we expect this approach to be of general utility. Fourth, our assay can monitor the presence of antibodies to multiple viruses allowing their potential simultaneous detection. Fifth, our assay can be automated to robotically probe hundreds of sera in parallel, a major advantage over the visual analysis in IIFT. Finally, unlike IIFT, in which results can be masked by the presence of high concentrations of antinuclear factor (60 such patients were present in our study), the protein array is not affected by such antibodies.
One concern with using protein microarrays is the reproducibility of the assay. After unblinding of the initial screening, we retested the ≈30 sera that exhibited either false-positive or -negative reactions; 22 were correctly reclassified. Furthermore, retesting 97 sera that were correctly classified but were close to the borderline resulted in misclassification of 13%. These results indicate that the assay as performed is 90% reproducible. The reason for this variation is currently unclear. Probing sera in triplicate will increase the reproducibility of the assay to 98% if the majority results are scored.
A subset of eight sera yielded false-negative results, whereas the patients had been classified as SARS-CoV cases using clinical and laboratory tests. This misclassification by the protein microarray assay occurred regardless of the array interpretation method used. We presume that either these patients were misclassified clinically, or IIFT is a more sensitive assay than the protein microarray. Possible explanations for the latter include that IIFT was tested at a lower serum dilution (1/10) as compared to the arrays (1/200), or that the SARS proteins had been purified from yeast cells, which have different posttranslational modifications compared with those of mammalian cells. Some sera may recognize glycosylated antigens modified in humans that are not present on the antigens prepared in yeast (see ref. 28). Consistent with this hypothesis, the infected sera primarily recognized the SARS-CoV-encapsulated N protein but none of the six surface glycoproteins. The purification of viral proteins from human cell lines should relieve this problem.
Specificity of the Coronavirus Microarray for Detecting Different Viral Infections.
Most of the human sera did not crossreact with antigens from other species, indicating the assay is specific. However, 82 individuals had antibodies reactive to HCoV-229E antigens. These were observed both in SARS-CoV-positive and -negative patients. Because these antibodies were observed in both types of patients, the simplest explanation is that these patients were exposed to HCoV-229E (or a closely related virus). It is unlikely that the antibodies present in SARS-CoV-infected patients crossreact with HCoV-229E antigens, because HCoV-229E and SARS-CoV belong to different phylogenetic groups, and their N antigens are only 27% identical. Thus, we expect our protein microarray assay monitors exposure to several types of coronaviruses.
In summary, we have constructed coronavirus protein microarrays that cover proteins from six coronavirus proteomes and have used them to classify sera from potential SARS-infected patients. The approaches developed here are applicable to potentially all viruses and are expected to have great impact in epidemiological studies and possibly in clinical diagnosis.
Materials and Methods
The 399 serum samples tested from Canada included 40 acute and 164 convalescent sera from 92 patients who met the clinical and laboratory criteria for SARS-CoV infection during the 2003 Toronto SARS outbreak. Sera from 112 Toronto patients who presented with non-SARS respiratory illness and 83 sera from health professionals were also included. None of the acute, all 164 of the convalescent, and 17 of the sera from 12 healthcare workers demonstrated IgG antibodies as detected by using the Euroimmun IIFT test. All positive results were repeated, and any unexpected result was confirmed by using the SARS-CoV neutralization assay. The Chinese samples were collected from several hospitals in Beijing by the Beijing Genomics Institute. These sera were collected from 147 nonconfirmed fever and 56 respiratory patients (36 confirmed SARS patients and 20 non-SARS individuals).
Preparation of a Coronavirus Microarray.
The SARS ORFs were amplified by RT-PCR from the SARS-CoV isolate BJ01 (GenBank accession no. AY278488) and cloned into a yeast GST expression vector (pEGH) described previously (12). The same approach was used for the cloning of other coronavirus genes. All clones were confirmed by sequencing their inserts.
The constructs were transformed into yeast, and proteins were purified as described (13). For samples that exhibited low yields, the purification was repeated by using 50-ml cultures and/or up to four purifications. The coronavirus protein microarrays were fabricated by spotting the purified proteins along with positive control proteins onto eight-pad FAST slides (Schleicher & Schuell) using a microarrayer (Bio-Rad). The printed arrays were incubated overnight at 4°C and stored at −20°C.
Serum Assays on Coronavirus Protein Microarrays.
Explaining free and open source software
Communicating the benefits and limitations of free and open source software to the less technically experienced can be (to say the least) challenging, so before attempting to do so, the smart professional will prepare himself or herself with three things:
- ▪A thorough (but concise) description
- ▪The ability to correct misconceptions
- ▪A list of open source applications an organization can (and may already) benefit from.
What It Is
In the introductory technology course I teach for graduate LIS students at Simmons College, students usually grasp the concept of compiled software (vs. scripting) fairly easily, so I often approach the topic of open source as a metaphor.
Imagine that it's your job to buy a cake for a co-worker's birthday. You go to the bakery to see what they've got on display and you find a lovely white cake with a beautiful yellow icing with “Happy Birthday!” flowing across it. It's even kosher. But there's a problem. The frosting is pink and the birthday person just can't stand that color. There are other pre-made cakes available but they have even more things wrong with them, like the chocolate cake with “Sto Lat!” written on it. (That's Polish.) You can order a custom cake, of course, but it's more expensive, takes a week and you have your doubts that the bakery can meet your specific demands. What do you do?
Why don't you bake your own cake? It would be pretty hard if you had to guess at the ingredients and just experiment with various ratios of flour, butter and sugar. You might be better off with the pink cake from the bakery. But that's why Betty Crocker invented new recipes. It's a proven way to create a cake. (And a list of ingredients is freely redistributable  by the way.) You can fiddle with the recipe to your heart's content to make the exact cake you need. And when you're done, you can give the list of ingredients to anyone else facing a similar situation.
Once the penny drops, I go on to explain that free and open source software is a lot like the cake that you made from the recipe. It's a creation that owes a lot to the person or people who created the original recipe, but since you were given the building blocks to create this end result, you had the opportunity to alter them to suit your needs. Ultimately, your modifications should help others who have needs similar to yours, and often, they might even be the people from whom you got the recipe in the first place. It's an evolutionary process that feeds back into itself.
Often, a student will then ask where open source software may be purchased. God bless him. I have heard colleagues ask the same thing. While the concept of direct access to application source code can be described in metaphor, the topic of licensing involves taking a look at licenses, those things we all like to click past in our rush to install software. It is necessary to broach the topic, however, since the license is truly what makes free software “free” and open source software “open.” It is also useful to understand the difference between free software, as defined by the Free Software Foundation (FSF), and open source software, as defined by the Open Source Initiative (OSI).
The free software definition was published in 1986 by Richard Stallman, president then and now of the FSF. The definition codifies four essential freedoms that computer software users should be entitled to:
- ▪The freedom to run the program for any purpose.
- ▪The freedom to study how the program works and adapt it to your needs.
- ▪The freedom to redistribute copies so you can help your neighbor.
- ▪The freedom to improve the program and release your improvements to the public, so that the whole community benefits.
With the emphasis on these four freedoms, free software is software that end users have freedom to alter, run and redistribute as they see fit. The label “free software” often causes confusion. “Free,” of course, also carries the connotation of “without cost,” when in reality, cost is not a criterion for free software. To address the ambiguity between “free as in free speech” versus “free as in free beer” mentioned in the free software definition, over the years a number of alternative terms have been suggested, mostly using non-English words that have unequivocal definitions. Software libre, for example, uses the Spanish and French adjective meaning “free” in the same sense as “liberty.” Similarly, software licensed free of charge is sometimes labeled software gratis. While these (and similar) terms have been adopted in non-English speaking countries - and while the FSF officially supports any term conveying the concept of liberty - software gratis and software libre have not been widely adopted in the United States. Instead, cost-free software is generally labeled “freeware” and the main FSF-approved label for liberty-infused software in use is “free software.”
This confusion, along with the confrontational activist stance of the FSF in defending these freedoms, led to the formation of the Open Source Initiative in 1998 and the open source definition. Ten criteria must be met in order for a software distribution to be considered open source:
- 1Free redistribution - The license must allow end users to redistribute the software, even as part of a larger software package and may not charge royalties for this right.
- 2Source code - The distribution must make the source code freely available to developers.
- 3Derived works - The license must permit modifications to be made to the software for redistribution under the same license.
- 4Integrity of the author's source code - The license may require that modified distributions be renamed, or that modifications be made via patch files rather than modifying the source code.
- 5No discrimination against persons or groups
- 6No discrimination against fields of endeavor - This includes commercial or controversial endeavors.
- 7Distribution of license - The same license must be passed on to others when the program is redistributed.
- 8License must not be specific to a product - A program may be extracted from a larger distribution and used under the same license.
- 9License must not restrict other software - The license cannot prescribe the terms of other software with which it is distributed.
- 10License must be technology-neutral - The license cannot restrict the use of the program to any individual interface or platform.
While the Open Source Initiative has approved over 50 different licenses as meeting the criteria of the organization, a list of the nine most widely used licenses is a sufficient sample to get an overview of the different restrictions and freedoms they provide:
Despite the few ideological differences between open source and free software, for practical purposes they provide the same basic advantages (and challenges) in a library or information science setting. For this reason, they are often referred to under a collective term such as “free and open source software,” FOSS, F/OSS or other terms.
What It Is Not
A more difficult task than defining free and open source software for novices is combating the inevitable assumptions and misinformation that materialize. The most common misconception, alluded to above, is that since the source code is freely distributed without royalty or licensing fee, open source applications are free of cost. While it is possible for a library or other organization to avoid buying a proprietary software package, open source may carry a plethora of hidden costs in development and maintenances, particularly if any customization is to be made to the software. These costs may translate into salaries for additional technical staff or possibly external support, development and/or hosting services such as the consulting service LibLime.
In my experience, the staff most susceptible to the “free lunch” myth are overeager novice technicians or new librarians hoping to stretch dwindling budgets as far as possible. Veteran administrators, on the other hand, seem to regard free and open source solutions with suspicion. While some of their caution is of the “you get what you pay for” variety, the absence of an accountable vendor causes distress to some of the old library guard. Without a contract to point to, non-technical administrators seem to feel that development of library services is in the hands of an unknown middle-aged, unemployed social misfit coding in his parents' basement at three a.m. between reruns of Stargate and Star Trek: Deep Space Nine. It is an image closely associated with hacker culture and all the bugaboos associated with it, such as Bill Gates' attempt to frame FOSS advocates as “modern day Communists.” . It is a valid point - not the image of the basement coder, but rather that the people currently working on any given open source application are not doing so to win customers; they are attempting to solve specific problems.
Richard Stallman clearly illustrates this clever problem-solving ability of developers in his 2002 article “On Hacking.” . “Playfully doing something difficult” is hacking, according to Stallman, as opposed to the criminal connotation sometimes associated with the word. A hacker enjoys puzzles and finding efficient solutions to seemingly insurmountable problems, something that libraries and information centers seem to have no lack of. An organization must hire either staff or consultants with this special ability to contribute to the larger problem-solving community if its unique concerns are to be addressed in an open source application. It is an entirely new and exciting paradigm to invite these creative people into our space, rather than knowing they are sequestered out of reach behind a vendor's competitive gate.
Along the same lines, non-technical staff are often concerned about the perceived absence of technical support available for open source projects. Library vendors charge an incredible amount of money to support the software they license, typically in the form of yearly maintenance fees. A significant portion of a technology operating budget can be spent this way in exchange for the privilege of calling the vendor when the software fails (one hopes 24/7, but sometimes only 9-5 weekdays), when a bug is identified (that perhaps the vendor has already documented but about which has not thought to inform customers) or when documentation proves incorrect or inadequate (sometimes referred to as a “training issue” by vendors). On occasion, the maintenance fee even entitles customers to conversations such as this one.
Me: … So, the end result is that the online catalog informs all our patrons that they will be notified by phone when their holds arrive.
Vendor Representative: Okay.
Me: Even if the individual patron will actually be notified by email or snail mail.
Vendor Rep: I see. But the default notification method for the location is “phone.”
Me: I know, but the patron's choice overrides this when the notice is sent.
Vendor Rep: Yes. But this is the way it was designed to work.
Vendor Rep: Well, you're certainly welcome to fill out an enhancement request…
While not all vendor support experiences are as fruitless as the tongue-in-cheek examples above, technical staff will recognize that the promise of support from a proprietary software vendor rarely matches value - monetary or quality - attributed to it by the vendor or non-technical library staff. Support from a hardware vendor can be exemplary without much effort. If a component breaks, swap it out with a working piece. Software packages, however, are complex systems that usually do not function in such a modular fashion. Well, not in proprietary applications, at any rate.
Free and open source software application users, on the other hand, must rely on development communities for support. Users and developers produce documentation, write installation guides and answer specific support questions in forums, not because they are bound by contract, but because they can learn more about the software that they, too, are using. One can see this type of activity in proprietary library software user groups as well. Indeed, many systems librarians around the world find user group listservs much more illuminating about how an application works than vendor-supplied documentation or training. Many systems librarians also spend a large amount of time writing scripts, developing external applications or finding unintended creative uses for application features. Vendors sometimes even celebrate these accomplishments at annual user group conferences. That recognition is nice, but the upshot of this type of grassroots support of proprietary software is summarized most succinctly by a meme I hear frequently repeated by my colleague Michael Klein: “The workaround has become the work.” When practical issues with open system software are addressed by those using the software, however, the solutions can be immediately returned to the user community as a new release. A user's contribution is not just a workaround. It is the essential work. No vendors are needed and it will not matter if you missed the user group conference.
Misconceptions such as those mentioned above can transform into terrifying specters sure to doom the success of any open source project at an organization. Nevertheless, a well-prepared technical staff member at a library or information-centric organization can circumvent misunderstandings and turn stakeholder anxiety into excitement, provided that an open source alternative is truly the best option. A full cost benefit analysis should be performed taking in account all of the factors mentioned above for both proprietary and open source alternatives, including the following:
- ▪Recurring fees, such as maintenance
- ▪Personnel costs for development and maintenance cycles
- ▪Amount and time of additional development required for missing features
- ▪Amount and time of workarounds required for missing features
- ▪The benefit of contributing to the support community
- ▪The “lock in” aspect of committing to a proprietary model
- ▪The ease with which one can (or can't) migrate to a new platform, if necessary.
If after this analysis, free or open source software seems to offer significantly more benefits, developing a quick fact sheet for administrators comparing a specific FOSS application to a known proprietary equivalent can quickly impress. Above all, a functional prototype created in the FOSS application will dispel concerns that the software is somehow rudimentary or experimental. Another oft-repeated meme among my colleagues is “working code works,” and it is true. Nothing illustrates your point better than illustrating your point.
You're Soaking In It
If conversations with non-technical staff veer off into the uncomfortable realm of doubt and uncertainty, they may start asking questions like “Are we sure that we are ready to invest in open source software?” or “Do you think we have investigated enough to commit to open source?” This moment is an excellent opportunity to pull out your best Madge, the Palmolive manicurist impression, and quip, “Commit to open source? Why, you're soaking in it!”
The pervasiveness of the World Wide Web guarantees that nearly every information organization is using free or open source software to perform some function. For example, 43.7% of web browsing is being done with Firefox, an open source application, and Internet Explorer is steadily losing its lead. It only has 50.5% of the market currently. . Similarly, 49.82% of web servers are running Apache, which has retained its first-place spot over Microsoft IIS for 12 years . It seems that every month, another visible library announces its website redesign in Drupal. And if an organization hosts a blog or a wiki, the chances that it is an open source package are pretty good. In fact, the chance that your organization's hosted blog is powered by WordPress is pretty good simply because it is supported by one of the most active open source communities in cyberspace.
|Operating Systems||Web/Proxy Servers|
|Instant Messengers||Content Management Systems|
Realizing that your organization is already a hybrid environment helps administrators and staff realize that a relationship with open source can be more like a respectful, close friendship than a toxic, codependent marriage. Open source will let you develop relationships with other software packages, open or proprietary. It is true that there are some great philosophical justifications for using FOSS. Just as many of us cannot ride a bicycle to every destination and instead opt to buy a hybrid car as a compromise, one can consider those noble justifications while being “as open as possible.” (“AOAP” is the meme, as Mr. Klein reminds me.)
A commitment need only be as deep as required by the organization and exploration can be done without an intention to replace proprietary software currently in place. Much has been written about the open source integrated library systems Koha and Evergreen, but if an organization currently has an ILS in place, it may still be worthwhile to install an alternate web OPAC like Scriblio, SOPAC or VuFind instead. The installation can live alongside the current system and, if presented as a public beta, can provide useful data regarding what users prefer from a catalog interface. There are hundreds of library applications in development, though it is advantageous to choose a project with an active development community.
|Repositories||MARC Module for Drupal|
|Digital Asset Factory||Scribilio|
Open source software can unnerve staff and administrators who do not have a full understanding of the concept, the myths and the all-around usefulness of it. Developers and technical staff who can communicate these three things will find it much easier to integrate some truly innovative software into their organization's technical environment.
Format AvailableFull text: HTML | PDF
Copyright © 2008 American Society for Information Science and Technology
- Issue online:
- Version of record online:
Resources Cited in the Artice
- 1U.S. Copyright Office (2008). Recipes. Retrieved August 24, 2008, from www.copyright.gov/fls/fl122.html
- 2Free Software Foundation (2007). The free software definition. Retrieved August 24, 2008, from www.fsf.org/licensing/essays/free-sw.html
- 3Tiemann, M. (2006). History of the OSI. Retrieved August 24, 2008 from www.opensource.org/history/
- 4Coar, K. (2006). The open source definition. Retrieved August 24, 2008, from www.opensource.org/docs/osd/
- 5Nelson, R. (2006). Open source licenses by category. Retrieved August 24, 2008, from www.opensource.org/licenses/category/
- 6Brown, A. (2005, January 11). The war on copyright communists: Bill Gates wants software patents to protect his profit, not the public. The Guardian [London, England], p. 22.
- 7 Retrieved August 24, 2008 from www.stallman.org/articles/on-hacking.html
- 8W3Schools. (2008). Browser statistics. Retrieved September 1, 2008, from www.w3schools.com/browsers/browsers_stats.asp
- 9Netcraft, Ltd. (2008). August 2008 Web server survey. Retrieved September 1, 2008, from http://news.netcraft.com/archives/2008/08/29/august_2008_web_server_survey.html