
Identifying differentially expressed genes in cDNA microarray experiments: making aging visibleH. Bengtsson^{1}, B. Calder^{2}, I.S. Mian^{3*}, M. Callow^{4}, E. Rubin^{4}, T.P. Speed^{5,6}October, 2001. 
This page accompanies the SAGE KE Discussion Forum on Identifying differentially expressed genes in cDNA microarray experiments: making aging visible.
On this page:  1. Introduction, 2. Experimental setup, 3. Image analysis, 4. Normalization, 5. Identifying differentially expressed genes, 6. Results. 
Transcription profiling monitors simultaneously the abundance of tens of thousands of transcripts in a biological sample. In biogerontology, such profiling is but one arrow in the quiver required to ascertain how the physiological vigor of an organism declines over time. Currently, expression profiling technology comes in a variety of flavors including cDNA microarrays, highdensity nylon membrane arrays, serial analysis of gene expression, short or long oligonucleotide arrays, fiber optic arrays, and so on [20]. The exponential rise in the number of profilingrelated MEDLINE articles is a testament to its popularity. If this experimental tool is to deliver on its promises, however, a growing list of statistical, computational, technological and biological issues will need to be addressed (for incomplete lists, see [21,22]).
This discussion forum focuses on the elemental, but critical issue of statistical approaches to identifying differentially expressed genes given cDNA microarray data (it should be noted that many of the same issues arise with the other technologies). The nature and type of data to be analyzed as well as a need to distinguish between biologically and statistically significant differential expression highlights the multiplicity of potential answers to such a deceptively simple question.
The starting premise for the forum is that a suitable scientific question has been formulated, the appropriate study designed, standard operating procedures employed to extract mRNA samples from relevant specimens, the transcription profiling experiments performed, and images of hybridized microarrays acquired. Accounting for technological and/or experimental confounding factors can reduce observation noise, errors associated with the measurement process or technology itself, but has little influence on model noise, variation due to the stochastic nature of gene expression and the underlying biology. It is assumed that strategies for minimizing observation noise and data analysis are independent of the scientific question. That is, aging is sufficiently similar to cancer, development and other biological processes so that no "new" statistical methods need to be invented to analyze transcription profiling data from biogerontology studies. Thus, employing a study of mouse lipid metabolism in mice ([1]) as the framework within which to structure the discussion forum and to illustrate pertinent issues seems appropriate.
The background corrected fluorescent intensities as well as the eight original scanned images processed using the Spot software [2] for the ApoAI experiments can be downloaded from [3]. The statistical analyses described on this page were performed using the objectoriented addon com.braju.sma [4] to the SMA (Statistics for Microarray Analysis) package [5] written using the R language [12,13]. How it can be done in R is described in the R program [19].
The outline of this article is as follows. In section 2, we describe the setup of the experiments of the study and in section 3 we shortly discuss problem of extracting the signal from the scanned images. In section 4, we will identify systematic variations in the data that do not have a biological source, but are believed to stem from different error sources along the cDNA microarray process, and we will show how different normalization methods can correct for these artifacts. In second 5, we discuss the problems and methods of how to identify differentially expressed genes. Throughout the article we are using the ApoAI data set to exemplify the different ideas. In section 6, we give a list of 40 genes that we identified to be differentially expressed. At the end, we also do a comparison between the genes that are found when Spot is used and those that are found when GenePix is used.
The goal of the study of mouse lipid metabolism in mice was to identify genes with altered expression in two mouse models with very low HDL cholesterol levels (treatment groups) compared to inbred control mice. The first mouse model compared eight mice with the apolipoprotein AI (ApoAI) knockedout with a control group consisting of eight "normal" C57B1/6 mice. The second mouse model compares in a similar way eight scavenger receptor BI transgenic mice (SRBI) with the same control group. To identify the altered gene expressions a cDNA microarray technique was used. For the first model target cDNA was obtained from mRNA for all sixteen mice by reverse transcription and labeled using the redfluorescent dye Cy5. The reference sample was prepared by pooling cDNA from the eight control mice and was labeled with the greenfluorescent dye Cy3. An analog setup was used for the second mouse model. The mix of target and reference cDNA was hybridized to microarrays containing N=5548 cDNA probes. The slides contained 6384 spots, but some of them where empty spots. Here we will focus on the comparison between the eight ApoAI mice (K=8) and the pool of control mice without making use of the fact that a comparison between each of the eight control mice with a pool of the same mice was also done. To include the eight slides from the control group in the analysis both twosample ttests and adjusted pvalues are required, an approach which is beyond the scope of this article. For an analysis of the ApoAI data set using twosample ttests, see [14].
Raw data for one cDNA microarray slide typically consists of two scanned 16bit images. During the scanning a green and a red laser excite the Cy3 and Cy5 dyes and the emission is registered. From these two images at least four different measures for each spot are extracted. The foreground signals, R_{fg} and G_{fg}, are often measured as the mean or the median of the intensities of the pixels that are identified to belong to the spot. There are several ways to define what the background of a spot is and different definitions often result in different background estimates, R_{bg} and G_{bg}. All background identification methods strive to get a representative estimate of the background noise in the area of the spot, a noise that is also expected to be added to the foreground signals. With a good estimate of the foreground and the background signals it is reasonable to believe that the background subtracted signals, R=R_{fg}R_{bg} and G=G_{fg}G_{bg}, reflects the gene expression levels. It is important to remember that the spots are not always easily identifiable regions in the images, but can often be weak, smeared out as comets (as an effect from the washof of the polyLlysine). It also happens that two or more spots are so large that they overlap or that the microarray slides are contaminated with dust and scratches. For reasons like these it is important to have good models that estimates of the foreground and the background signal. Yang et al [8] compare the different foreground and background identification and estimation methods that are used by the most commonly used image analysis software, such as GenePix [9], QuantArray [10], ScanAlyze [11] and Spot [2]. The authors show that the latter gives better foreground and background estimates than the others.
Before any method for identifying differentially expressed genes can be applied, the data has to be normalized. Normalization is a process of removing systematic variation in microarray experiments, which affect the measured gene expression ratios. The sources for these artifacts can be many and they often reveal themselves in different ways. For instance, a too large imbalance in the PMT (photomultiplier tube) voltage setting for the red and the green laser in the scanner will give a bias in the fluorescence intensities towards one of channels. Another reason for having such a bias could be due to different physical properties of the two dyes (green Cy3 and red Cy5 dye), such as differences in halflife, size and weight, and heat and light sensitivity. Other systematic variations could be due to variations in the quality of the cDNA clones printed on the cDNA microarray slides. Varying printing conditions, such as temperature and humidity, during the printing process might further contribute to the variability in the measured gene expression levels, or, as shown below, there could be problems with the printtips.
The normalization process can be dived into two main steps. The first step identifies and normalizes for artifacts within each slide. The withinslide normalization is applied on each slide independently of the others. To be able to compare the measured gene expression levels between replicated slides the spread of expression levels must be the same for all slides. This is taken care of by the second normalization step called betweenslide normalization.
Since the quantity of interest is the relative gene expression level and R and G typically range from 0 to 65535, we transform the data by taking the logarithm of the ratio between the red and the green signals, M=log_{2}(R/G). Another interesting quantity is the average log intensity of the two signals, A=1/2·log_{2}(R·G). The rational for the factor 1/2 is that A will have the same natural range as log_{2}(R) and log_{2}(G) and logbase two is used because the data is originally binary. It is easy to show that this transform is reversible (and therefore the information contained in R and G is preserved). We have that
M = log_{2}(R/G),
A = 1/2·log_{2}(R·G) <=> R = (2^{2A+M})^{1/2}, G = (2^{2AM})^{1/2}. 
(1) 
Many cDNA microarray experiments are set up in such a way that most of the genes are not expected to be differentially expressed. In these cases an M vs. A scatterplot will be more revealing than an R vs. G scatterplot, because most of the data points in an R vs. G plot will be distributed around the diagonal line R=G whereas the same data points in an M vs. A plot will be distributed around the horizontal line M=0. In figure 1 an R vs. G and an M vs. A scatterplot for the same set of data are shown. The normalization methods discussed below are operating in the (M, A) space. Also, if the noise in the two channels is multiplicative, i.e. R=c·R_{true} and G=c·G_{true} the transform to M will remove this noise.
Figure 1. Left: An R vs. G plot (log base 2) of the data points for the last slide in the ApoAI knockout experiment. Right: The same data points plotted in a M vs. A plot (log base 2). M is the log ratio between the red and the green channel and A is the intensity. The dashed lines R=G and M=0 are where the signals in the two channels are equal. 
It is important to have in mind that all the normalization methods used below are based on the assumption that there is only a small fraction of outliers (12%). We believe that this assumption is true for the ApoAI data set.
Taking the mean (or the median) of the log ratios for each slide we should, since we assume that most of the genes to be nondifferentially expressed get values close to zero. A boxplot of the M values for each of the eight ApoAI slides shows that this is not the case (cf. left plot in figure 2). For the slides 2 & 3, 58 the relative expression levels are biased towards the green channel (downregulated genes) and for the slides 1 & 4 there is a small bias towards the red channel (upregulated genes).
Figure 2. Left: Boxandwhisker plot showing the sample distribution of the log intensity ratios for each of the eight ApoAI slides. Since we assume most of the genes to be nondifferentially expressed, ideally this plot should show a median around zero for all slides. Right: M vs. A scatterplot with a lowessfitted curve revealing that the log ratios are intensity dependent. 
A first approach to normalize the data would simply be to subtract the bias. However, a M vs. A plot, e.g. for the data from the third ApoAI slide, shows that the shift in not only a constant shift, but an intensitydependent shift (cf. right plot in figure 2). The solid line in the plot is the lowessfitted curve. The lowess function uses robust (not much affected by outliers) locally linear regression to smooth scatterplots [12,13]. All slides have this intensitydependent property. This suggests that an intensitydependent normalization [3] is needed:
M_{norm} = M  c(A)  (2) 
where c(A) is the lowess fit to the M vs. A plot.
The printing step in the cDNA microarray measurement process can also introduce systematic differences. In the ApoAI data set it is believed that there is an artifact present due to variation in the quality of the print tips [14,15]. This could be due to different lengths of the printtips making some of them leaving more cDNA on the slides than others, which in turn, due to dilution, could result in different concentration of cDNA in the wells on clone plates. Another reason could be deformation of some of the printtip slits after hours of printing. For the ApoAI microarrays a printhead of fourbyfour printtips were used (J=16). Each printtip printed a group of 399 spots, called a printtip group (or grid). The printtip groups are laid out on the slide from lefttoright and from toptobottom counting from the upper left corner to the lower right corner (cf. legend in figure 4). Within each printtip the spots are printed from lefttoright and from toptobottom. In other words, the sixteen spots in the upper left corner in the sixteen printtip groups were printed at the same time, then the spots right next to them were printed and so forth. Even if the slides were printed with a wellorganized clone library, this mapping of spots assures that any interesting genes are located "all over" the slide. For this reason we can assume that the "true" logratios have the same zeromean distribution across the printtip groups. A boxplot of the logratios for the sixteen groups (cf. figure 3) shows that there are systematic differences between the groups and especially that the printtip groups 1316 are different from the others. A plot of the spatial distribution of the top 5% absolute logratios (cf. figure 3) confirms this. This suggests that some kind of normalization within each printtip group is needed.
Figure 3. Before normalization. Left: Boxandwhisker plot of the data points in each of the sixteen printtip groups. The lower four groups, 1316, shows a different behavior than the rest. A perfect slide would have the same average and variance across the printtip groups. Right: The spatial position of the top 5% absolute logratios (M) on the slide. Each small rectangle represents the logratio of a spot and the large fourbyfour rectangles represent the printtip groups. It is clear that there is, especially in the lower four groups, some kind of artifact. 
The (global) lowess normalization method above can be applied to each printtip group individually. The sixteen lowess curves in figure 4 confirms that the groups 1316 are different from the others and that the differences are intensity dependent. Printtip normalization [15] is
M_{j,norm} = M_{j}  c_{j}(A_{j}); j=1,2,...,J  (3) 
where J is the number of printtip groups and c_{j}(A_{j}) is the lowess fit of the log ratios M_{j} and the intensities A_{j} of the spots belonging to printtip group j. When doing a withinprinttip normalization a slideintensitydependent normalization as discussed in the previous section is unnecessary.
Figure 4. The M vs. A plot with sixteen lowessestimated curves, one for each of printtip groups. The legend in the lower right corner shows how the printtip groups are numbered, starting with one in the upper left group and counting lefttoright to sixteen in the lower right group. The data from the four lowest printtip groups, 1316, have different properties than the others. 
Doing a boxplot and a spatial plot (not shown) similar to the ones in figure 3 one gets that the printtip normalization method not only removes the bias and the intensity dependence, but it also removes some of the spatial effects and makes the differences in variance between the groups smaller. However, there might still be differences in the variance, which can be normalized by scaling the logratios within each printtip group. The scale method used in [15] assumes that the logratios in each group, j, follow a normal distribution with zero mean and a variance s_{j}^{2}. Not showing the details, it is straightforward to scale the printtip groups so they get the same variance s^{2}. To assure that the scaling is not affected by outliers the robust statistic MAD (median absolute deviation) is used to estimate the individual variances. The printtip normalization method followed by a scaling step is called the scaled printtip normalization method. The result of scaled printtip normalization is shown in figure 5.
Figure 5 After withinslide normalization. Left: Boxandwhisker plot of the normalized data for each of the sixteen printtip groups. The scaled printtip normalization method first adjusts the data in each printtip group such that their lowess lines became zero for all intensities. Second, it scales the data within each group such that their variances become the same. Right: The spatial distribution on the slide of the top 5% absolute logratios (M) from ApoAI slide 8. 
Scaled printtip normalization applied to all slides separately removes the intensity dependency of the logratios within the printtip groups and this also removed the intensity dependency and the bias within the slides. However, the spread of the logratios might still vary from slide to slide and the same scaling technique that was applied to the printtip groups may be used to scale variances between slides. Betweenslide normalization makes it possible to compare the gene expression levels measured by the different slides. The boxplot in figure 6 shows that scaled printtip normalization followed by betweenslide normalization makes the slides comparable (cf. figure 2). The "average" slide is shown as an M vs. A scatterplot in figure 6. The bars over the axis labels denote the mean of M and the mean of A, respectively.
Figure 6. Left: Boxandwhisker plot showing the sample distribution of the log intensity ratios of the eight ApoAI slides after scaled printtip normalization and betweenslide normalization. Right: The "average" slide where the mean of the logratios and the mean of the intensities have been calculated for all spots across the slides. 
The most simple way to select genes being differentially expressed is to stratify on the absolute average logratios, e.g. by choosing genes that have an M>M_{min}. The main argument against using the average M value for deciding if a gene is differentially expressed or not, is that a large mean might be driven by an outlier, something which is not rare for cDNA microarray measurements. Thus, a better statistic to be used is the traditional tstatistic:
t_{n} = M_{n} / SE_{n} ; n=1,2,...,N  (4) 
where M_{n} is the (sample) mean of the logratios for gene n; (M_{n1}, M_{n2}, ..., M_{nK}). SE_{n}=S / K^{1/2} is the standard error of the mean, where S_{n} is the (sample) standard deviation of the logratios for gene n. Since we do not care if the genes are up or down regulated, we look at the absolute t value, t. For two genes with the same mean of logratios, the gene with the smaller standard error will get the larger t value. The rational for this is that if repeated measurements of a gene expression ratio give very similar results, i.e. the standard error is small, we should trust these measurements more than in the case where they differ a lot.
There are problems with using the tstatistics. A gene with a small absolute mean might get a large t value only because of a small SE, which might not be the true reflection of the gene. Remember that we are not only comparing t values from two genes but from several thousands of genes. For this reason, by chance there will be genes that have larger t values because they have small SE, not because they are differentially expressed. An ad hoc way to solve this problem is to discount genes with a small absolute mean whose standard errors are small. A small SE could be defined as SEs that are among the bottom 1%. Tusher et al. [16] presented another solution to the problem where they add a constant to each SE. Their statstic is referred to as the Sstatistic.
A better and more correct method for dealing with the above problem and for identifying differentially expressed genes is an empirical Bayes method developed by Lönnstedt et al. [17]. The authors introduce a new statistic called the Bstatistic. The B value calculated for gene n is a logposterior odds ratio for the gene to be differentially expressed against it is not differentially expressed given the observed logratios for all genes and all experiments:

(5) 
where n=1,2,...,N. The probability model behind this Bstatistic assumes that the K replicated logratio measurements for gene n, M_{nk}; k=1,2,...,K, are normally distributed with mean m_{n} and variance s_{n}^{2}. However, the mean and the variance are not treated as fixed parameters but as random variables themselves and therefore we look at the distribution of M_{nk} as a conditional distribution M_{nk}m_{n}, s_{n}^{2} ~ N(m_{n}, s_{n}^{2}). The assumption is still that most genes are nondifferentially expressed, i.e. they have m_{n}=0, but a small proportion p of genes have a m_{n}<>0. The expression above can now be written B_{n}=log[P(m_{n}<>0  {M_{nk}} / P(m_{n}=0  {M_{nk}}]. Assuming special conjugatepriori distributions for the mean and the variances (for details see [17]), the authors get the full expression for B_{n} to be:

(6) 
where M_{n} and S_{n}^{2} are the sample mean and sample variance of the M values for gene n and N is the number of genes. Leaving out the details, a, v and c are hyperparameters that are estimated from the logratios from all genes and all slides. Finally, p is the proportion of differentially expressed genes and p is normally the only parameter to be tuned. The Bstatistic works in a similar fashion as the Sstatistic, but the former gives a slightly smaller number of false positives and false negatives [17].
Before presenting the results, a relevant and important question to be asked is how many replicates are that needed to get trustful results. We did all the data analysis discussed above for the case where n=2, 4, 6 and 8 slides were used. Since the number of slides does not affect withinslide normalization we did not have to repeat that step. However, we had to do the betweenslide normalization for each n. The top 40 genes when two, four, six and eight slides were used are listed in table 1. The genes are ranked according to their B values.



Table 1. Genes found as a function of the number of replicates. The table is listing the top 40 (0.6%) ApoAI differentiated genes found when the number of slides used in the analysis is n=2,4,6,8, respectively. In each column, the genes are ranked by their Bstatistic. The number listed are the genes' spot numbers as they occur on the microarray slides. The genes marked with a star (*) are the eight genes listed in [14]. The color and style coding is only used to help locate the same gene across the columns. All genes found when n=8 slides were used are boldfaced. To simplify the identification of these genes in the n=2, n=4 and n=6 columns, they are also color coded and some of them are in italic. Genes not found in the n=8 column are nonbold and colored in silver. 
For n=2 there were no genes with a positive logposterior odds ratio among the top 40 genes. For n=4 there were 12 with positive B values and for n=6 and n=8 all top 40 genes had positive B values. More interestingly, when data from two slides are used, we can only identify 10 out of the top 40 genes found when all slides were used. When four slides were used, 17 out of 40 were found and when six slides were used, 24 out of 40 were found. It must be mentioned that the selection of two, four and six slides was done by taking the first two slides, then the first four and finally the first six slides as they occur in the data set. A more rigorous approach would be to use a bootstrap technique, which repeatedly samples two (four or six) slides from the set of eight slides and then takes the average of the Bvalues for each gene.
Using the Bstatistic on the ApoAI data set we identify a set of genes that we believe to be differentially expressed. The genes are listed table 2, but can also be downloaded in a computerreadable format from [18]. Almost all the identified genes are down regulated, which makes sense since we are looking at a knockout model. A more detailed discussion of the biological relevance of these genes are left to the SAGE KE discussion forum. The B values were calculated assuming 1% of the genes to be differentially expressed (p=0.01). If we instead assume 0.5% of the genes to be differentially expressed (p=0.005) we get that the 40 genes with largest logposterior odds ratio are still ranked the same as in the p=0.01 case.


Table 2. Differentially expressed genes. The columns are 1) the rank (from Bvalues), 2) the spot number, 3) the log ratio, 4) the intensity, 5) the logposterior odds ratio, 6) the Cy5 intensity, 7) the Cy3 intensity, 9) the gene id, and 9) a short description of the gene/est. The list of genes can be downloaded in a computerreadable format from [18]. 
A B vs. average M plot (cf. left plot in figure 7) of the ApoAI data set shows that selecting genes based solely on the average log ratios would not give the same set of genes as if the logposterior odds ratio would be used. In the plot, the 40 (0.6%) genes with highest ranked B values are highlighted red. To the right in figure 7, the same genes are highlighted, but now in an average M vs. average A plot.
Figure 7. Left: The logposterior odds ratio vs. the average log ratios. The Bstatistic is the logposterior odds ratio for each gene, which here is the same as each spot, to be differentially expressed. It is clear from this plot, that identification of differentially expressed genes based only on the average log ratios would not give the same genes as if it was based on the logposterior odds ratio. Small M values tend to correspond to small B values, but the reverse is not true. The genes with the top 0.6% largest Bstatistics are highlighted red. We used the parameter value p=0.01, i.e. we expect about 1% of the genes to be differentially expressed. Right: A scatterplot of the average logratios vs. the average intensities with the same identified genes highlighted. 
In addition to use the Spot software for extracting the signals in the cDNA microarray images, we used the GenePix software. We analyzed the data from GenePix in the exact the same way as we did with the data from Spot and we compared the genes identified. The genes with the top 40 largest B values found by GenePix and the genes with the top 40 largest B values found by Spot are listed in table 3. Both Spot and GenePix resulted in ranking the gene at spot 1496 to be the highest. Spot ranked 2149 second whereas GenePix ranked spot 4941 second and so on. In GenePix's top 40 list there are 15 genes that are found in Spot's top 40 list, or equivalent in GenePix's top 40 list, there are 25 genes that are not found in Spot's top 40 list. Descriptions of these 25 genes are listed in the last column in table 3.
Finally, looking at the B values, Spot seems to produce larger B values than GenePix. This can be interpreted as suggesting that the quality of the data (signal to noise) from Spot is slightly higher than that from GenePix. However, a preferred comparison would be one based on their respective abilities to correctly identify genuinely differentially expressed genes, and on their false positive rates. Such a comparison is not possible at this point.


Table 3. Genes ranked highly by Spot vs genes ranked highly by GenePix. In column 2 are the spot number for the genes found using Spot for the image analysis. In column 6 are the spot number for the genes found by using GenePix for the image analysis. Except for the image analysis the rest of the data analysis was identical. In the last column are the descriptions of the genes that was highly ranked by GenePix but not by Spot listed. For the rest of the description, both for the GenePix and the Spot genes, see table 2. The color and style coding is done in the same fashion as in table 1. 
1  Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden (hb@maths.lth.se) 
2  Health Science Center at San Antonio, The University of Texas, Texas (calder@uthscsa.edu) 
3  Radiation Biology and Environmental Toxicology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California (smian@lbl.gov) 
4  Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, California (mjcallow@lbl.gov, emrubin@lbl.gov) 
5  Department of Statistics, University of California, Berkeley (terry@stat.berkeley.edu) 
6  Division of Genetics & Bioinformatics, Walter & Eliza Hall Institute of Medical Research, Melbourne (terry@wehi.edu.au) 
*  Correspondence should be addressed to Saira Mian, telephone +15104866216, fax +15104866949, email smian@lbl.gov. 
[1]  M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin. Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research, 10(12):20229, December 2000. 
[2] 
Spot software, CSIRO Mathematical and Information Sciences. http://www.cmis.csiro.au/iap/spot.htm 
[3] 
M. Callow, ApoAI Data, Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley. http://www.stat.berkeley.edu/users/terry/zarray/Html/apodata.html 
[4] 
com.braju.sma  an objectoriented add on to sma (beta) http://www.braju.com/R/com.braju.sma/ 
[5] 
Statistics for Microarray Analysis (sma) http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html 
[6]  R. Ihaka and R. Gentleman, R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics 5(3):299314, 1996. 
[7] 
The R Project for Statistical Computing http://www.rproject.org/ 
[8] 
Y. H. Yang, M. J. Buckley, S. Dudoit and T. P. Speed. Comparison of methods for image analysis on cDNA microarray data. Technical report #584, University of California, Berkeley. November 2000. http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html 
[9] 
GenePix software, Axon Instruments, Inc. http://www.axon.com/GN_Genomics.html 
[10] 
QuantArray Analysis Software, Packard Instrument Company. http://www.packardbiochip.com/products/quant_soft.htm 
[11] 
M. Eisen, ScanAlyze software, Eisen Lab, Life Science Division, Lawrence Berkeley National Laboratory, Berkeley, California. http://rana.lbl.gov/EisenSoftware.htm 
[12]  Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74:829836. 1979. 
[13]  Cleveland, W. S. LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician, 35:54. 1981. 
[14] 
S. Dudoit, Y. H. Yang, M. J. Callow and T. P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical report #578, University of California, Berkeley. August 2000. http://www.stat.berkeley.edu/users/terry/zarray/Html/matt.html 
[15] 
Y. H. Yang, S. Dudoit, P. Luu and T. P. Speed. Normalization for cDNA Microarray Data. SPIE BiOS 2001, San Jose, California, January 2001. http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html 
[16]  V.G. Tusher, R. Tibshirani and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):511621, April 2001. 
[17] 
I. Lönnstedt and T. P. Speed. Replicated Microarray Data. Statistical Sinica, Accepted. http://www.stat.berkeley.edu/users/terry/zarray/TechReport/Baypap4d.ps 
[18]  List of 40 genes found to be differentially expressed in the ApoAI knockout experiments. http://www.maths.lth.se/matstat/bioinformatics/sageke/apoa1genelist.txt (a commaseparated ASCII file) 
[19] 
Example R program that loads the data, normalizes it and identifies differentially expressed genes using the Bstatistics method. http://www.maths.lth.se/matstat/bioinformatics/sageke/Rcode.txt 
[20] 
L. Shi, DNA Microarray (Genome Chip)  Monitoring the Genome on a Chip http://www.genechips.com/ 
[21] 
T. Speed et al., Statistical problems involving microarray data, University of California, Berkeley http://www.stat.Berkeley.EDU/users/terry/zarray/Html/list.html 
[22] 
W. Li, Papers on microarray data analysis, Rockefeller University http://linkage.rockefeller.edu/wli/microarray/ 
Back to the SAGE KE Discussion Forum on Identifying differentially expressed genes in cDNA microarray experiments: making aging visible.