A Japanese-American Werner patient as a teenager, and at age 48 (Case #1 Epstein et al, 1966, Medicine 45:177). She had eight children, two of whom were also affected. At 48 , she had hair loss and greying, thin extremities, chronic ulcerations of the ankles, atrophy of the skin and her the right eye had been enucleated several years earlier due to acute glaucoma resulting from bilateral cateract extraction at the age of 27. She lived longer than many Werner patients, dying at 57. Source: http://www.hms.harvard.edu/pathol/sinclair/Pages/wernerspage.html

Identifying differentially expressed genes in cDNA microarray experiments: making aging visible

H. Bengtsson1, B. Calder2, I.S. Mian3*, M. Callow4, E. Rubin4, T.P. Speed5,6

October, 2001.

This page accompanies the SAGE KE Discussion Forum on Identifying differentially expressed genes in cDNA microarray experiments: making aging visible.

On this page:1. Introduction, 2. Experimental setup, 3. Image analysis, 4. Normalization, 5. Identifying differentially expressed genes, 6. Results.

1 Introduction

Transcription profiling monitors simultaneously the abundance of tens of thousands of transcripts in a biological sample. In biogerontology, such profiling is but one arrow in the quiver required to ascertain how the physiological vigor of an organism declines over time. Currently, expression profiling technology comes in a variety of flavors including cDNA microarrays, high-density nylon membrane arrays, serial analysis of gene expression, short or long oligonucleotide arrays, fiber optic arrays, and so on [20]. The exponential rise in the number of profiling-related MEDLINE articles is a testament to its popularity. If this experimental tool is to deliver on its promises, however, a growing list of statistical, computational, technological and biological issues will need to be addressed (for incomplete lists, see [21,22]).

This discussion forum focuses on the elemental, but critical issue of statistical approaches to identifying differentially expressed genes given cDNA microarray data (it should be noted that many of the same issues arise with the other technologies). The nature and type of data to be analyzed as well as a need to distinguish between biologically and statistically significant differential expression highlights the multiplicity of potential answers to such a deceptively simple question.

The starting premise for the forum is that a suitable scientific question has been formulated, the appropriate study designed, standard operating procedures employed to extract mRNA samples from relevant specimens, the transcription profiling experiments performed, and images of hybridized microarrays acquired. Accounting for technological and/or experimental confounding factors can reduce observation noise, errors associated with the measurement process or technology itself, but has little influence on model noise, variation due to the stochastic nature of gene expression and the underlying biology. It is assumed that strategies for minimizing observation noise and data analysis are independent of the scientific question. That is, aging is sufficiently similar to cancer, development and other biological processes so that no "new" statistical methods need to be invented to analyze transcription profiling data from biogerontology studies. Thus, employing a study of mouse lipid metabolism in mice ([1]) as the framework within which to structure the discussion forum and to illustrate pertinent issues seems appropriate.

The background corrected fluorescent intensities as well as the eight original scanned images processed using the Spot software [2] for the ApoAI experiments can be downloaded from [3]. The statistical analyses described on this page were performed using the object-oriented add-on com.braju.sma [4] to the SMA (Statistics for Microarray Analysis) package [5] written using the R language [12,13]. How it can be done in R is described in the R program [19].

The outline of this article is as follows. In section 2, we describe the setup of the experiments of the study and in section 3 we shortly discuss problem of extracting the signal from the scanned images. In section 4, we will identify systematic variations in the data that do not have a biological source, but are believed to stem from different error sources along the cDNA microarray process, and we will show how different normalization methods can correct for these artifacts. In second 5, we discuss the problems and methods of how to identify differentially expressed genes. Throughout the article we are using the ApoAI data set to exemplify the different ideas. In section 6, we give a list of 40 genes that we identified to be differentially expressed. At the end, we also do a comparison between the genes that are found when Spot is used and those that are found when GenePix is used.

2 Experimental setup

The goal of the study of mouse lipid metabolism in mice was to identify genes with altered expression in two mouse models with very low HDL cholesterol levels (treatment groups) compared to inbred control mice. The first mouse model compared eight mice with the apolipoprotein AI (ApoAI) knocked-out with a control group consisting of eight "normal" C57B1/6 mice. The second mouse model compares in a similar way eight scavenger receptor BI transgenic mice (SRBI) with the same control group. To identify the altered gene expressions a cDNA microarray technique was used. For the first model target cDNA was obtained from mRNA for all sixteen mice by reverse transcription and labeled using the red-fluorescent dye Cy5. The reference sample was prepared by pooling cDNA from the eight control mice and was labeled with the green-fluorescent dye Cy3. An analog setup was used for the second mouse model. The mix of target and reference cDNA was hybridized to microarrays containing N=5548 cDNA probes. The slides contained 6384 spots, but some of them where empty spots. Here we will focus on the comparison between the eight ApoAI mice (K=8) and the pool of control mice without making use of the fact that a comparison between each of the eight control mice with a pool of the same mice was also done. To include the eight slides from the control group in the analysis both two-sample t-tests and adjusted p-values are required, an approach which is beyond the scope of this article. For an analysis of the ApoAI data set using two-sample t-tests, see [14].

3 Image analysis

Raw data for one cDNA microarray slide typically consists of two scanned 16-bit images. During the scanning a green and a red laser excite the Cy3 and Cy5 dyes and the emission is registered. From these two images at least four different measures for each spot are extracted. The foreground signals, Rfg and Gfg, are often measured as the mean or the median of the intensities of the pixels that are identified to belong to the spot. There are several ways to define what the background of a spot is and different definitions often result in different background estimates, Rbg and Gbg. All background identification methods strive to get a representative estimate of the background noise in the area of the spot, a noise that is also expected to be added to the foreground signals. With a good estimate of the foreground and the background signals it is reasonable to believe that the background subtracted signals, R=Rfg-Rbg and G=Gfg-Gbg, reflects the gene expression levels. It is important to remember that the spots are not always easily identifiable regions in the images, but can often be weak, smeared out as comets (as an effect from the wash-of of the poly-L-lysine). It also happens that two or more spots are so large that they overlap or that the microarray slides are contaminated with dust and scratches. For reasons like these it is important to have good models that estimates of the foreground and the background signal. Yang et al [8] compare the different foreground and background identification and estimation methods that are used by the most commonly used image analysis software, such as GenePix [9], QuantArray [10], ScanAlyze [11] and Spot [2]. The authors show that the latter gives better foreground and background estimates than the others.

4 Normalization

Before any method for identifying differentially expressed genes can be applied, the data has to be normalized. Normalization is a process of removing systematic variation in microarray experiments, which affect the measured gene expression ratios. The sources for these artifacts can be many and they often reveal themselves in different ways. For instance, a too large imbalance in the PMT (photo-multiplier tube) voltage setting for the red and the green laser in the scanner will give a bias in the fluorescence intensities towards one of channels. Another reason for having such a bias could be due to different physical properties of the two dyes (green Cy3 and red Cy5 dye), such as differences in half-life, size and weight, and heat and light sensitivity. Other systematic variations could be due to variations in the quality of the cDNA clones printed on the cDNA microarray slides. Varying printing conditions, such as temperature and humidity, during the printing process might further contribute to the variability in the measured gene expression levels, or, as shown below, there could be problems with the print-tips.

The normalization process can be dived into two main steps. The first step identifies and normalizes for artifacts within each slide. The within-slide normalization is applied on each slide independently of the others. To be able to compare the measured gene expression levels between replicated slides the spread of expression levels must be the same for all slides. This is taken care of by the second normalization step called between-slide normalization.

4.1 Transformation of variables

Since the quantity of interest is the relative gene expression level and R and G typically range from 0 to 65535, we transform the data by taking the logarithm of the ratio between the red and the green signals, M=log2(R/G). Another interesting quantity is the average log intensity of the two signals, A=1/2·log2(R·G). The rational for the factor 1/2 is that A will have the same natural range as log2(R) and log2(G) and log-base two is used because the data is originally binary. It is easy to show that this transform is reversible (and therefore the information contained in R and G is preserved). We have that


M = log2(R/G), A = 1/2·log2(R·G)

<=>

R = (22A+M)1/2, G = (22A-M)1/2.
(1)

Many cDNA microarray experiments are set up in such a way that most of the genes are not expected to be differentially expressed. In these cases an M vs. A scatterplot will be more revealing than an R vs. G scatterplot, because most of the data points in an R vs. G plot will be distributed around the diagonal line R=G whereas the same data points in an M vs. A plot will be distributed around the horizontal line M=0. In figure 1 an R vs. G and an M vs. A scatterplot for the same set of data are shown. The normalization methods discussed below are operating in the (M, A) space. Also, if the noise in the two channels is multiplicative, i.e. R=c·Rtrue and G=c·Gtrue the transform to M will remove this noise.

R vs. G scatterplot <i>M</i> vs. <i>A</i> scatterplot
Figure 1. Left: An R vs. G plot (log base 2) of the data points for the last slide in the ApoAI knock-out experiment. Right: The same data points plotted in a M vs. A plot (log base 2). M is the log ratio between the red and the green channel and A is the intensity. The dashed lines R=G and M=0 are where the signals in the two channels are equal.

It is important to have in mind that all the normalization methods used below are based on the assumption that there is only a small fraction of outliers (1-2%). We believe that this assumption is true for the ApoAI data set.

4.2 Within-slide normalization

4.2.1 Global lowess normalization

Taking the mean (or the median) of the log ratios for each slide we should, since we assume that most of the genes to be non-differentially expressed get values close to zero. A boxplot of the M values for each of the eight ApoAI slides shows that this is not the case (cf. left plot in figure 2). For the slides 2 & 3, 5-8 the relative expression levels are biased towards the green channel (downregulated genes) and for the slides 1 & 4 there is a small bias towards the red channel (upregulated genes).

Boxplot for M across the slides
Figure 2. Left: Box-and-whisker plot showing the sample distribution of the log intensity ratios for each of the eight ApoAI slides. Since we assume most of the genes to be non-differentially expressed, ideally this plot should show a median around zero for all slides. Right: M vs. A scatterplot with a lowess-fitted curve revealing that the log ratios are intensity dependent.

A first approach to normalize the data would simply be to subtract the bias. However, a M vs. A plot, e.g. for the data from the third ApoAI slide, shows that the shift in not only a constant shift, but an intensity-dependent shift (cf. right plot in figure 2). The solid line in the plot is the lowess-fitted curve. The lowess function uses robust (not much affected by outliers) locally linear regression to smooth scatterplots [12,13]. All slides have this intensity-dependent property. This suggests that an intensity-dependent normalization [3] is needed:


Mnorm = M - c(A) (2)

where c(A) is the lowess fit to the M vs. A plot.

4.2.2 Print-tip normalization

The printing step in the cDNA microarray measurement process can also introduce systematic differences. In the ApoAI data set it is believed that there is an artifact present due to variation in the quality of the print tips [14,15]. This could be due to different lengths of the print-tips making some of them leaving more cDNA on the slides than others, which in turn, due to dilution, could result in different concentration of cDNA in the wells on clone plates. Another reason could be deformation of some of the print-tip slits after hours of printing. For the ApoAI microarrays a print-head of four-by-four print-tips were used (J=16). Each print-tip printed a group of 399 spots, called a print-tip group (or grid). The print-tip groups are laid out on the slide from left-to-right and from top-to-bottom counting from the upper left corner to the lower right corner (cf. legend in figure 4). Within each print-tip the spots are printed from left-to-right and from top-to-bottom. In other words, the sixteen spots in the upper left corner in the sixteen print-tip groups were printed at the same time, then the spots right next to them were printed and so forth. Even if the slides were printed with a well-organized clone library, this mapping of spots assures that any interesting genes are located "all over" the slide. For this reason we can assume that the "true" log-ratios have the same zero-mean distribution across the print-tip groups. A boxplot of the log-ratios for the sixteen groups (cf. figure 3) shows that there are systematic differences between the groups and especially that the print-tip groups 13-16 are different from the others. A plot of the spatial distribution of the top 5% absolute log-ratios (cf. figure 3) confirms this. This suggests that some kind of normalization within each print-tip group is needed.


Boxplot for the M values in each of the print-tip groups Spatial plot showing the top 5% differentially expressed genes
Figure 3. Before normalization. Left: Box-and-whisker plot of the data points in each of the sixteen print-tip groups. The lower four groups, 13-16, shows a different behavior than the rest. A perfect slide would have the same average and variance across the print-tip groups. Right: The spatial position of the top 5% absolute log-ratios (|M|) on the slide. Each small rectangle represents the log-ratio of a spot and the large four-by-four rectangles represent the print-tip groups. It is clear that there is, especially in the lower four groups, some kind of artifact.

The (global) lowess normalization method above can be applied to each print-tip group individually. The sixteen lowess curves in figure 4 confirms that the groups 13-16 are different from the others and that the differences are intensity dependent. Print-tip normalization [15] is


Mj,norm = Mj - cj(Aj);     j=1,2,...,J (3)

where J is the number of print-tip groups and cj(Aj) is the lowess fit of the log ratios Mj and the intensities Aj of the spots belonging to print-tip group j. When doing a within-print-tip normalization a slide-intensity-dependent normalization as discussed in the previous section is unnecessary.

<i>M</i> vs. <i>A</i> scatterplot
Figure 4. The M vs. A plot with sixteen lowess-estimated curves, one for each of print-tip groups. The legend in the lower right corner shows how the print-tip groups are numbered, starting with one in the upper left group and counting left-to-right to sixteen in the lower right group. The data from the four lowest print-tip groups, 13-16, have different properties than the others.

4.2.3 Scaled print-tip normalization

Doing a boxplot and a spatial plot (not shown) similar to the ones in figure 3 one gets that the print-tip normalization method not only removes the bias and the intensity dependence, but it also removes some of the spatial effects and makes the differences in variance between the groups smaller. However, there might still be differences in the variance, which can be normalized by scaling the log-ratios within each print-tip group. The scale method used in [15] assumes that the log-ratios in each group, j, follow a normal distribution with zero mean and a variance sj2. Not showing the details, it is straightforward to scale the print-tip groups so they get the same variance s2. To assure that the scaling is not affected by outliers the robust statistic MAD (median absolute deviation) is used to estimate the individual variances. The print-tip normalization method followed by a scaling step is called the scaled print-tip normalization method. The result of scaled print-tip normalization is shown in figure 5.

Scaled print-tip normalized boxplot for the M values in each of the print-tip groups Spatial plot showing the top 5% differentially expressed genes after scaled print-tip normalization
Figure 5 After within-slide normalization. Left: Box-and-whisker plot of the normalized data for each of the sixteen print-tip groups. The scaled print-tip normalization method first adjusts the data in each print-tip group such that their lowess lines became zero for all intensities. Second, it scales the data within each group such that their variances become the same. Right: The spatial distribution on the slide of the top 5% absolute log-ratios (|M|) from ApoAI slide 8.

4.2 Between-slide normalization

Scaled print-tip normalization applied to all slides separately removes the intensity dependency of the log-ratios within the print-tip groups and this also removed the intensity dependency and the bias within the slides. However, the spread of the log-ratios might still vary from slide to slide and the same scaling technique that was applied to the print-tip groups may be used to scale variances between slides. Between-slide normalization makes it possible to compare the gene expression levels measured by the different slides. The boxplot in figure 6 shows that scaled print-tip normalization followed by between-slide normalization makes the slides comparable (cf. figure 2). The "average" slide is shown as an M vs. A scatterplot in figure 6. The bars over the axis labels denote the mean of M and the mean of A, respectively.

The average ApoAI slide.
Figure 6. Left: Box-and-whisker plot showing the sample distribution of the log intensity ratios of the eight ApoAI slides after scaled print-tip normalization and between-slide normalization. Right: The "average" slide where the mean of the log-ratios and the mean of the intensities have been calculated for all spots across the slides.

5 Identifying differentially expressed genes

5.1 Absolute log-ratios and the t-statistic

The most simple way to select genes being differentially expressed is to stratify on the absolute average log-ratios, e.g. by choosing genes that have an |M|>Mmin. The main argument against using the average M value for deciding if a gene is differentially expressed or not, is that a large mean might be driven by an outlier, something which is not rare for cDNA microarray measurements. Thus, a better statistic to be used is the traditional t-statistic:

tn = Mn / SEn ;   n=1,2,...,N (4)

where Mn is the (sample) mean of the log-ratios for gene n; (Mn1, Mn2, ..., MnK). SEn=S / K1/2 is the standard error of the mean, where Sn is the (sample) standard deviation of the log-ratios for gene n. Since we do not care if the genes are up or down regulated, we look at the absolute t value, |t|. For two genes with the same mean of log-ratios, the gene with the smaller standard error will get the larger |t| value. The rational for this is that if repeated measurements of a gene expression ratio give very similar results, i.e. the standard error is small, we should trust these measurements more than in the case where they differ a lot.

There are problems with using the t-statistics. A gene with a small absolute mean might get a large |t| value only because of a small SE, which might not be the true reflection of the gene. Remember that we are not only comparing t values from two genes but from several thousands of genes. For this reason, by chance there will be genes that have larger |t| values because they have small SE, not because they are differentially expressed. An ad hoc way to solve this problem is to discount genes with a small absolute mean whose standard errors are small. A small SE could be defined as SEs that are among the bottom 1%. Tusher et al. [16] presented another solution to the problem where they add a constant to each SE. Their statstic is referred to as the S-statistic.

5.2 The B-statistic

A better and more correct method for dealing with the above problem and for identifying differentially expressed genes is an empirical Bayes method developed by Lönnstedt et al. [17]. The authors introduce a new statistic called the B-statistic. The B value calculated for gene n is a log-posterior odds ratio for the gene to be differentially expressed against it is not differentially expressed given the observed log-ratios for all genes and all experiments:

Bn = log  P({gene n is differentially expressed} | {Mnk})
P({gene n is not differentially expressed} | {Mnk})
(5)

where n=1,2,...,N. The probability model behind this B-statistic assumes that the K replicated log-ratio measurements for gene n, Mnk; k=1,2,...,K, are normally distributed with mean mn and variance sn2. However, the mean and the variance are not treated as fixed parameters but as random variables themselves and therefore we look at the distribution of Mnk as a conditional distribution Mnk|mn, sn2 ~ N(mn, sn2). The assumption is still that most genes are non-differentially expressed, i.e. they have mn=0, but a small proportion p of genes have a mn<>0. The expression above can now be written Bn=log[P(mn<>0 | {Mnk} / P(mn=0 | {Mnk}]. Assuming special conjugate-priori distributions for the mean and the variances (for details see [17]), the authors get the full expression for Bn to be:

Bn = log   p

1-p
 ·  1 

(1+Nc)-1/2
 ·  [  a + Sn2 + Mn2

a + Sn2 + Mn2/(1+Nc)
 ]v+N/2
(6)

where Mn and Sn2 are the sample mean and sample variance of the M values for gene n and N is the number of genes. Leaving out the details, a, v and c are hyperparameters that are estimated from the log-ratios from all genes and all slides. Finally, p is the proportion of differentially expressed genes and p is normally the only parameter to be tuned. The B-statistic works in a similar fashion as the S-statistic, but the former gives a slightly smaller number of false positives and false negatives [17].

5.3 Number of replicates needed

Before presenting the results, a relevant and important question to be asked is how many replicates are that needed to get trustful results. We did all the data analysis discussed above for the case where n=2, 4, 6 and 8 slides were used. Since the number of slides does not affect within-slide normalization we did not have to repeat that step. However, we had to do the between-slide normalization for each n. The top 40 genes when two, four, six and eight slides were used are listed in table 1. The genes are ranked according to their B values.


Number of slides included in the analysis
Rank n=2 n=4 n=6 n=8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
*2149
*1496
*2537
1337
*4139
4357
3023
6033
6061
5287
5345
1049
3251
6130
1494
*1739
761
5730
4951
1911
*2149
*1496
*4139
6050
6061
6130
*1739
*4941
5345
5731
6043
*2537
4942
6129
5729
1658
2963
5700
761
2314
*2149
*4139
*1496
*4941
6124
6061
5345
761
6043
*2537
1658
6129
1270
6050
4930
6045
473
5719
1755
4875
*1496
*2149
*4941
*4139
6043
761
4930
6129
*1739
5719
3153
6061
4875
6116
*2537
4530
2186
5751
4529
2314
Number of slides included in the analysis
Rank n=2 n=4 n=6 n=8
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
3530
1428
1270
3942
4922
3339
4354
4942
5731
5625
6040
6050
3261
26
5660
27
2675
3552
6043
483
6124
5417
483
5749
823
6045
6327
5418
5116
2922
4930
4951
3729
5719
5335
4875
4922
2748
4148
6222
4951
*1739
6130
3153
*5356
2314
5731
6216
2186
6116
4960
3151
5418
5751
2963
1337
4529
5673
4148
5703
*540
1257
5821
5703
6118
5353
2296
4925
6042
4884
6045
6130
5746
6109
5944
4882
4874
5358
5673
4951

Table 1. Genes found as a function of the number of replicates. The table is listing the top 40 (0.6%) ApoAI differentiated genes found when the number of slides used in the analysis is n=2,4,6,8, respectively. In each column, the genes are ranked by their B-statistic. The number listed are the genes' spot numbers as they occur on the microarray slides. The genes marked with a star (*) are the eight genes listed in [14]. The color and style coding is only used to help locate the same gene across the columns. All genes found when n=8 slides were used are bold-faced. To simplify the identification of these genes in the n=2, n=4 and n=6 columns, they are also color coded and some of them are in italic. Genes not found in the n=8 column are non-bold and colored in silver.

For n=2 there were no genes with a positive log-posterior odds ratio among the top 40 genes. For n=4 there were 12 with positive B values and for n=6 and n=8 all top 40 genes had positive B values. More interestingly, when data from two slides are used, we can only identify 10 out of the top 40 genes found when all slides were used. When four slides were used, 17 out of 40 were found and when six slides were used, 24 out of 40 were found. It must be mentioned that the selection of two, four and six slides was done by taking the first two slides, then the first four and finally the first six slides as they occur in the data set. A more rigorous approach would be to use a bootstrap technique, which repeatedly samples two (four or six) slides from the set of eight slides and then takes the average of the B-values for each gene.

6 Results

6.1 Top forty differentially expressed genes in the ApoAI data set

Using the B-statistic on the ApoAI data set we identify a set of genes that we believe to be differentially expressed. The genes are listed table 2, but can also be downloaded in a computer-readable format from [18]. Almost all the identified genes are down regulated, which makes sense since we are looking at a knock-out model. A more detailed discussion of the biological relevance of these genes are left to the SAGE KE discussion forum. The B values were calculated assuming 1% of the genes to be differentially expressed (p=0.01). If we instead assume 0.5% of the genes to be differentially expressed (p=0.005) we get that the 40 genes with largest log-posterior odds ratio are still ranked the same as in the p=0.01 case.


Rank spot M A B R G id Description
1 1496 -1.11
12.0 8.12 2828 6121 484183 est
2 2149 -3.91
11.8 7.31 934 14049 1077520 Apo AI, lipid-Img
3 4941 -1.07
12.9 7.25 5226 10941 737183 similar to yeast sterol desaturase, lipid-Img
4 4139 -1.52
12.3 6.87 2909 8373 374370 EST, Weakly similar to C-5 STEROL DESATURASE [Saccharomyces cerevisiae], lipid-UG
5 6043 -0.54
9.8 5.46 712 1038 870879 5'. gi|2186618|gb|AA461727|AA461727 [2186618], AA461727
6 761 -0.72
12.4 5.18 4068 6710 519093 Glucosidase, alpha, acid
7 4930 -0.57
9.9 4.93 796 1182 493497 ESTs, Highly similar to AROMATIC-L-AMINO-ACID DECARBOXYLASE [Rattus norvegicus], Brain-UG
8 6129 -0.54
11.2 4.80 1954 2834 570543 ESTs, Highly similar to C-8 STEROL ISOMERASE [Magnaporthe grisea], lipid-UG
9 1739 -1.09
13.3 4.74 6758 14418 483614 Apo CIII, lipid-Img
10 5719 -0.54
10.5 4.72 1191 1732 480221 Mouse steroid 21-hydroxylase A (21-OHase A) gene, lipid-UG
11 3153 -0.47
11.6 4.69 2551 3539 540161 Mus musculus alanine:glyoxylate aminotransferase (Agxt1) mRNA, complete cds
12 6061 -0.59
10.3 4.49 1012 1526 871167 5'. gi|2187584|gb|AA462693|AA462693 [2187584], AA462693
13 4875 -0.60
10.1 4.40 897 1359 437224 Isl1, AA003609
14 6116 -0.53
10.2 4.33 953 1376 526650 Murine GABA-A receptor delta-subunit gene, Brain-Img
15 2537 -1.53
13.5 4.16 6655 19208 483614 ESTs, Highly similar to APOLIPOPROTEIN C-III PRECURSOR [Mus musculus], lipid-UG
16 4530 -0.66
12.2 3.94 3758 5928 444794 BRAIN PROTEIN D3, Brain-UG
17 2186 -0.47
8.9 3.90 404 558 MDB0345 MDB0345, MDB0345
18 5751 -0.50
9.8 3.83 728 1031 918176 NEURONAL ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA-5 CHAIN, Brain-Img
19 4529 -0.41
9.9 3.79 818 1089 443390 EST, Moderately similar to APOLIPOPROTEIN B-100 PRECURSOR [Homo sapiens], lipid-UG
20 2314 -0.56
11.9 3.68 3255 4816 372643 Uncoupling protein 2, mitochondrial
21 540 -2.85
11.7 3.67 1231 8890 439353 EST, Highly similar to APOLIPOPROTEIN A-I PRECURSOR [Mus musculus], lipid-UG
22 1257 -0.48
9.6 3.61 672 938 870815 5'. gi|2186896|gb|AA462005|AA462005 [2186896], AA462005
23 5821 0.33
10.5 3.45 1667 1325 MDB1531 MDB1531, MDB1531
24 5703 -0.47
10.5 3.38 1197 1661 1196370 Grb2 (downstream of rec kinase homol), AA794463
25 6118 -0.42
10.2 3.34 982 1318 552012 FK506-BINDING PROTEIN PRECURSOR, heart-UG
26 5353 -0.41
10.6 3.18 1361 1807 1312454 MEK Kinase 3, heart-Img
27 2296 -0.73
10.4 3.11 1059 1763 475851 est
28 4925 -0.57
10.8 3.07 1476 2191 482240 Mus musculus APC-binding protein EB2 mRNA, partial cds, heart-UG
29 6042 -0.51
10.3 2.98 1062 1510 870875 5' similar to TR:G972043 G972043 MRNA; EXPRESSED SEQUENCE TAG ;. gi|2186616|gb|AA461725|AA461725 [2186616], AA461725
30 4884 -0.55
11.0 2.96 1663 2440 467322 Id like, AA036390
31 6045 -0.58
10.7 2.94 1322 1978 870887 5'. gi|2186621|gb|AA461730|AA461730 [2186621], AA461730
32 6130 -0.53
10.3 2.93 1069 1540 572995 Mouse Ki-ras cellular oncogene exon 1 from Y1 adrenal tumor cells (and joined CDS), heart-UG
33 5746 -0.47
10.0 2.89 850 1179 846842 Alpha-1B Adrenergic Receptor, heart-Img
34 6109 0.51
10.8 2.83 2132 1502 516686 Nkx2.6, AA108980
35 5944 0.33
11.8 2.73 4085 3250 478048 NEURONAL PROTEIN 3.1
36 4882 -0.39
10.4 2.60 1210 1583 466584 Ntrk3 (Neurotrophic RTK), AA030917
37 4874 -0.39
10.6 2.42 1361 1785 427319 ephrinB1, AA003297
38 5358 -0.42
11.0 2.40 1799 2413 1362186 NGFI-A BINDING PROTEIN 1, Brain-Img
39 5673 -0.45
9.8 2.39 742 1014 437516 EST- Contaxin precursor e-5, AA003596
40 4951 -0.49
11.1 2.39 1851 2592 876791 beta-3-adrenergic receptor, Brain-Img

Table 2. Differentially expressed genes. The columns are 1) the rank (from B-values), 2) the spot number, 3) the log ratio, 4) the intensity, 5) the log-posterior odds ratio, 6) the Cy5 intensity, 7) the Cy3 intensity, 9) the gene id, and 9) a short description of the gene/est. The list of genes can be downloaded in a computer-readable format from [18].

A B vs. average M plot (cf. left plot in figure 7) of the ApoAI data set shows that selecting genes based solely on the average log ratios would not give the same set of genes as if the log-posterior odds ratio would be used. In the plot, the 40 (0.6%) genes with highest ranked B values are highlighted red. To the right in figure 7, the same genes are highlighted, but now in an average M vs. average A plot.

The average ApoAI slide.
Figure 7. Left: The log-posterior odds ratio vs. the average log ratios. The B-statistic is the log-posterior odds ratio for each gene, which here is the same as each spot, to be differentially expressed. It is clear from this plot, that identification of differentially expressed genes based only on the average log ratios would not give the same genes as if it was based on the log-posterior odds ratio. Small |M| values tend to correspond to small B values, but the reverse is not true. The genes with the top 0.6% largest B-statistics are highlighted red. We used the parameter value p=0.01, i.e. we expect about 1% of the genes to be differentially expressed. Right: A scatterplot of the average log-ratios vs. the average intensities with the same identified genes highlighted.

6.2 Comparison between GenePix and Spot

In addition to use the Spot software for extracting the signals in the cDNA microarray images, we used the GenePix software. We analyzed the data from GenePix in the exact the same way as we did with the data from Spot and we compared the genes identified. The genes with the top 40 largest B values found by GenePix and the genes with the top 40 largest B values found by Spot are listed in table 3. Both Spot and GenePix resulted in ranking the gene at spot 1496 to be the highest. Spot ranked 2149 second whereas GenePix ranked spot 4941 second and so on. In GenePix's top 40 list there are 15 genes that are found in Spot's top 40 list, or equivalent in GenePix's top 40 list, there are 25 genes that are not found in Spot's top 40 list. Descriptions of these 25 genes are listed in the last column in table 3.

Finally, looking at the B values, Spot seems to produce larger B values than GenePix. This can be interpreted as suggesting that the quality of the data (signal to noise) from Spot is slightly higher than that from GenePix. However, a preferred comparison would be one based on their respective abilities to correctly identify genuinely differentially expressed genes, and on their false positive rates. Such a comparison is not possible at this point.


Spot top 40GenePix top 40Gene ranked highly by GenePix but not Spot
Rank spot M A B spot M A B
1 *1496 -1.11
12.0 8.12 *1496 -1.06
11.5 8.24  
2 *2149 -3.91
11.8 7.31 *4941 -1.44
13.0 6.95  
3 *4941 -1.07
12.9 7.25 603 1.64
5.4 5.55 MDB0670, R75477
4 *4139 -1.52
12.3 6.87 6130 -0.71
9.8 4.89  
5 6043 -0.54
9.8 5.46 6124 -0.96
8.4 4.75 Androgen receptor, Brain-UG
6 761 -0.72
12.4 5.18 3489 0.73
9.0 4.57 est
7 4930 -0.57
9.9 4.93 4875 -0.85
9.7 4.32  
8 6129 -0.54
11.2 4.80 6217 0.67
10.5 4.25 5' novel gene, H64597
9 *1739 -1.09
13.3 4.74 5032 0.93
10.0 4.13 MDB0488, R75338
10 5719 -0.54
10.5 4.72 4530 -0.64
12.1 4.04  
11 3153 -0.47
11.6 4.69 *1739 -1.17
13.5 3.82  
12 6061 -0.59
10.3 4.49 *540 -3.37
11.2 3.78  
13 4875 -0.60
10.1 4.40 761 -0.74
12.3 3.55  
14 6116 -0.53
10.2 4.33 6221 0.79
10.6 3.48 novel gene, AA505401
15 *2537 -1.53
13.5 4.16 5700 -0.93
9.4 3.46 Jagged1 (Alagille syndrome), AA619107
16 4530 -0.66
12.2 3.94 5274 -0.61
9.5 3.39 Jak3, AA023709
17 2186 -0.47
8.9 3.90 6284 -0.57
10.2 3.39 est
18 5751 -0.50
9.8 3.83 4855 -0.94
9.4 3.35 5'. gi|2187342|gb|AA462451|AA462451 [2187342], AA462451
19 4529 -0.41
9.9 3.79 4961 -0.70
8.7 3.33 Myosin Binding Protein C, heart-Img
20 2314 -0.56
11.9 3.68 1514 -0.80
7.9 3.20 est
21 *540 -2.85
11.7 3.67 563 -0.97
11.2 2.94 FATTY ACID-BINDING PROTEIN, EPIDERMAL, lipid-UG
22 1257 -0.48
9.6 3.61 1519 -0.72
8.6 2.80 est
23 5821 0.33
10.5 3.45 5283 -0.67
9.5 2.76 Hkr-t1, AA110661
24 5703 -0.47
10.5 3.38 2803 0.88
8.4 2.69 5' similar to PIR:A56136 A56136 jagged protein precursor - rat ;. gi|1287598|gb|W13561|W13561 [1287598], W13561
25 6118 -0.42
10.2 3.34 2235 1.40
8.2 2.68 5' similar to gb:X64229_cds1 DEK PROTEIN (HUMAN), AA445930
26 5353 -0.41
10.6 3.18 5349 -0.71
9.2 2.67 G PROTEIN-COUPLED RECEPTOR KINASE GRK5, Brain-Img
27 2296 -0.73
10.4 3.11 4830 -0.47
11.6 2.61 5'. gi|2192760|gb|AA466620|AA466620 [2192760], AA466620
28 4925 -0.57
10.8 3.07 4862 -0.98
9.1 2.59 5' similar to WP:F32D8.4 CE05783 LACTATE DEHYDROGENASE ;. gi|2187569|gb|AA462678|AA462678 [2187569], AA462678
29 6042 -0.51
10.3 2.98 *2537 -1.42
13.7 2.57  
30 4884 -0.55
11.0 2.96 *5356 -1.98
13.0 2.52 CATECHOL O-METHYLTRANSFERASE, MEMBRANE-BOUND FORM, Brain-Img
31 6045 -0.58
10.7 2.94 6061 -0.85
9.8 2.45  
32 6130 -0.53
10.3 2.93 *2149 -4.79
11.3 2.33  
33 5746 -0.47
10.0 2.89 5244 -0.90
8.8 2.32 5'. gi|2186615|gb|AA461724|AA461724 [2186615], AA461724
34 6109 0.51
10.8 2.83 722 -0.64
10.3 2.31 Mus musculus thrombospondin-4 mRNA, complete cds
35 5944 0.33
11.8 2.73 5358 -0.80
10.3 2.21  
36 4882 -0.39
10.4 2.60 6045 -0.72
10.3 2.21  
37 4874 -0.39
10.6 2.42 5268 -0.68
9.0 2.11 Ash1 homolg4, AA014730
38 5358 -0.42
11.0 2.40 2296 -1.02
9.7 2.02  
39 5673 -0.45
9.8 2.39 3585 0.78
9.7 2.01 B actin, 1030797
40 4951 -0.49
11.1 2.39 *4139 -1.54
12.2 1.98  

Table 3. Genes ranked highly by Spot vs genes ranked highly by GenePix. In column 2 are the spot number for the genes found using Spot for the image analysis. In column 6 are the spot number for the genes found by using GenePix for the image analysis. Except for the image analysis the rest of the data analysis was identical. In the last column are the descriptions of the genes that was highly ranked by GenePix but not by Spot listed. For the rest of the description, both for the GenePix and the Spot genes, see table 2. The color and style coding is done in the same fashion as in table 1.


Authors

1 Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden (hb@maths.lth.se)
2 Health Science Center at San Antonio, The University of Texas, Texas (calder@uthscsa.edu)
3 Radiation Biology and Environmental Toxicology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California (smian@lbl.gov)
4 Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, California (mjcallow@lbl.gov, emrubin@lbl.gov)
5 Department of Statistics, University of California, Berkeley (terry@stat.berkeley.edu)
6 Division of Genetics & Bioinformatics, Walter & Eliza Hall Institute of Medical Research, Melbourne (terry@wehi.edu.au)
* Correspondence should be addressed to Saira Mian, telephone +1-510-486-6216, fax +1-510-486-6949, email smian@lbl.gov.


References

[1] M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, and E. M. Rubin. Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research, 10(12):2022-9, December 2000.
[2] Spot software, CSIRO Mathematical and Information Sciences.
http://www.cmis.csiro.au/iap/spot.htm
[3] M. Callow, ApoAI Data, Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley.
http://www.stat.berkeley.edu/users/terry/zarray/Html/apodata.html
[4] com.braju.sma - an object-oriented add on to sma (beta)
http://www.braju.com/R/com.braju.sma/
[5] Statistics for Microarray Analysis (sma)
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
[6] R. Ihaka and R. Gentleman, R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics 5(3):299-314, 1996.
[7] The R Project for Statistical Computing
http://www.r-project.org/
[8] Y. H. Yang, M. J. Buckley, S. Dudoit and T. P. Speed. Comparison of methods for image analysis on cDNA microarray data. Technical report #584, University of California, Berkeley. November 2000.
http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html
[9] GenePix software, Axon Instruments, Inc.
http://www.axon.com/GN_Genomics.html
[10] QuantArray Analysis Software, Packard Instrument Company.
http://www.packardbiochip.com/products/quant_soft.htm
[11] M. Eisen, ScanAlyze software, Eisen Lab, Life Science Division, Lawrence Berkeley National Laboratory, Berkeley, California.
http://rana.lbl.gov/EisenSoftware.htm
[12] Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74:829-836. 1979.
[13] Cleveland, W. S. LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician, 35:54. 1981.
[14] S. Dudoit, Y. H. Yang, M. J. Callow and T. P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical report #578, University of California, Berkeley. August 2000.
http://www.stat.berkeley.edu/users/terry/zarray/Html/matt.html
[15] Y. H. Yang, S. Dudoit, P. Luu and T. P. Speed. Normalization for cDNA Microarray Data. SPIE BiOS 2001, San Jose, California, January 2001.
http://www.stat.berkeley.edu/users/terry/zarray/Html/normspie.html
[16] V.G. Tusher, R. Tibshirani and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116-21, April 2001.
[17] I. Lönnstedt and T. P. Speed. Replicated Microarray Data. Statistical Sinica, Accepted.
http://www.stat.berkeley.edu/users/terry/zarray/TechReport/Baypap4d.ps
[18] List of 40 genes found to be differentially expressed in the ApoAI knockout experiments. http://www.maths.lth.se/matstat/bioinformatics/sageke/apoa1-genelist.txt (a comma-separated ASCII file)
[19] Example R program that loads the data, normalizes it and identifies differentially expressed genes using the B-statistics method.
http://www.maths.lth.se/matstat/bioinformatics/sageke/Rcode.txt
[20] L. Shi, DNA Microarray (Genome Chip) - Monitoring the Genome on a Chip
http://www.gene-chips.com/
[21] T. Speed et al., Statistical problems involving microarray data, University of California, Berkeley
http://www.stat.Berkeley.EDU/users/terry/zarray/Html/list.html
[22] W. Li, Papers on microarray data analysis, Rockefeller University
http://linkage.rockefeller.edu/wli/microarray/

Back to the SAGE KE Discussion Forum on Identifying differentially expressed genes in cDNA microarray experiments: making aging visible.


ruler
This URL: http://www.maths.lth.se/matstat/bioinformatics/sageke/.
Generated by www.braju.com and hosted by Lund University.
Last updated: October 3, 2001
HTML validate this page!