Tuesday, April 15, 2014

array (gene and exon)

Quality Control of Microarray Data

  • Quality assessment of microarrays should be performed to ensure the data is reliable before normalization. Bioconductor has packages for QC'ing.
    • For affy, use simpleaffy and/or affyPLM to examine percent present calls (from A/P calls), Normalized Unscale Standard Error (NUSE), and Relative Log Expression (RLE). Choose a cutoff for each method, for eg. more than 30% for present.
    • For Agilent, the package arrayQualityMetrics can used. Sample code for getting percent present calls, RLE and NUSE (for Affy)
   library("simpleaffy")
   library("affyPLM")  #for NUSE
   # Read cel files from directory
   data = ReadAffy() 
   # Create affy QA matrix
   data.qc = qc(data)
   # percent present
   pp = percent.present(data.qc)
   # RLE and NUSE
   plmStruc = fitPLM(data)
   RLE(plmStruct, type="stats")
   NUSE(plmStruct, type="stats")
   
   # arrayQualityMetrics using custom CDF (optional)
   library(arrayQualityMetrics)
   CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
   eset = rma(CELs)
   arrayQualityMetrics(eset, outdir="QC", force=TRUE)
   
   # arrayQualityMetrics using oligo package
   library(arrayQualityMetrics)
   library(oligo)
   library(pd.huex.1.0.st.v2)
   geneCELs =list.celfiles(full.names = TRUE)
   affyGeneFS = read.celfiles(geneCELs)
   eset.oligoGeneCore = rma(affyGeneFS, target = "core")
   arrayQualityMetrics(eset.oligoGeneCore, outdir="QC", force=TRUE)


   #Agilent (arrayQualityMetrics) on 2-color
   library(arrayQualityMetrics)
   library(limma)
   scanFiles = dir(pattern = ".*.txt$")
   maData = read.maimages(scanFiles, source="agilent")
   arrayQualityMetrics(expressionset=maData, outdir="QC", force = TRUE, do.logtransform = TRUE)

   #Agilent (arrayQualityMetrics) on 1-color
   library(arrayQualityMetrics)
   library(limma)
   scanFiles = dir(pattern = ".*.txt$")
   maData = read.maimages(scanFiles, source="agilent", green.only=TRUE)
   eSet = new("ExpressionSet", exprs = maData$E, annotation =maData$genes[,7])
   arrayQualityMetrics(eSet, outdir="QC", force = TRUE, do.logtransform = TRUE)

Pre-processing Affymetrix microarrays

  • Each probe or gene signal needs to be corrected.
  • General steps (for Affymetrix arrays):
    1. Background Correction, if any
    2. Normalization
    3. Summarization
  • Commonly used methods for background correction and normalization Affy data include MAS5 (by Affy), Robust Multichip Array (RMA) and GC Robust Multichip Array (GCRMA).
  • References for these methods:
    • RMA: Irizarry RA et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249-64, 2003.
    • RMA FAQs
    • GCRMA: Wu Z and Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proceedings of RECOMB ’04.
  • RMA is our preferred method. It's as good (or better) as MAS5 and GCRMA but also faster. On the other hand, all three of these methods are popular.
  • RMA and GCRMA output log2-transformed intensities, whereas MAS5 outputs untransformed intensities.
Sample code for processing Affymetrix arrays using standard Affymetrix probeset definitions (CDF):
    library(affy)
    # Read all CEL files
    CELs = ReadAffy()
    # Normalize with RMA or use justRMA for large data set
    # To use GCRMA use gcrma which ignores MM intensities
    eset = rma(CELs)
    # Get the expression values alone (as a matrix)
    eset.rma.expr = exprs(eset)
    # Affy QC report
    #library(affyQCReport)
    # Print report
    library(arrayQualityMetrics)
    arrayQualityMetrics(eset, outdir="QC", force=TRUE)

  • Clustering samples after normalization can be used to see possible batch effects
        d=dist(t(eset.rma.expr))
        plot(hclust(d, method="average"))
    
  • Analysis of Mouse (or Human) Gene 1.0 ST Arrays
The use of the Affy CDF file for these arrays is not recommended because it is "not supported" by Affymetrix. For these arrays use either a custom CDF (ex: CELs.entrez = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")) or the standard Affymetrix probeset definitions but using the oligo package. The output from the oligo package agrees with the output from commercial Affymetrix analysis software. NOTE: Normalizing these arrays with small sample size (eg. < 5) may not work properly or as expected, eg. normalized values might occur multiple times within, or across, arrays. Possible solutions may include:
Code to normalize using the oligo package and the standard Affymetrix probeset definitions (CDF)
with Human Gene ST Arrays:
library(oligo)
library(pd.hugene.1.0.st.v1)
geneCELs =list.celfiles(full.names = TRUE)
affyGeneFS = read.celfiles(geneCELs)
# Do RMA - transcript level (35556 features)
eset.oligoGeneCore = rma(affyGeneFS, target = "core")
eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
with Human Exon ST Arrays:
library(oligo)
library(pd.huex.1.0.st.v2)
geneCELs =list.celfiles(full.names = TRUE)
affyGeneFS = read.celfiles(geneCELs)
# Do RMA - gene level (22011 genes ["transcript clusters"])
eset.oligoGeneCore = rma(affyGeneFS, target = "core")
eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
Code to normalize using the affy package and custom probeset definitions (CDF):
library(affy)
library(mogene10stv1mmentrezgcdf)
CELs.entrezGene = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")
# Do RMA - gene level (21225 features)
eset.entrezGene = rma(CELs.entrezGene)
eset.entrezGene.expr = exprs(eset.entrezGene)
write.table(eset.ense.expr, file="RMA_using_affy_customCDF_EntrezGene.txt", sep="\t", quote=F)

Using custom probeset definitions with Affymetrix microarrays


  • At the time they were designed, probes were organized into probesets, with each probeset representing one transcript (or part of a transcript). Since then, our understanding of transcripts and genes have increased, and it's possible to use updated probeset definitions based on gene models according to resources like NCBI Entrez Gene or Ensembl.
  • One source of custom CDFs (chip definition files, describing probeset definitions) is is the BrainArray Microarray Lab at the University of Michigan.
  • For custom CDFs keyed to Entrez Gene IDs, a probeset like 10000_at represents Entrez Gene with ID = 10000 (AKT3). Instead of using the entrez_gene database on canna, we can also use the 'db' package, like mogene10stv1mmentrezg.db to link probeset ID to other information.
  • Annotations of custom CDF probesets can also be access from the corresponding R package [see code below].
  • Some custom CDF files are designed to define each gene by just one probeset, eliminating the issue of summarizing a gene represented by multiple probesets.
Sample code for processing Affymetrix arrays using custom probeset definitions (CDF) and RMA:
    library(affy)
    library(hgu133plus2hsentrezgcdf)
    # Read all CEL files
    CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
    # Normalize with RMA or use justRMA for large data set
    eset = rma(CELs)
    # Get the expression values alone (as a matrix)
    eset.rma.expr = exprs(eset)
Sample code for processing Affymetrix arrays using custom probeset definitions (CDF) and GCRMA:
    library(gcrma)
    # Normalize with GCRMA where CEL files are in the directory CEL_Files
    eset = justGCRMA(celfile.path = "CEL_Files", verbose=T, cdfname="mouse4302mmentrezg")
    # Get the expression values alone (as a matrix)
    eset.rma.expr = exprs(eset)

Accessing probeset annotations for Affymetrix arrays

To link standard probeset IDs to genes, one can download a "NetAffx Annotation File" from the  Affymetrix web site (requiring free registration). Custom probeset IDs usually have a gene ID encoded in the probeset ID. For example, using Entrez Gene - based custom probesets, probeset 12960_at represents Crybb1, which has an Entrez Gene ID of 12960. Standard Affymetrix or custom probesets can also be annotated using Bioconductor, as long as one has installed the "db" package corresponding to the CDF file. If one used hgu133plus2hsentrezgcdf as the CDF, then hgu133plus2hsentrezg.db would contain probeset annotations.
    # Use the standard "db" file for the Mouse Genome 430 2.0 array
    library(mouse4302.db)

    # Start with a list of all probesets on the array with a command like
    probesets.all = rownames(eset.rma.expr)
 
    # List all annotation types available (optional)
    ls("package:mouse4302.db")

    # Get some selected information
    gene.IDs = unlist(mget(probesets.all, mouse4302ENTREZID, ifnotfound=NA))
    gene.symbols = unlist(mget(probesets.all, mouse4302SYMBOL, ifnotfound=NA))
    gene.names = unlist(mget(probesets.all, mouse4302GENENAME, ifnotfound=NA))

    # Print them all (and/or merge them with the expression matrix)
    probe.anno = cbind(probesets.all, gene.IDs, gene.symbols, gene.names)
    write.table(probe.anno, file="Mouse4302.probe_anno.txt", quote=F, sep="\t", row.names=F)

    # Get probes for a specific probeset (2_at)
    library(hgu133plus2hsentrezgprobe)
    as.data.frame(hgu133plus2hsentrezgprobe[hgu133plus2hsentrezgprobe$Probe.Set.Name=="2_at",])

Normalizing Agilent microarrays

  • Steps for analyzing Agilent data and code
  1. Read data (*.txt files, one for each array)
  2. Run background correction (if desired)
  3. Use loess for within-array normalization and Aquantile for between-array normalization.
  4. Draw QC plots (eg. MA plot)
  5. Print MA and RG values for all data
 library(limma)
 # Read array scanner files
 scanFiles = dir(pattern = ".*.txt$")
 maData = read.maimages(scanFiles, source="agilent")
 # Do not background subtract
 maData.nobg.0 = backgroundCorrect(maData, method="none", offset=0)
 # Normalize using loess 
 MA.loess.0 = normalizeWithinArrays(maData.nobg.0, method="loess")
 # Normalize between arrays
 MA.loess.q.0 = normalizeBetweenArrays(MA.loess.0, method="Aquantile")
 
 # Write-out normalized values
 # After quantile normalization (M and A values)
 write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$M), file="M_log2ratios_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
 write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$A), file="A_log2intensities_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
 # Get R and G values for verification (optional)
 RG.loess.q.0 = RG.MA(MA.loess.q.0)
 write.table(cbind(RG.loess.q.0$genes, RG.loess.q.0$R ,RG.loess.q.0$G), file="RG_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
Sample R code for processing Agilent data:
Wi-files1\BaRC_Public\BaRC_code\R\normalize_agilent_2color_arrays_GB.R

Choice of offset

You can judge a good value for the offset by inspection of the MA-plots. If you really want a quantitative way to judge this, look at the component fit$df.prior after you use the eBayes() function in limma. The better you stabilise the variances, the larger will be df.prior and the greater will be the power to detect DE genes. Hence the offset which maximises df.prior is, in sense, optimal.
From:  LIMMA: choice of offset... Reply by Gordon Smyth

Other Issues

  • Spot flags can be used to define spot weights, which are taken into consideration for normalization and/or statistics, such as with commands like
    # Define weight function
    spotWeights <- function(qta) {
        mapply(min,1-qta[,"gIsFeatNonUnifOL"],1-qta[,"gIsFeatNonUnifOL"],
        1-qta[,"gIsBGNonUnifOL"],1-qta[,"gIsBGNonUnifOL"],
        1-qta[,"gIsFeatPopnOL"],1-qta[,"gIsFeatPopnOL"],
        1-qta[,"gIsBGPopnOL"],1-qta[,"gIsBGPopnOL"])
        }
    # Use weights for normalization
    maData.weighted = read.maimages(scanFiles, source="agilent", wt.fun=spotWeights)
    # Use weights for differnetial expression statistics
    fit = lmFit(maData.weighted, design, weights = maData.weighted$weights)
    

More Information

Friday, April 11, 2014

R.sample.RMA-MS.call.txt human exon array

##
## RMA-MS sample code using a dummy example
##  - this works as of BioC 2.3 in R-2.8.1 (Windows)
##

## First load data -- create an AffyBatch object called Data

library(ALLMLL); data(MLL.B)
Data <- MLL.B #AffyBatch object, 20 arrays



## Now define some functions (shouldn't need to change any of this)

######################################################################################
# RMA-MS: abatch is AffyBatch object, gn.sig is a vector of signal-possible geneNames
rma.ms <- function(abatch,gn.sig)
{ rma.abatch <- bg.correct(abatch, method="ms", gn.sig=gn.sig)
rma.abatch.norm <- qnorm.subset(rma.abatch,gn.sig)
out.eset <- expresso(rma.abatch.norm,bg.correct=FALSE,normalize=FALSE,
pmcorrect.method="pmonly",summary.method="medianpolish",summary.subset=gn.sig)
return(out.eset)
}
# RMA-MS background correction; based on bg.correct.rma
bg.correct.ms <- function (object, gn.sig, ...)
{ pn <- probeNames(object); t.sig <- is.element(pn,gn.sig)
pm.mat <- pm(object); pm.sig <- apply(pm.mat, 2, bg.adjust.ms, t.sig=t.sig)
pm.mat <- pm.sig; pm(object) <- pm.mat
return(object)
}
bg.adjust.ms <- function (pm, n.pts = 2^14, t.sig, ...)
{ param <- bg.parameters.ms(pm, n.pts, t.sig)
b <- param$sigma; pm <- pm - param$mu - param$alpha * b^2
pm + b * ((1/sqrt(2 * pi)) * exp((-1/2) * ((pm/b)^2)))/pnorm(pm/b)
}
bg.parameters.ms <- function (pm, n.pts = 2^14, t.sig)
{ pm.bg <- pm[!t.sig]; pm.sig <- pm[t.sig]
mu.bg <- median(pm.bg); sd.bg <- mad(pm.bg)
alpha.sig <- mean(pm.sig[pm.sig>mu.bg])
list(alpha = 1/alpha.sig, mu = mu.bg, sigma = sd.bg) # Note 1/alpha.sig
}
upDate.bgcorrect.methods(c(bgcorrect.methods(),"ms"))
# Quantile normalize a subset of genes; based on normalize.AffyBatch.quantiles
qnorm.subset <- function(abatch,subset)
{ pms <- unlist(pmindex(abatch,subset))
noNA <- rowSums(is.na(intensity(abatch)[pms, , drop = FALSE])) == 0
pms <- pms[noNA]
intensity(abatch)[pms, ] <- normalize.quantiles(intensity(abatch)[pms, , drop = FALSE],
copy = FALSE)
return(abatch)
}
######################################################################################



## Now define signal-possible geneNames (gn.sp) and call rma.ms function

library(affy); library(preprocessCore)
gn.sp <- geneNames(Data)[1:1000] # vector of signal-possible geneNames (probeset ids)
eset.ms <- rma.ms(Data,gn.sp) 

## This final eset.ms object is an ExpressionSet object, with only 
## expression estimates for signal-possible geneNames (in gn.sp)

Mapping between Affymetrix exon array Features to Genes

library(biomaRt)
mart.hs <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# get probeset IDs for transcript cluster 2315633
huex <- read.table("~/Downloads/HuEx-1_0-st-v2.na32.hg19.probeset.csv", sep = ",", stringsAsFactors = F, header = T)
probes <- subset(huex, transcript_cluster_id == "2315633")$probeset_id
# get gene symbols
genes <- getBM(attributes = c("affy_huex_1_0_st_v2", "hgnc_symbol"), filters = "affy_huex_1_0_st_v2", values = probes, mart = mart.hs)
genes
#  affy_huex_1_0_st_v2 hgnc_symbol
#1             2315638     B3GALT6
#2             2315642     B3GALT6
#3             2315639     B3GALT6
#4             2315643     B3GALT6
#5             2315644     B3GALT6
#6             2315640     B3GALT6
#7             2315637     B3GALT6
#8             2315645     B3GALT6
#9             2315641     B3GALT6