MAAMD: A Workflow to Standardize Meta-Analyses of Affymetrix Microarray Data

Short Description: 

Background

Mandatory deposition of microarray data files for public access, prior to study publication, provides significant opportunities for the scientific community to conduct new bioinformatics analyses within and across multiple datasets. Meta-analyses of microarrays are time consuming, complex, and require programming skills. The development of analytical workflows to automate these tasks simplifies processing, improves efficiency, and serves to standardize the analyses.
Methods: We constructed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze, as well as Bioconductor packages such as GEOquery and arrayQualityMetrics were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. The output of MAAMD is a structured folder containing the microarray data and analyzed results. An experimental analysis using MAAMD was designed to identify the gene orthologues responding to hypoxia in mice and drosophila.
Results: An automated workflow (MAAMD) which executes data downloading, data organization, data quality control, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons was developed in Kepler. The experimental analysis using MAAMD identified several gene orthologues responding to hypoxia in both mice and drosophila, thus validating the utilization of MAAMD. The whole analyses finished in one hour for four independent microarray datasets, which can take days without MAAMD. 
 

Version Requirements

Operating Systems

This workflow has been tested in Windows and Mac OSX systems. Note: the paths in the page are for Windows environment. For Mac users, all the paths should use slash "/", not back slash "\", such as "/Applications/AltAnalyze".

 

MAAMD Workflow Download

MAAMD 1.0 and test sample files: This package contains MAAMD workflow and the required input files for a hypoxia study case. Please configure your operation system following the instruction in 'MAAMD-instruction.pdf' and run MAAMD for the hypoxia study case following Section D. MAAMD-ALL.xml is the workflow which contains meta analyses and quality control.MAAMD-ALL-NoQC.xml is the workflow which contains all meta analyses but no sample quality estimation. This workflow is recommended when sample quality is ensured since it will save you time and require less Bioconductor packages.

MAAMD 1.0 in separate workflows:  This package contains individual MAAMD workflows which allow you running MAAMD separately. Please download this package and decompress to where MAAMD-ALL.xml is located.  MAAMD-Download.xml: download GEO datasets. MAAMD-AltAnalyze.xml: analyze datasets. MAAMD-Comparison: compare analyzed datasets.
 
MAAMD for both local and online microarray data: This package contains a workflow which allow you to analyze not only online microarray data but also your local microarray data.  Two versions, QC version and  NoQC version, are included in the package. For your local dataset, please prepare your input file following the format of 'datasets-local.csv' which is available in package 'MAAMD1.0 and test sample files'.
 
Below is the download links:

 

RESOURCES

The workflows and more detailed instructions are available at http://zgan.weebly.com/workflows.html

Related vedio in youtube: https://www.youtube.com/watch?v=KpQNUuCbxT4

Discussion and QA Support: https://groups.google.com/forum/#!forum/maamd

Workflow Inputs: 
CSV files:
Edit input CSV files for the selected data sets with the fixed file format.
Refer to demo file datasets.csv for the format of the summary of datasets and datainfo-gse12160.csv, datainfo-gse15879.csv, datainfo-gse14981.csv, and datainfo-gse9400.csv for the format of the samples in an individual data set.       Note1: please do not modify the names of columns.      Note2: The suffix ".CEL" is required for both "SampleName" and "NewName".
You can find your data and data information in GEO database : http://www.ncbi.nlm.nih.gov/geo/ .
User alterable parameters: 
Nset: the number of datasets that you want to analyze. Note: If Nset is smaller than what you listed in datasets.csv, then MAAMD will analzye the first “N” data sets only.
WorkPath: the folder where you want to store the data and results.
DataFile:  the path of the csv file where you collect all datasets’ information. Note:  This csv file contains the summary of all targeted datasets.  Don’t assign the path to those csv files for sample information in individual data sets. The path of sample information file is assigned in this csv. Refer to “datasets.csv” as an example.         
MAAMDPath:  the folder where you store MAAMD workflows. Note: homologene.txt must be stored in the same folder as a reference file.
AltAnalyze: the directory of AltAnalyze location, for example, "C:/AltAnalyze_v.2.0.8-Win64".
Output: 

A Kepler display pops up to tell you whether homologous genes have been found across data sets.  If homologous genes are found across selected datasets, a file named "ComparisonSets.txt" can be found in the WorkPath folder.

Validation and Test Plan: 

 A case study by MAAMD is available in 'detailed description document', please refer to the document, the required files can be found in the downloaded workflow zip package.

Software Installation: 
1.     Installation of R
  • Download R from http://cran.r-project.org/ and install
  • Add the folder path where R.exe locates to the ‘system variables’ list.
For Windows,
  • Right click "Computer" and select "Properties"
  • Go to "Advanced system settings", click the sub-menu "Advanced"
  • Select "Environment Variables…"
  • Scroll down the "System Variables" list and select the variable "Path"
  • Add the R.exe path to the end of the path, separate it from the existing variables using ";".
    • For 64-bit OS, the path should be like "C:\Program Files\R\R-3.0.0\bin\x64"
    • For 32-bit OS, the path should be like "C:\Program Files\R\R-3.0.0\bin\i386".
2.     Installation of Kepler
  • Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html to download and install JDK.
  • Go to https://kepler-project.org/users/downloads to download and install Kepler
  • To check whether Kepler is installed properly, start Kepler by double clicking its icon. You should see Kepler’s graphical user interface. If a dialog for updating modules appears, select ‘yes’; Kepler will automatically restart after installing those updated modules.
  • After installation, go to the directory where Kepler is installed, open file /common-2.4.0/resources/configurations/configuration.xml. Modify the line "<maxWaitTime>300</maxWaitTime>" to "<maxWaitTime>-1</maxWaitTime>", this will allow the pop-up webpages to wait until the user makes a decision.
3.     Installation of AltAnalyze
  • Go to http://code.google.com/p/altanalyze/downloads/list?can=1&q and download v.2.0.8.
  • Unzip to your desired directory.
  • To make sure AltAnalyze works properly and has installed the species database, go to the command-line console; for windows, go to "start", select "run" and type "cmd". 
  • Change the directory to the location where AltAnalyze.py is installed using a command line like "cd C:\tools\AltAnalyze_v.2.0.8".
  • Type "AltAnalyze.exe", you should see AltAnalyze’s graphical user interface if AltAnalyze can start properly. Otherwise, please check the version of your AltAnalyze.
  • Click "Begin Analysis"; a prompt window will appear which indicates no species database found if this is the first time you’ve run AltAnalyze. Click "Continue" and select the species which you want to analyze, then click "Continue". AltAnalyze will download corresponding resources automatically. Click "Quit" after the downloading is complete.
4.     Installation of R packages
  • Make sure that you have the permission to update R libraries.
  • For windows, go to the directory where R is installed, such as "C:\Program Files\R\R-3.0.0", right-click the folder and select "properties". Under “security” tab, edit the permission and make sure you have the ‘write’ permission.  
  • Double-click R shortcut to open R console. If both 32-bit and 64-bit are installed, pick the correct one for your OS.
  • Input the following commands in R console to install bioconductor packages:
            source("http://www.bioconductor.org/biocLite.R")
            biocLite()
            biocLite("affyQCReport")
            biocLite("GEOquery")
            biocLite("arrayQualityMetrics")
  • Input the following commands to test whether the packages have been installed successfully
            library(affyQCReport)
            library(GEOquery)
            library(arrayQualityMetrics)
You need install these libraries properly before you run MAAMD workflow.
Workflow Procedure: 
1. Download MAAMD zip package, and unzip to C:, so you will have a folder "C:\MAAMD" which contains a "workflow" folder and a "sample" folder
2. Search GEO database http://www.ncbi.nlm.nih.gov/geo/.  Look for data sets and collect data set information. Edit input CSV files for the selected data sets with the fixed file format. Refer to "datasets.csv" for the format of the summary of datasets and "datainfo-gse9400.csv" and "datainfo-gse33100.csv" for theformat of the samples in an individual data set.
    Note: please do not modify the names of columns.
    Note: The suffix ".CEL" is required for both "SampleName" and "NewName". 
3. Run Kepler and open MAAMD workflow in Kepler, keep Internet connection open when MAAMD is running.
4. Edit the parameters for MAAMD.  Note: all paths have to use forward slash, namely "/", for path delimiter. "\" does not work in Kepler.
    Nset: the number of datasets that you want to analyze. 
    Note: If Nset is smaller than what you listed in datasets.csv, then MAAMD will analzye the first “N” data sets only.
    WorkPath: the folder where you want to store the data and results.
    DataFile:  the path of the csv file where you collect all datasets’ information.
    Note:  This csv file contains the summary of all targeted datasets.  Don’t assign the path to those csv files for sample information in individual data sets. The path of sample information file is   assigned this csv.  Refer to “datasets.csv” as an example.         
   MAAMDPath:  the folder where you store MAAMD workflows.
   Note: homologene.txt must be stored in the same folder as a reference file.
   AltAnalyze: the directory of AltAnalyze location, for example, "C:/AltAnalyze_v.2.0.8-Win64".
5.Click "run" button