Bioinformatics

The GRT-Hub offers a range of bioinformatics services to its clients. Bioinformatics support is available for the full experimental process, as the data are generated in the laboratory through the tertiary statistical analysis of the data.

Bioinformatics Core Overview

Recharge Services

Planning your experiment

Data Analysis Pipeline

Microarray Analysis

NGS analysis

Publication References

Submitting data to NCBI GEO via HPC

Faculty Collaborations

Tutorials

Frequently Asked Questions

Bioinformatics Core Overview

Overview

The Bioinformatics Core has expertise in design and strategic planning for experiments utilizing optical mapping, high-throughput sequencing, SNP, methylation, and CNV microarrays and spatial transcriptomics, and statistics. The GRT Hub clients are encouraged to contact the Bioinformatics Core at the planning stage of their experiments to ensure that the bioinformatics requirements of the study are included in the execution of the experiment.

Recharge Services

Next Generation Sequencing (NGS) technologies are revolutionizing biology. However, analyzing NGS data requires dedicated bioinformaticians who are often not available in many labs. The UCI GRT Hub, we can help you overcome the challenges of analyzing increasingly large and complex NGS data. We can help you design and analyze a wide variety of high throughput experiments. The GRT Hub provides basic data analysis and performs defined data analysis routines. These services are typically provided on a recharge basis and are tracked through the iLab system. If you are new to iLab, please review the process for initiating a work request before continuing.

How it works

Following preliminary consultation each user will be provided a Work Plan Agreement along with his/her quotation for recharge services. The Work Plan Agreement is to be signed by the user PI and uploaded into iLab.
In relatively open-ended research projects where prior consultation has indicated the need for greater scientific input from Hub staff or even that the pipeline to the project endpoint cannot be completely defined it may be necessary to transition to a collaborative effort with bioinformatics staff or research faculty (see Bioinformatics Hub). Depending on the role of Bioinformatics staff in such research projects, the Project Agreement will also outline expectations with respect to co-authorship. A limited number of these types of projects are possible.

Prioritization

Our general policy is first-come, first-served. In any event, the Work Plan Agreement will indicate our estimate of start and completion dates. CFCCC Members may have priority if delays are anticipated. Users will be notified of any significant deviations from those dates.

Planning your experiment

Please contact Bioinformatics Director, Jenny Wu to schedule a consultation about experimental design and the time frame of your experiment.

GRT-Hub staff can prepare mRNA, scRNA, small RNA, genomic, ChIP, methyl, exome, ATAC-seq, and multi-ome sequencing libraries. They are experienced with multiple strategy options for exome enrichment, ribosomal RNA depletion and generation of multiplex libraries in order to insure maximal return of data from experiments. GRT Hub is equipped with ThermoFisher KingFisher Flex for DNA extraction, BioRad CFX96 qPCR, Covaris S220 focused sonicating shearer; Sage Sciences Blue Pippin electrophoresis system for DNA sizing and protein fractionation, NanoDrop 1000 and One Spectrophotometer, Qubit 1.0 and 3.0 fluorometers, MJ Research Tetrad thermal cycler, Agilent 2100 Bioanalyzer, Agilent MX PRO RTPCR to facilitate the preparation of libraries and quality testing and final titration of samples. The 10X Genomics Chromium system or Parse Biosciences systems are used to capture single cells and prepare templates for single cell applications. The 10x Genomics Chromium is available for single cell RNA 3’-end sequencing or genome and exome DNA sequencing. The iScan Infinium beadarray platform is utilized for mouse or human SNPs, methylation, and CNV analysis. Sequencing is currently performed using the Illumina short read sequencers, NovaSeq and MiSeq. For long read sequencing, GRT Hub offers Pac Bio Sequel II that is useful for whole genome assembly; targeted RNA sequencing for identification of alternative splice variants; targeted DNA sequencing for haplophase genotyping, and for direct sequence identification of DNA base modifications. Digital gene expression and miRNA analysis is performed on the NanoString nCounter. High throughput long-range genome optical mapping for genome assembly or identification of structural variants is performed on the BioNano Saphyr system. Spatial transcriptomics is currently performed on 10X Genomics Visium and NanoString GeoMX platforms, with several developing platforms under investigation in early access.

Data Analysis Pipeline

The IGB has developed an automated pipeline to manage and analyze the high-throughput sequencing data produced by the GRT Hub, for its clients at UCI and elsewhere. The pipeline manages the data generated during the runs of the high-throughput sequencers in the facility and does the primary analysis of the data to extract the sequences and corresponding quality scores for each sample in a given run. The pipeline uses a combination of commercial and open source software, as well as software developed in house in the Bioinformatics Laboratory. The scripts in the first part of the pipeline monitor permanently the sequencing runs at the facility and the data sent to the IGB servers via the campus network during the runs. The intensities of the spots on the images and the base calls are computed directly during the run using Illumina’s Real Time Analysis (RTA) software installed on a computer located in the the GRT Hub. The results are sent to the IGB servers on a dedicated partition. Once the sequencer completes a run, the pipeline automatically generates the FastQ files containing the sequencing reads and the corresponding quality scores. The FastQ files are then compressed and made available for download to the client via the web server. The client receives an email when their data are ready for download with a web address and instructions for retrieving the data.

Basic Pipeline Specifications

Secured 30TB high-quality storage
Data transferred from Illumina instrument to IGB via campus network (high speed compared to Illumina machine)
Data generated:
Image [~2-4TBs]
Base calls with intensities [~100-200 GB compressed]
FASTQ files corresponding to read sequences with quality scores [~15GB compressed] generated by IGB using Illumina GERALD software

Client Services

User friendly web interface to enter basic information (genome, experiment type, etc.)
Automatic notifications
Ability to download results from secure web ftp portal
Storage duration [Duration available free of charge]
Image [2 weeks]
Base Calls [2 months]
FASTQ files [1 year/downloadable]

IGB web interface for high-throughput sequencing

Microarray Analysis

GRT Hub offers Illumina iScan Infinium BeadArray microarrays for SNP, methylation, and CNV analysis.

You can analyze your microarray data with a variaty of open source or commercial software such as GenomeStudio. The latter can be downloaded from the Illumina website. GRT Hub provides data in standard format IDAT files that can be imported into GenomeStudio. Excellent Illumina YouTube tutorials are available to walk users through the basic elements of analysis genotyping and methyome arrays.

You can also use R packages such as XXXXX.

Integrated Bioinformatics Tools available for analysis:

HPC3 also provides UCI researchers with integrated tools for analysis and comprehension of biological data. R based Bioconductor, Matlab bioinformatics toolbox , and Galaxy that can do be used analyze both NGS and microarray data. A commercial software specifically designed for biologists to easily analyze high throughput data, CLCbio Genomic Workbench, is also available on HPC for GHTF users. See CLCbio manual for a detailed instruction on how to use CLCbio to analyze your NGS data and microarray data.

NGS analysis

NGS Data Analysis Workflow

Here we provide a general workflow for NGS data analysis with RNA-Seq, ChIP-Seq and RRBS-Seq (Reduced Representation of bisulfite sequencing, a cost effecient alternative to whole genome bisulfite sequencing).

Getting Started With Your Own Analysis on NGS Data

If you want to do your own analysis, first see HPC to request an account for yourself if you are UCI affiliated. Also refer to this page for a brief introduction on how to manipulate data on Linux. This page also explains how to transfer your data to/from HPC. Once you have obtained your account and transferred data to HPC, you can start running software to do your analysis. (Use “module av” to see a full list of software available at HPC.) Note that HPC does not have a regular backup so you are responsible for archiving your own data.

Computational Resources Required for NGS Reads Assembly

NSG sequencers contribute to orders of magnitude more data to sift through, analyze, and share, increasing the complexities of sequencing data analysis workflows. A tightly integrated scalable high-performance computing platform with intelligent data management is recommended. Here at UCI, we recommend using the high performance Linux clusters HPC, which is available to all campus to analyze your NGS data. If you prefer to use your own desktop/laptop, depending on your experiments and data volume, a minimum of 8GB+ RAM is recommended.

Strategies/Software Involved in Assembly and Alignment

In most NGS data analysis workflows (exome sequencing, RNA seq, ChIP-seq etc), the first analysis step is to map (also called “align”) each of the short reads produced from the sequencer to a reference genome to infer the genomic location where the read is derived. Depending upon the size of the reference genome and the total number of reads, this step could be computationally very challenging. Many open source software have been developed to solve this problem and their underlying algorithms largely come in two categories. One category is hash-based, hashing either reads (MAQ 2008, ELAND 2007) or the reference genome (Mosaik 2008). A second category is based on the Burrows-Wheeler transform (BWT) and associated data structures, which support fast retrieval of exact or approximate string matches (BWA 2009, Bowtie 2009, Bowtie2 2012). The BWT based methods are designed to align short reads with exact or small number of mismatches. As such, the performance of these methods tend to deteriorate as the read length becomes longer and the allowed number of mismatches increases.

All the above mentioned software is available on HPC. Depending on your data volume, size of your reference genome, time that you can spend on the computation, you can choose either category one software (slow, more accurate) or category two (faster, less sensitive).

At UCI, people have also developed our own read aligner, called Hobbes, which is both fast and accurate, but requiring relatively large memory. Hobbes also provides a free web service to allow you to align your reads online, which is a good option if the total number of reads from your study is less than 5 million.

Integrated Bioinformatics Tools

HPC also provides integrated tools for analysis and comprehension of biological data. R based Bioconductor is open source and available to all users. A commercial software specifically designed for biologists to easily analyze high throughput data, CLCbio Genomic Workbench, is also available on HPC for GHTF <---BAD LINK users. See CLCbio manual for a detailed instruction on how to use CLCbio to analyze your NGS data. Genome Analysis Tool Kit (GATK) develped by the Broad Institute is also available on HPC for next generation sequencing data analysis and human medical resequencing data analysis.

Publication References

RNA-Seq data analysis references
Chen, G., Wang, C., & Shi, T. (2011). Overview of available methods for diverse RNA-Seq data analyses. Science China. Life sciences, 54(12), 1121–8. doi:10.1007/s11427-011-4255-x
DeLuca, D. S., Levin, J. Z., Sivachenko, A., Fennell, T., Nazaire, M.-D., Williams, C., Reich, M., et al. (2012). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics (Oxford, England), 28(11), 1530–2. doi:10.1093/bioinformatics/bts196
Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., et al. (2012). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, bbs046–. doi:10.1093/bib/bbs046
Feng, H., Qin, Z., & Zhang, X. (2012). Opportunities and Methods for Studying Alternative Splicing in Cancer with RNA-Seq. Cancer letters,null(null). doi:10.1016/j.canlet.2012.11.010
Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods, 8(6), 469–477.
Hu, M., Zhu, Y., Taylor, J. M. G., Liu, J. S., & Qin, Z. S. (2012). Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics (Oxford, England), 28(1), 63–8. doi:10.1093/bioinformatics/btr616
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.
Mezlini, A. M., Smith, E. J., Fiume, M., Buske, O., Savich, G., Shah, S., Aparicion, S., et al. (2012). iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Research, gr.142232.112–. doi:10.1101/gr.142232.112
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq.Nature methods, 5(7), 621–8. doi:10.1038/nmeth.1226
Rehrauer, H. (n.d.). RNA-seq Quantification RNA-seq isoform quantification problem : How many transcripts ?
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England), 26(1), 139–40. doi:10.1093/bioinformatics/btp616
Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics (Oxford, England), 27(6), 863–4. doi:10.1093/bioinformatics/btr026
Trapnell, C., Hendrickson, D. G., Sauvageau, M., Goff, L., Rinn, J. L., & Pachter, L. (2012). Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology, advance on. doi:10.1038/nbt.2450
Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105–11.
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols, 7(3), 562–78.
Vijay, N., Poelstra, J. W., Künstner, A., & Wolf, J. B. W. (2012). Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Molecular Ecology, n/a–n/a. doi:10.1111/mec.12014
Wang, K., Singh, D., Zeng, Z., Coleman, S. J., Huang, Y., Savich, G. L., He, X., et al. (2010). MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38(18), e178.
Wang, L., Feng, Z., Wang, X., Wang, X., & Zhang, X. (2010). DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics (Oxford, England), 26(1), 136–8. doi:10.1093/bioinformatics/btp612
Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57–63. doi:10.1038/nrg2484

ChIP-Seq data analysis references
DeLuca, D. S., Levin, J. Z., Sivachenko, A., Fennell, T., Nazaire, M.-D., Williams, C., Reich, M., et al. (2012). RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics (Oxford, England), 28(11), 1530–2. doi:10.1093/bioinformatics/bts196
Feng, J., Liu, T., Qin, B., Zhang, Y., & Liu, X. S. (2012). Identifying ChIP-seq enrichment using MACS. Nature protocols, 7(9), 1728–40. doi:10.1038/nprot.2012.101
Ji, H., Jiang, H., Ma, W., & Wong, W. H. (2011). Using CisGenome to analyze ChIP-chip and ChIP-seq data. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis … [et al.], Chapter 2, Unit2.13. doi:10.1002/0471250953.bi0213s33
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.
Machanick, P., & Bailey, T. L. (2011). MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics (Oxford, England), 27(12), 1696–7. doi:10.1093/bioinformatics/btr189
Narlikar, L., & Jothi, R. (2012). ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. Methods in molecular biology (Clifton, N.J.), 802, 305–22. doi:10.1007/978-1-61779-400-1_20
Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., & Xie, X. (2011). AREM: aligning short reads from ChIP-sequencing by expectation maximization. Journal of computational biology : a journal of computational molecular cell biology, 18(11), 1495–505. doi:10.1089/cmb.2011.0185
Rozowsky, J., Euskirchen, G., Auerbach, R. K., Zhang, Z. D., Gibson, T., Bjornson, R., Carriero, N., et al. (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature biotechnology, 27(1), 66–75. doi:10.1038/nbt.1518
Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics (Oxford, England), 27(6), 863–4. doi:10.1093/bioinformatics/btr026
Wilbanks, E. G., & Facciotti, M. T. (2010). Evaluation of algorithm performance in ChIP-seq peak detection. PloS one, 5(7), e11471. doi:10.1371/journal.pone.0011471
Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., Nusbaum, C., et al. (2008). Model-based analysis of ChIP-Seq (MACS). Genome biology, 9(9), R137. doi:10.1186/gb-2008-9-9-r137

Submitting data to NCBI GEO via HPC

How to Submit

Obtaining GEO Upload Information

1. Create a myNCBI account here
2. Obtain a personal upload directory from NCBI (this should come with a GEO File Transfer Protocol)

Gather Files for Submission

1. Download and fill in the GEO metadata spreadsheet
2. Gather all the raw data files and processed data files into one directory on HPC (transferring data to HPC)
3. Calculate the md5sum for all data files (md5sum examples)
4. Fill in the “checksum” cell with the calculated md5sum for each file
5. Once the Metadata spreadsheet is complete upload to HPC in the same directory as the data files

FTP to NCBI GEO

NOTE: Once you FTP in, the connection will timeout within 60 seconds of inactivity. When that happens, you just ‘exit’ and ftp again.
1. Navigate to the HPC directory containing all the files you want to upload
2. Then type ftp host address (example: ftp ftp-private.ncbi.nlm.nih.gov) (should be letter ‘e’ under Step 2 of the GEO File Transfer Protocol)
3. You will then be asked to enter the username and password for the FTP server (under host address, this is not your MyNCBI account info)
4. Type prompt n (turns off interactive mode)
5. Then navigate to your personal NCBI GEO upload directory (cd /upload/personaldirectory)
6. Make a directory for your new submission (mkdir geo_upload_example)
7. Navigate into that new directory
8. Type mput * (All the files in the HPC directory you were in will now get uploaded to your GEO submission directory)
9. All of your files should have been uploaded. You can check with ls.
10. Type exit to leave ftp

Notify NCBI of the Upload

1. After the FTP transfer is complete, notify GEO by clicking the “Notify GEO” button.
2. You will need to specify the new GEO submission directory you uploaded the files into.

Finished! You should receive your GEO Accession Code within 5 business days.

Visit the NCBI GEO Submission site for more information.

Faculty Collaborations

The GRT Hub includes participating bioinformatics, statistics and epidemiology faculty who are interested in collaborative research projects. We have highlighted a list of some, but not all of these faculty below, along with a combined video presentation of their research interests.

View Presentations

Karen Edwards
Department of Epidemiology
Email
Profile

Research goal is to improve population health and ensure that all segments of the population benefit; consideration of the ethical, legal, and social implications of translating genomic discoveries for public health and clinical practice.

Trina Norden-Krichmar, PhD
Department of Epidemiology
Email
Profile

The overall goal of the lab is to gain a better understanding of the mechanisms underlying the causes of human diseases through the computational analyses of genomic, clinical, and environmental data.

Wei Li, PhD
Department of Biological Chemistry
Division of Biomedical Computation
Email
Profile

Research is focused on the design and application of bioinformatics algorithms to elucidate global regulatory mechanisms in normal development and diseases such as cancer.

Xiaohu Xie, PhD
Department of Computer Science
Email
Profile

Research interests are primarily in the areas of: AI and Machine Learning, Computational Genomics, Computer Vision and Image Analysis, Systems Biology, Medical Image Analysis.

Zhaoxia Yu, PhD
Department of Statistics
Email
Profile

Developer of statistical methods and a problem solver. I have also been devoted to improving scientific rigor and reproducibility by promoting the use of valid statistical methods.

Min Zhang, MD, PhD
Department of Epidemiology and Biostatistics
Email
Profile

Statistical inference for omics data including genomic, epigenetic, transcriptomic, and metabolomic data / Statistical methods for QTL mapping and GWAS including GxG and GxE interactions/ Integrative analysis of multi-omics data and causal network construction / Integration of EHR and molecular data for precision medicine

Jing Zhang, PhD
Department of Computer Science
Email
Profile

Focus on developing computational and statistical methods to uncover the underlying principles of the multi-step and tightly coordinated gene regulation process and understand how genetic variations can result in phenotypic changes and even diseases

Tutorials

As well as offering personalized Bioinformatics services, the Genomics Research and Technology Hub (GRT Hub) is about teaching people new skills! We have created detailed step-by-step guides on how to perform some common bioinformatics analysis using UCI HPC3 clusters. There are many different ways to perform bioinformatics analysis, so think of this as a beginning of a journey. We are getting you started with an example of a general pipeline for a few frequently used types of analysis. If you have interest or questions in using other programs, please feel free to contact the GRT Hub bioinformatics team for assistance or suggestions.

GRT Hub Workshops

In addition to these guides, the GRT Hub offers hands-on workshops in bioinformatics analysis. Please check our workshop tab to learn about upcoming workshops or to watch presentations from previous workshops.

NOTE:
The guides are provided in the form of Google Docs that are currently only accessible by UCI google accounts, however our guides are free to all users, so if you are having difficulties accessing the file, then please try the “request access” button or email Christina Lin (linc13@uci.edu) who can assist you with this step.

Intro to Bioinformatics Analysis
This is an introduction to basic Linux commands and other tips and tricks to navigating the Linux environment that you will need for setting up the basic processing pipeline from downloading fastq Files to producing trimmed reads. It utilizes software such as Vim, fastQC, and trimmomatic. The final output of this process is trimmed reads ready for the alignment step.

Bulk RNA-seq Analysis Pipeline
This guide describes the basic pipeline for Bulk RNA-seq analysis from raw fastq files to counts. We will be using software such as fastQC, trimmomatic, hisat2, and featureCounts. The final output is raw counts that can be further processed using other platforms such as R to perform statistical analysis of differential expression.

Single Cell Gene Expression Analysis Pipeline
This guide describes the basic pipeline for single cell 3′ gene expression analysis from raw fastq files to gene count matrices. This tutorial will cover both the 10x Genomics Cell Ranger and Parse Biosciences Split-Pipe. The final output would be gene count matrices and other alignment information that can be further processed using other platforms such as Seurat and Scanpy to perform statistical analysis of differential expression.

ATAC-seq Analysis Pipeline
This guide takes you stepwise through analysis of ATAC-seq data, from raw fastq files to peak calling. We describe examples with fastQC, trimmomatic, bowtie2, and MACS2. The final output files identify chromatin accessible regions as peaks. The output provides a genome wide open chromatin region landscapes useful for understanding global epigenetic control of gene regulation.

ChIP-seq Analysis Pipeline
The ChIP-seq (Chromatin ImmunoPrecipitation) pipeline resembles the steps of ATAC-seq analysis, but identifies positions of DNA bound by specific proteins (e.g. TFs) and so can be precipitated by antibodies. Data is processed from raw fastq files to peak calling, comparing controls with immuno-precipitated (IP) samples. We use fastQC, trimmomatic, bowtie2, and MACS2 software tools. The final output is a list of identified peak regions along the genome.

Methylation Analysis (coming soon)
This guide will walk you through the steps of sequencing data analysis from reads to methylation calls using Bismark.