Growing numbers of studies employ cell line-based systematic short interfering RNA (siRNA) screens to study gene functions and to identify drug targets. As multiple sources of variations that are unique to siRNA screens exist, there is a growing demand for a computational tool that generates normalized values and standardized scores. However, only a few tools have been available so far with limited usability. Here, we present siMacro, a fast and easy-to-use Microsoft Office Excel-based tool with a graphic user interface, designed to process single-condition or two-condition synthetic screen datasets. siMacro normalizes position and batch effects, censors outlier samples, and calculates Z-scores and robust Z-scores, with a spreadsheet output of >120,000 samples in under 1 minute.
Human tissue-derived cell lines have served as an effective platform for understanding the molecular biology of diseases and increasingly for drug discovery [
Statistical methods to process whole-genome siRNA screen data have been reported by others [
Here, we present siMacro, a GUI-based simple tool for processing cell-based high-throughput screening datasets. siMacro has been implemented in Visual Basic for Applications (VBA) and packaged as a Microsoft Office Excel add-in. It allows one-step, fast, and easy suppression of outlier values; normalization; and standardization of a complete raw dataset from a genomewide siRNA screen in an intuitive spreadsheet format. The tool processes the data points associated with a 2-condition genomewide screen with biological triplicates on a standard laptop computer in less than 1 minute.
We assume the screen was done in 96- or 384-well plates and passed the standard quality control metric (by Z or Z' factor, for example). siMacro is robust against sporadic bad wells from triplicate experiments but will not censor an entire plate compromised by massive failure, such as broad contamination. We also assume siRNAs in the library plate are randomly distributed, which is generally true for most commercially available genomewide siRNA libraries. siMacro currently supports 1- or 2-condition screens.
siMacro requires all the individual plate readouts be put into an Excel spreadsheet with the field headers: day or cell batch, plate name, well name, and raw data columns per siRNA or a pool (
Most popular normalization protocols employ either on-plate control-based or sample-based methods. Although there is no golden rule for this, we prefer the latter for most siRNA screens, as 1) cell-based siRNA screening is vulnerable to within-plate variation, such as an edge effect and column/row effects that are not corrected by on-board controls [
Sporadic bad wells are often manually censored one by one, which is inefficient and error-prone in large-scale screens. If the experiment is done in triplicate or more, bad wells can be detected automatically by the inflated coefficient of variation (CV) among the replicates. siMacro identifies bad wells by applying a user-defined cutoff to the CV among normalized values from a replicate. The default is 1%, meaning 1% of the total genes with the highest CV will have a masked outlier well. This significantly reduces false positives but can also overcorrect. Therefore, siMacro reports flags for all genes with censored wells to aid user decisions with regard to exclusion from downstream analysis.
A unified scoring scheme is employed that accounts for batch effects from multiday experiments. Under the assumption of normality, the Z-score, which indicates how many standard deviations an observation is away from the mean, is an intuitive scoring metric. However, it is sensitive to biological outliers (hits) in the data pool, resulting in deflated scores. An alternative is to use the robust Z-score, which is a measure of the median absolute deviation from the median. siMacro calculates the Z-score and the robust Z-score from the user-provided unit of experiment: e.g., the day, cell batch, or plate. siMacro takes the mean of log2-transformed normalized values from replicates for the calculation. For a 2-condition synthetic phenotype screen, the log2 ratio between the 2 conditions is used per siRNA to calculate the Z and the robust Z-scores for a synthetic effect. siMacro returns output, including individual and mean normalized values, Z-score, robust Z-score, and flags, for the censored bad wells directly onto the Excel data sheet as additional columns (
The original source written in R is implemented in VBA and packaged as an Excel add-in. Since ease of use is the main objective of the plug-in, it provides simple GUI and depends only on Excel. As Excel is widely utilized and familiar to most biologists, siMacro provides an immediate option for dataset processing together with Excel-based data visualization tools. siMacro runs on Microsoft Office Excel 2007 or later for the Windows OS or Excel 2011 for the Mac OS. On a standard laptop computer with a Pentium dual-core 1.73 GHz processor and 1 GB memory, the operating time is under 60 seconds to process a triplicate 2-condition genomewide data set.
As a test set, the direct lethality dataset from 21,125 sets of siRNA oligos in a non-small cell lung cancer line, H1155 [
We thank Angelique Whitehurst for the dataset and Hannah Chung for comments. This work was supported by grants from the National Institutes of Health (CA71443 and CA129451), the Welch Foundation (I-1414), and the Cancer Prevention Research Institute of Texas (CPRIT).
Data processing example of siMacro. One of 3 measures of cell viability against siRNAs targeting 21,115 genes in a non-small cell lung cancer line, H1155 [